Re: skip setting output path for a sequential MR job..

2009-03-31 Thread Aaron Kimball
You must remove the existing output directory before running the job. This
check is put in to prevent you from inadvertently destroying or muddling
your existing output data.

You can remove the output directory in advance programmatically with code
similar to:

FileSystem fs = FileSystem.get(conf); // use your JobConf here
fs.delete(new Path("/path/to/output/dir"), true);

See
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/fs/FileSystem.htmlfor
more details.

- Aaron


On Mon, Mar 30, 2009 at 9:25 PM, some speed  wrote:

> Hello everyone,
>
> Is it necessary to redirect the ouput of reduce to a file? When I am trying
> to run the same M-R job more than once, it throws an error that the output
> file already exists. I dont want to use command line args so I hard coded
> the file name into the program.
>
> So, Is there a way , I could delete a file on HDFS programatically?
> or can i skip setting a output file path n just have my output print to
> console?
> or can I just append to an existing file?
>
>
> Any help is appreciated. Thanks.
>
> -Sharath
>


Re: ANN: Hadoop UI beta

2009-03-31 Thread Dave butlerdi
+1

On Tue, Mar 31, 2009 at 1:11 PM, Stefan Podkowinski wrote:

> Hello,
>
> I'd like to invite you to take a look at the recently released first
> beta of Hadoop UI, a graphical Flex/Java based client for Hadoop Core.
> Hadoo UI currently includes a HDFS file explorer and basic job
> tracking features.
>
> Get it here:
> http://code.google.com/p/hadoop-ui/
>
> As this is the first release it may (and does) still contain bugs, but
> I'd like to give everyone the chance to send feedback as early as
> possible.
> Give it a try :)
>
> - Stefan
>



-- 
Regards

Dave Butler
butlerdi-at-gmail-dot-com

Also on Skype as butlerdi

Get Skype here http://www.skype.com/download.html


**
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.

This footnote also confirms that this email message has been swept by
MIMEsweeper for the presence of computer viruses.

www.mimesweeper.com
**


ANN: Hadoop UI beta

2009-03-31 Thread Stefan Podkowinski
Hello,

I'd like to invite you to take a look at the recently released first
beta of Hadoop UI, a graphical Flex/Java based client for Hadoop Core.
Hadoo UI currently includes a HDFS file explorer and basic job
tracking features.

Get it here:
http://code.google.com/p/hadoop-ui/

As this is the first release it may (and does) still contain bugs, but
I'd like to give everyone the chance to send feedback as early as
possible.
Give it a try :)

- Stefan


Re: ANN: Hadoop UI beta

2009-03-31 Thread Mikhail Yakshin
Hi, Stefan,

> I'd like to invite you to take a look at the recently released first
> beta of Hadoop UI, a graphical Flex/Java based client for Hadoop Core.
> Hadoo UI currently includes a HDFS file explorer and basic job
> tracking features.

Couldn't you please explain, what does it do or at least what do you
want it to do? Why is it better than default Hadoop web UI?

I've peeked here and there, and, so far, as I understand, it's
somewhat underfeatured copy of default Hadoop web UI, created using
closed Adobe technology. May be I'm missing the point here and it is
or would be something completely different (with different
focus/emphasis)?

-- 
WBR, Mikhail Yakshin


Re: ANN: Hadoop UI beta

2009-03-31 Thread W
+1 wow .., looks fantastic ... :)

On the summary it's said it works only for 0.19. Just curious, does it
work with the hadoop trunk ..

Thanks!

Best Regards,
Wildan

---
OpenThink Labs
www.tobethink.com

Aligning IT and Education

>> 021-99325243
Y! : hawking_123
Linkedln : http://www.linkedin.com/in/wildanmaulana



On Tue, Mar 31, 2009 at 6:11 PM, Stefan Podkowinski  wrote:
> Hello,
>
> I'd like to invite you to take a look at the recently released first
> beta of Hadoop UI, a graphical Flex/Java based client for Hadoop Core.
> Hadoo UI currently includes a HDFS file explorer and basic job
> tracking features.
>
> Get it here:
> http://code.google.com/p/hadoop-ui/
>
> As this is the first release it may (and does) still contain bugs, but
> I'd like to give everyone the chance to send feedback as early as
> possible.
> Give it a try :)
>
> - Stefan
>


Re: Typical hardware configurations

2009-03-31 Thread Steve Loughran

Scott Carey wrote:

On 3/30/09 4:41 AM, "Steve Loughran"  wrote:


Ryan Rawson wrote:


You should also be getting 64-bit systems and running a 64 bit distro on it
and a jvm that has -d64 available.

For the namenode yes. For the others, you will take a fairly big memory
hit (1.5X object size) due to the longer pointers. JRockit has special
compressed pointers, so will JDK 7, apparently.



Sun Java 6 update 14 has ³Ordinary Object Pointer² compression as well.
-XX:+UseCompressedOops.  I¹ve been testing out the pre-release of that with
great success.


Nice. Have you tried Hadoop with it yet?



Jrockit has virtually no 64 bit overhead up to 4GB, Sun Java 6u14 has small
overhead up to 32GB with the new compression scheme.  IBM¹s VM also has some
sort of pointer compression but I don¹t have experience with it myself.


I use the JRockit JVM as it is what our customers use and we need to 
test on the same JVM. It is interesting in that recursive calls don't 
ever seem to run out; the way it does stack doesn't have separate memory 
spaces for stack, permanent generation heap space and the like.



That doesn't mean apps are light: a freshly started IDE consumes more 
physical memory than a VMWare image running XP and outlook. But it is 
fairly responsive, which is good for a UI:

2295m 650m  22m S2 10.9   0:43.80 java
855m 543m 530m S   11  9.1   4:40.40 vmware-vmx




http://wikis.sun.com/display/HotSpotInternals/CompressedOops
http://blog.juma.me.uk/tag/compressed-oops/
 
With pointer compression, there may be gains to be had with running 64 bit

JVMs smaller than 4GB on x86 since then the runtime has access to native 64
bit integer operations and registers (as well as 2x the register count).  It
will be highly use-case dependent.


that would certainly benefit atomic operations on longs; for floating 
point math it would be less useful as JVMs have long made use of the SSE 
register set for FP work. 64 bit registers would make it easier to move 
stuff in and out of those registers.


I will try and set up a hudson server with this update and see how well 
it behaves.


Re: ANN: Hadoop UI beta

2009-03-31 Thread vishal s. ghawate
hi,
its really helpful in the sense that it helps me to directly delete or rename 
the files on the HDFS.but is there ny problem while creating dir using the 
mkdir option.
also can it show the history of tasktrackers?

DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.


Re: ANN: Hadoop UI beta

2009-03-31 Thread Stefan Podkowinski
On Tue, Mar 31, 2009 at 1:23 PM, Mikhail Yakshin
 wrote:
>
> Couldn't you please explain, what does it do or at least what do you
> want it to do? Why is it better than default Hadoop web UI?
>

Mikhail. We needed a full featured hdfs file manager for end-users
that could be distributed over the web. Its something we haven't found
out there  (webdav or fuse not being an option for us) and may also be
useful for other hadoop users. Its an offspring of a commercial
platform we're about to develop, along with the job tracker for
internal use and other modules, related and unrelated to hadoop.

@W:
Thanks :) I haven't tried it with trunk. I'm going to create a custom
branch for 0.18 and trunk when theres time for it..

@vishal:
Job history is not implemented yet. I haven't yet figured out exactly
how to do this. Its on the list..

- Stefan


Re: ANN: Hadoop UI beta

2009-03-31 Thread Brian Bockelman

Hey Stefan,

I like it.  I would like to hear a bit how the security policies  
work.  If I open this up to "the world", how does "the world"  
authenticate/authorize with my cluster?


I'd love nothing more to be able to give my users a dead-simple way to  
move files on and off the cluster.  This appears to be a step in the  
right direction.


I'm not familiar with Adobe Flex -- how will this affect other's  
abilities to use it (i.e., Linux & Mac folks?) and how will this  
affect the ability to contribute (i.e., if you get a new job, are the  
users of this project screwed?).  Gah, I sound like my boss.


Brian

On Mar 31, 2009, at 8:41 AM, Stefan Podkowinski wrote:


On Tue, Mar 31, 2009 at 1:23 PM, Mikhail Yakshin
 wrote:


Couldn't you please explain, what does it do or at least what do you
want it to do? Why is it better than default Hadoop web UI?



Mikhail. We needed a full featured hdfs file manager for end-users
that could be distributed over the web. Its something we haven't found
out there  (webdav or fuse not being an option for us) and may also be
useful for other hadoop users. Its an offspring of a commercial
platform we're about to develop, along with the job tracker for
internal use and other modules, related and unrelated to hadoop.

@W:
Thanks :) I haven't tried it with trunk. I'm going to create a custom
branch for 0.18 and trunk when theres time for it..

@vishal:
Job history is not implemented yet. I haven't yet figured out exactly
how to do this. Its on the list..

- Stefan




Re: Problem: Some blocks remain under replicated

2009-03-31 Thread ilayaraja

We are using hadoop-0.15.
Let me explain the scenario:
 We have around 6 TB of data in our cluster on couple of data 
directories(/mnt, /mnt2)with a replication factor of 1. when we increased 
the replication to 2 for the entire data, we observed that /mnt is used 100% 
while /mnt2 is under utilized. So we wanted to balance the utilization of 
space from both the data directories by changing the hadoop code for 
getNextVolume(..) API. The new algorithm checks which data directory is 
having more space available and returns that as the volume for the block to 
be written. This updated version of hadoop is then used for setting the 
replication to 2 for the entire dfs. However, when the dfs replication is 
over, it reported more than 100 GB of data blocks are missing as well as 
some blocks are under replicated. We also observed that there are many 
blocks of size zero present in the cluster, we do not know how these blocks 
were created.



- Original Message - 
From: "Hairong Kuang" 
To: "hadoop-dev" ; "ilayaraja" 
; "hadoop-user" 

Sent: Tuesday, March 31, 2009 3:30 AM
Subject: Re: Problem: Some blocks remain under replicated



Which version of HADOOP are you running? Your cluster might have hit
HADOOP-5465.

Hairong


On 3/29/09 10:24 PM, "ilayaraja"  wrote:


Hello !

I am trying to increase the replication factor of a directory in our 
hadoop

dfs from 1 to 2.
I observe that some of the blocks (12 out of 400) always remain under
replicated, throwing the following message when I do an 'fsck' :

Under replicated blk_9084408236031628003. Target 
Replicas

is 2 but found 1 replica(s).

I thought it could be a problem with a specific data node in the cluster,
however I observe that the under replicated blocks belong to different 
data

nodes .

Please give me your thoughts.

Thanks.
Ilay










Re: skip setting output path for a sequential MR job..

2009-03-31 Thread Owen O'Malley


On Mar 30, 2009, at 9:25 PM, some speed wrote:


So, Is there a way , I could delete a file on HDFS programatically?
or can i skip setting a output file path n just have my output print  
to

console?
or can I just append to an existing file?


I wouldn't suggest using append yet. If you really just want side- 
effects from a job, you can use the NullOutputFormat that just ignores  
the output and throws it away.


If you want it to come back out to the launching program, you could  
just print it to stderr in the task and set  
JobClient.setTaskOutputFilter to SUCCEEDED and the output will be  
printed. (Don't try this at home on a real cluster, or your client  
will be swamped!)


-- Owen


Re: X not in slaves, yet still being used as a tasktracker

2009-03-31 Thread Saptarshi Guha
Aha,there isn't a tasktracker running on X, yet jps shows 4 children
and when jobs fail i see failures on X too.
So what exactly can I stop?
jps output on X
20044 org.apache.hadoop.mapred.JobTracker
12871 org.apache.hadoop.mapred.Child 127.0.0.1 51590
attempt_200903302220_0036_m_20_0 -541073721
12819 org.apache.hadoop.mapred.Child 127.0.0.1 51590
attempt_200903302220_0036_m_03_0 1185151254
19768 org.apache.hadoop.hdfs.server.namenode.NameNode
13004 org.apache.hadoop.mapred.Child 127.0.0.1 51590
attempt_200903302220_0036_r_10_0 1044080068
11396 org.saptarshiguha.rhipe.rhipeserver.RHIPEMain --tsp=mimosa:8200
--listen=4445 --quiet --no-save --max-nsize=1G --max-ppsize=10
19961 org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode
13131 sun.tools.jps.Jps -ml
22054
12931 org.apache.hadoop.mapred.Child 127.0.0.1 51590
attempt_200903302220_0036_m_32_0 -690673928
12962 org.apache.hadoop.mapred.Child 127.0.0.1 51590
attempt_200903302220_0036_m_33_0 203873369

Saptarshi Guha



On Mon, Mar 30, 2009 at 11:37 AM, Saptarshi Guha
 wrote:
> thanks
> Saptarshi Guha
>
>
>
> On Sun, Mar 29, 2009 at 7:42 PM, Bill Au  wrote:
>> The jobtracker does not have to be a tasktracker.  Just stop and don't start
>> the tasktracker process.
>> Bill
>>
>> On Sun, Mar 29, 2009 at 12:00 PM, Saptarshi Guha 
>> wrote:
>>>
>>> Hello,
>>> A machine X which is the master: it is the jobtracker, namenode and
>>> secondary namenode.
>>> It is not in the slaves file and  is not part of the HDFSHowever in
>>> the mapreduce web page, I notice it is being used as a tasktracker.
>>> Is the jobtracker always a tasktracker? I'd rather not place too much
>>> load on the one machine which plays so many roles.
>>> (Hadoop 0.19.0)
>>>
>>> Thanks
>>> Saptarshi
>>
>>
>


Re: ANN: Hadoop UI beta

2009-03-31 Thread Stefan Podkowinski
Hi Brian

On Tue, Mar 31, 2009 at 3:46 PM, Brian Bockelman  wrote:
> Hey Stefan,
>
> I like it.  I would like to hear a bit how the security policies work.  If I
> open this up to "the world", how does "the world" authenticate/authorize
> with my cluster?

Not at all. The daemon part of Hadoop UI is running under a
configurable user and will issue calls to Hadoop on behalf of this
user. Its not much different from the standard web UI in this domain.
The plan is to introduce a authentication layer with one of the next
releases. It will be based on Spring Security and thus enables you to
use many different authentication providers. So downloading all those
Spring libraries along with the project will finally pay off ;)

> I'd love nothing more to be able to give my users a dead-simple way to move
> files on and off the cluster.  This appears to be a step in the right
> direction.
>
> I'm not familiar with Adobe Flex -- how will this affect other's abilities
> to use it (i.e., Linux & Mac folks?) and how will this affect the ability to
> contribute (i.e., if you get a new job, are the users of this project
> screwed?).  Gah, I sound like my boss.

Theres nothing arcane about Flex, but please don't tell anybody. You
can get the recently open sourced (MPL) SDK for any platform
supporting Java and compile Hadoop UI using ant. Other libraries used
are flexlib (MIT license), Spring (Apache L.),  BlazeDS (LGPL). In
case I would have to look for a new job, as you suggest, other people
would be able to fork as long as they know some Action Script, the
actual language used in Flex, and some XML.

- Stefan


Re: swap hard drives between datanodes

2009-03-31 Thread Ian Soboroff

Or if you have a node blow a motherboard but the disks are fine...

Ian

On Mar 30, 2009, at 10:03 PM, Mike Andrews wrote:


i tried swapping two hot-swap sata drives between two nodes in a
cluster, but it didn't work: after restart, one of the datanodes shut
down since namenode said it reported a block belonging to another
node, which i guess namenode thinks is a fatal error. is this caused
by the hadoop/datanode/current/VERSION file having the IP address and
other ID information of the datanode hard-coded? it'd be great to be
able to do a manual gross cluster rebalance by just physically
swapping hard drives, but seems like this is not possible in the
current version 0.18.3.

--
permanent contact information at http://mikerandrews.com




One quick question

2009-03-31 Thread Sameer Tilak
Hi All,

I wrote my WordCount program using Netbeans (both on Linux and OS X
platform). When I try running the program from the dist directory (where
NetBeans outputs the jar file after compilation) I get the following error:

 ~/software/hadoop-0.19.1/bin/hadoop jar WCount.jar wcount.WCount
~/software/hadoop-0.19.1/input/ ~/software/hadoop-0.19.1/output
shell-init: error retrieving current directory: getcwd: cannot access parent
directories: No such file or directory
chdir: error retrieving current directory: getcwd: cannot access parent
directories: No such file or directory
chdir: error retrieving current directory: getcwd: cannot access parent
directories: No such file or directory
Error occurred during initialization of VM
java.lang.Error: Properties init: Could not determine current working
directory.
at java.lang.System.initProperties(Native Method)
at java.lang.System.initializeSystemClass(System.java:1072)



However, if I go to my Hadoop install directory, I get the following error:

 bin/hadoop jar WCount.jar wordcount.WCount input/ output/
java.lang.ClassNotFoundException: wcount.Main
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at org.apache.hadoop.util.RunJar.main(RunJar.java:158)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)


Any help on this would be great!


Please help!

2009-03-31 Thread Hadooper
Dear developers,

Is there any detailed example of how Hadoop processes input?
Article http://hadoop.apache.org/core/docs/r0.19.1/mapred_tutorial.htmlgives
a good idea, but I want to see input data being passed from class to
class, and how each class manipulates data. The purpose is to analyze the
time and space complexity of Hadoop as a generalized computational
model/algorithm. I tried to search the web and could not find more detail.
Any pointer/hint?
Thanks a million.

-- 
Cheers! Hadoop core


Please help!

2009-03-31 Thread Hadooper
Dear developers,

Is there any detailed example of how Hadoop processes input?
Article http://hadoop.apache.org/core/docs/r0.19.1/mapred_tutorial.htmlgives
a good idea, but I want to see input data being passed from class to
class, and how each class manipulates data. The purpose is to analyze the
time and space complexity of Hadoop as a generalized computational
model/algorithm. I tried to search the web and could not find more detail.
Any pointer/hint?
Thanks a million.

-- 
Cheers! Hadoop core


Please help

2009-03-31 Thread Hadooper
Dear developers,

Is there any detailed example of how Hadoop processes input?
Article http://hadoop.apache.org/core/docs/r0.19.1/mapred_tutorial.htmlgives
a good idea, but I want to see input data being passed from class to
class, and how each class manipulates data. The purpose is to analyze the
time and space complexity of Hadoop as a generalized computational
model/algorithm. I tried to search the web and could not find more detail.
Any pointer/hint?
Thanks a million.

-- 
Cheers! Hadoop core


Please help

2009-03-31 Thread Hadooper
Dear developers,

Is there any detailed example of how Hadoop processes input?
Article http://hadoop.apache.org/core/docs/r0.19.1/mapred_tutorial.htmlgives
a good idea, but I want to see input data being passed from class to
class, and how each class manipulates data. The purpose is to analyze the
time and space complexity of Hadoop as a generalized computational
model/algorithm. I tried to search the web and could not find more detail.
Any pointer/hint?
Thanks a million.

-- 
Cheers! Hadoop core


Re: Please help!

2009-03-31 Thread Jim Twensky
See the original Map Reduce paper by Google at
http://labs.google.com/papers/mapreduce.html and please don't spam the list.

-jim

On Tue, Mar 31, 2009 at 6:15 PM, Hadooper wrote:

> Dear developers,
>
> Is there any detailed example of how Hadoop processes input?
> Article
> http://hadoop.apache.org/core/docs/r0.19.1/mapred_tutorial.htmlgives
> a good idea, but I want to see input data being passed from class to
> class, and how each class manipulates data. The purpose is to analyze the
> time and space complexity of Hadoop as a generalized computational
> model/algorithm. I tried to search the web and could not find more detail.
> Any pointer/hint?
> Thanks a million.
>
> --
> Cheers! Hadoop core
>


Re: Please help!

2009-03-31 Thread Hadooper
Thanks, Jim.
I am very familiar with Google's original publication.

On Tue, Mar 31, 2009 at 4:31 PM, Jim Twensky  wrote:

> See the original Map Reduce paper by Google at
> http://labs.google.com/papers/mapreduce.html and please don't spam the
> list.
>
> -jim
>
> On Tue, Mar 31, 2009 at 6:15 PM, Hadooper  >wrote:
>
> > Dear developers,
> >
> > Is there any detailed example of how Hadoop processes input?
> > Article
> > http://hadoop.apache.org/core/docs/r0.19.1/mapred_tutorial.htmlgives
> > a good idea, but I want to see input data being passed from class to
> > class, and how each class manipulates data. The purpose is to analyze the
> > time and space complexity of Hadoop as a generalized computational
> > model/algorithm. I tried to search the web and could not find more
> detail.
> > Any pointer/hint?
> > Thanks a million.
> >
> > --
> > Cheers! Hadoop core
> >
>



-- 
Cheers! Hadoop core


Re: swap hard drives between datanodes

2009-03-31 Thread Raghu Angadi


IP Adress mismatch should not matter. What is the actual error you saw? 
The mismatch might be unintentional.


Raghu.

Mike Andrews wrote:

i tried swapping two hot-swap sata drives between two nodes in a
cluster, but it didn't work: after restart, one of the datanodes shut
down since namenode said it reported a block belonging to another
node, which i guess namenode thinks is a fatal error. is this caused
by the hadoop/datanode/current/VERSION file having the IP address and
other ID information of the datanode hard-coded? it'd be great to be
able to do a manual gross cluster rebalance by just physically
swapping hard drives, but seems like this is not possible in the
current version 0.18.3.





Re: swap hard drives between datanodes

2009-03-31 Thread Raghu Angadi

Raghu Angadi wrote:


IP Adress mismatch should not matter. What is the actual error you saw? 
The mismatch might be unintentional.


The reason I say ip address should not matter is that if you change the 
ip address of a datanode, it should still work correctly.


Raghu.


Raghu.

Mike Andrews wrote:

i tried swapping two hot-swap sata drives between two nodes in a
cluster, but it didn't work: after restart, one of the datanodes shut
down since namenode said it reported a block belonging to another
node, which i guess namenode thinks is a fatal error. is this caused
by the hadoop/datanode/current/VERSION file having the IP address and
other ID information of the datanode hard-coded? it'd be great to be
able to do a manual gross cluster rebalance by just physically
swapping hard drives, but seems like this is not possible in the
current version 0.18.3.







Re: Please help

2009-03-31 Thread Amandeep Khurana
Have you read the Map Reduce paper? You might be able to find some pointers
there for your analysis.



Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Tue, Mar 31, 2009 at 4:28 PM, Hadooper wrote:

> Dear developers,
>
> Is there any detailed example of how Hadoop processes input?
> Article
> http://hadoop.apache.org/core/docs/r0.19.1/mapred_tutorial.htmlgives
> a good idea, but I want to see input data being passed from class to
> class, and how each class manipulates data. The purpose is to analyze the
> time and space complexity of Hadoop as a generalized computational
> model/algorithm. I tried to search the web and could not find more detail.
> Any pointer/hint?
> Thanks a million.
>
> --
> Cheers! Hadoop core
>


A bizarre problem in reduce method

2009-03-31 Thread Farhan Husain
Hello All,

I am facing some problems with a reduce method I have written which I cannot
understand. Here is the method:

@Override
public void reduce(Text key, Iterator values,
OutputCollector output, Reporter reporter)
throws IOException {
String sValues = "";
int iCount = 0;
String sValue;
while (values.hasNext()) {
sValue = values.next().toString();
iCount++;
sValues += "\t" + sValue;

}
sValues += "\t" + iCount;
//if (iCount == 2)
output.collect(key, new Text(sValues));
}

The output of the code is like the following:

D0U0:GraduateStudent0lehigh:GraduateStudent111
D0U0:GraduateStudent1lehigh:GraduateStudent111
D0U0:GraduateStudent10lehigh:GraduateStudent111
D0U0:GraduateStudent100lehigh:GraduateStudent111
D0U0:GraduateStudent101lehigh:GraduateStudent1
D0U0:GraduateCourse0121
D0U0:GraduateStudent102lehigh:GraduateStudent111
D0U0:GraduateStudent103lehigh:GraduateStudent111
D0U0:GraduateStudent104lehigh:GraduateStudent111
D0U0:GraduateStudent105lehigh:GraduateStudent111

The problem is there cannot be so many 1's in the output value. The output
which I expect should be like this:

D0U0:GraduateStudent0lehigh:GraduateStudent1
D0U0:GraduateStudent1lehigh:GraduateStudent1
D0U0:GraduateStudent10lehigh:GraduateStudent1
D0U0:GraduateStudent100lehigh:GraduateStudent1
D0U0:GraduateStudent101lehigh:GraduateStudent
D0U0:GraduateCourse02
D0U0:GraduateStudent102lehigh:GraduateStudent1
D0U0:GraduateStudent103lehigh:GraduateStudent1
D0U0:GraduateStudent104lehigh:GraduateStudent1
D0U0:GraduateStudent105lehigh:GraduateStudent1

If I do not append the iCount variable to sValues string, I get the
following output:

D0U0:GraduateStudent0lehigh:GraduateStudent
D0U0:GraduateStudent1lehigh:GraduateStudent
D0U0:GraduateStudent10lehigh:GraduateStudent
D0U0:GraduateStudent100lehigh:GraduateStudent
D0U0:GraduateStudent101lehigh:GraduateStudent
D0U0:GraduateCourse0
D0U0:GraduateStudent102lehigh:GraduateStudent
D0U0:GraduateStudent103lehigh:GraduateStudent
D0U0:GraduateStudent104lehigh:GraduateStudent
D0U0:GraduateStudent105lehigh:GraduateStudent

This confirms that there is no 1's after each of those values (which I
already know from the intput data). I do not know why the output is
distorted like that when I append the iCount to sValues (like the given
code). Can anyone help in this regard?

Now comes the second problem which is equally perplexing. Actually, the
reduce method which I want to run is like the following:

@Override
public void reduce(Text key, Iterator values,
OutputCollector output, Reporter reporter)
throws IOException {
String sValues = "";
int iCount = 0;
String sValue;
while (values.hasNext()) {
sValue = values.next().toString();
iCount++;
sValues += "\t" + sValue;

}
sValues += "\t" + iCount;
if (iCount == 2)
output.collect(key, new Text(sValues));
}

I want to output only if "values" contained only two elements. By looking at
the output above you can see that there is at least one such key values pair
where values have exactly two elements. But when I run the code I get an
empty output file. Can anyone solve this?

I have tried many versions of the code (e.g. using StringBuffer instead of
String, using flags instead of integer count) but nothing works. Are these
problems due to bugs in Hadoop? Please let me know any kind of solution you
can think of.

Thanks,

-- 
Mohammad Farhan Husain
Research Assistant
Department of Computer Science
Erik Jonsson School of Engineering and Computer Science
University of Texas at Dallas


RE: Please help!

2009-03-31 Thread Ricky Ho
I have written a blog about Hadoop's implementation couple months back here at 
...
http://horicky.blogspot.com/2008/11/hadoop-mapreduce-implementation.html

Note that Hadoop is not about reducing latency.  It is about increasing 
throughput (not throughput per resource) by adding more machines in case your 
problem is "data parallel".

Time-wise:
If it takes T seconds to process B amount of data, then by using Hadoop with N 
machines, you can process it within cT/N seconds where constant c > 1 accounts 
for the overhead.

Space-wise:
If it takes M amount of memory during the processing, then by using Hadoop with 
N machines, you need M/N + c

Bandwidth-wise:
You definitely need more bandwidth because a distributed file system is used.  
And it also depends on your read / write ratio and how many ways of 
replication.  ... Need more time to think of the formula...

Rgds,
Ricky

-Original Message-
From: Hadooper [mailto:kusanagiyang.had...@gmail.com] 
Sent: Tuesday, March 31, 2009 3:35 PM
To: core-user@hadoop.apache.org
Subject: Re: Please help!

Thanks, Jim.
I am very familiar with Google's original publication.

On Tue, Mar 31, 2009 at 4:31 PM, Jim Twensky  wrote:

> See the original Map Reduce paper by Google at
> http://labs.google.com/papers/mapreduce.html and please don't spam the
> list.
>
> -jim
>
> On Tue, Mar 31, 2009 at 6:15 PM, Hadooper  >wrote:
>
> > Dear developers,
> >
> > Is there any detailed example of how Hadoop processes input?
> > Article
> > http://hadoop.apache.org/core/docs/r0.19.1/mapred_tutorial.htmlgives
> > a good idea, but I want to see input data being passed from class to
> > class, and how each class manipulates data. The purpose is to analyze the
> > time and space complexity of Hadoop as a generalized computational
> > model/algorithm. I tried to search the web and could not find more
> detail.
> > Any pointer/hint?
> > Thanks a million.
> >
> > --
> > Cheers! Hadoop core
> >
>



-- 
Cheers! Hadoop core