Re: Confusing NameNodeFailover page in Hadoop Wiki

2008-08-07 Thread Steve Loughran

Doug Cutting wrote:

Konstantin Shvachko wrote:

Imho we either need to correct it or remove.


+1

Doug


I added some pages there on namenode/jobtracker, etc, linking to the 
faiover doc, which I didnt compare to the svn docs to see what was 
correct. Perhaps the failover page could be set up to say you can do 
some things here and point to the full docs at SVN or the hadoop site


--
Steve Loughran  http://www.1060.org/blogxter/publish/5
Author: Ant in Action   http://antbook.org/


Re: DFS. How to read from a specific datanode

2008-08-07 Thread Steve Loughran

Kevin wrote:

Thank you for the suggestion. I looked at DFSClient. It appears that
chooseDataNode method decides which data node to connect to. Currently
it chooses the first non-dead data node returned by namenode, which
have sorted the nodes by proximity to the client. However,
chooseDataNode is private, so overriding it seems infeasible. Neither
are the callers of chooseDataNode public or protected.

I need this because I do not want to trust namenode's ordering. For
applications where network congestion is rare, we should let the
client to decide which data node to load from.



dangerous. what happens when network congestion arrives and the apps are 
out there. Maybe it should be negotiated -namenode provides an ordered 
list and the client can choose some based on its own measurements. If 
the name node provides one only, that's the one you get to use


Re: Configuration: I need help.

2008-08-07 Thread Steve Loughran

Allen Wittenauer wrote:

On 8/6/08 11:52 AM, Otis Gospodnetic [EMAIL PROTECTED] wrote:

You can put the same hadoop-site.xml on all machines.  Yes, you do want a
secondary NN - a single NN is a SPOF.  Browser the archives a few days back to
find an email from Paul about DRBD (disk replication) to avoid this SPOF.


Keep in mind that even with a secondary name node, you still have a
SPOF.  If the NameNode process dies, so does your HDFS. 



There's always a SPOF. it just moves. Sometimes it moves out of your own 
infrastructure, and then you have big problems :)


Re: fuse-dfs

2008-08-07 Thread Sebastian Vieira
Thanks. After alot of experimenting (and ofcourse, right before you sent
this reply) i figured it out. I also had to include the path to libhdfs.so
in my ld.so.conf and update it before i was able to succesfully compile
fuse_dfs. However when i try to mount the HDFS, it fails. I have tried both
the wrapper script and the single binary. Both display the following error:

fuse-dfs didn't recognize /mnt/hadoop,-2
fuse-dfs ignoring option -d

regards,

Sebastian

On Wed, Aug 6, 2008 at 5:29 PM, Pete Wyckoff [EMAIL PROTECTED] wrote:


 Sorry - I see the problem now: should be:

 Ant compile-contrib -Dlibhdfs=1 -Dfusedfs=1

 Compile-contrib depends on compile-libhdfs which also requires the
 -Dlibhdfs=1 property to be set.

 pete


 On 8/6/08 5:04 AM, Sebastian Vieira [EMAIL PROTECTED] wrote:

  Hi,
 
  I have installed Hadoop on 20 nodes (data storage) and one master
 (namenode)
  to which i want to add data. I have learned that this is possible through
 a
  Java API or via the Hadoop shell. However, i would like to mount the HDFS
  using FUSE and i discovered that there's a contrib/fuse-dfs within the
  Hadoop tar.gz package. Now i read the README file and noticed that i was
  unable to compile using this command:
 
  ant compile-contrib -Dcompile.c++=1 -Dfusedfs=1
 
  If i change the line to:
 
  ant compile-contrib -Dcompile.c++=1 -Dlibhdfs-fuse=1
 
  It goes a little bit further. It will now start the configure script, but
  still fails. I've tried alot of different things but i'm unable to
 compile
  fuse-dfs. This is a piece of the error i get from ant:
 
  compile:
   [echo] contrib: fuse-dfs
  -snip-
   [exec] Making all in src
   [exec] make[1]: Entering directory
  `/usr/local/src/hadoop-core-trunk/src/contrib/fuse-dfs/src'
   [exec] gcc  -Wall -O3
 -L/usr/local/src/hadoop-core-trunk/build/libhdfs
  -lhdfs -L/usr/lib -lfuse -L/usr/java/jdk1.6.0_07/jre/lib/i386/server
 -ljvm
  -o fuse_dfs  fuse_dfs.o
   [exec] /usr/bin/ld: cannot find -lhdfs
   [exec] collect2: ld returned 1 exit status
   [exec] make[1]: *** [fuse_dfs] Error 1
   [exec] make[1]: Leaving directory
  `/usr/local/src/hadoop-core-trunk/src/contrib/fuse-dfs/src'
   [exec] make: *** [all-recursive] Error 1
 
  BUILD FAILED
  /usr/local/src/hadoop-core-trunk/build.xml:413: The following error
 occurred
  while executing this line:
  /usr/local/src/hadoop-core-trunk/src/contrib/build.xml:30: The following
  error occurred while executing this line:
  /usr/local/src/hadoop-core-trunk/src/contrib/fuse-dfs/build.xml:40: exec
  returned: 2
 
 
  Could somebody shed some light on this?
 
 
  thanks,
 
  Sebastian.




Re: How to run hadoop without DNS server?

2008-08-07 Thread Torsten Curdt
While I configure and use the hadoop framework, it seems that the  
DNS server
must be used to do hostname resolution (even if i configure the IP  
address
but not hostname in config/slaves and config/masters file). Because  
we don't
have local DNS server in our local ethernet, so i have to add the  
hostname -

IP mappings in /etc/hosts file.


Yeah ...annoying isn't it? :)


I have two questions about the hostname configuration:
  1) Can we do some configuration in hadoop to avoid hostname
resolution, but use IP address directly?


We tried, failed and gave up. That said that was quite some time ago.  
(0.13?)


I know some fixes went in but...

  2) If I add a new machine to the cluster, it seems that i have  
to add
the new machines hostname or IP address on each node's config/slaves  
file.
If the cluster size is too large, this way could be impossible to  
maintain.
Is there  any simply way to add a node dynamically without  
modifying all

the other cluster nodes?


Good question! Would love lo see a somewhat more dynamic discovery as  
well.


That said. For a big cluster you will probably have a central  
configuration management anyway.
So for us it's just changing one file and Puppet will roll it out to  
the nodes.


cheers
--
Torsten


Why is scaling HBase much simpler then scaling a relational db?

2008-08-07 Thread Mork0075

Hello,

can someone please explain oder point me to some documentation or 
papers, where i can read well proven facts, why scaling a relational db 
is so hard and scaling a document oriented db isnt?


So perhaps if i got lots of requests to my relational db, i would 
duplicate it to several servers and partition the requests. So why this 
doenst scale and why HBase for instance could manage this?


I'am really new to this topic and would like to dive in deeper.

Thanks a lot


Re: Why is scaling HBase much simpler then scaling a relational db?

2008-08-07 Thread Steve Loughran

Mork0075 wrote:

Hello,

can someone please explain oder point me to some documentation or 
papers, where i can read well proven facts, why scaling a relational db 
is so hard and scaling a document oriented db isnt?




http://labs.google.com/papers/bigtable.html


relational dbs are great for having lots of structured data, where you 
can run SELECT operations, do O/R mapping to make them look like 
objects, etc. Its one thing to back up, and you get transactions. 
They're bad places to store binary data, or, say, billions and billions 
of rows of web server log data


by relaxing some of the expectations of a relational db, things like 
bigtable, hbase and others can scale well, but as they have relaxed the 
rules, may not do everything you want.



So perhaps if i got lots of requests to my relational db, i would 
duplicate it to several servers and partition the requests. So why this 
doenst scale and why HBase for instance could manage this?


That's called sharding/horizontal partitioning.

It works well if you can partition all your data so that different users 
can go on different places. though once you've done that. you cant think 
about JOIN-ing stuff from multiple machines.


The alternative option (which is apparently common in places like 
myspace and imdb) is to or have one r/w master and a number of read only 
slaves. All changes go into the master, the slaves pick the changes later





I'am really new to this topic and would like to dive in deeper.


check out the articles in http://highscalability.com/

-steve


Re: DFS. How to read from a specific datanode

2008-08-07 Thread Kevin
Yes, I agree with you that it should be negotiated. That is namenode
provides an ordered list and the client can choose some based on its
own measurements. But I am afraid 0.17.1 does not provide easy
interface for this.


-Kevin



On Thu, Aug 7, 2008 at 3:40 AM, Steve Loughran [EMAIL PROTECTED] wrote:
 Kevin wrote:

 Thank you for the suggestion. I looked at DFSClient. It appears that
 chooseDataNode method decides which data node to connect to. Currently
 it chooses the first non-dead data node returned by namenode, which
 have sorted the nodes by proximity to the client. However,
 chooseDataNode is private, so overriding it seems infeasible. Neither
 are the callers of chooseDataNode public or protected.

 I need this because I do not want to trust namenode's ordering. For
 applications where network congestion is rare, we should let the
 client to decide which data node to load from.


 dangerous. what happens when network congestion arrives and the apps are out
 there. Maybe it should be negotiated -namenode provides an ordered list and
 the client can choose some based on its own measurements. If the name node
 provides one only, that's the one you get to use



Re: hdfs question

2008-08-07 Thread Pete Wyckoff

One way to get all Unix commands to work as is is to mount hdfs as a normal
unix filesystem with either fuse-dfs (in contrib) or hdfs-fuse (on google
code).

Pete

On 8/6/08 5:08 PM, Mori Bellamy [EMAIL PROTECTED] wrote:

 hey all,
 often i find it would be convenient for me to run conventional unix
 commands on hdfs, such as using the following to delete the contents
 of my HDFS
 hadoop dfs -rm *
 
 or moving files from one folder to another:
 hadoop dfs -mv /path/one/* path/two/
 
 does anyone know of a way to do this?



extracting input to a task from a (streaming) job?

2008-08-07 Thread John Heidemann

I have a large Hadoop streaming job that generally works fine,
but a few (2-4) of the ~3000 maps and reduces have problems.
To make matters worse, the problems are system-dependent (we run an a
cluster with machines of slightly different OS versions).
I'd of course like to debug these problems, but they are embedded in a
large job.

Is there a way to extract the input given to a reducer from a job, given
the task identity?  (This would also be helpful for mappers.)

This is clearly technically *possible*, since hadoop can rerun the jobs
if they fail.  But is an external program that actually does it?
Or are there instructions for poking around on the compute nodes' local
disks to assemble it by hand?  Or better suggestions?

It would be a real boon for people developing map and reduce user code.

Thanks for any pointers.
   -John Heidemann


Re: hadoop question

2008-08-07 Thread Khanh Nguyen
Can you also post you hadoop-site.xml and hadoop-default.xml?

-k

On Thu, Aug 7, 2008 at 3:52 AM, Mr.Thien [EMAIL PROTECTED] wrote:
 Hi everyone,

 I am trying to use hadoop.

 I set up my computer (thientd-desktop) as master (jobtracker and
 namenode). Two other computers: trunght-desktop and quanglt-desktop as
 slave.

 When I execute an example as below, the map operation seems to be ok.
 (It always success and fast).
 However the reduce operation always fails at 16.66%.
 If I run hadoop on only one computer, it runs ok.

 Below are the screen when the problem arise.

 /
 [EMAIL PROTECTED]:~/projects/hadoop-0.17.1$ bin/hadoop jar
 hadoop-0.17.1-examples.jar wordcount gutenberg thien-out
 08/08/07 14:17:15 INFO mapred.FileInputFormat: Total input paths to
 process : 1
 08/08/07 14:17:15 INFO mapred.JobClient: Running job:
 job_200808071415_0001
 08/08/07 14:17:16 INFO mapred.JobClient:  map 0% reduce 0%
 08/08/07 14:17:23 INFO mapred.JobClient:  map 50% reduce 0%
 08/08/07 14:17:24 INFO mapred.JobClient:  map 100% reduce 0%
 08/08/07 14:17:28 INFO mapred.JobClient:  map 100% reduce 16%
 08/08/07 14:25:34 INFO mapred.JobClient: Task Id :
 task_200808071415_0001_m_01_0, Status : FAILED
 Too many fetch-failures
 08/08/07 14:25:34 WARN mapred.JobClient: Error reading task
 outputquanglt-desktop
 08/08/07 14:25:34 WARN mapred.JobClient: Error reading task
 outputquanglt-desktop
 //

 Could anyone told me the possible reason of the error?
 Thanks in advanced.

 thientd.




Re: extracting input to a task from a (streaming) job?

2008-08-07 Thread Leon Mergen
Hello John,

On Thu, Aug 7, 2008 at 6:30 PM, John Heidemann [EMAIL PROTECTED] wrote:


 I have a large Hadoop streaming job that generally works fine,
 but a few (2-4) of the ~3000 maps and reduces have problems.
 To make matters worse, the problems are system-dependent (we run an a
 cluster with machines of slightly different OS versions).
 I'd of course like to debug these problems, but they are embedded in a
 large job.

 Is there a way to extract the input given to a reducer from a job, given
 the task identity?  (This would also be helpful for mappers.)


I believe you should set keep.failed.tasks.files to true -- this way, give
a task id, you can see what input files it has in ~/
taskTracker/${taskid}/work (source:
http://hadoop.apache.org/core/docs/r0.17.0/mapred_tutorial.html#IsolationRunner
)

On top of that, you can always use the debugging facilities:

http://hadoop.apache.org/core/docs/r0.17.0/mapred_tutorial.html#Debugging

When map/reduce task fails, user can run script for doing post-processing
on task logs i.e task's stdout, stderr, syslog and jobconf. The stdout and
stderr of the user-provided debug script are printed on the diagnostics. 

I hope this helps.

Regards,

Leon Mergen


Re: reduce job did not complete in a long time

2008-08-07 Thread Karl Anderson


On 28-Jul-08, at 6:33 PM, charles du wrote:


Hi:

I tried to run one of my map/reduce jobs on a cluster (hadoop 0.17.0).
I used 10 reducers. 9 of them returns quickly ( in a few seconds), but
one has been running for several hours, and still no sign of
completion. Do you know how I can debug it or find out what is going
on with this reducer?


You can log, and set the status message.  If you're using streaming, I  
think you're limited to writing to stderr.  The only way I've found to  
read the logs on a distributed run is by sshing to the actual task box  
and looking at the log directory.  I've almost gotten frustrated  
enough to have my tasks send email, but not quite.


Debugging is easier on a single pseudodistributed box because all the  
logs and stderr is right there, so try that if you can.


Re: reduce job did not complete in a long time

2008-08-07 Thread Miles Osborne
you should use the web UI --each mapper / reducer can be inspected and there
is no need to ssh in.

Miles

2008/8/7 Karl Anderson [EMAIL PROTECTED]


 On 28-Jul-08, at 6:33 PM, charles du wrote:

  Hi:

 I tried to run one of my map/reduce jobs on a cluster (hadoop 0.17.0).
 I used 10 reducers. 9 of them returns quickly ( in a few seconds), but
 one has been running for several hours, and still no sign of
 completion. Do you know how I can debug it or find out what is going
 on with this reducer?


 You can log, and set the status message.  If you're using streaming, I
 think you're limited to writing to stderr.  The only way I've found to read
 the logs on a distributed run is by sshing to the actual task box and
 looking at the log directory.  I've almost gotten frustrated enough to have
 my tasks send email, but not quite.

 Debugging is easier on a single pseudodistributed box because all the logs
 and stderr is right there, so try that if you can.




-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.


Join example

2008-08-07 Thread John DeTreville
Hadoop ships with a few example programs. One of these is join, which
I believe demonstrates map-side joins. I'm finding its usage
instructions a little impenetrable; could anyone send me instructions
that are more like type this then type this then type this?

Thanks in advance.

Cheers,
John


Re: fuse-dfs

2008-08-07 Thread Sebastian Vieira
On Thu, Aug 7, 2008 at 4:25 PM, Pete Wyckoff [EMAIL PROTECTED] wrote:


 Hi Sebastian,

 Those 2 things are just warnings and shouldn't cause any problems.  What
 happens when you ls /mnt/hadoop ?


[EMAIL PROTECTED] fuse-dfs]# ls /mnt/hadoop
ls: /mnt/hadoop: Transport endpoint is not connected

Also, this happens when i start fuse-dfs in one terminal, and do a df -h in
another:

[EMAIL PROTECTED] fuse-dfs]# ./fuse_dfs_wrapper.sh dfs://master:9000 /mnt/hadoop
-d
port=9000,server=master
fuse-dfs didn't recognize /mnt/hadoop,-2
fuse-dfs ignoring option -d
unique: 1, opcode: INIT (26), nodeid: 0, insize: 56
INIT: 7.8
flags=0x0003
max_readahead=0x0002
   INIT: 7.8
   flags=0x0001
   max_readahead=0x0002
   max_write=0x0010
   unique: 1, error: 0 (Success), outsize: 40
unique: 2, opcode: STATFS (17), nodeid: 1, insize: 40

-now i do a df -h in the other term-

Exception in thread main java.lang.NoClassDefFoundError:
org/apache/hadoop/conf/Configuration
Caused by: java.lang.ClassNotFoundException:
org.apache.hadoop.conf.Configuration
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClassInternal(Unknown Source)

Then the output from df is:

df: `/mnt/hadoop': Software caused connection abort



  And also what version of fuse-dfs are you
 using? The handling of options is different in trunk than in the last
 release.


[EMAIL PROTECTED] fuse-dfs]# ./fuse_dfs --version
./fuse_dfs 0.1.0

I did a checkout of the latest svn and compiled using the command you gave
in one of your previous mails.



 You can also look in /var/log/messages.


Only one line:
Aug  7 20:21:05 master fuse_dfs: mounting dfs://master:9000/


Thanks for your time,


Sebastian


Re: fuse-dfs

2008-08-07 Thread Pete Wyckoff

This just means your classpath is not set properly, so when fuse-dfs uses
libhdfs to try and connect to your server, it cannot instantiate hadoop
objects.

I have a JIRA open to improve error messaging when this happens:

https://issues.apache.org/jira/browse/HADOOP-3918

If you use the fuse_dfs_wrapper.sh, you should be able to set HADOOP_HOME
and it will create the classpath for you.

In retrospect, fuse_dfs_wrapper.sh should probably complain and exit if
HADOOP_HOME is not set.

-- pete


On 8/7/08 2:35 PM, Sebastian Vieira [EMAIL PROTECTED] wrote:

 On Thu, Aug 7, 2008 at 4:25 PM, Pete Wyckoff [EMAIL PROTECTED] wrote:
 
 
 Hi Sebastian,
 
 Those 2 things are just warnings and shouldn't cause any problems.  What
 happens when you ls /mnt/hadoop ?
 
 
 [EMAIL PROTECTED] fuse-dfs]# ls /mnt/hadoop
 ls: /mnt/hadoop: Transport endpoint is not connected
 
 Also, this happens when i start fuse-dfs in one terminal, and do a df -h in
 another:
 
 [EMAIL PROTECTED] fuse-dfs]# ./fuse_dfs_wrapper.sh dfs://master:9000 
 /mnt/hadoop
 -d
 port=9000,server=master
 fuse-dfs didn't recognize /mnt/hadoop,-2
 fuse-dfs ignoring option -d
 unique: 1, opcode: INIT (26), nodeid: 0, insize: 56
 INIT: 7.8
 flags=0x0003
 max_readahead=0x0002
INIT: 7.8
flags=0x0001
max_readahead=0x0002
max_write=0x0010
unique: 1, error: 0 (Success), outsize: 40
 unique: 2, opcode: STATFS (17), nodeid: 1, insize: 40
 
 -now i do a df -h in the other term-
 
 Exception in thread main java.lang.NoClassDefFoundError:
 org/apache/hadoop/conf/Configuration
 Caused by: java.lang.ClassNotFoundException:
 org.apache.hadoop.conf.Configuration
 at java.net.URLClassLoader$1.run(Unknown Source)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(Unknown Source)
 at java.lang.ClassLoader.loadClass(Unknown Source)
 at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
 at java.lang.ClassLoader.loadClass(Unknown Source)
 at java.lang.ClassLoader.loadClassInternal(Unknown Source)
 
 Then the output from df is:
 
 df: `/mnt/hadoop': Software caused connection abort
 
 
 
  And also what version of fuse-dfs are you
 using? The handling of options is different in trunk than in the last
 release.
 
 
 [EMAIL PROTECTED] fuse-dfs]# ./fuse_dfs --version
 ./fuse_dfs 0.1.0
 
 I did a checkout of the latest svn and compiled using the command you gave
 in one of your previous mails.
 
 
 
 You can also look in /var/log/messages.
 
 
 Only one line:
 Aug  7 20:21:05 master fuse_dfs: mounting dfs://master:9000/
 
 
 Thanks for your time,
 
 
 Sebastian



java.io.IOException: Could not get block locations. Aborting...

2008-08-07 Thread Piotr Kozikowski
Hi there:

We would like to know what are the most likely causes of this sort of
error:

Exception closing
file 
/data1/hdfs/tmp/person_url_pipe_59984_3405334/_temporary/_task_200807311534_0055_m_22_0/part-00022
java.io.IOException: Could not get block locations. Aborting...
at 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2080)
at 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1300(DFSClient.java:1702)
at 
org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1818)

Our map-reduce job does not fail completely but over 50% of the map tasks fail 
with this same error. 

We recently migrated our cluster from 0.16.4 to 0.17.1, previously we didn't 
have this problem using the same input data in a similar map-reduce job

Thank you,

Piotr



RE: Distributed Lucene - from hadoop contrib

2008-08-07 Thread Deepika Khera
Hey guys,

I would appreciate any feedback on this

Deepika

-Original Message-
From: Deepika Khera [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, August 06, 2008 5:39 PM
To: core-user@hadoop.apache.org
Subject: Distributed Lucene - from hadoop contrib

Hi,

 

I am planning to use distributed lucene from hadoop.contrib.index for
indexing. Has anyone used this or tested it? Any issues or comments?

 

I see that the design described is different from HDFS (Namenode is
stateless, stores no information regarding blocks for files, etc) . Does
anyone know how hard will it be to setup this kind of system or is there
something that can be reused.

 

A reference link -

 

http://wiki.apache.org/hadoop/DistributedLucene

 

Thanks,
Deepika



Re: Are lines broken in dfs and/or in InputSplit

2008-08-07 Thread Doug Cutting

Kevin wrote:

Yes, I have looked at the block files and it matches what you said. I
am just wondering if there is some property or flag that would turn
this feature on, if it exists.


No.  If you required this then you'd need to pad your data, but I'm not 
sure why you'd ever require it.  Running off the end of a block in 
mapreduce makes for a small amount of non-local i/o, but it's generally 
insignificant.


Doug


Re: Distributed Lucene - from hadoop contrib

2008-08-07 Thread Ning Li
http://wiki.apache.org/hadoop/DistributedLucene
and hadoop.contrib.index are two different things.

For information on hadoop.contrib.index, see the README file in the package.

I believe you can find code for http://wiki.apache.org/hadoop/DistributedLucene
at http://katta.wiki.sourceforge.net/.

Ning


On 8/7/08, Deepika Khera [EMAIL PROTECTED] wrote:
 Hey guys,

 I would appreciate any feedback on this

 Deepika

 -Original Message-
 From: Deepika Khera [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, August 06, 2008 5:39 PM
 To: core-user@hadoop.apache.org
 Subject: Distributed Lucene - from hadoop contrib

 Hi,



 I am planning to use distributed lucene from hadoop.contrib.index for
 indexing. Has anyone used this or tested it? Any issues or comments?



 I see that the design described is different from HDFS (Namenode is
 stateless, stores no information regarding blocks for files, etc) . Does
 anyone know how hard will it be to setup this kind of system or is there
 something that can be reused.



 A reference link -



 http://wiki.apache.org/hadoop/DistributedLucene



 Thanks,
 Deepika




Passing TupleWritable between map and reduce

2008-08-07 Thread Michael Andrews
Hi,

I am a new hadoop developer and am struggling to understand why I cannot pass 
TupleWritable between a map and reduce function.  I have modified the wordcount 
example to demonstrate the issue.  Also I am using hadoop 0.17.1.

package wordcount; import java.io.IOException; import java.util.*; import 
org.apache.hadoop.fs.Path; import org.apache.hadoop.io.*; import 
org.apache.hadoop.mapred.*; import org.apache.hadoop.mapred.join.*; public 
class WordCount {public static class Map extends MapReduceBase implements 
MapperLongWritable, Text, Text, TupleWritable {private final static 
IntWritable one = new IntWritable(1);private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollectorText, 
TupleWritable output, Reporter reporter) throws IOException {
String line = value.toString();StringTokenizer tokenizer = new 
StringTokenizer(line);TupleWritable tuple = new TupleWritable(new 
Writable[] { one } );while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());output.collect(word, 
tuple);}}}public static class Reduce extends 
MapReduceBase implements ReducerText, TupleWritable, Text, TupleWritable {
public void reduce(Text key, IteratorTupleWritable values, 
OutputCollectorText, TupleWritable output, Reporter reporter) throws 
IOException {IntWritable i = new IntWritable();int sum 
= 0;while (values.hasNext()) {i = ((IntWritable) 
values.next().get(0));sum += i.get();}
TupleWritable tuple = new TupleWritable(new Writable[] { new IntWritable(sum) } 
);output.collect(key, tuple);}}public static void 
main(String[] args) throws Exception {JobConf conf = new 
JobConf(WordCount.class);conf.setJobName(wordcount);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(TupleWritable.class);
conf.setMapperClass(Map.class);conf.setReducerClass(Reduce.class);  
  conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);} }
The output is always empty tuples ('[]').  Using the debugger, I have 
determined that the line:
TupleWritable tuple = new TupleWritable(new Writable[] { one } );

Is properly constructing the desired tuple.  I am not sure if it is being 
outputed correctly by output.collect as I cannot find the field in the 
OutputCollector data structure.  When I check in the reduce method the values 
are always empty tuples.  I have a feeling it has something to do with this 
line in the JavaDoc:

TupleWritable(Writable[] vals)
  Initialize tuple with storage; unknown whether any of them contain 
written values.

Thanks in advance for any all help,

Michael




Re: Passing TupleWritable between map and reduce

2008-08-07 Thread Michael Andrews

Sorry about the massive code chunk, I am not used to this mail client, I 
attached the file instead.

On 8/7/08 4:18 PM, Michael Andrews [EMAIL PROTECTED] wrote:

Hi,

I am a new hadoop developer and am struggling to understand why I cannot pass 
TupleWritable between a map and reduce function.  I have modified the wordcount 
example to demonstrate the issue.  Also I am using hadoop 0.17.1.

Is properly constructing the desired tuple.  I am not sure if it is being 
outputed correctly by output.collect as I cannot find the field in the 
OutputCollector data structure.  When I check in the reduce method the values 
are always empty tuples.  I have a feeling it has something to do with this 
line in the JavaDoc:

TupleWritable(Writable[] vals)
  Initialize tuple with storage; unknown whether any of them contain 
written values.

Thanks in advance for any all help,

Michael





Re: Passing TupleWritable between map and reduce

2008-08-07 Thread Chris Douglas
You need access to TupleWritable::setWritten(int). If you want to use  
TupleWritable outside the join package, then you need to make this  
(and probably related methods, like clearWritten(int)) public and  
recompile.


Please file a JIRA if you think it should be more general. -C

On Aug 7, 2008, at 4:18 PM, Michael Andrews wrote:


Hi,

I am a new hadoop developer and am struggling to understand why I  
cannot pass TupleWritable between a map and reduce function.  I have  
modified the wordcount example to demonstrate the issue.  Also I am  
using hadoop 0.17.1.


package wordcount; import java.io.IOException; import java.util.*;  
import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.*;  
import org.apache.hadoop.mapred.*; import  
org.apache.hadoop.mapred.join.*; public class WordCount {public  
static class Map extends MapReduceBase implements  
MapperLongWritable, Text, Text, TupleWritable {private  
final static IntWritable one = new IntWritable(1);private  
Text word = new Text();public void map(LongWritable key,  
Text value, OutputCollectorText, TupleWritable output, Reporter  
reporter) throws IOException {String line =  
value.toString();StringTokenizer tokenizer = new  
StringTokenizer(line);TupleWritable tuple = new  
TupleWritable(new Writable[] { one } );while  
(tokenizer.hasMoreTokens())  
{word.set(tokenizer.nextToken()); 
output.collect(word, tuple);}}}public  
static class Reduce extends MapReduceBase implements ReducerText,  
TupleWritable, Text, TupleWritable {public void reduce(Text  
key, IteratorTupleWritable values, OutputCollectorText,  
TupleWritable output, Reporter reporter) throws IOException  
{IntWritable i = new IntWritable();int sum =  
0;while (values.hasNext()) {i =  
((IntWritable) values.next().get(0));sum +=  
i.get();}TupleWritable tuple = new  
TupleWritable(new Writable[] { new IntWritable(sum) } ); 
output.collect(key, tuple);}}public static void  
main(String[] args) throws Exception {JobConf conf = new  
JobConf(WordCount.class); 
conf.setJobName(wordcount); 
conf.setOutputKeyClass(Text.class); 
conf.setOutputValueClass(TupleWritable.class); 
conf.setMapperClass(Map.class); 
conf.setReducerClass(Reduce.class); 
conf.setInputFormat(TextInputFormat.class); 
conf.setOutputFormat(TextOutputFormat.class); 
FileInputFormat.setInputPaths(conf, new Path(args[0])); 
FileOutputFormat.setOutputPath(conf, new Path(args[1])); 
JobClient.runJob(conf);} }
The output is always empty tuples ('[]').  Using the debugger, I  
have determined that the line:

   TupleWritable tuple = new TupleWritable(new Writable[] { one } );

Is properly constructing the desired tuple.  I am not sure if it is  
being outputed correctly by output.collect as I cannot find the  
field in the OutputCollector data structure.  When I check in the  
reduce method the values are always empty tuples.  I have a feeling  
it has something to do with this line in the JavaDoc:


TupleWritable(Writable[] vals)
 Initialize tuple with storage; unknown whether any of them  
contain written values.


Thanks in advance for any all help,

Michael






Re: extracting input to a task from a (streaming) job?

2008-08-07 Thread John Heidemann

On Thu, 07 Aug 2008 19:42:05 +0200, Leon Mergen wrote: 
Hello John,

On Thu, Aug 7, 2008 at 6:30 PM, John Heidemann [EMAIL PROTECTED] wrote:


 I have a large Hadoop streaming job that generally works fine,
 but a few (2-4) of the ~3000 maps and reduces have problems.
 To make matters worse, the problems are system-dependent (we run an a
 cluster with machines of slightly different OS versions).
 I'd of course like to debug these problems, but they are embedded in a
 large job.

 Is there a way to extract the input given to a reducer from a job, given
 the task identity?  (This would also be helpful for mappers.)


I believe you should set keep.failed.tasks.files to true -- this way, give
a task id, you can see what input files it has in ~/
taskTracker/${taskid}/work (source:
http://hadoop.apache.org/core/docs/r0.17.0/mapred_tutorial.html#IsolationRunner
)

On top of that, you can always use the debugging facilities:

http://hadoop.apache.org/core/docs/r0.17.0/mapred_tutorial.html#Debugging

When map/reduce task fails, user can run script for doing post-processing
on task logs i.e task's stdout, stderr, syslog and jobconf. The stdout and
stderr of the user-provided debug script are printed on the diagnostics. 

I hope this helps.

Thanks.

It looks like IsolationRunner is what I'm asking for.  I'll try it out.

I was aware of the logs, but unfortunately, have problems where inputs hang
or don't log meaningful information.


Separtely I found the output from the map stage
(In our config, in:
.../hadoop-hadoop/mapred/local/taskTracker/jobcache/job_200808051739_0005/attempt_200808051739_0005_r_09_0/output/

which is a bit different than taskTracker/${taskid}/work.  There's a
work dir parallel to output, but it's empty.
)

Hopefully isolation runner will deal with this layout.

   -John


Re: Passing TupleWritable between map and reduce

2008-08-07 Thread Michael Andrews
OK thanks for the information.  I guess it seems strange to want to use 
TupleWritable in this way, but this just seemed like the right thing to do this 
based on the API docs. Is it more idiomatic to inherit from Writable when 
processing structured data?  Again, I am really new to the hadoop community but 
I will try to file something with JIRA on this. Not really sure how to proceed 
with a patch, maybe I could just try and clarify the docs?

On 8/7/08 4:38 PM, Chris Douglas [EMAIL PROTECTED] wrote:

You need access to TupleWritable::setWritten(int). If you want to use
TupleWritable outside the join package, then you need to make this
(and probably related methods, like clearWritten(int)) public and
recompile.

Please file a JIRA if you think it should be more general. -C

On Aug 7, 2008, at 4:18 PM, Michael Andrews wrote:

 Hi,

 I am a new hadoop developer and am struggling to understand why I
 cannot pass TupleWritable between a map and reduce function.  I have
 modified the wordcount example to demonstrate the issue.  Also I am
 using hadoop 0.17.1.

 package wordcount; import java.io.IOException; import java.util.*;
 import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.*;
 import org.apache.hadoop.mapred.*; import
 org.apache.hadoop.mapred.join.*; public class WordCount {public
 static class Map extends MapReduceBase implements
 MapperLongWritable, Text, Text, TupleWritable {private
 final static IntWritable one = new IntWritable(1);private
 Text word = new Text();public void map(LongWritable key,
 Text value, OutputCollectorText, TupleWritable output, Reporter
 reporter) throws IOException {String line =
 value.toString();StringTokenizer tokenizer = new
 StringTokenizer(line);TupleWritable tuple = new
 TupleWritable(new Writable[] { one } );while
 (tokenizer.hasMoreTokens())
 {word.set(tokenizer.nextToken());
 output.collect(word, tuple);}}}public
 static class Reduce extends MapReduceBase implements ReducerText,
 TupleWritable, Text, TupleWritable {public void reduce(Text
 key, IteratorTupleWritable values, OutputCollectorText,
 TupleWritable output, Reporter reporter) throws IOException
 {IntWritable i = new IntWritable();int sum =
 0;while (values.hasNext()) {i =
 ((IntWritable) values.next().get(0));sum +=
 i.get();}TupleWritable tuple = new
 TupleWritable(new Writable[] { new IntWritable(sum) } );
 output.collect(key, tuple);}}public static void
 main(String[] args) throws Exception {JobConf conf = new
 JobConf(WordCount.class);
 conf.setJobName(wordcount);
 conf.setOutputKeyClass(Text.class);
 conf.setOutputValueClass(TupleWritable.class);
 conf.setMapperClass(Map.class);
 conf.setReducerClass(Reduce.class);
 conf.setInputFormat(TextInputFormat.class);
 conf.setOutputFormat(TextOutputFormat.class);
 FileInputFormat.setInputPaths(conf, new Path(args[0]));
 FileOutputFormat.setOutputPath(conf, new Path(args[1]));
 JobClient.runJob(conf);} }
 The output is always empty tuples ('[]').  Using the debugger, I
 have determined that the line:
TupleWritable tuple = new TupleWritable(new Writable[] { one } );

 Is properly constructing the desired tuple.  I am not sure if it is
 being outputed correctly by output.collect as I cannot find the
 field in the OutputCollector data structure.  When I check in the
 reduce method the values are always empty tuples.  I have a feeling
 it has something to do with this line in the JavaDoc:

 TupleWritable(Writable[] vals)
  Initialize tuple with storage; unknown whether any of them
 contain written values.

 Thanks in advance for any all help,

 Michael






mapred/map only at 2, always?

2008-08-07 Thread James Graham (Greywolf)

hadoop 0.16.4

Why are mapred.reduce.tasks and mapred.map.tasks always showing up
as 2?

I have the same config on all nodes.
hadoop-site.xml contains the following parameters:

property
   namemapred.map.tasks/name
   value67/value
   descriptionThe default number of map tasks per job.  Typically set
   to a prime several times greater than number of available hosts.
   Ignored when mapred.job.tracker is local.
   /description
/property

property
   namemapred.reduce.tasks/name
   value23/value
   descriptionThe default number of reduce tasks per job.  Typically set
   to a prime close to the number of available hosts.  Ignored when
   mapred.job.tracker is local.
   /description
/property

property
  namemapred.job.tracker/name
  valueidx1-r70:50030/value  !-- mapred.job.tracker --
  descriptionThe host and port that the MapReduce job tracker runs
  at.  If local, then jobs are run in-process as a single map
  and reduce task.
  /description
/property

--
James Graham (Greywolf)   |
650.930.1138|925.768.4053 *
[EMAIL PROTECTED] |
Check out what people are saying about SearchMe! -- click below
http://www.searchme.com/stack/109aa


Re: Passing TupleWritable between map and reduce

2008-08-07 Thread Chris Douglas
Particularly if you know which types to expect in your structured  
data, rolling your own Writable is strongly preferred to  
TupleWritable. The latter serializes to a comically verbose format and  
should only be used when the types and nesting depth are unknown. -C


On Aug 7, 2008, at 5:45 PM, Michael Andrews wrote:

OK thanks for the information.  I guess it seems strange to want to  
use TupleWritable in this way, but this just seemed like the right  
thing to do this based on the API docs. Is it more idiomatic to  
inherit from Writable when processing structured data?  Again, I am  
really new to the hadoop community but I will try to file something  
with JIRA on this. Not really sure how to proceed with a patch,  
maybe I could just try and clarify the docs?


On 8/7/08 4:38 PM, Chris Douglas [EMAIL PROTECTED] wrote:

You need access to TupleWritable::setWritten(int). If you want to use
TupleWritable outside the join package, then you need to make this
(and probably related methods, like clearWritten(int)) public and
recompile.

Please file a JIRA if you think it should be more general. -C

On Aug 7, 2008, at 4:18 PM, Michael Andrews wrote:


Hi,

I am a new hadoop developer and am struggling to understand why I
cannot pass TupleWritable between a map and reduce function.  I have
modified the wordcount example to demonstrate the issue.  Also I am
using hadoop 0.17.1.

package wordcount; import java.io.IOException; import java.util.*;
import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*; import
org.apache.hadoop.mapred.join.*; public class WordCount {public
static class Map extends MapReduceBase implements
MapperLongWritable, Text, Text, TupleWritable {private
final static IntWritable one = new IntWritable(1);private
Text word = new Text();public void map(LongWritable key,
Text value, OutputCollectorText, TupleWritable output, Reporter
reporter) throws IOException {String line =
value.toString();StringTokenizer tokenizer = new
StringTokenizer(line);TupleWritable tuple = new
TupleWritable(new Writable[] { one } );while
(tokenizer.hasMoreTokens())
{word.set(tokenizer.nextToken());
output.collect(word, tuple);}}}public
static class Reduce extends MapReduceBase implements ReducerText,
TupleWritable, Text, TupleWritable {public void reduce(Text
key, IteratorTupleWritable values, OutputCollectorText,
TupleWritable output, Reporter reporter) throws IOException
{IntWritable i = new IntWritable();int sum =
0;while (values.hasNext()) {i =
((IntWritable) values.next().get(0));sum +=
i.get();}TupleWritable tuple = new
TupleWritable(new Writable[] { new IntWritable(sum) } );
output.collect(key, tuple);}}public static void
main(String[] args) throws Exception {JobConf conf = new
JobConf(WordCount.class);
conf.setJobName(wordcount);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(TupleWritable.class);
conf.setMapperClass(Map.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);} }
The output is always empty tuples ('[]').  Using the debugger, I
have determined that the line:
  TupleWritable tuple = new TupleWritable(new Writable[] { one } );

Is properly constructing the desired tuple.  I am not sure if it is
being outputed correctly by output.collect as I cannot find the
field in the OutputCollector data structure.  When I check in the
reduce method the values are always empty tuples.  I have a feeling
it has something to do with this line in the JavaDoc:

TupleWritable(Writable[] vals)
Initialize tuple with storage; unknown whether any of them
contain written values.

Thanks in advance for any all help,

Michael









Setting up a Hadoop cluster where nodes are spread over the Internet

2008-08-07 Thread Lucas Nazário dos Santos
Hello,

Can someone point me out what are the extra tasks that need to be performed
in order to set up a cluster where nodes are spread over the Internet, in
different LANs?

Do I need to free any datanode/namenode ports? How do I get the datanodes to
know the valid namenode IP, and not something like 10.1.1.1?

Any help is appreciate.

Lucas