Re: Confusing NameNodeFailover page in Hadoop Wiki

2008-08-07 Thread Steve Loughran
Doug Cutting wrote: Konstantin Shvachko wrote: Imho we either need to correct it or remove. +1 Doug I added some pages there on namenode/jobtracker, etc, linking to the faiover doc, which I didnt compare to the svn docs to see what was correct. Perhaps the failover page could be set up

Re: DFS. How to read from a specific datanode

2008-08-07 Thread Steve Loughran
Kevin wrote: Thank you for the suggestion. I looked at DFSClient. It appears that chooseDataNode method decides which data node to connect to. Currently it chooses the first non-dead data node returned by namenode, which have sorted the nodes by proximity to the client. However, chooseDataNode

Re: Configuration: I need help.

2008-08-07 Thread Steve Loughran
Allen Wittenauer wrote: On 8/6/08 11:52 AM, Otis Gospodnetic [EMAIL PROTECTED] wrote: You can put the same hadoop-site.xml on all machines. Yes, you do want a secondary NN - a single NN is a SPOF. Browser the archives a few days back to find an email from Paul about DRBD (disk replication) to

Re: fuse-dfs

2008-08-07 Thread Sebastian Vieira
Thanks. After alot of experimenting (and ofcourse, right before you sent this reply) i figured it out. I also had to include the path to libhdfs.so in my ld.so.conf and update it before i was able to succesfully compile fuse_dfs. However when i try to mount the HDFS, it fails. I have tried both

Re: How to run hadoop without DNS server?

2008-08-07 Thread Torsten Curdt
While I configure and use the hadoop framework, it seems that the DNS server must be used to do hostname resolution (even if i configure the IP address but not hostname in config/slaves and config/masters file). Because we don't have local DNS server in our local ethernet, so i have to add

Why is scaling HBase much simpler then scaling a relational db?

2008-08-07 Thread Mork0075
Hello, can someone please explain oder point me to some documentation or papers, where i can read well proven facts, why scaling a relational db is so hard and scaling a document oriented db isnt? So perhaps if i got lots of requests to my relational db, i would duplicate it to several

Re: Why is scaling HBase much simpler then scaling a relational db?

2008-08-07 Thread Steve Loughran
Mork0075 wrote: Hello, can someone please explain oder point me to some documentation or papers, where i can read well proven facts, why scaling a relational db is so hard and scaling a document oriented db isnt? http://labs.google.com/papers/bigtable.html relational dbs are great for

Re: DFS. How to read from a specific datanode

2008-08-07 Thread Kevin
Yes, I agree with you that it should be negotiated. That is namenode provides an ordered list and the client can choose some based on its own measurements. But I am afraid 0.17.1 does not provide easy interface for this. -Kevin On Thu, Aug 7, 2008 at 3:40 AM, Steve Loughran [EMAIL PROTECTED]

Re: hdfs question

2008-08-07 Thread Pete Wyckoff
One way to get all Unix commands to work as is is to mount hdfs as a normal unix filesystem with either fuse-dfs (in contrib) or hdfs-fuse (on google code). Pete On 8/6/08 5:08 PM, Mori Bellamy [EMAIL PROTECTED] wrote: hey all, often i find it would be convenient for me to run conventional

extracting input to a task from a (streaming) job?

2008-08-07 Thread John Heidemann
I have a large Hadoop streaming job that generally works fine, but a few (2-4) of the ~3000 maps and reduces have problems. To make matters worse, the problems are system-dependent (we run an a cluster with machines of slightly different OS versions). I'd of course like to debug these problems,

Re: hadoop question

2008-08-07 Thread Khanh Nguyen
Can you also post you hadoop-site.xml and hadoop-default.xml? -k On Thu, Aug 7, 2008 at 3:52 AM, Mr.Thien [EMAIL PROTECTED] wrote: Hi everyone, I am trying to use hadoop. I set up my computer (thientd-desktop) as master (jobtracker and namenode). Two other computers: trunght-desktop and

Re: extracting input to a task from a (streaming) job?

2008-08-07 Thread Leon Mergen
Hello John, On Thu, Aug 7, 2008 at 6:30 PM, John Heidemann [EMAIL PROTECTED] wrote: I have a large Hadoop streaming job that generally works fine, but a few (2-4) of the ~3000 maps and reduces have problems. To make matters worse, the problems are system-dependent (we run an a cluster with

Re: reduce job did not complete in a long time

2008-08-07 Thread Karl Anderson
On 28-Jul-08, at 6:33 PM, charles du wrote: Hi: I tried to run one of my map/reduce jobs on a cluster (hadoop 0.17.0). I used 10 reducers. 9 of them returns quickly ( in a few seconds), but one has been running for several hours, and still no sign of completion. Do you know how I can debug it

Re: reduce job did not complete in a long time

2008-08-07 Thread Miles Osborne
you should use the web UI --each mapper / reducer can be inspected and there is no need to ssh in. Miles 2008/8/7 Karl Anderson [EMAIL PROTECTED] On 28-Jul-08, at 6:33 PM, charles du wrote: Hi: I tried to run one of my map/reduce jobs on a cluster (hadoop 0.17.0). I used 10 reducers. 9

Join example

2008-08-07 Thread John DeTreville
Hadoop ships with a few example programs. One of these is join, which I believe demonstrates map-side joins. I'm finding its usage instructions a little impenetrable; could anyone send me instructions that are more like type this then type this then type this? Thanks in advance. Cheers, John

Re: fuse-dfs

2008-08-07 Thread Sebastian Vieira
On Thu, Aug 7, 2008 at 4:25 PM, Pete Wyckoff [EMAIL PROTECTED] wrote: Hi Sebastian, Those 2 things are just warnings and shouldn't cause any problems. What happens when you ls /mnt/hadoop ? [EMAIL PROTECTED] fuse-dfs]# ls /mnt/hadoop ls: /mnt/hadoop: Transport endpoint is not connected

Re: fuse-dfs

2008-08-07 Thread Pete Wyckoff
This just means your classpath is not set properly, so when fuse-dfs uses libhdfs to try and connect to your server, it cannot instantiate hadoop objects. I have a JIRA open to improve error messaging when this happens: https://issues.apache.org/jira/browse/HADOOP-3918 If you use the

java.io.IOException: Could not get block locations. Aborting...

2008-08-07 Thread Piotr Kozikowski
Hi there: We would like to know what are the most likely causes of this sort of error: Exception closing file /data1/hdfs/tmp/person_url_pipe_59984_3405334/_temporary/_task_200807311534_0055_m_22_0/part-00022 java.io.IOException: Could not get block locations. Aborting... at

RE: Distributed Lucene - from hadoop contrib

2008-08-07 Thread Deepika Khera
Hey guys, I would appreciate any feedback on this Deepika -Original Message- From: Deepika Khera [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 06, 2008 5:39 PM To: core-user@hadoop.apache.org Subject: Distributed Lucene - from hadoop contrib Hi, I am planning to use

Re: Are lines broken in dfs and/or in InputSplit

2008-08-07 Thread Doug Cutting
Kevin wrote: Yes, I have looked at the block files and it matches what you said. I am just wondering if there is some property or flag that would turn this feature on, if it exists. No. If you required this then you'd need to pad your data, but I'm not sure why you'd ever require it.

Re: Distributed Lucene - from hadoop contrib

2008-08-07 Thread Ning Li
http://wiki.apache.org/hadoop/DistributedLucene and hadoop.contrib.index are two different things. For information on hadoop.contrib.index, see the README file in the package. I believe you can find code for http://wiki.apache.org/hadoop/DistributedLucene at http://katta.wiki.sourceforge.net/.

Passing TupleWritable between map and reduce

2008-08-07 Thread Michael Andrews
Hi, I am a new hadoop developer and am struggling to understand why I cannot pass TupleWritable between a map and reduce function. I have modified the wordcount example to demonstrate the issue. Also I am using hadoop 0.17.1. package wordcount; import java.io.IOException; import java.util.*;

Re: Passing TupleWritable between map and reduce

2008-08-07 Thread Michael Andrews
Sorry about the massive code chunk, I am not used to this mail client, I attached the file instead. On 8/7/08 4:18 PM, Michael Andrews [EMAIL PROTECTED] wrote: Hi, I am a new hadoop developer and am struggling to understand why I cannot pass TupleWritable between a map and reduce function.

Re: Passing TupleWritable between map and reduce

2008-08-07 Thread Chris Douglas
You need access to TupleWritable::setWritten(int). If you want to use TupleWritable outside the join package, then you need to make this (and probably related methods, like clearWritten(int)) public and recompile. Please file a JIRA if you think it should be more general. -C On Aug 7,

Re: extracting input to a task from a (streaming) job?

2008-08-07 Thread John Heidemann
On Thu, 07 Aug 2008 19:42:05 +0200, Leon Mergen wrote: Hello John, On Thu, Aug 7, 2008 at 6:30 PM, John Heidemann [EMAIL PROTECTED] wrote: I have a large Hadoop streaming job that generally works fine, but a few (2-4) of the ~3000 maps and reduces have problems. To make matters worse, the

Re: Passing TupleWritable between map and reduce

2008-08-07 Thread Michael Andrews
OK thanks for the information. I guess it seems strange to want to use TupleWritable in this way, but this just seemed like the right thing to do this based on the API docs. Is it more idiomatic to inherit from Writable when processing structured data? Again, I am really new to the hadoop

mapred/map only at 2, always?

2008-08-07 Thread James Graham (Greywolf)
hadoop 0.16.4 Why are mapred.reduce.tasks and mapred.map.tasks always showing up as 2? I have the same config on all nodes. hadoop-site.xml contains the following parameters: property namemapred.map.tasks/name value67/value descriptionThe default number of map tasks per job.

Re: Passing TupleWritable between map and reduce

2008-08-07 Thread Chris Douglas
Particularly if you know which types to expect in your structured data, rolling your own Writable is strongly preferred to TupleWritable. The latter serializes to a comically verbose format and should only be used when the types and nesting depth are unknown. -C On Aug 7, 2008, at 5:45 PM,

Setting up a Hadoop cluster where nodes are spread over the Internet

2008-08-07 Thread Lucas Nazário dos Santos
Hello, Can someone point me out what are the extra tasks that need to be performed in order to set up a cluster where nodes are spread over the Internet, in different LANs? Do I need to free any datanode/namenode ports? How do I get the datanodes to know the valid namenode IP, and not something