Re: exceptions copying files into HDFS

2010-12-11 Thread Sanford Rockowitz
On 12/11/2010 10:48 PM, Varadharajan Mukundan wrote: Hi, org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/rock/input/fair-scheduler.xml could only be replicated to 0 nodes, instead of 1 I think none of your datanodes are actually running. why not use jps and make sure whe

Re: exceptions copying files into HDFS

2010-12-11 Thread li ping
That's right. You have to make sure the datanode is running. If you are using the virtual machine, like Virtual-box, sometime, you should wait for a moment until the datanode is active. seems like the performance issue, the datanode in vm will be active after several mins. On Sun, Dec 12, 2010 at

Re: exceptions copying files into HDFS

2010-12-11 Thread Varadharajan Mukundan
Hi, > org.apache.hadoop.ipc.RemoteException: java.io.IOException: File > /user/rock/input/fair-scheduler.xml could only be replicated to 0 nodes, > instead of 1 I think none of your datanodes are actually running. why not use jps and make sure whether they are running. Also check the datanode log

exceptions copying files into HDFS

2010-12-11 Thread Sanford Rockowitz
Folks, I'm a Hadoop newbie, and I hope this is an appropriate place to post this question. I'm trying to work through the initial examples. When I try to copy files into HDFS, hadoop throws exceptions. I imagine it's something in my configuration, but I'm at a loss to figure out what. I

Re: Error: ... It is indirectly referenced from required .class files - implements

2010-12-11 Thread Harsh J
Try adding the commons-logging jar to your build path. It is available in the lib/ folder of your Hadoop distribution. If you use the MapReduce eclipse plugin which comes with the Hadoop distro, it would add all required jars to create a Hadoop project automatically (i.e. everything in lib/*.jar +

Re: Error: ... It is indirectly referenced from required .class files - implements

2010-12-11 Thread maha
This is a compilation error I get in Eclipse ... So, I don't so how putting the hadoop-core.jar in the lib/ directory will change the error. Do you suggest another way of running a hadoop java program ? The way I do it is .. Create an Eclipse project , Build Paths --> Add External Archives: had

Re: Is it possible to write file output in Map phase once and write another file output in Reduce phase?

2010-12-11 Thread Edward Choi
Thanks. Then I should definitely try that. Thanks for all the info :-) Ed From mp2893's iPhone On 2010. 12. 12., at 오전 3:00, Ted Dunning wrote: > Of course. It is just a set of Hadoop programs. > > 2010/12/11 edward choi > >> Can I operate Bixo on a cluster other than Amazon EC2? >>

Re: Error: ... It is indirectly referenced from required .class files - implements

2010-12-11 Thread li ping
Can you try to add the jar file in your Hadoop lib directory. On Sun, Dec 12, 2010 at 8:00 AM, Maha A. Alabduljalil wrote: > > Hi all, > > I extended my project path with the hadoop-0.20.2-core.jar file, but I can > see that some of the classes I need aren't there, so for example an error I > ge

Error: ... It is indirectly referenced from required .class files - implements

2010-12-11 Thread Maha A. Alabduljalil
Hi all, I extended my project path with the hadoop-0.20.2-core.jar file, but I can see that some of the classes I need aren't there, so for example an error I get: " The type org.apache.commons.logging.Log cannot be resolved. It is indirectly referenced from required .class files- i

Re: Slow final few reducers

2010-12-11 Thread Ted Dunning
The job history program tells you this. The syntax is hideous, but there is a parser provided. On Sat, Dec 11, 2010 at 8:23 AM, Mithila Nagendra wrote: > Just curious and off topic :) How do you find the time taken by each > reducer? What command/method do you use? I need that for my research.

Re: Slow final few reducers

2010-12-11 Thread Ted Dunning
It sounds like your key distribution is being reflected in the size of your reduce tasks, thus making some of them take much longer than the rest. There are three solutions to this: a) down-sample. Particularly for statistical computations, once you have seen a thousand instances, you have seen

Re: Is it possible to write file output in Map phase once and write another file output in Reduce phase?

2010-12-11 Thread Ted Dunning
Of course. It is just a set of Hadoop programs. 2010/12/11 edward choi > Can I operate Bixo on a cluster other than Amazon EC2? >

Re: Slow final few reducers

2010-12-11 Thread Mithila Nagendra
Hi Rob, Just curious and off topic :) How do you find the time taken by each reducer? What command/method do you use? I need that for my research. Thanks, Mithila On Sat, Dec 11, 2010 at 4:05 AM, Rob Stewart wrote: > Hi, > > I have a problem with a MapReduce job I am trying to run on a 32 node

Re: Slow final few reducers

2010-12-11 Thread Harsh J
On Sat, Dec 11, 2010 at 7:41 PM, Rob Stewart wrote: > Sorry my fault - It's someone running a network simulator on the cluster ! > Culprit found? *wide grin* -- Harsh J www.harshj.com

Re: Slow final few reducers

2010-12-11 Thread Rob Stewart
Sorry my fault - It's someone running a network simulator on the cluster ! Rob On 11 December 2010 14:09, Rob Stewart wrote: > OK, slight update: > > Immediately underneath public void reduce(), I have added a: > System.out.println("Key: " + key.toString()); > > And I am logged on a node that is

Re: Slow final few reducers

2010-12-11 Thread Rob Stewart
OK, slight update: Immediately underneath public void reduce(), I have added a: System.out.println("Key: " + key.toString()); And I am logged on a node that is still working on a reducer. However, it stopped printing "Key:" long ago, so it is not processing new keys. But looking more closely at

Re: Slow final few reducers

2010-12-11 Thread Harsh J
On Sat, Dec 11, 2010 at 5:25 PM, Rob Stewart wrote: > Oh, > > I should add, of the Java processes running on the remaining nodes for > the final wave of reducers, the one taking all the CPU is the "Child" > process (not TaskTracker). I log into the Master, and also, the Java > process taking all t

Re: Multicore Nodes

2010-12-11 Thread Harsh J
On Sat, Dec 11, 2010 at 5:10 PM, Rob Stewart wrote: > Ah, > > that is very interesting indeed. > > I am running on a homogeneous cluster, where each node has 8 cores. > > Does that mean that Hadoop would need to be carefully configured, so > that 8 core machines had a max.tasks value of 8, and dua

Re: Slow final few reducers

2010-12-11 Thread Rob Stewart
Oh, I should add, of the Java processes running on the remaining nodes for the final wave of reducers, the one taking all the CPU is the "Child" process (not TaskTracker). I log into the Master, and also, the Java process taking all the CPU is "Child". Is this normal? thanks, Rob On 11 December

Re: Multicore Nodes

2010-12-11 Thread Rob Stewart
Ah, that is very interesting indeed. I am running on a homogeneous cluster, where each node has 8 cores. Does that mean that Hadoop would need to be carefully configured, so that 8 core machines had a max.tasks value of 8, and dual core machines had the value 2 ? Very useful to know, Rob On 1

Re: Slow final few reducers

2010-12-11 Thread Rob Stewart
Hi, many thanks for your response. A few observations: - I know that for a fact my key distribution is quite radically skewed (some keys with *many* value, most keys with few). - I have overlooked the fact that I need a partitioner. I suspect that this will help dramatically. I realize that the n

Re: Multicore Nodes

2010-12-11 Thread Harsh J
Hi, On Sat, Dec 11, 2010 at 4:39 PM, Rob Stewart wrote: > Hi, > > When trying to compare Hadoop against other parallel  paradigms, it is > important to consider heterogeneous systems. Some may have 100 nodes, > each single core. Some may have 100 nodes, with 8 cores on each, and > others may have

Re: Slow final few reducers

2010-12-11 Thread Harsh J
Hi, Certain reducers may receive a higher share of data than others (Depending on your data/key distribution, the partition function, etc.). Compare the longer reduce tasks' counters with the quicker ones. Are you sure that the reducers that take long are definitely the last wave, as in with IDs

Multicore Nodes

2010-12-11 Thread Rob Stewart
Hi, When trying to compare Hadoop against other parallel paradigms, it is important to consider heterogeneous systems. Some may have 100 nodes, each single core. Some may have 100 nodes, with 8 cores on each, and others may have 5 nodes, 32 cores per node. As Hadoop runs on JVM's on each node, a

Slow final few reducers

2010-12-11 Thread Rob Stewart
Hi, I have a problem with a MapReduce job I am trying to run on a 32 node cluster. The final few reducers take a *lot* longer than the rest. e.g. If I specify 100 reducers, the first 90 will complete in 5 minutes, and then the remaining 10 reducers might take 10 minutes. Same is true for any num

Re: Is it possible to write file output in Map phase once and write another file output in Reduce phase?

2010-12-11 Thread edward choi
Excuse me but could I ask one more question? Can I operate Bixo on a cluster other than Amazon EC2? I already am running a Hadoop cluster of my own, so I'd like run Bixo on top of my cluster. But I don't see how to do it in the Bixo's "Getting Started" page. All I see are "running locally", "runnin

Re: Is it possible to write file output in Map phase once and write another file output in Reduce phase?

2010-12-11 Thread edward choi
I'd start with only a few rss feeds at first, but I plan to expand it to the scale of a thousands of rss feeds every 30 minutes eventually. That's why I am so eager to implement my system in Hadoop. I skimmed through Nutch and Bixo but I feel that eventually I'm gonna have to build the system from