HOD on Scyld

2010-04-12 Thread Boyu Zhang
Dear All, I have been trying to use HOD on Scyld as a common user(not root), but I have some problem to get it start. I am wondering did anyone use HOD on Scyld cluster successfully? Any help would be appreciated, thanks! Boyu

Re: Failure to find shared lib in distributed cache

2010-04-12 Thread Keith Wiley
I love working on a problem for an hour, sending an email for help, then solving it. Problem was the space after the comma in the -files option: $ hadoop pipes -D hadoop.pipes.java.recordreader=true -D hadoop.pipes.java.recordwriter=true -files \ ...[SEVERAL .so FILES TO DISTRIBUTED CACHE] hdfs

Failure to find shared lib in distributed cache

2010-04-12 Thread Keith Wiley
I've been steadily adding more and more shared libraries to the -files option of my pipes command and have had moderate success in that each time I add a new library the app no longer fails on that library, but rather on the next one. However, I've hit a snag. I'm getting the following error ev

Re: Optimal setup for a test problem

2010-04-12 Thread Andrew Nguyen
I guess my question below can be rephrased as, "What is the absolute minimum hw requirements for me to still see 'better-than-a-single-machine' performance?" Thanks! On Apr 12, 2010, at 1:45 PM, Andrew Nguyen wrote: > I don't think you can :-). Sorry, they are 100Mbps NIC's... I get > 95Mbit

Re: Optimal setup for a test problem

2010-04-12 Thread Andrew Nguyen
I don't think you can :-). Sorry, they are 100Mbps NIC's... I get 95Mbit/sec from one node to another with iperf. Should I still be expecting such dismal performance with just 100Mbps? On Apr 12, 2010, at 1:31 PM, Todd Lipcon wrote: > On Mon, Apr 12, 2010 at 1:05 PM, Andrew Nguyen < > andrew-

Re: Optimal setup for a test problem

2010-04-12 Thread Todd Lipcon
On Mon, Apr 12, 2010 at 1:05 PM, Andrew Nguyen < andrew-lists-had...@ucsfcti.org> wrote: > 5 identically spec'ed nodes, each has: > > 2 GB RAM > Pentium 4 3.0G with HT > 250GB HDD on PATA > 10Mbps NIC > This is probably your issue - 10mbps nic? I didn't know you could even get those anymore! Had

Re: Optimal setup for a test problem

2010-04-12 Thread Andrew Nguyen
@Todd: I do need the sorting behavior, eventually. However, I'll try it with zero reduce jobs to see. @Alex: Yes, I was planning on incrementally building my mapper and reducer functions so currently, the mapper takes the value and multiplies by the gain and adds the offset and outputs a new

Re: Optimal setup for a test problem

2010-04-12 Thread alex kamil
Andrew, I would also suggest to run DFSIO benchmark to isolate io related issues hadoop jar hadoop-0.20.2-test.jar TestDFSIO -write -nrFiles 10 -fileSize 1000 hadoop jar hadoop-0.20.2-test.jar TestDFSIO -read -nrFiles 10 -fileSize 1000 there are additional tests specific for mapreduce - run "h

Re: Most effective way to use a lot of shared libraries?

2010-04-12 Thread Brian Bockelman
Hey Keith, The way we (LHC) approach a similar problem (not using Hadoop, but basically the same thing) is we distributed the common software everywhere (either through a shared file system or an RPM which is installed as part of the base image), and allow users to fly in changed code with the

Re: feed queue fetcher with hadoop/zookeeper/gearman?

2010-04-12 Thread Patrick Hunt
See this environment http://bit.ly/4ekN8G. Subsequently I used the 3 server setup, each configured with 8gig of heap in the jvm and 4 CPUs/jvm (I think I used 10second session timeouts for this) for some additional testing that I've not written up yet. I was able to run ~500 clients (same test

Most effective way to use a lot of shared libraries?

2010-04-12 Thread Keith Wiley
I am have partial success chipping away at the shared library dependencies of my hadoop job by submitting them to the distributed cache with the -files option. When I add another library to the -files list, it seems to work in that the run no longer fails on that library, but rather fails on an

Re: How to kill orphaned hadoop job through web UI

2010-04-12 Thread Keith Wiley
On Apr 12, 2010, at 10:22 , Edward Capriolo wrote: > On Mon, Apr 12, 2010 at 1:17 PM, Keith Wiley wrote: > >> So I ^C a job from the command line and get my prompt back, but sometimes >> the job remains on the cluster, I can see it on the admin web UI, and >> sometimes it lingers there for hours

Re: How to kill orphaned hadoop job through web UI

2010-04-12 Thread Keith Wiley
Update: on another mailing list it was shown how to use the hadoop binary with the 'jobs -kill' command to kill a job. On Apr 12, 2010, at 10:17 , Keith Wiley wrote: > So I ^C a job from the command line and get my prompt back, but sometimes the > job remains on the cluster, I can see it on the

Re: Optimal setup for a test problem

2010-04-12 Thread Todd Lipcon
Hi Andrew, Do you need the sorting behavior that having an identity reducer gives you? If not, set the number of reduce tasks to 0 and you'll end up with a map only job, which should be significantly faster. -Todd On Mon, Apr 12, 2010 at 9:43 AM, Andrew Nguyen < andrew-lists-had...@ucsfcti.org>

Re: feed queue fetcher with hadoop/zookeeper/gearman?

2010-04-12 Thread Thomas Koch
Mahadev Konar: > Hi Thomas, > There are a couple of projects inside Yahoo! that use ZooKeeper as an > event manager for feed processing. > > I am little bit unclear on your example below. As I understand it- > > 1. There are 1 million feeds that will be stored in Hbase. > 2. A map reduce job wi

Re: How to kill orphaned hadoop job through web UI

2010-04-12 Thread Edward Capriolo
On Mon, Apr 12, 2010 at 1:17 PM, Keith Wiley wrote: > So I ^C a job from the command line and get my prompt back, but sometimes > the job remains on the cluster, I can see it on the admin web UI, and > sometimes it lingers there for hours before finally getting flushed. > > Is there a way to kill

How to kill orphaned hadoop job through web UI

2010-04-12 Thread Keith Wiley
So I ^C a job from the command line and get my prompt back, but sometimes the job remains on the cluster, I can see it on the admin web UI, and sometimes it lingers there for hours before finally getting flushed. Is there a way to kill a hadoop job once the command line prompt has returned, onc

Re: Announce: Karmasphere Studio for Hadoop 1.2.0

2010-04-12 Thread Shevek
On Mon, 2010-04-12 at 00:32 +, Allen Wittenauer wrote: > On Apr 10, 2010, at 7:10 PM, Shevek wrote: > > > > * Full cross-platform support > > - Job submission, HDFS and S3 browsing from Windows, MacOS or Linux. > > > If you list three OSes, that isn't cross platform. :) We support quite a

Re: feed queue fetcher with hadoop/zookeeper/gearman?

2010-04-12 Thread Mahadev Konar
Hi Thomas, There are a couple of projects inside Yahoo! that use ZooKeeper as an event manager for feed processing. I am little bit unclear on your example below. As I understand it- 1. There are 1 million feeds that will be stored in Hbase. 2. A map reduce job will be run on these feeds to f

Optimal setup for a test problem

2010-04-12 Thread Andrew Nguyen
Hello, I recently setup a 5 node cluster (1 master, 4 slaves) and am looking to use it to process high volumes of patient physiologic data. As an initial exercise to gain a better understanding, I have attempted to run the following problem (which isn't the type of problem that Hadoop was real

Re: -files flag question

2010-04-12 Thread Keith Wiley
So how does the example work where the second argument is simply separated by a space and indicates some sort of "label" by which to find the file in the distributed cache: ... -files URI_TO_FILE name ... where 'name' is canonically the file name in the uri but without a scheme or path, ju

Specify Number of Cores in HOD

2010-04-12 Thread Song Liu
Dear all, I'm running HOD (.20.2) on my cluster. Each machine in my cluster has more than one processor. How can I make use of them? I can see there is a argument which can control the number nodes to allocate, but I cant see any parameters which can specify the number of processors per

feed queue fetcher with hadoop/zookeeper/gearman?

2010-04-12 Thread Thomas Koch
Hi, I'd like to implement a feed loader with Hadoop and most likely HBase. I've got around 1 million feeds, that should be loaded and checked for new entries. However the feeds have different priorities based on their average update frequency in the past and their relevance. The feeds (url, las