Re: Hadoop for Independant Tasks not using Map/Reduce?

2009-08-19 Thread yang song
Hadoop streaming is the utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer. I'm not familiar with it, but I think you can find something useful here http://hadoop.apache.org/common/docs/current/streaming.html 2009/8/19 Poole, Samuel

Re: Hadoop for Independant Tasks not using Map/Reduce?

2009-08-19 Thread Owen O'Malley
On Aug 18, 2009, at 2:40 PM, Poole, Samuel [USA] wrote: I am new to Hadoop (I have not yet installed/configured), and I want to make sure that I have the correct tool for the job. I do not currently have a need for the Map/Reduce functionality, but I am interested in using Hadoop for

Running Cloudera's distribution without their support agreement - is that a bad idea?

2009-08-19 Thread Erik Forsberg
Hi! I'm currently evaluating different Hadoop versions for a new project. I'm tempted by the Cloudera distribution, since it's neatly packaged into .deb files, and is the stable distribution but with some patches applied, for example the bzip2 support. I understand that I can get a support

Re: Running Cloudera's distribution without their support agreement - is that a bad idea?

2009-08-19 Thread Harish Mallipeddi
On Wed, Aug 19, 2009 at 4:26 PM, Erik Forsberg forsb...@opera.com wrote: I understand that I can get a support agreement from Cloudera to match this distribution, but if that's not an option, will running the Cloudera distribution put me in a position where I won't get any help from the

Re: How to deal with too many fetch failures?

2009-08-19 Thread yang song
I'm sorry, the version is 0.19.1 2009/8/19 Ted Dunning ted.dunn...@gmail.com Which version of hadoop are you running? On Tue, Aug 18, 2009 at 10:23 PM, yang song hadoop.ini...@gmail.com wrote: Hello, all I have met the problem too many fetch failures when I submit a big job(e.g.

Re: Running Cloudera's distribution without their support agreement - is that a bad idea?

2009-08-19 Thread Edward Capriolo
Generally if I have an issue I will bring it up on the forums and just reference the hadoop major.18.3 ce your likely to get the same level of help. On 8/19/09, Erik Forsberg forsb...@opera.com wrote: Hi! I'm currently evaluating different Hadoop versions for a new project. I'm tempted by the

Re: Running Cloudera's distribution without their support agreement - is that a bad idea?

2009-08-19 Thread Jason Venner
Cloudera submits their patches back to the projects, and people are free to pick them up. It is becoming a normal thing to run a patched distribution, particularly since Yahoo made their version of 0.20 available. On Wed, Aug 19, 2009 at 5:46 AM, Edward Capriolo edlinuxg...@gmail.comwrote:

submitting multiple small jobs simultaneously

2009-08-19 Thread George Jahad
I'm importing a bunch of data into HDSF. It involves running a bunch of small jobs, that don't put much load on my cluster, but it would be nice if I could do them all from the same job client. I'd submit them all asynchronously and then wait for the results of each. I imagine this has been

Re: Faster alternative to FSDataInputStream

2009-08-19 Thread Ananth T. Sarathy
I am not saying there is a slowdown cause by hadoop. I was wondering if there were anyother techinques that optimize speed (IE reading a little a time and writing to the local disk). Ananth T Sarathy On Wed, Aug 19, 2009 at 1:26 AM, Raghu Angadi rang...@yahoo-inc.com wrote: Ananth T. Sarathy

Re: Faster alternative to FSDataInputStream

2009-08-19 Thread Edward Capriolo
Ananth, That is your issue really. For example. I have 20 web servers and I wish to download all the weblogs from all of them into hadoop. If you write a top down program that uses FSDataOutput. You are using hadoop half way. You are using the distributed file system, but you are not doing any

Re: How to deal with too many fetch failures?

2009-08-19 Thread Arun C Murthy
I'd dig around a bit more to check if it's there it's caused by a specific set of nodes... i.e. are maps on specific tasktrackers failing in this manner? Arun On Aug 18, 2009, at 10:23 PM, yang song wrote: Hello, all I have met the problem too many fetch failures when I submit a big

Loading data failed with timeout

2009-08-19 Thread Mayuran Yogarajah
Hello, we were importing several TB of data overnight and it seemed one of the loads failed. We're running Hadoop 0.18.3, and there are 6 nodes in the cluster, all are dual quad core with 6 gigs of ram. We were using hadoop dfs -put to load the data from both the namenode server and the

Re: submitting multiple small jobs simultaneously

2009-08-19 Thread Jakob Homan
George- You can certainly submit jobs asynchronously via the JobClient.submitJob() method (http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/JobClient.html). This will return a handle (a RunningJob instance) that you can poll for completion. This is what the

Re: Faster alternative to FSDataInputStream

2009-08-19 Thread Scott Carey
On 8/19/09 10:58 AM, Raghu Angadi rang...@yahoo-inc.com wrote: Edward Capriolo wrote: On Wed, Aug 19, 2009 at 11:11 AM, Edward Capriolo edlinuxg...@gmail.comwrote: It would be as fast as underlying filesystem goes. I would not agree with that statement. There is overhead. You might be

Re: Location of the source code for the fair scheduler

2009-08-19 Thread Aaron Kimball
Hi Mithila, In the Mapreduce svn tree, it's under src/contrib/fairscheduler/ - Aaron On Wed, Aug 19, 2009 at 2:48 PM, Mithila Nagendra mnage...@asu.edu wrote: Hello I was wondering how I could locate the source code files for the fair scheduler. Thanks Mithila

syslog-ng and hadoop

2009-08-19 Thread Mike Anderson
Has anybody had any luck setting up the log4j.properties file to send logs to a syslog-ng server? My log4j.properties excerpt: log4j.appender.SYSLOG=org.apache.log4j.net.SyslogAppender log4j.appender.SYSLOG.syslogHost=10.0.20.164 log4j.appender.SYSLOG.layout=org.apache.log4j.PatternLayout

Re: How does hadoop deal with hadoop-site.xml?

2009-08-19 Thread Aaron Kimball
Hi Inifok, This is a confusing aspect of Hadoop, I'm afraid. Settings are divided into two categories: per-job and per-node. Unfortunately, which are which, isn't documented. Some settings are applied to the node that is being used. So for example, if you set fs.default.name on a node to be

File Chunk to Map Thread Association

2009-08-19 Thread roman kolcun
Hello everyone, could anyone please tell me in which class and which method does Hadoop download the file chunk from HDFS and associate it with the thread that executes the Map function on given chunk and process it? I would like to extend the Hadoop so one Task may have more chunks associated and

Re: Location of the source code for the fair scheduler

2009-08-19 Thread Mithila Nagendra
Thanks! But How do I know which version to work with? Mithila On Thu, Aug 20, 2009 at 2:30 AM, Ravi Phulari rphul...@yahoo-inc.comwrote: Currently Fairscheduler source is in hadoop-mapreduce/src/contrib/fairscheduler/ Download mapreduce source from.

Re: How does hadoop deal with hadoop-site.xml?

2009-08-19 Thread yang song
Thank you, Aaron. I've benefited a lot. per-node means some settings associated with the node. e.g., fs.default.name, mapred.job.tracker, etc. per-job means some settings associated with the jobs which are submited from the node. e.g., mapred.reduce.tasks. That means, if I set per-job

Re: syslog-ng and hadoop

2009-08-19 Thread Brian Bockelman
Hey Mike, Yup. We find the stock log4j needs two things: 1) Set the rootLogger manually. The way 0.19.x has the root logger set up breaks when adding new appenders. I.e., do: log4j.rootLogger=INFO,SYSLOG,console,DRFA,EventCounter 2) Add the headers; otherwise log4j is not compatible