Re: TaskTrackers behind NAT

2011-07-18 Thread Allen Wittenauer
On Jul 18, 2011, at 12:53 PM, Ben Clay wrote: > I'd like to spread Hadoop across two physical clusters, one which is > publicly accessible and the other which is behind a NAT. The NAT'd machines > will only run TaskTrackers, not HDFS, and not Reducers either (configured > with 0 Reduce slots). T

Re: How to query a slave node for monitoring information

2011-07-12 Thread Allen Wittenauer
On Jul 12, 2011, at 4:34 PM, wrote: > I am working on deploying Hadoop on a small cluster. For now, I am interested > in > restarting (restart the node or even reboot the OS) the nodes Hadoop detects > as > crashed. There are quite a few scenarios where one service may be up but an

Re: How to query a slave node for monitoring information

2011-07-12 Thread Allen Wittenauer
On Jul 12, 2011, at 3:02 PM, wrote: > I am new to Hadoop, and I apologies if this was answered before, or if this > is > not the right list for my question. common-user@ would likely have been better, but I'm too lazy to forward you there today. :) > > I am trying to do the followi

Re: Lack of data locality in Hadoop-0.20.2

2011-07-12 Thread Allen Wittenauer
On Jul 12, 2011, at 10:27 AM, Virajith Jalaparti wrote: > I agree that the scheduler has lesser leeway when the replication factor is > 1. However, I would still expect the number of data-local tasks to be more > than 10% even when the replication factor is 1. How did you load your data?

Re: Dead data nodes during job excution and failed tasks.

2011-06-30 Thread Allen Wittenauer
On Jun 30, 2011, at 12:36 PM, David Ginzburg wrote: > > Is it possible though the server runs with vm.swappiness =5 That only controls how aggressive the system swaps. If you eat all the RAM in user space, the system is going to start paging memory regardless of swappiness.

Re: How does Hadoop manage memory?

2011-06-30 Thread Allen Wittenauer
On Jun 28, 2011, at 1:43 PM, Peter Wolf wrote: > Hello all, > > I am looking for the right thing to read... > > I am writing a MapReduce Speech Recognition application. I want to run many > Speech Recognizers in parallel. > > Speech Recognizers not only use a large amount of processor, they

Re: Emit an entire file

2011-06-30 Thread Allen Wittenauer
On Jun 28, 2011, at 6:19 AM, Jeremy Cunningham wrote: > I have lots of binary files stored in hdfs. I read them using Apache POI and > can search with no problems. I want to be able to search for keywords (which > I can do) and then copy the file that has the text out to a different > locatio

Re: Dead data nodes during job excution and failed tasks.

2011-06-30 Thread Allen Wittenauer
On Jun 30, 2011, at 10:01 AM, David Ginzburg wrote: > > Hi, > I am running a certain job which constantly cause dead data nodes (who come > back later, spontaneously ). Check your memory usage during the job run. Chances are good the DataNode is getting swapped out.

Re: tasktracker maximum map tasks for a certain job

2011-06-23 Thread Allen Wittenauer
On Jun 22, 2011, at 2:20 PM, Jonathan Zukerman wrote: > > That way I can set the maximum number of maps and maximum number of reducers > in the configuration of the job ("loadmanager.maximum.maps.per.tasktracker" > will be 1 for these special jobs). > Am I right? Am I missing something?

Re: "No space left on device" and "Could not find any valid local directory for taskTracker/jobcache/"

2011-06-23 Thread Allen Wittenauer
On Jun 23, 2011, at 7:09 AM, Virajith Jalaparti wrote: > Hi, > > I am trying to run a sort job (from hadoop-0.20.2-examples.jar) on 50GB of > data (generated using randomwriter). I am using hadoop-0.20.2 on a cluster > of 3 machines with one machine serving as the master and the other two as > s

Re: Large startup time in remote MapReduce job

2011-06-22 Thread Allen Wittenauer
On Jun 22, 2011, at 10:08 AM, Allen Wittenauer wrote: > > On Jun 21, 2011, at 2:02 PM, Harsh J wrote: >>>> >>>> If your jar does not contain code changes that need to get transmitted >>>> every time, you can consider placing them on the JT/TT classpa

Re: Large startup time in remote MapReduce job

2011-06-22 Thread Allen Wittenauer
On Jun 21, 2011, at 2:02 PM, Harsh J wrote: >>> >>> If your jar does not contain code changes that need to get transmitted >>> every time, you can consider placing them on the JT/TT classpaths >> >>... which means you get to bounce your system every time you change >> code. > > Its ugl

Re: controlling no. of mapper tasks

2011-06-22 Thread Allen Wittenauer
On Jun 20, 2011, at 12:24 PM, wrote: > Hi there, > I know client can send "mapred.reduce.tasks" to specify no. of reduce tasks > and hadoop honours it but "mapred.map.tasks" is not honoured by Hadoop. Is > there any way to control number of map tasks? What I noticed is that Hadoop > is choo

Re: tasktracker maximum map tasks for a certain job

2011-06-22 Thread Allen Wittenauer
On Jun 21, 2011, at 9:52 AM, Jonathan Zukerman wrote: > Hi, > > Is there a way to set the maximum map tasks for all tasktrackers in my > cluster for a certain job? > Most of my tasktrackers are configured to handle 4 maps concurrently, and > most of my jobs don't care where does the map function

Re: concurrent job execution

2011-06-06 Thread Allen Wittenauer
On Jun 3, 2011, at 1:11 AM, Felix Sprick wrote: > Hi, > > We are running MapReduce on Hbase tables and are trying to implement a > scenario with MapReduce where tasks are submitted from a GUI application. > This means that several users (currently 5-10) may use the system in > parallel.

Re: distributed cache exceeding local.cache.size

2011-04-01 Thread Allen Wittenauer
On Apr 1, 2011, at 12:05 PM, Travis Crawford wrote: > On Thu, Mar 31, 2011 at 3:25 PM, Allen Wittenauer wrote: >> >> On Mar 31, 2011, at 11:45 AM, Travis Crawford wrote: >> >>> Is anyone familiar with how the distributed cache deals when datasets >>&g

Re: distributed cache exceeding local.cache.size

2011-03-31 Thread Allen Wittenauer
On Mar 31, 2011, at 11:45 AM, Travis Crawford wrote: > Is anyone familiar with how the distributed cache deals when datasets > larger than the total cache size are referenced? I've disabled the job > that caused this situation but am wondering if I can configure things > more defensively.

Re: map tasks vs launched map tasks

2011-03-25 Thread Allen Wittenauer
On Mar 25, 2011, at 10:09 AM, Pedro Costa wrote: > Hi, > > during the setup phase and the cleanup phase of the tasks, the Hadoop > MR uses map tasks to do it. These tasks appears in the counters shown > at the end of an example? > For example, the counter below shows that my example ran 9 map ta

Re: A way to monitor HDFS for a file to come live, and then kick off a job?

2011-03-25 Thread Allen Wittenauer
On Mar 24, 2011, at 10:09 AM, Jonathan Coveney wrote: > I am not sure if this is the right listserv, forgive me if it is not. A better choice would likely be hdfs-user@, since this is really about watching files in HDFS. > My > goal is this: monitor HDFS until a file is create, and th

Re: any plans to deploy OSGi bundles on cluster?

2011-01-04 Thread Allen Wittenauer
On Jan 4, 2011, at 10:30 AM, Hiller, Dean (Contractor) wrote: > I guess I meant in the setting for number of tasks in child JVM before > teardown. In that case, it is nice to separate/unload my previous > classes from the child JVM which OSGi does. I was thinking we may do 10 > tasks / JVM sett

Re: large intermediate outputs

2011-01-03 Thread Allen Wittenauer
On Jan 3, 2011, at 5:11 AM, Debbie Fu wrote: > I think it will cause a disk fill-up, too. Is there any mechanism in Hadoop > that handles this situation? Not in a way that saves the job. > If my local disk stores too much chunk data, > and spare little space for intermediate output, and

Re: any plans to deploy OSGi bundles on cluster?

2011-01-03 Thread Allen Wittenauer
On Jan 2, 2011, at 9:51 AM, Hiller, Dean (Contractor) wrote: > I was looking at distributed cache and how I need to copy local jars to > hdfs. I was wondering if there was any plans to just deploy an OSGi > bundle(ie. Introspect and auto deploy jars from bundle to the > distributed cache and the

Re: Reduce Task Priority / Scheduler

2010-12-20 Thread Allen Wittenauer
This makes sense until you realize: a) It won't scale. b) Machines fail. On Dec 20, 2010, at 5:26 AM, Martin Becker wrote: > I wrote a little bit much, so I put a summary up front. Sorry about that. > > Summary: > 1) Is there any point in time, where on

Re: Reduce Task Priority / Scheduler

2010-12-19 Thread Allen Wittenauer
On Dec 19, 2010, at 7:39 AM, Martin Becker wrote: > Hello everybody, > > is there a possibility to make sure that certain/all reduce tasks, > i.e. the reducers to certain keys, are executed in a specified order? > This is Job internal, so the Job Scheduler is probably the wrong place to > start

Re: Passing messages

2010-12-19 Thread Allen Wittenauer
On Dec 19, 2010, at 10:21 AM, Eric wrote: > I don't know if there is such a thing in Hadoop, I'm guessing not since > MapReduce is designed to have independent mappers and reducers. Yup. > I'm just suggesting something here: you could write a small server yourself. > Say you start yo

Re: How to Influence Reduce Task Location.

2010-12-19 Thread Allen Wittenauer
On Dec 19, 2010, at 10:23 AM, Jane Chen wrote: > Suppose that the output is written to a database, that only runs on certain > nodes. It will be desirable to schedule the reducer tasks to run on the > nodes local or close to the database nodes. a) That's a side-effect--pretty much "a

Re: Scheduler in Hadoop MR

2010-12-08 Thread Allen Wittenauer
On Dec 7, 2010, at 6:47 PM, Harsh J wrote: >> 1 - When we've two JobTrackers running simultaneously, each JobTracker is >> running in a separate process? > > You can't run simultaneous JobTrackers for the same data-cluster > AFAIK; only one JT process can exist. Did you mean jobs? Sure y

Re: Too large class path for map reduce jobs

2010-09-17 Thread Allen Wittenauer
On Sep 17, 2010, at 4:56 AM, Henning Blohm wrote: > When running map reduce tasks in Hadoop I run into classpath issues. Contrary > to previous posts, my problem is not that I am missing classes on the Task's > class path (we have a perfect solution for that) but rather find too many > (e.g. E

Re: TOTAL_LAUNCHED_MAPS Counter

2010-09-09 Thread Allen Wittenauer
On Sep 9, 2010, at 11:42 AM, Elton Pinto wrote: > Does anyone know the difference between the Hadoop counter > TOTAL_LAUNCHED_MAPS and the "mapred.map.tasks" parameter available in the > JobConf? mapred.map.tasks is what Hadoop thinks you need at a minimum. TOTAL_LAUNCHED_MAPS will be all ma

Re: Map output in the map side is 10 bytes bigger than on the reduce?

2010-08-09 Thread Allen Wittenauer
On Aug 9, 2010, at 1:27 PM, Pedro Costa wrote: > > 2 - If I'm deducting correctly, the reduce will always fetch 10 bytes > less than the saved map output? Why do you care?

Re: specify different number of mapper tasks for different machines

2010-07-14 Thread Allen Wittenauer
On Jul 14, 2010, at 11:50 AM, Shaojun Zhao wrote: > Is there any way to specify that some machines run, say, 8 mapper > tasks, while some machines run only 2 tasks? A custom mapred-site.xml per machine.

Re: How many ports and processes are open/created in MR?

2010-06-23 Thread Allen Wittenauer
On Jun 23, 2010, at 3:13 PM, Pedro Costa wrote: > 1 - Hadoop uses several ports to run. It exists ports for HDFS, for the > MapReduce JvmTasks, etc. I don't know how I can identify all the ports that a > MapReduce and HDFS uses. I'm running the wordcount example, and I would like > to see what

Re: HDFS Errors

2010-06-22 Thread Allen Wittenauer
On Jun 22, 2010, at 1:58 PM, Steve Lewis wrote: > train...@hadoop1:~$ hadoop dfsadmin -safemode get > Safe mode is OFF OK, so you are out of safemode. > > train...@hadoop1:~$ hadoop dfsadmin -refreshNodes This just re-reads the list of nodes. hadoop dfsadmin -report might be more useful.

Re: HDFS Errors

2010-06-22 Thread Allen Wittenauer
On Jun 22, 2010, at 12:55 PM, Steve Lewis wrote: > /user/training/small_yeast/yeast_chrXIV0006.sam.gz could only be > replicated to 0 nodes, instead of 1 ... almost always means the namenode doesn't think it has any viable datanodes (anymore). > Anyone seen this and know how to fix it > I

Re: how to set max map tasks individually for each job?

2010-06-04 Thread Allen Wittenauer
On Jun 3, 2010, at 1:45 AM, Alex Munteanu wrote: > I am running several different mapreduce jobs. For some of them it is > better to have a rather high number of running map tasks per node, > whereas others do very intensive read operations on our database > resulting in read timeouts. So for thes

Re: Do we shoot ourselves by using all task slots?

2010-05-28 Thread Allen Wittenauer
On May 28, 2010, at 11:43 AM, Todd Lipcon wrote: > Hi Allen, > > Recent versions of the fair scheduler have configurations for "delay > scheduling" - essentially, it will wait for a few seconds when a slot opens > up to try to find a local task before assigning a non-local one. This is > spec

Do we shoot ourselves by using all task slots?

2010-05-28 Thread Allen Wittenauer
I've been thinking (which is always a dangerous thing) about data locality lately. If we look at file systems, there is this idea of 'reserved space'. This space is used for a variety of reasons, including to reduce fragmentation on busy file systems. This allows the file s

Re: Separate communications of HDFS and MapReduce

2010-04-26 Thread Allen Wittenauer
On Apr 26, 2010, at 6:23 AM, Druilhe Remi wrote: > For example, when I run "wordcount" example, there is HDFS communications and > MapReduce communications and I am not able to distinguish which packet belong > to HDFS or to MapReduce. This shouldn't be too surprising given that the MapReduce j

Re: Distributed Cache

2010-04-03 Thread Allen Wittenauer
On Apr 2, 2010, at 11:44 PM, Raja Thiruvathuru wrote: > > DistributedCache.addCacheFile(new > URI("hdfs://localhost:9000/user/guest/lib/userlib.jar"), conf); > DistributedCache.addArchiveToClassPath(new > Path("hdfs://localhost:9000/user/guest/lib/userlib.jar"), conf); localhos

Re: How to Recommission?

2010-03-31 Thread Allen Wittenauer
On 3/31/10 8:12 PM, "Zhanlei Ma" wrote: > But how to Recommission? Wish your help. Take them out of dfs.exclude and refreshnodes again.

Re: Setting the group for output files

2010-03-11 Thread Allen Wittenauer
On 3/11/10 11:05 AM, "Gregory Lawrence" wrote: > Is there a way to set the output group for a mapreduce (or hdfs fs operation) > job? For example -Ddfs.umaskmode=027 successfully sets the permissions. I > would think the -Dgroup.name=GROUP would do a similar thing for the file's > group. Howeve

Re: Question about setting the number of mappers.

2010-01-19 Thread Allen Wittenauer
l file. > > Cheers, > > Teryl > > > On Tue, Jan 19, 2010 at 4:32 PM, Allen Wittenauer > wrote: > >> What is the value of: >> >> mapred.tasktracker.map.tasks.maximum >> mapred.tasktracker.reduce.tasks.maximum >> >> >> On 1

Re: Question about setting the number of mappers.

2010-01-19 Thread Allen Wittenauer
What is the value of: mapred.tasktracker.map.tasks.maximum mapred.tasktracker.reduce.tasks.maximum On 1/19/10 10:23 AM, "Teryl Taylor" wrote: > Hi guys, > > Thanks for the answers. Michael, yes you are right, that is what I guess, > I'm looking for...how to reduce the number of mappers runn

Re: How to use MultipleTextOutputFormat ?

2009-10-27 Thread Allen Wittenauer
On 10/27/09 3:34 AM, "tim robertson" wrote: > to create file /user/root/delme2/resource-101 for I wouldn't recommend running your grid/jobs as root. :)