RE: How Jobtracker choose DataNodes to run TaskTracker ?

2011-12-15 Thread Shreya.Pal
Hi Praveenesh, The NN will send list of DN to the client in sorted order (nodes nearer to client are first in the list). If one DN takes more time hadoop has a mechanism to detect that - Speculative execution. Speculative execution: One problem with the Hadoop system is that by dividing the tasks

How Jobtracker choose DataNodes to run TaskTracker ?

2011-12-15 Thread praveenesh kumar
Okay so I have one question in mind. Suppose I have a replication factor of 3 on my cluster of some N nodes, where N>3 and there is a data block B1 that exists on some 3 Data nodes --> DD1, DD2, DD3. I want to run some Mapper function on this block.. My JT will communicate with NN, to know where

Generating job and topology traces from history folder of multinode cluster using Rumen

2011-12-15 Thread ArunKumar
Hai guys ! I have set up 5 node cluster with each of them in different racks. I have hadoop-0.20.2 set up on my Eclipse Helios. So, i ran Tracebuilder using Main Class: org.apache.hadoop.tools.rumen.TraceBuilder I ran some jobs on cluster and used copy of /usr/local/hadoop/logs/history folder o

Re: streaming data ingest into HDFS

2011-12-15 Thread Joey Echeverria
You could run the flume collectors on other machines and write a source which connects to the sockets on the data generators. -Joey On Dec 15, 2011, at 21:27, "Periya.Data" wrote: > Sorry...misworded my statement. What I meant was that the sources are meant > to be untouched and admins do

Re: streaming data ingest into HDFS

2011-12-15 Thread Periya.Data
Sorry...misworded my statement. What I meant was that the sources are meant to be untouched and admins do not want to mess with it and add more tools in there. All I've got is source addresses, port numbers. Once I know what technique(s) I will be using, accordingly, I will be given access via fire

Re: streaming data ingest into HDFS

2011-12-15 Thread Russell Jurney
Just curious - what is the situation you're in where no collectors are possible? Sounds interesting. Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com On Dec 15, 2011, at 5:01 PM, "Periya.Data" wrote: > Hi all, > I would like to know what options I have to ingest

streaming data ingest into HDFS

2011-12-15 Thread Periya.Data
Hi all, I would like to know what options I have to ingest terabytes of data that are being generated very fast from a small set of sources. I have thought about : 1. Flume 2. Have an intermediate staging server(s) where you can offload data and from there use dfs -put to load into H

Re: Map Task Capacity Not Changing

2011-12-15 Thread Raj V
Joey What is it do you really want to do? Increase the number of map slots available in the task tracker  or do you want to increase the number of map tasks for a job? If you want to increase the number of map slots available, what you did will work - as long as you restarted the task tracker

Re: Map Task Capacity Not Changing

2011-12-15 Thread James Warren
(moving to mapreduce-user@, bcc'ing common-user@) Hi Joey - You'll want to change the value on all of your servers running tasktrackers and then restart each tasktracker to reread the configuration. cheers, -James On Thu, Dec 15, 2011 at 3:30 PM, Joey Krabacher wrote: > I have looked up how to

Map Task Capacity Not Changing

2011-12-15 Thread Joey Krabacher
I have looked up how to up this value on the web and have tried all suggestions to no avail. Any help would be great. Here is some background: Version: 0.20.2, r911707 Compiled: Fri Feb 19 08:07:34 UTC 2010 by chrisdo Nodes: 5 Current Map Task Capacity : 10 <--- this is what I want to increase

DistributedCache in NewAPI on 0.20.X branch

2011-12-15 Thread Shi Yu
Hi, I am using 0.20.X branch. However, I need to use the new API because it has the cleanup(context) method in Mapper. However, I am confused about how to load the cached files in mapper. I could load the DistributedCache files using old API (JobConf), but in new API it always returns nu

RE: Large server recommedations

2011-12-15 Thread GOEKE, MATTHEW (AG/1000)
mapred.map.tasks is a suggestion to the engine and there is really no reason to define it as it will be driven by the block level partitioning of your files (e.g. if you have a file that is 30 blocks then it will by default spawn 30 map tasks). As for mapred.reduce.tasks, just set it to whatever

Re: Large server recommedations

2011-12-15 Thread Dale McDiarmid
thanks matt, Assuming therefore i run a single tasktracker and have 48 cores available. Based on your recommendation of 2:1 mappers to reducer threads i will be assigning: mapred.tasktracker.map.tasks.maximum=30 mapred.tasktracker.reduce.tasks.maximum=15 This brings me onto my question: "Ca

RE: Large server recommedations

2011-12-15 Thread GOEKE, MATTHEW (AG/1000)
Dale, Talking solely about hadoop core you will only need to run 4 daemons on that machine: Namenode, Jobtracker, Datanode and Tasktracker. There is no reason to run multiple of any of them as the tasktracker will spawn multiple child jvms which is where you will get your task parallelism. When

RE: More cores Vs More Nodes ?

2011-12-15 Thread Michael Segel
Tom, Look, I've said this before and I'm going to say it again. Your knowledge of Hadoop is purely academic. It may be ok to talk to C level execs who visit the San Jose IM Lab or in Markham, but when you give answers on issues you don't have first hand practical experience, you end up doing

Large server recommedations

2011-12-15 Thread Dale McDiarmid
Hi all New to the community and using hadoop and was looking for some advice as to optimal configurations on very large servers. I have a single server with 48 cores and 512GB of RAM and am looking to perform an LDA analysis using Mahoot across approx 180 million documents. I have configured

DistributedCache when running locally

2011-12-15 Thread Justin Vincent
Hello, I am trying to package some config data out to my mappers. Was just testing while running locally, and I can't get anything to work for me. ~/hadoop/hadoop-0.20.204.0/bin/hadoop jar the_jar.jar com.bar.ApplyMappings -files data/config_stuff.txt#config_stuff.txt input_dir output_dir confi

Replacing failed nodes

2011-12-15 Thread Sloot, Hans-Peter
Hi, Are there any standard procedures for replacing failed nodes from a cluster? Regards Hans-Peter Dit bericht is vertrouwelijk en kan geheime informatie bevatten enkel bestemd voor de geadresseerde. Indien dit bericht niet voor u is bestemd, verzoeken wij u dit onmiddellijk aan ons te me

Re: NameNode - didn't persist the edit log

2011-12-15 Thread Guy Doulberg
Hi Todd, you are right I should be more specific: 1. from the namenode log: 2011-12-11 08:57:23,245 WARN org.apache.hadoop.hdfs.server.common.Storage: rollEdidLog: removing storage /srv/hadoop/hdfs/edit 2011-12-11 08:57:23,311 WARN org.apache.hadoop.hdfs.server.common.Storage: incrementCheckpo

Re: NameNode - didn't persist the edit log

2011-12-15 Thread Todd Lipcon
Hi Guy, Several questions come to mind here: - What was the exact WARN level message you saw? - Did you have multiple dfs.name.dirs configured as recommended by most setup guides? - Did you try entering safemode and then running saveNamespace to persist the image before shutting down the NN? This

NameNode - didn't persist the edit log

2011-12-15 Thread Guy Doulberg
Hi guys, We recently had the following problem on our production cluster: The filesystem containing the editlog and fsimage had no free inodes. As a result the namenode wasn't able to obtain an inode for the fsimage and editlog after a checkpiot has been reached, while the previous files we