RE: Memory mapped file in DFS?

2008-06-26 Thread Goel, Ankur
In a map reduce setting, files are read as a sequence of records. In mappers you process each record to generate an intermediate set of (key, value) pairs. All the values for a particular key are collected and grouped together and provided as (key, value1, value2...) in reducers. The input data-set

HDFS blocks

2008-06-27 Thread Goel, Ankur
Hi Folks, I have a setup where in I am streaming data into HDFS from a remote location and creating a new files every X min. The file generated is of a very small size (512 KB - 6 MB) size. Since that is the size range the streaming code sets the block size to 6MB whereas default that we

Using value aggregator framework with MultipleTextOutputFormat

2008-06-27 Thread Goel, Ankur
Hi All, Has anyone used value aggregator framework with MultipleTextOutputFormat ? The javadoc presently lists TextOutputFormat and SequenceFileOutputFormat as the options. What I want to do is specify different aggregator plugins and based upon the key names collect output to different

RE: Using value aggregator framework with MultipleTextOutputFormat

2008-06-27 Thread Goel, Ankur
I guess I made a mistake. After having a look at the source code it looks like there is a choice between input format (sequence or text) but the output format is hardcoded to be text. That is something that can be improved. -Original Message- From: Goel, Ankur [mailto:[EMAIL PROTECTED

RE: one input file per map

2008-07-02 Thread Goel, Ankur
Nope, But if the intent is so then there are 2 ways of doing it. 1. Just extend the input format of your choice and override isSplitable() method to return false. 2. Compress your text file using a compression format supported by hadoop (e.g gzip). This will ensure that one map task processes 1 f

MultiFileInputFormat - Not enough mappers

2008-07-11 Thread Goel, Ankur
Hi Folks, I am using hadoop to process some temporal data which is split in lot of small files (~ 3 - 4 MB) Using TextInputFormat resulted in too many mappers (1 per file) creating a lot of overhead so I switched to MultiFileInputFormat - (MutiFileWordCount.MyInputFormat) which result

RE: MultiFileInputFormat - Not enough mappers

2008-07-11 Thread Goel, Ankur
#setNumMapTasks() or via command line arg -D mapred.map.tasks= Goel, Ankur wrote: > Hi Folks, > I am using hadoop to process some temporal data which is > split in lot of small files (~ 3 - 4 MB) Using TextInputFormat > resulted in too many mappers (1 per file) creating a lot

RE: How to chain multiple hadoop jobs?

2008-07-15 Thread Goel, Ankur
Hadoop typically complains if you try to re-use a JobConf object by modifing job parameters (Mapper, Reducer, output path etc..) and re-submitting it to the job client. You should be creating a new JobConf object for every map-reduce job and if there are some parameters that should be copied from p

RE: Pulling input from http?

2008-07-15 Thread Goel, Ankur
Assuming the intent is to transfer HTTP logs into hadoop. There are 2 ways of doing it that I can tell 1. Have a writer plugged into the HTTP server or simply tail the logs to the writer's inputstream which then writes it into HDFS files. 2. Enable FTP access on your server's log directory. This w

RE: How to add/remove slave nodes on run time

2008-07-15 Thread Goel, Ankur
The newly joining node will need to add its hostname to the "slaves" file and this should be copied to the conf/ dir of all the nodes in the cluster so that it comes up as a part of the next cluster restart (if any). Also doing a start-all.sh from the newly added node could be an issue if passwordl

RE: When does reducer read mapper's intermediate result?

2008-07-15 Thread Goel, Ankur
Just to elaborate a little bit more to what Chris said, the intermediate map-outputs are sorted and written to disk. In reduce there are 3 phases , copy, sort-merge and reduce (when user's reduce funcion is called). As mappers complete and write their sorted output to disk, the reduce tasks can co

RE: Is there a way to preempt the initial set of reduce tasks?

2008-07-16 Thread Goel, Ankur
I presume that the initial set of reducers of job1 are taking fairly long to complete thereby denying the reducers of job2 a chance to run. I don't see a provision in hadoop to preempt a running task. This looks like an enhancment to task tracker scheduling where running tasks are preempted (afte

RE: Is there a way to preempt the initial set of reduce tasks?

2008-07-16 Thread Goel, Ankur
get scheduled even though the high priority job has some tasks to run. Amar Goel, Ankur wrote: > I presume that the initial set of reducers of job1 are taking fairly > long to complete thereby denying the reducers of job2 a chance to run. > I don't see a provision in hadoop to preempt

RE: How can I control Number of Mappers of a job?

2008-07-30 Thread Goel, Ankur
How big is your cluster? Assuming you are running a single node cluster, Hadoop-default.xml has a parameter 'mapred.map.tasks' that is set to 2. So By default, no matter how many map tasks are calculated by framework, only 2 map task will execute on a single node cluster. -Original Message--

RE: How can I control Number of Mappers of a job?

2008-08-04 Thread Goel, Ankur
This can be done very easily setting the number of mappers you want - jobConf.setNumMapTasks() and use input format - MultiFileWordCount.MyInputFormat.class which is a concrete implementation of MultiFileInputFormat. -Original Message- From: Jason Venner [mailto:[EMAIL PROTECTED] Sent: Sa

RE: 1 file per record

2008-09-26 Thread Goel, Ankur
The way this is done in hadoop-land is you create your custom InputFormat and override the getSplits(), isSplitable() and getRecordReader() APIs. The idea is that application knows how to construct splits of the data (which is no splits in your case) and how to detect record boundaries and read re

IPC Client error | Too many files open

2008-09-26 Thread Goel, Ankur
Hi Folks, We have developed a simple log writer in Java that is plugged into Apache custom log and writes log entries directly to our hadoop cluster (50 machines, quad core, each with 16 GB Ram and 800 GB hard-disk, 1 machine as dedicated Namenode another machine as JobTracker & TaskTracker +

Best way to handle namespace host failures

2008-11-10 Thread Goel, Ankur
Hi Folks, I am looking for some advice on some the ways / techniques that people are using to get around namenode failures (Both disk and host). We have a small cluster with several job scheduled for periodic execution on the same host where name server runs. What we would like to h

RE: Best way to handle namespace host failures

2008-11-10 Thread Goel, Ankur
- From: Amar Kamat [mailto:[EMAIL PROTECTED] Sent: Monday, November 10, 2008 3:53 PM To: core-user@hadoop.apache.org Subject: Re: Best way to handle namespace host failures Goel, Ankur wrote: > Hi Folks, > > I am looking for some advice on some the ways / techniques >

RE: Best way to handle namespace host failures

2008-11-10 Thread Goel, Ankur
namenode fails. As Hadoop exists today, the namenode is a single point of failure. Alex On Mon, Nov 10, 2008 at 3:12 AM, Goel, Ankur <[EMAIL PROTECTED]>wrote: > Thanks for the replies folks. We are not seeing this frequently but we > just want to avoid single point of failure and kee