Re: Is intermediate data produced by mappers always flushed to disk ?

2009-05-19 Thread Billy Pearson
The only way to do something like this is get them mapers to use something like /dev/shm as there storage folder that's 100% memory outside of that everything is flushed because the mapper exits when its done the tasktracker is the one delivering the output to the reduce task. Billy "paula_t

Re: Hadoop & Python

2009-05-19 Thread Billy Pearson
I used streaming and php before to work with processing data with a data set of about 1TB with out any problems at all. Billy "s d" wrote in message news:24b53fa00905191035w41b115c1q94502ee82be43...@mail.gmail.com... Thanks. So in the overall scheme of things, what is the general feeling ab

Re: Regarding Capacity Scheduler

2009-05-14 Thread Billy Pearson
I am seeing the the same problem posted on the list on the 11th and have not any reply. Billy - Original Message - From: "Manish Katyal" Newsgroups: gmane.comp.jakarta.lucene.hadoop.user To: Sent: Wednesday, May 13, 2009 11:48 AM Subject: Regarding Capacity Scheduler I'm exp

Capacity Scheduler?

2009-05-11 Thread Billy Pearson
Does the Capacity Scheduler not recover reduce tasks in the setting mapred.capacity-scheduler.queue.{name}.reclaim-time-limit? on my test it only recovers map task if it can not get its full Guaranteed Capacity. Billy

Re: How to do load control of MapReduce

2009-05-11 Thread Billy Pearson
Might try setting the tasktrackers linux nice level to say 5 or 10 leavening dfs and hbase setting to 0 Billy "zsongbo" wrote in message news:fa03480d0905110549j7f09be13qd434ca41c9f84...@mail.gmail.com... Hi all, Now, if we have a large dataset to process by MapReduce. The MapReduce will ta

Re: Logging in Hadoop Stream jobs

2009-05-09 Thread Billy Pearson
When I was looking to capture debugging data about my scripts I would just write to stderr stream in php it like fwrite(STDERR,"message you want here"); then it get captured in the task logs when you view the detail of each task. Billy "Mayuran Yogarajah" wrote in message news:4a049154.607

Re: Sequence of Streaming Jobs

2009-05-02 Thread Billy Pearson
In php I run exec commands with the job commands and it has a variable that stores the exit status code. Billy "Mayuran Yogarajah" wrote in message news:49fc975a.3030...@casalemedia.com... Billy Pearson wrote: I done this with and array of commands for the jobs in a php script

Re: Sequence of Streaming Jobs

2009-05-02 Thread Billy Pearson
I done this with and array of commands for the jobs in a php script checking the return of the job to tell if it failed or not. Billy "Dan Milstein" wrote in message news:58d66a11-b59c-49f8-b72f-7507482c3...@hubteam.com... If I've got a sequence of streaming jobs, each of which depends on the

Re: How to run many jobs at the same time?

2009-04-21 Thread Billy Pearson
The only way I know of is try using different Scheduling Queue's for each group Billy "nguyenhuynh.mr" wrote in message news:49ee6e56.7080...@gmail.com... Tom White wrote: You need to start each JobControl in its own thread so they can run concurrently. Something like: Thread t = new

Re: NameNode resilency

2009-04-11 Thread Billy Pearson
Not 100% sure but I thank they plan on using zookeeper to help with namenode fail over but that may have changed. Billy "Stas Oskin" wrote in message news:77938bc20904110243u7a2baa6dw6d710e4e51ae0...@mail.gmail.com... Hi. I wonder, what Hadoop community uses in order to make NameNode resili

Re: Reduce task attempt retry strategy

2009-04-06 Thread Billy Pearson
I seen the same thing happening on 0.19.branch. When a task fails on the reduce end it always retries on the same node until it kills the job for to many failed tries on one reduce task. I am running a cluster of 7 nodes. Billy "Stefan Will" wrote in message news:c5ff7f91.18c09%stefan.w..

Re: hdfs-doubt

2009-03-29 Thread Billy Pearson
Your client doesn't have to be on the namenode it can be on any system that can access the namenode and the datanodes. Hadoop uses 64MB block to store files so file sizes >= 64mb should be as efficient as 128MB or 1GB file sizes. more reading and information here: http://wiki.apache.org/hadoop

Re: Reducer handing at 66%

2009-03-29 Thread Billy Pearson
66% is the start of the reduce function so its likely a endless loop there burning the cpu cycles "Amandeep Khurana" wrote in message news:35a22e220903271631i25ff749bx5814348e66ff4...@mail.gmail.com... I have a MR job running on approximately 15 lines of data in a text file. The reducer

Re: Typical hardware configurations

2009-03-29 Thread Billy Pearson
I run 10 node cluster with 2 cores 2.4Ghz with 4Gb Ram and dual 250GB drives per node. I run on used 32 bit servers so I can only run 2GB hbase but I still have memory left for tasktracker and datanode. more files in hadoop = more memory used on the namenode. hbase master is lightly loaded so I

Re: reduce task failing after 24 hours waiting

2009-03-26 Thread Billy Pearson
#x27;m not sure. Thanks Amareshwari Billy Pearson wrote: I am seeing on one of my long running jobs about 50-60 hours that after 24 hours all active reduce task fail with the error messages java.io.IOException: Task process exit with nonzero status of 255. at org.apache.hadoop.mapred

Re: reduce task failing after 24 hours waiting

2009-03-26 Thread Billy Pearson
adasu wrote: Set mapred.jobtracker.retirejob.interval This is used to retire completed jobs. and mapred.userlog.retain.hours to higher value. This is used to discard user logs. By default, their values are 24 hours. These might be the reason for failure, though I'm not sure. Thanks Amareshwari

reduce task failing after 24 hours waiting

2009-03-25 Thread Billy Pearson
I am seeing on one of my long running jobs about 50-60 hours that after 24 hours all active reduce task fail with the error messages java.io.IOException: Task process exit with nonzero status of 255. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418) Is there something in the confi

Re: intermediate results not getting compressed

2009-03-19 Thread Billy Pearson
tputs, but not any intermediate files produced. The reducer will decompress the map output files after copying them, and then compress its own output only after it has finished. I wonder if this is by design, or just an oversight. -- Stefan From: Billy Pearson Reply-To: Date: Wed, 18 Mar 200

Re: intermediate results not getting compressed

2009-03-19 Thread Billy Pearson
open issue https://issues.apache.org/jira/browse/HADOOP-5539 Billy "Billy Pearson" wrote in message news:cecf0598d9ca40a08e777568361de...@billypc... How are you concluding that the intermediate output is compressed from the map, but not in the reduce? -C my hadoo

Re: intermediate results not getting compressed

2009-03-19 Thread Billy Pearson
er outputs, but not any intermediate files produced. The reducer will decompress the map output files after copying them, and then compress its own output only after it has finished. I wonder if this is by design, or just an oversight. -- Stefan From: Billy Pearson Reply-To: Date: Wed, 18 Mar

Re: intermediate results not getting compressed

2009-03-19 Thread Billy Pearson
How are you concluding that the intermediate output is compressed from the map, but not in the reduce? -C my hadoop-site.xml mapred.compress.map.output true Should the job outputs be compressed? mapred.output.compression.type BLOCK If the job outputs are to compressed as Sequenc

Re: intermediate results not getting compressed

2009-03-18 Thread Billy Pearson
I can run head on the map.out files and I get compressed garbish but I run head on a intermediate file and I can read the data in the file clearly so compression is not getting passed but I am setting the CompressMapOutput to true by default in my hadoop-site.conf file. Billy "Billy Pe

Re: intermediate results not getting compressed

2009-03-18 Thread Billy Pearson
runnning at the same time. Billy Pearson

Re: intermediate results not getting compressed

2009-03-17 Thread Billy Pearson
Watching a second job with more reduce task running looks like the in-memory merges are working correctly with compression. The task I was watching failed and was running again it Shuffle all the map output files then started the merged after all was copied so non was merged in memory it was c

Re: intermediate results not getting compressed

2009-03-17 Thread Billy Pearson
I understand that I got CompressMapOutput set and it works the maps outputs are compressed but on the reduce end it downloads x files then merges the x file in to one intermediate file to keep the number of files to a minimal <= io.sort.factor. My problem is the output from merging the inte

intermediate results not getting compressed

2009-03-16 Thread Billy Pearson
I am running a large streaming job that processes that about 3TB of data I am seeing large jumps in hard drive space usage in the reduce part of the jobs I tracked the problem down. The job is set to compress map outputs but looking at the intermediate files on the local drives the intermediate

Re: Hadoop job using multiple input files

2009-02-06 Thread Billy Pearson
If it was me I would prefix the map values outputs with a: and n:. a: for address and n: for number then on the reduce you could test the value to see if its the address or the name with if statements no need to worry about which one comes first just make sure they both have been set before outp

Re: hadoop balanceing data

2009-01-24 Thread Billy Pearson
the disks with mapred and mapred tasks may use a lot of disks temporally. So trying to keep the same %free is impossible most of the time. Hairong On 1/19/09 10:28 PM, "Billy Pearson" wrote: Why do we not use the Remaining % in place of use Used % when we are selecting datanode f

hadoop balanceing data

2009-01-19 Thread Billy Pearson
Why do we not use the Remaining % in place of use Used % when we are selecting datanode for new data and when running the balancer. form what I can tell we are using the use % used and we do not factor in non DFS Used at all. I see a datanode with only a 60GB hard drive fill up completely 100% be

Re: Namenode BlocksMap on Disk

2008-11-26 Thread Billy Pearson
Doug: If we use the heap as a cache and you have a large cluster then you will have the memory on the NN to handle keeping all the namespace in memory. We are looking for a way to support smaller clusters also that might over run there heap size causing the cluster to crash. So if the NN has the

Re: Namenode BlocksMap on Disk

2008-11-26 Thread Billy Pearson
I would like to see something like this also I run 32bit servers so I am limited on how much memory I can use for heap. Besides just storing to disk I would like to see some sort of cache like a block cache that will cache parts the BlocksMap this would help reduce the hits to disk for lookups a

Re: The Case of a Long Running Hadoop System

2008-11-15 Thread Billy Pearson
If I understand the secondary namenode merges the edits log in to the fsimage and reduces the edit log size. Which is likely the root of your problems 8.5G seams large and likely putting a strain on your master servers memory and io bandwidth Why do you not have a secondary namenode? If you do

Re: Any Way to Skip Mapping?

2008-11-03 Thread Billy Pearson
I need the Reduce to Sort so I can merge the records and output in a sorted order. I do not need to join any data just merge rows together so I do not thank the join will be any help. I am storing the data like >> with a sorted map as the value. and on the merge I need to take all the rows tha

Any Way to Skip Mapping?

2008-11-01 Thread Billy Pearson
I have a job that merges multi output directories of MR jobs that run over time. The output of them are all the same and the MR that merges them uses a mapper that just outputs the same key,value as its is given so basically the same as the IdentityMapper The Problem I am seeing is as I add

Re: Improving locality of table access...

2008-10-22 Thread Billy Pearson
generate a patch and post it here https://issues.apache.org/jira/browse/HBASE-675 Billy "Arthur van Hoff" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] Hi, Below is some code for improving the read performance of large tables by processing each region on the host holding that re

Re: Maps running after reducers complete successfully?

2008-10-03 Thread Billy Pearson
Do we not have an option to store the map results in hdfs? Billy "Owen O'Malley" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] It isn't optimal, but it is the expected behavior. In general when we lose a TaskTracker, we want the map outputs regenerated so that any reduces that n

Re: Can hadoop sort by values rather than keys?

2008-09-28 Thread Billy Pearson
Might be able to use InverseMapper.class To help flip the key/value to value/key Billy "Jeremy Chow" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] Hi list, The default way hadoop doing its sorting is by keys , can it sort by values rather than keys? Regards, Jeremy -- My rese

Re: adding nodes while computing

2008-09-28 Thread Billy Pearson
You should be able to add nodes to the cluster while jobs are running the jobtracker should start assigning task to the tasktrackers and dfs should start using the nodes for storage But map data files are stored on the slaves and copied to the reduce task so if a node goes down during a MR job

Re: Is Hadoop the thing for us ?

2008-06-27 Thread Billy Pearson
I do not totally understand you job you are running but if each simulation can run independent of each other then you could run a map reduce job that will spread the simulation's over many servers so each one can run one or more at the same time this will give you a level of protection on server

Re: how to write data to one file on HDF S by some clients Synchronized?

2008-06-24 Thread Billy Pearson
https://issues.apache.org/jira/browse/HADOOP-1700 "过佳" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] Does HDFS support it?I need it to be synchronized , e.g. I call many clients to write a lots of IntWritable to one file. Best. Jarvis.

MR input Format Type mismatch

2008-06-21 Thread Billy Pearson
2008-06-21 20:30:18,928 WARN org.apache.hadoop.mapred.TaskTracker: Error running child java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.LongWritable, recieved org.apache.hadoop.io.Text at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:419)

Re: Ec2 and MR Job question

2008-06-14 Thread Billy Pearson
when need to add extra cpu power to my cluster and to automatically start the tasktracker vi a shell script that can be ran at startup. Billy "Billy Pearson" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] I have a question someone may have answered here before bu

Re: Ec2 and MR Job question

2008-06-14 Thread Billy Pearson
/docs/r0.16.4/api/org/apache/hadoop/mapred/lib/IdentityMapper.html ckw On Jun 14, 2008, at 1:31 PM, Billy Pearson wrote: I have a question someone may have answered here before but I can not find the answer. Assuming I have a cluster of servers hosting a large amount of data I want to run a la

Ec2 and MR Job question

2008-06-14 Thread Billy Pearson
I have a question someone may have answered here before but I can not find the answer. Assuming I have a cluster of servers hosting a large amount of data I want to run a large job that the maps take a lot of cpu power to run and the reduces only take a small amount cpu to run. I want to run th

Re: Streaming --counters question

2008-06-10 Thread Billy Pearson
Streaming works on stdin and stdout so unless there was a way to capture the stdout as a counter I do not see any other way to report the to the jobtracker. Unless there was a url the task could call on the jobtracker to update counters. Billy "Miles Osborne" <[EMAIL PROTECTED]> wrote in m