Re: Writing a New Aggregate Function

2009-04-24 Thread Runping Qi
thout patching the code within the > aggregator package? > > It sure doesn't look like it, but just to make sure. > > Thanks again, > -Dan M > > > On Apr 24, 2009, at 12:56 PM, Runping Qi wrote: > > A couple of general goals behind of the aggregate package: &g

Re: Writing a New Aggregate Function

2009-04-24 Thread Runping Qi
A couple of general goals behind of the aggregate package: 1. If you are application developers using aggregate package, you only need to develop your own (user defined) valuator descriptor classes, which are typically sub class of ValueAggregatorDescriptor. You can use the existing aggregator typ

Re: Output file name

2009-03-18 Thread Runping Qi
You need to implement your own OutputFormat. See MultipleOutputFormat class for examples. Runping On Wed, Mar 18, 2009 at 9:11 PM, Rodrigo Schmidt wrote: > > In a Hadoop job, how do I set the prefix of the output files to something > different than "part-" ? > > I mean, what should I do if I w

Re: Jobs run slower and slower

2009-03-04 Thread Runping Qi
out at the individual map task level. What would be the best > way > for me to determine that? > > -Sean > > On Wed, Mar 4, 2009 at 12:13 PM, Runping Qi wrote: > > > Do you know the break down of times for a mapper task takes to initialize > > and to execute the map

Re: Jobs run slower and slower

2009-03-04 Thread Runping Qi
Do you know the break down of times for a mapper task takes to initialize and to execute the map function? On Wed, Mar 4, 2009 at 8:44 AM, Sean Laurent wrote: > On Tue, Mar 3, 2009 at 10:14 PM, Amar Kamat wrote: > > > Yeah. May be its not the problem with the JobTracker. Can you check (via > >

Re: Mappers become less utilized as time goes on?

2009-03-03 Thread Runping Qi
Were task Trackers black-listed? On Tue, Mar 3, 2009 at 3:25 PM, Nathan Marz wrote: > I'm seeing some really bizarre behavior from Hadoop 0.19.1. I have a fairly > large job with about 29000 map tasks and 72 reducers. there are 304 map task > slots in the cluster. When the job starts, it runs 3

Re: Jobs run slower and slower

2009-03-03 Thread Runping Qi
) > 5) Run #4 281.96 (secs) > > I don't think that's the problem here... :( > > -S > - Show quoted text - > > On Tue, Mar 3, 2009 at 2:33 PM, Runping Qi wrote: > > > The jobtracker's memory increased as you ran more and more jobs because > th

Re: Jobs run slower and slower

2009-03-03 Thread Runping Qi
e greatly appreciated. > > -Sean > - Show quoted text - > > On Mon, Mar 2, 2009 at 7:50 PM, Runping Qi wrote: > > > Your problem may be related to > > https://issues.apache.org/jira/browse/HADOOP-4766 > > > > Runping > > > > > > On Mon,

Re: OutOfMemory error processing large amounts of gz files

2009-03-02 Thread Runping Qi
Your job tracker out-of-memory problem may be related to https://issues.apache.org/jira/browse/HADOOP-4766 Runping On Mon, Mar 2, 2009 at 4:29 PM, bzheng wrote: > > Thanks for all the info. Upon further investigation, we are dealing with > two > separate issues: > > 1. problem processing a l

Re: Jobs run slower and slower

2009-03-02 Thread Runping Qi
Your problem may be related to https://issues.apache.org/jira/browse/HADOOP-4766 Runping On Mon, Mar 2, 2009 at 4:46 PM, Sean Laurent wrote: > Hi all, > I'm conducting some initial tests with Hadoop to better understand how well > it will handle and scale with some of our specific problems. As

Re: RAID vs. JBOD

2009-01-15 Thread Runping Qi
Yes, all the machines in the tests are new, with the same spec. The 30% to 50% throughput variations of the disks were observed on the disks of the same machines. Runping On 1/15/09 2:41 AM, "Steve Loughran" wrote: > Runping Qi wrote: >> Hi, >> >> We at Yah

Re: RAID vs. JBOD

2009-01-14 Thread Runping Qi
Hi, We at Yahoo did some Hadoop benchmarking experiments on clusters with JBOD and RAID0. We found that under heavy loads (such as gridmix), JBOD cluster performed better. Gridmix tests: Load: gridmix2 Cluster size: 190 nodes Test results: RAID0: 75 minutes JBOD: 67 minutes Difference: 10% T

Warning on turning on ipv6 on your Hadoop clusters

2008-12-17 Thread Runping Qi
If you may have turned on ipv6 on your hadoop cluster, it may cause severe performance hit! When I ran the gridmix2 benchmark on a newly constructed cluster, it took 30% more time than the baseline time that was obtained on a similar cluster. I noticed that some task processes on some machines

Re: Is there a way to know the input filename at Hadoop Streaming?

2008-10-26 Thread Runping Qi
Each mapper works on only one file split, which is either from file1 or file2 in your case. So the value for map.input.file gives you the exact information you need. Runping On 10/23/08 11:09 AM, "Steve Gao" <[EMAIL PROTECTED]> wrote: > Thanks, Amogh. But my case is slightly different. The

Re: Custom InputFormat/OutputFormat

2008-07-10 Thread Runping Qi
All this is because you were using streaming. Streaming treats each line in the stream as one "record" and then break it into a key/value pair (using '\t' as the separator by default). If you write your mapper class in Java, the values passed to the calls to your map function should be the whole te

RE: RecordReader Functionality

2008-06-30 Thread Runping Qi
Your record reader must be able to find the beginning of the next record beyond the start position of a given split. Your file format must enable your record reader to detect the beginning of the next record beyond the start pos of a split. It seems to me that is not possible based on the info I s

RE: reducers hanging problem

2008-06-30 Thread Runping Qi
Looks like the reducer stuck at shuffling phase. What is the progression percentage do you see for the reducer from web GUI? It is known that 0.17 does not handle shuffling well. Runping > -Original Message- > From: Andreas Kostyrka [mailto:[EMAIL PROTECTED] > Sent: Monday, June 30, 20

RE: Using value aggregator framework with MultipleTextOutputFormat

2008-06-27 Thread Runping Qi
Right. Please open a Jira for that. Runping > -Original Message- > From: Goel, Ankur [mailto:[EMAIL PROTECTED] > Sent: Friday, June 27, 2008 6:33 AM > To: core-user@hadoop.apache.org > Subject: RE: Using value aggregator framework with > MultipleTextOutputFormat > > I guess I made a m

RE: Stack Overflow When Running Job

2008-06-09 Thread Runping Qi
This is a known problem for 0.17.0: https://issues.apache.org/jira/browse/HADOOP-3442 It should be fixed in 0.17.1 Runping > -Original Message- > From: Colin Freas [mailto:[EMAIL PROTECTED] > Sent: Monday, June 09, 2008 12:56 PM > To: core-user@hadoop.apache.org > Subject: Re: Stack Ov

RE: [core-user] Help deflating output files

2008-06-04 Thread Runping Qi
You can run another map-only job to read convert the deflated files and write them out in the format you want. Runping > -Original Message- > From: Jim R. Wilson [mailto:[EMAIL PROTECTED] > Sent: Wednesday, June 04, 2008 4:13 PM > To: core-user@hadoop.apache.org > Subject: [core-user] H

RE: Stackoverflow

2008-06-03 Thread Runping Qi
Chris, Your version will use LongWritable as the map output key type, which changes the job nature completely. You should use ${hadoop} jar hadoop-0.17-examples.jar sort -m \ >-r 88 \ >-inFormat org.apache.hadoop.mapred.KeyValueTextInputFormat \ >-outFormat org.apache.hadoop.mapred.

FW: bug on jute?

2008-06-01 Thread Runping Qi
From: Flavio Junqueira [mailto:[EMAIL PROTECTED] Sent: Saturday, May 31, 2008 2:27 AM To: [EMAIL PROTECTED] Subject: bug on jute? Hi, I found a small bug on jute, and I was wondering how to proceed with fixing it. The problem is the following. If I decla

RE: FileSystem.create

2008-05-14 Thread Runping Qi
My experience is to call Thread.sleep(100) after calling dfs writes N (say 1000) times. > -Original Message- > From: Xavier Stevens [mailto:[EMAIL PROTECTED] > Sent: Wednesday, May 14, 2008 10:47 AM > To: core-user@hadoop.apache.org > Subject: FileSystem.create > > I've having some probl

RE: Performance difference over two map-reduce solutions of same problem in different cluster sizes

2008-05-14 Thread Runping Qi
Your diagnose sounds reasonable. Since the mappers of your optimized solution outputs 3 key/value pairs for each input key/value pair, the map output size may be three times of the input size for each mapper. That size map exceeds the value of io.sort.mb in your configuration. If so, the mappers h

RE: Lease expired on open file

2008-04-18 Thread Runping Qi
Sounds like you also hit this problem: https://issues.apache.org/jira/browse/HADOOP-2669 Runping > -Original Message- > From: Luca [mailto:[EMAIL PROTECTED] > Sent: Friday, April 18, 2008 1:21 AM > To: core-user@hadoop.apache.org > Subject: Re: Lease expired on open file > > dhruba Bor

RE: Counters giving double values

2008-04-16 Thread Runping Qi
Here is a related jira: https://issues.apache.org/jira/browse/HADOOP-3126 > -Original Message- > From: Devaraj Das [mailto:[EMAIL PROTECTED] > Sent: Wednesday, April 16, 2008 3:56 AM > To: core-user@hadoop.apache.org > Subject: RE: Counters giving double values > > Also, in those cases

FW: streaming + binary input/output data?

2008-04-14 Thread Runping Qi
Observing a few emails on this list, I think the following email exchange between me and john may be of interest to a broader audience. Runping From: Runping Qi Sent: Sunday, April 13, 2008 8:58 AM To: 'JJ' Subject: RE: streaming + bi

RE: streaming + binary input/output data?

2008-04-12 Thread Runping Qi
Actually, there is an old jira about the same issue: https://issues.apache.org/jira/browse/HADOOP-1722 Runping > -Original Message- > From: John Menzer [mailto:[EMAIL PROTECTED] > Sent: Saturday, April 12, 2008 2:45 PM > To: core-user@hadoop.apache.org > Subject: RE: streaming + binary

RE: What's the proper way to use hadoop task side-effect files?

2008-04-11 Thread Runping Qi
Look like you use your reducer class as the combiner. The combiner will be called from mappers, potentially for multiple times. If you want to create side files in reducer, you cannot use that class as the combiner. Runping > -Original Message- > From: Zhang, jian [mailto:[EMAIL PROT

RE: DFS get blocked when writing a file.

2008-03-29 Thread Runping Qi
This is a know issue: https://issues.apache.org/jira/browse/HADOOP-3033 Your best bet now is to use 0.16.2 release. Runping > -Original Message- > From: Iván de Prado [mailto:[EMAIL PROTECTED] > Sent: Friday, March 28, 2008 6:08 AM > To: core-user@hadoop.apache.org > Subject: DFS get b

RE: Partitioning reduce output by date

2008-03-20 Thread Runping Qi
If you want to output data to different files based on date or any value parts, you may want to check https://issues.apache.org/jira/browse/HADOOP-2906 Runping > -Original Message- > From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] > Sent: Thursday, March 20, 2008 4:00 PM > To: core-us

RE: Calculations involve large datasets

2008-02-22 Thread Runping Qi
There is a package for joining data from multiple sources: contrib/data-join. It implements the basic joining logic and allows the user to provide application specific logic for filtering/projecting and combining multiple records into one. Runping > -Original Message- > From: Ted Dun