Map Output

2010-09-14 Thread Yağız Kargın
Hi All, I have some questions about the map output. As far as I know, the map output is written to the local disk then shipped to reducer via network. Is this correct? Does it read from and written to the disk multiple times or only once when the map task ends? Which parts of the Hadoop code

Re: Map Output

2010-09-14 Thread Arun C Murthy
Moving to mapreduce-user@, bcc common-u...@. Please use the appropriate project list. The map output is saved to disk at the end of the map. However, if there is insufficient memory, we do intermediate spills. The config knob io.sort.mb controls the memory available to keep map outputs

Re: Map Output

2010-09-17 Thread Amogh Vasekar
Hi, >>As far as I know, the map output is written to the local disk then shipped to >>reducer via network. Is this correct? Yes. Each reducer picks up its own partition from the map output, once the map task completes. However, its little more complicated (and very interesting) on

Multiple map output value types

2009-09-03 Thread ll_oz_ll
strings and add the doubles. We could have doubles as strings too and we can cast them back to doubles but I think it would be computationally easy if they are doubles after the map job to start with. Any clues? Thanks -- View this message in context: http://www.nabble.com/Multiple-map-output

OOM Error Map output copy.

2011-12-07 Thread Niranjan Balasubramanian
All I am encountering the following out-of-memory error during the reduce phase of a large job. Map output copy failure : java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1669) at

The value of "Map output records"

2010-05-05 Thread Dan Fundatureanu
Is there a way to get the value of "Map output records" from within the Reducer ? I want to know the total number of the "Map output records" while the Reducer is running and I've noticed this value in the web interface shown for each Map. Or is there a way to read in Red

RE: OOM Error Map output copy.

2011-12-08 Thread Devaraj K
ich version of hadoop using? Devaraj K -Original Message- From: Niranjan Balasubramanian [mailto:niran...@cs.washington.edu] Sent: Thursday, December 08, 2011 12:21 AM To: common-user@hadoop.apache.org Subject: OOM Error Map output copy. All I am encountering the following out-of-mem

Re: OOM Error Map output copy.

2011-12-08 Thread Niranjan Balasubramanian
Balasubramanian [mailto:niran...@cs.washington.edu] > Sent: Thursday, December 08, 2011 12:21 AM > To: common-user@hadoop.apache.org > Subject: OOM Error Map output copy. > > All > > I am encountering the following out-of-memory error during the reduce phase > of a larg

Re: OOM Error Map output copy.

2011-12-08 Thread Niranjan Balasubramanian
:+UseSerialGC >> >> >> and also can you tell me which version of hadoop using? >> >> >> Devaraj K >> >> -Original Message- >> From: Niranjan Balasubramanian [mailto:niran...@cs.washington.edu] >> Sent: Thursday, December 08,

RE: OOM Error Map output copy.

2011-12-09 Thread Devaraj K
mx1536M -XX:+UseSerialGC >> >> >> and also can you tell me which version of hadoop using? >> >> >> Devaraj K >> >> -Original Message- >> From: Niranjan Balasubramanian [mailto:niran...@cs.washington.edu] >> Sent: Thursday,

Re: OOM Error Map output copy.

2011-12-09 Thread Arun C Murthy
our job to a queue with a small capacity and max-capacity to restrict your job to 10 or 20 concurrent reduces at a given point. Arun On Dec 7, 2011, at 10:51 AM, Niranjan Balasubramanian wrote: > All > > I am encountering the following out-of-memory error during the reduce phase >

Re: OOM Error Map output copy.

2011-12-09 Thread Prashant Kommireddi
, int compressedLength) throws IOException, InterruptedException { // Reserve ram for the map-output . . . . // Copy map-output into an in-memory buffer byte[] shuffleData = new byte[mapOutputLength]; -Prahant Kommireddi On Fri, Dec 9, 2011 at 10:29 AM

Re: OOM Error Map output copy.

2011-12-09 Thread Chandraprakash Bhagtani
throws IOException, InterruptedException { >// Reserve ram for the map-output > . > . > . > . > >// Copy map-output into an in-memory buffer >byte[] shuffleData = new byte[mapOutputLength]; > > > -Prahant Kommireddi > > On Fri, Dec

Re: OOM Error Map output copy.

2011-12-10 Thread Niranjan Balasubramanian
InputStream input, >> int mapOutputLength, >> int compressedLength) >> throws IOException, InterruptedException { >> // Reserve ram for the map-output >> . >> . >> . >> .

map output not euqal to reduce input

2009-12-10 Thread Gang Luo
Hi all, after finish one mapreduce job, the statistics shows that the number of records map generated is not equal to the number of records that reduce input. It says: Map output records=15 Reduce input records=93282 I think it is abnormal. Please give me some ideas how this happen and how

What can cause: Map output copy failure

2010-01-07 Thread Mayuran Yogarajah
I'm seeing this error when a job runs: Shuffling 35338524 bytes (35338524 raw bytes) into RAM from attempt_201001051549_0036_m_03_0 Map output copy failure: java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInM

Restricting number of records from map output

2011-01-12 Thread Rakesh Davanum
Hi, I have a sort job consisting of only the Mapper (no Reducer) task. I want my results to contain only the top n records. Is there any way of restricting the number of records that are emitted by the Mappers? Basically I am looking to see if there is an equivalent of achieving the behavior simi

BZip2Codec memory usage for map output compression?

2011-01-17 Thread Attila Csordas
Hi, How can memory usage be calculated in case of BZip2Codec for map output? Cheers, Attila

why quick sort when spill map output?

2011-02-28 Thread elton sky
Hello forumers, Before spill the data in kvbuffer to local disk in map task, k/v are sorted using quick sort. The complexity of quick sort is O(nlogn) and worst case is O(n^2). Why using quick sort? Regards

Re: map output not euqal to reduce input

2009-12-10 Thread Huy Phan
Do you have any combiner implemented in your job ? On 12/10/2009 09:11 PM, Gang Luo wrote: Hi all, after finish one mapreduce job, the statistics shows that the number of records map generated is not equal to the number of records that reduce input. It says: Map output records=15 Reduce

Re: map output not euqal to reduce input

2009-12-10 Thread Gang Luo
Reduce shuffle bytes=10266764 Reduce output records=43282 Spilled Records=30 Map output bytes=9966746 Map input bytes=18711537 Combine input records=0 Map output records=15 Reduce input records=93282 -Gang - 原始邮件 发件人: Huy Phan 收件人: common-user@hadoop.apache.org 发送日期: 2009/12/10

Re: What can cause: Map output copy failure

2010-01-08 Thread Amogh Vasekar
Hi, Can you please let us know your system configuration running hadoop? The error you see is when the reducer is copying its respective map output into memory. The parameter mapred.job.shuffle.input.buffer.percent can be manipulated for this ( a bunch of others will also help you optimize sort

Re: What can cause: Map output copy failure

2010-01-08 Thread Mayuran Yogarajah
Amogh Vasekar wrote: Hi, Can you please let us know your system configuration running hadoop? The error you see is when the reducer is copying its respective map output into memory. The parameter mapred.job.shuffle.input.buffer.percent can be manipulated for this ( a bunch of others will also

LZO compression for Map output in Hadoop 0.20+?

2010-02-16 Thread jiang licht
   New to Hadoop (now using 0.20.1), I want to know how to choose and set up compression methods for Map output, especially how to configure and use LZO compression?     Specifically, please share your experience for the following 2 scenarios. Thanks!    (1) Is there a global setting in some

Use intermediate compression for Map output or not?

2010-02-25 Thread jiang licht
Hi hadoop Gurus, here's a question about intermediate compression. As I understand, the point to compress Map output is to reduce network traffic that occur when feeding sequence files from Map to Reduce tasks which do not reside on the same boxes as Map tasks. So, depending on various fa

Re: Restricting number of records from map output

2011-01-12 Thread Anthony Urso
Either use an instance variable or a Combiner. The latter is correct if you want the top-n per key from the mapper. On Wed, Jan 12, 2011 at 10:03 AM, Rakesh Davanum wrote: > Hi, > > I have a sort job consisting of only the Mapper (no Reducer) task. I want my > results to contain only the top n r

Re: Restricting number of records from map output

2011-01-14 Thread Hari Sreekumar
Ideally, mappers should be independent of other mappers. Still, you can use counters and start skipping records when counter>some value to achieve similar behavior. It will not be very reliable if you want very exact results though. On Thu, Jan 13, 2011 at 12:43 AM, Anthony Urso wrote: > Either

Re: Restricting number of records from map output

2011-01-14 Thread Alex Kozlov
Hi Rakesh, What do you mean by the top N? The first ones or you need to sort them in memory? You can always output records in the cleanup() method at the end of the mapper run. On Fri, Jan 14, 2011 at 7:05 AM, Hari Sreekumar wrote: > Ideally, mappers should be independent of other mappers. Stil

Re: Restricting number of records from map output

2011-01-14 Thread Niels Basjes
Hi, > I have a sort job consisting of only the Mapper (no Reducer) task. I want my > results to contain only the top n records. Is there any way of restricting > the number of records that are emitted by the Mappers? > > Basically I am looking to see if there is an equivalent of achieving > the be

Re: why quick sort when spill map output?

2011-02-28 Thread James Seigel
Sorting out of the map phase is core to how hadoop works. Are you asking why sort at all? or why did someone use quick sort as opposed to _sort? Cheers James On 2011-02-28, at 3:30 AM, elton sky wrote: > Hello forumers, > > Before spill the data in kvbuffer to local disk in map task, k/

Re: why quick sort when spill map output?

2011-02-28 Thread MANISH SINGLA
one of the major reasons of using quicksort would be that quicksort can easily be parallalized...due to its divide and conquer nature On Mon, Feb 28, 2011 at 6:06 PM, James Seigel wrote: > Sorting out of the map phase is core to how hadoop works.  Are you asking why > sort at all?  or why did so

I got the problem from "Map output lost"

2011-09-01 Thread Tu Tu
>From this week,My Hadoop caught his problem with information as following: Lost task tracker: tracker_rsync.host01:localhost/127.0.0.1:40759 Map output lost, rescheduling: getMapOutput(attempt_201108021855_6734_m_97_1,2002) failed : org.apache.hadoop.util.DiskChecker$DiskErrorExcept

Re: Re: map output not euqal to reduce input

2009-12-10 Thread Todd Lipcon
10266764 > Reduce output records=43282 > Spilled Records=30 > Map output bytes=9966746 >  Map input bytes=18711537 > Combine input records=0 > Map output records=15 > Reduce input records=93282 > > -Gang

Re: LZO compression for Map output in Hadoop 0.20+?

2010-02-16 Thread himanshu chandola
al Message From: jiang licht To: common-user@hadoop.apache.org Sent: Wed, February 17, 2010 12:26:48 AM Subject: LZO compression for Map output in Hadoop 0.20+? New to Hadoop (now using 0.20.1), I want to know how to choose and set up compression methods for Map output, especially how to configu

Re: LZO compression for Map output in Hadoop 0.20+?

2010-02-17 Thread jiang licht
Thanks Himanshu. Is there a part 2? -- Michael --- On Tue, 2/16/10, himanshu chandola wrote: From: himanshu chandola Subject: Re: LZO compression for Map output in Hadoop 0.20+? To: common-user@hadoop.apache.org Date: Tuesday, February 16, 2010, 11:35 PM You might want to check out this

Re: LZO compression for Map output in Hadoop 0.20+?

2010-02-17 Thread himanshu chandola
: Wed, February 17, 2010 3:26:26 AM Subject: Re: LZO compression for Map output in Hadoop 0.20+? Thanks Himanshu. Is there a part 2? -- Michael --- On Tue, 2/16/10, himanshu chandola wrote: From: himanshu chandola Subject: Re: LZO compression for Map output in Hadoop 0.20+? To: common-user@ha

Re: LZO compression for Map output in Hadoop 0.20+?

2010-02-17 Thread Arun C Murthy
http://code.google.com/p/hadoop-gpl-compression/ Arun On Feb 16, 2010, at 9:26 PM, jiang licht wrote: New to Hadoop (now using 0.20.1), I want to know how to choose and set up compression methods for Map output, especially how to configure and use LZO compression? Specifically, please

Re: Re: map output not euqal to reduce in put

2009-12-10 Thread Gang Luo
Hi Todd, I didn't change the partitioner, just use the default one. Will the default partitioner cause the lost of the records? -Gang - 原始邮件 发件人: Todd Lipcon 收件人: common-user@hadoop.apache.org 发送日期: 2009/12/10 (周四) 3:37:18 下午 主 题: Re: Re: map output not euqal to reduce inpu

Re: Re: Re: map output not euqal to reduce input

2009-12-10 Thread Todd Lipcon
On Thu, Dec 10, 2009 at 1:15 PM, Gang Luo wrote: > Hi Todd, > I didn't change the partitioner, just use the default one. Will the default > partitioner cause the lost of the records? > > -Gang > Do the maps output data nondeterministically? Did you experience any task failures in the run of the

Re: Re: Re: map output not euqal to reduce input

2009-12-10 Thread Gang Luo
In the mapper of this job, I get something I am interested in for each line and then output all of them. So the number of map input records is equal to the map output records. Actually, I am doing semi join in this job. There is no failure during execution. -Gang - 原始邮件 发件人: Todd

java.io.IOException: Spill failed when using w/ GzipCodec for Map output

2010-02-22 Thread jiang licht
I have a pig script. If I don't set any codec for Map output for hadoop cluster, no problem. Now I made the following compression settings, the job failed and the error message is shown below. I guess there are some other settings that should be correctly set together with using the compre

Why is Spilled Records always equal to Map output records

2009-07-12 Thread Mu Qiao
Hi, everyone I'm a beginner of hadoop. I notice it from the web console after I've tried to run serveral jobs. Every one of the jobs has the number of Spilled Records equal to Map output records, even if there are only 5 map output records In the reduce phase, there are also spill

Re: Re: Re: Re: map output not euqal to reduce input

2009-12-11 Thread Gang Luo
Thanks, Amogn. I am not sure whether all the records mepper generate are consumed by reducer. But how do you define 'consumed by reducer'? I can set a counter to see how many lines go to my map function, but this is likely the same as reduce input # which is less than map output #.

Re: java.io.IOException: Spill failed when using w/ GzipCodec for Map output

2010-02-22 Thread Amogh Vasekar
ript. If I don't set any codec for Map output for hadoop cluster, no problem. Now I made the following compression settings, the job failed and the error message is shown below. I guess there are some other settings that should be correctly set together with using the compression. Im using 0.20.

Re: java.io.IOException: Spill failed when using w/ GzipCodec for Map output

2010-02-22 Thread jiang licht
in hadoop wiki :) ) Amogh On 2/23/10 8:16 AM, "jiang licht" wrote: I have a pig script. If I don't set any codec for Map output for hadoop cluster, no problem. Now I made the following compression settings, the job failed and the error message is shown below. I guess there are some

Re: java.io.IOException: Spill failed when using w/ GzipCodec for Map output

2010-02-22 Thread Amogh Vasekar
hael --- On Mon, 2/22/10, Amogh Vasekar wrote: From: Amogh Vasekar Subject: Re: java.io.IOException: Spill failed when using w/ GzipCodec for Map output To: "common-user@hadoop.apache.org" Date: Monday, February 22, 2010, 11:27 PM Hi, Can you please let us know what platform you are ru

Re: java.io.IOException: Spill failed when using w/ GzipCodec for Map output

2010-02-23 Thread jiang licht
Thanks, Amogh. Good to know :) Michael --- On Tue, 2/23/10, Amogh Vasekar wrote: From: Amogh Vasekar Subject: Re: java.io.IOException: Spill failed when using w/ GzipCodec for Map output To: "common-user@hadoop.apache.org" Date: Tuesday, February 23, 2010, 1:45 AM Hi, Certainly

Re: Why is Spilled Records always equal to Map output records

2009-07-13 Thread Owen O'Malley
On Jul 12, 2009, at 3:55 AM, Mu Qiao wrote: I notice it from the web console after I've tried to run serveral jobs. Every one of the jobs has the number of Spilled Records equal to Map output records, even if there are only 5 map output records This is good. The map outputs need

Re: Why is Spilled Records always equal to Map output records

2009-07-13 Thread Mu Qiao
ote: > > I notice it from the web console after I've tried to run serveral jobs. >> Every one of the jobs has the number of Spilled Records equal to Map >> output >> records, even if there are only 5 map output records >> > > > This is good. The map ou

Re: Why is Spilled Records always equal to Map output records

2009-07-13 Thread Dali Kilani
wrote: > > > > I notice it from the web console after I've tried to run serveral jobs. > >> Every one of the jobs has the number of Spilled Records equal to Map > >> output > >> records, even if there are only 5 map output records > >> > > &

Re: Why is Spilled Records always equal to Map output records

2009-07-13 Thread Mu Qiao
from the web console after I've tried to run serveral > jobs. > > >> Every one of the jobs has the number of Spilled Records equal to Map > > >> output > > >> records, even if there are only 5 map output records > > >> > > > > &g

Re: Why is Spilled Records always equal to Map output records

2009-07-14 Thread Owen O'Malley
There is no requirement that all of the reduces are running while the map is running. The dataflow is that the map writes its output to local disk and that the reduces pull the map outputs when they need them. There are threads handling sorting and spill of the records to disk, but that doesn't rem

Re: Why is Spilled Records always equal to Map output records

2009-07-14 Thread Mu Qiao
Thanks. But when I refer to "Hadoop: The Definitive Guide" chapter 6, I find that the map writes its outputs to a memory buffer(not to local disk) whose size is controlled by io.sort.mb. Only the buffer reaches its threshold, it will spill the outputs to local disk. If that is true, I can't see any

Re: Why is Spilled Records always equal to Map output records

2009-07-14 Thread Jothi Padmanabhan
It is true, map writes its output to a memory buffer. But when the map process is complete, the contents of this buffer are sorted and spilled to the disk so that the Task Tracker running on that node can serve these map outputs to the requesting reducers. On 7/15/09 7:59 AM, "Mu Qiao" wrote: >

Re: Why is Spilled Records always equal to Map output records

2009-07-14 Thread Mu Qiao
Thanks. It's clear now. :) On Wed, Jul 15, 2009 at 11:40 AM, Jothi Padmanabhan wrote: > It is true, map writes its output to a memory buffer. But when the map > process is complete, the contents of this buffer are sorted and spilled to > the disk so that the Task Tracker running on that node can

Re: Re: Re: Re: map output not euqal t o reduce input

2009-12-10 Thread Amogh Vasekar
ot; wrote: In the mapper of this job, I get something I am interested in for each line and then output all of them. So the number of map input records is equal to the map output records. Actually, I am doing semi join in this job. There is no failure during execution. -Gang - ԭʼ�ʼ� ��

Re: Re: Re: Re: Re: map output not euqal to reduce input

2009-12-14 Thread Amogh Vasekar
quot;Gang Luo" wrote: Thanks, Amogn. I am not sure whether all the records mepper generate are consumed by reducer. But how do you define 'consumed by reducer'? I can set a counter to see how many lines go to my map function, but this is likely the same as reduce input # which is l

Re: Re: Re: Re: Re: map output no t euqal to reduce input

2009-12-16 Thread Gang Luo
- 发件人: Amogh Vasekar 收件人: "common-user@hadoop.apache.org" 发送日期: 2009/12/15 (周二) 1:59:14 上午 主 题: Re: Re: Re: Re: Re: map output not euqal to reduce input >>how do you define 'consumed by reducer' Trivially, as long as you have your values iterator go to the end, you shoul

Re: (keytype,valuetype) of map output should match (keytype,valuetype) of reducer input?

2011-08-02 Thread Harsh J
Yes, in terms of 'Java types', they _must_ match. That doesn't mean you can't set them all to just 'Writable' and have fun, I think ;) 2011/8/3 Daniel,Wu : >  should they matched exactly? > -- Harsh J

Does the map task push map output to reduce task or reduce task pull it from map task

2009-10-26 Thread Jeff Zhang
Hi all, I'd like to know does the map task push map output to reduce task or reduce task pull it from map task ? Which way is real in hadoop ? Thank you very much. Jeff zhang

Re: Does the map task push map output to reduce task or reduce task pull it from map task

2009-10-26 Thread Prabhu Hari Dhanapal
bottle neck, but in reality it seems it isnt . btw, Wait for some expert to answer, I m a beginner too ! On Mon, Oct 26, 2009 at 9:05 PM, Jeff Zhang wrote: > Hi all, > > I'd like to know does the map task push map output to reduce task or reduce > task pull it from map task ? Whi

Re: Does the map task push map output to reduce task or reduce task pull it from map task

2009-10-26 Thread dave bayer
On Oct 26, 2009, at 6:05 PM, Jeff Zhang wrote: I'd like to know does the map task push map output to reduce task or reduce task pull it from map task ? Which way is real in hadoop ? In 0.19, it appears to be a pull. Look at the run() method in mapred/ org/apache/hadoop/m

Re: Does the map task push map output to reduce task or reduce task pull it from map task

2009-10-26 Thread Amogh Vasekar
O(n) , S&S is O(nlogn), so if the amount of intermediate data is huge you will see a relative drop in performance. Amogh On 10/27/09 6:35 AM, "Jeff Zhang" wrote: Hi all, I'd like to know does the map task push map output to reduce task or reduce task pull it from map task

Re: Does the map task push map output to reduce task or reduce task pull it from map task

2009-10-26 Thread Jothi Padmanabhan
Don't know what the equivalent would be in the mapreduce package in 0.20.x. dave bayer The framework code to do with fetching of map outputs is the same for both the mapred and mapreduce based reducers.