Re: hadoop.tmp.dir with multiple disks
Harsh, thanks for the heads up, that seemed to do the trick. Jay, i am building local files from the input, then compressing them on the local drive, then copy back to hdfs. So in my case it is really about io to the local fs.. On Sun, Apr 22, 2012 at 5:44 PM, Edward Capriolo edlinuxg...@gmail.comwrote: Since each hadoop tasks is isolated from others having more tmp directories allows you to isolate that disk bandwidth as well. By listing the disks you give more firepower to shuffle-sorting and merging processes. Edward On Sun, Apr 22, 2012 at 10:02 AM, Jay Vyas jayunit...@gmail.com wrote: I don't understand why multiple disks would be particularly beneficial for a Map/Reduce job. would I/O for a map/reduce job be i/o *as well as CPU bound* ? I would think that simply reading and parsing large files would still require dedicated CPU blocks. ? On Sun, Apr 22, 2012 at 3:14 AM, Harsh J ha...@cloudera.com wrote: You can use mapred.local.dir for this purpose. It accepts a list of directories tasks may use, just like dfs.data.dir uses multiple disks for block writes/reads. On Sun, Apr 22, 2012 at 12:50 PM, mete efk...@gmail.com wrote: Hello folks, I have a job that processes text files from hdfs on local fs (temp directory) and then copies those back to hdfs. I added another drive to each server to have better io performance, but as far as i could see hadoop.tmp.dir will not benefit from multiple disks,even if i setup two different folders on different disks. (dfs.data.dir works fine). As a result the disk with temp folder set is highy utilized, where the other one is a little bit idle. Does anyone have an idea on what to do? (i am using cdh3u3) Thanks in advance Mete -- Harsh J -- Jay Vyas MMSB/UCHC
Re: Reading data output by MapFileOutputFormat
Ali, MapFiles are explained at http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html - Please give it a read and it should solve half your questions. In short, MapFile is two files - one raw SequenceFile and another an index file built on top of it. The reason MR does not provide a MapFileInputFormat is that you don't need to use the index file in MR jobs (no lookups for input-driven jobs). Hence the SequenceFileInputFormat suffices to read the data (it ignores the index file, and only reads the sequence ones that carries the data). If you wish to make use of MapFile's index abilities for lookups/etc., use the MapFile.Reader class directly in your implementation. On Mon, Apr 23, 2012 at 4:23 PM, Ali Safdar Kureishy safdar.kurei...@gmail.com wrote: Hi, If I use a *MapFileOutputFormat* to output some data, I see that each reducer's output is a folder (part-0, for example), and inside that folder are two files: data and index. However, there is no corresponding MapFileInputFormat, to read back this folder (part-0). Instead, *SequenceFileInputFormat* seems to read the data. So, I have some questions: - does SequenceFileInputFormat actually read *all* the data that was output by MapFileOutputFormat? Or is some relationship data between the data and index files lost in this process that would have been better handled by another InputFormat class? In other words, is SequenceFileInputFormat the right InputFormat to read data written by MapFileOutputFormat? - how is it that SequenceFileInputFormat works to read outputs from *both*MapFileOutputFormat and SequenceFileOutputFormat? That would imply that MapFileOutputFormat and SequenceFileOutputFormat output the same data, OR that SequenceFileInputFormat internally handles both differently. What is the reality? Thanks, Safdar -- Harsh J
isSplitable() problem
I require each input file to be processed by each mapper as a whole. I subclass c.o.a.h.mapreduce.lib.input.TextInputFormat and override isSplitable() to invariably return false. The job is configured to use this subclass as the input format class via setInputFormatClass(). The job runs without error, yet the logs reveal files are still processed line by line by the mappers. Any help would be greatly appreciated, Thanks
Re: isSplitable() problem
Dan, Split and reading a whole file as a chunk are two slightly different things. The former controls if your files ought to be split across mappers (useful if there are multiple blocks of file in HDFS). The latter needs to be achieved differently. The TextInputFormat provides by default a LineRecordReader, which as it name goes - reads whatever stream is provided to it line-by-line. This is regardless of the file's block splits (a very different thing than line splits). You need to implement your own RecordReader and return it from your InputFormat to do what you want it to - i.e. read the whole stream into an object and then pass it out to the Mapper. On Mon, Apr 23, 2012 at 5:10 PM, Dan Drew wirefr...@googlemail.com wrote: I require each input file to be processed by each mapper as a whole. I subclass c.o.a.h.mapreduce.lib.input.TextInputFormat and override isSplitable() to invariably return false. The job is configured to use this subclass as the input format class via setInputFormatClass(). The job runs without error, yet the logs reveal files are still processed line by line by the mappers. Any help would be greatly appreciated, Thanks -- Harsh J
Re: isSplitable() problem
Thanks for the clarification. On 23 April 2012 12:52, Harsh J ha...@cloudera.com wrote: Dan, Split and reading a whole file as a chunk are two slightly different things. The former controls if your files ought to be split across mappers (useful if there are multiple blocks of file in HDFS). The latter needs to be achieved differently. The TextInputFormat provides by default a LineRecordReader, which as it name goes - reads whatever stream is provided to it line-by-line. This is regardless of the file's block splits (a very different thing than line splits). You need to implement your own RecordReader and return it from your InputFormat to do what you want it to - i.e. read the whole stream into an object and then pass it out to the Mapper. On Mon, Apr 23, 2012 at 5:10 PM, Dan Drew wirefr...@googlemail.com wrote: I require each input file to be processed by each mapper as a whole. I subclass c.o.a.h.mapreduce.lib.input.TextInputFormat and override isSplitable() to invariably return false. The job is configured to use this subclass as the input format class via setInputFormatClass(). The job runs without error, yet the logs reveal files are still processed line by line by the mappers. Any help would be greatly appreciated, Thanks -- Harsh J
Re: Algorithms used in fairscheduler 0.20.205
Anyone? On 19 April 2012 17:34, Merto Mertek masmer...@gmail.com wrote: I could find that the closest doc matching the current implementation of the fairscheduler could be find in this documenthttp://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-55.htmlfrom Matei Zaharia et al.. Another documented from delay scheduling can be found from year 2010.. a) I am interested if there maybe exist any newer documented version of the implementation? b) Are there any other algorithms in addition to delay scheduling, copy-compute splitting algorithm and fairshare calculation algorithm that are important for the cluster performance and fairsharing? c) Is there maybe any connection between copy-compute splitting and mapreduce phases (copy-sort-reduce)? Thank you..
Re: Reading data output by MapFileOutputFormat
Thanks Harsh! This is very helpful. Regards, Ali On Mon, Apr 23, 2012 at 2:08 PM, Harsh J ha...@cloudera.com wrote: Ali, MapFiles are explained at http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html - Please give it a read and it should solve half your questions. In short, MapFile is two files - one raw SequenceFile and another an index file built on top of it. The reason MR does not provide a MapFileInputFormat is that you don't need to use the index file in MR jobs (no lookups for input-driven jobs). Hence the SequenceFileInputFormat suffices to read the data (it ignores the index file, and only reads the sequence ones that carries the data). If you wish to make use of MapFile's index abilities for lookups/etc., use the MapFile.Reader class directly in your implementation. On Mon, Apr 23, 2012 at 4:23 PM, Ali Safdar Kureishy safdar.kurei...@gmail.com wrote: Hi, If I use a *MapFileOutputFormat* to output some data, I see that each reducer's output is a folder (part-0, for example), and inside that folder are two files: data and index. However, there is no corresponding MapFileInputFormat, to read back this folder (part-0). Instead, *SequenceFileInputFormat* seems to read the data. So, I have some questions: - does SequenceFileInputFormat actually read *all* the data that was output by MapFileOutputFormat? Or is some relationship data between the data and index files lost in this process that would have been better handled by another InputFormat class? In other words, is SequenceFileInputFormat the right InputFormat to read data written by MapFileOutputFormat? - how is it that SequenceFileInputFormat works to read outputs from *both*MapFileOutputFormat and SequenceFileOutputFormat? That would imply that MapFileOutputFormat and SequenceFileOutputFormat output the same data, OR that SequenceFileInputFormat internally handles both differently. What is the reality? Thanks, Safdar -- Harsh J
Re: How to set the KeyRange in an Hadoop+Cassandra job
Hey Filippo, I think this question best belongs to the Cassandra user lists (u...@cassandra.apache.org) as it is pretty specific to its APIs and implementation. On Mon, Apr 23, 2012 at 6:37 PM, Filippo Diotalevi fili...@ntoklo.com wrote: Hi, I'm trying to set the KeyRange using the ConfigHelper API for a Cassandra 1.0.8 + Hadoop job. The only API I see is public static void setInputRange(Configuration conf, String startToken, String endToken) which allows me to set a key range specifying the startToken and endToken, but I'd like to set the starting and ending row key. How can I achieve that? Thanks, -- Filippo Diotalevi fili...@ntoklo.com m: +44(0)7543 286 481 www.ntoklo.com -- Harsh J
Re: Distributing MapReduce on a computer cluster
Shailesh, there's a lot that goes into distributing work across tasks/nodes. It's not just distributing work but also fault-tolerance, data locality etc that come into play. It might be good to refer Hadoop apache docs or Tom White's definitive guide. Sent from my iPhone On Apr 23, 2012, at 11:03 AM, Shailesh Samudrala shailesh2...@gmail.com wrote: Hello, I am trying to design my own MapReduce Implementation and I want to know how hadoop is able to distribute its workload across multiple computers. Can anyone shed more light on this? thanks!
Design question
I just wanted to check how do people design their storage directories for data that is sent to the system continuously. For eg: for a given functionality we get data feed continuously writen to sequencefile, that is then coverted to more structured format using map reduce and stored in tab separated files. For such continuous feed what's the best way to organize directories and the names? Should it be just based of timestamp or something better that helps in organizing data. Second part of question, is it better to store output in sequence files so that we can take advantage of compression per record. This seems to be required since gzip/snappy compression of entire file would launch only one map tasks. And the last question, when compressing a flat file should it first be split into multiple files so that we get multiple mappers if we need to run another job on this file? LZO is another alternative but then it requires additional configuration, is it preferred? Any articles or suggestions would be very helpful.
Determine the key of Map function
Hello everyone ! I have a problem with MapReduce [:(] like that : I have 4 file input with 3 fields : teacherId, classId, numberOfStudent (numberOfStudent is ordered by desc for each teach) Output is top 30 classId that numberOfStudent is max for each teacher. My approach is MapReduce like Wordcount example. But I don't know how to determine key for map function. I run Wordcount example, understand its code but I have no experience at programming MapReduce. Can anyone help me to resolve this problem ? Thanks so much ! -- Lạc Trung 20083535
Re: Determine the key of Map function
Its somewhat tricky to understand exactly what you need from your explanation, but I believe you want teachers who have the most students in a given class. So for English, i have 10 teachers teaching the class - and i want the ones with the highes # of students. You can output key= classid, value=-1*#ofstudent,teacherid as the values. The values will then be sorted, by # of students. You can thus pick teacher in the the first value of your reducer, and that will be the teacher for class id = xyz , with the highes number of students. You can also be smart in your mapper by running a combiner to remove the teacherids who are clearly not maximal. On Mon, Apr 23, 2012 at 9:38 PM, Lac Trung trungnb3...@gmail.com wrote: Hello everyone ! I have a problem with MapReduce [:(] like that : I have 4 file input with 3 fields : teacherId, classId, numberOfStudent (numberOfStudent is ordered by desc for each teach) Output is top 30 classId that numberOfStudent is max for each teacher. My approach is MapReduce like Wordcount example. But I don't know how to determine key for map function. I run Wordcount example, understand its code but I have no experience at programming MapReduce. Can anyone help me to resolve this problem ? Thanks so much ! -- Lạc Trung 20083535 -- Jay Vyas MMSB/UCHC
Re: Determine the key of Map function
Hi Jay ! I think it's a bit difference here. I want to get 30 classId for each teacherId that have most students. For example : get 3 classId. (File1) 1) Teacher1, Class11, 30 2) Teacher1, Class12, 29 3) Teacher1, Class13, 28 4) Teacher1, Class14, 27 ... n ... n+1) Teacher2, Class21, 45 n+2) Teacher2, Class22, 44 n+3) Teacher2, Class23, 43 n+4) Teacher2, Class24, 42 ... n+m ... = return 3 line 1, 2, 3 for Teacher1 and line n+1, n+2, n+3 for Teacher2 Vào 09:52 Ngày 24 tháng 4 năm 2012, Jay Vyas jayunit...@gmail.com đã viết: Its somewhat tricky to understand exactly what you need from your explanation, but I believe you want teachers who have the most students in a given class. So for English, i have 10 teachers teaching the class - and i want the ones with the highes # of students. You can output key= classid, value=-1*#ofstudent,teacherid as the values. The values will then be sorted, by # of students. You can thus pick teacher in the the first value of your reducer, and that will be the teacher for class id = xyz , with the highes number of students. You can also be smart in your mapper by running a combiner to remove the teacherids who are clearly not maximal. On Mon, Apr 23, 2012 at 9:38 PM, Lac Trung trungnb3...@gmail.com wrote: Hello everyone ! I have a problem with MapReduce [:(] like that : I have 4 file input with 3 fields : teacherId, classId, numberOfStudent (numberOfStudent is ordered by desc for each teach) Output is top 30 classId that numberOfStudent is max for each teacher. My approach is MapReduce like Wordcount example. But I don't know how to determine key for map function. I run Wordcount example, understand its code but I have no experience at programming MapReduce. Can anyone help me to resolve this problem ? Thanks so much ! -- Lạc Trung 20083535 -- Jay Vyas MMSB/UCHC -- Lạc Trung 20083535
Re: Determine the key of Map function
Ahh... Well than the key will be teacher, and the value will simply be -1 * # students, class_id . Then, you will see in the reducer that the first 3 entries will always be the ones you wanted. On Mon, Apr 23, 2012 at 10:17 PM, Lac Trung trungnb3...@gmail.com wrote: Hi Jay ! I think it's a bit difference here. I want to get 30 classId for each teacherId that have most students. For example : get 3 classId. (File1) 1) Teacher1, Class11, 30 2) Teacher1, Class12, 29 3) Teacher1, Class13, 28 4) Teacher1, Class14, 27 ... n ... n+1) Teacher2, Class21, 45 n+2) Teacher2, Class22, 44 n+3) Teacher2, Class23, 43 n+4) Teacher2, Class24, 42 ... n+m ... = return 3 line 1, 2, 3 for Teacher1 and line n+1, n+2, n+3 for Teacher2 Vào 09:52 Ngày 24 tháng 4 năm 2012, Jay Vyas jayunit...@gmail.com đã viết: Its somewhat tricky to understand exactly what you need from your explanation, but I believe you want teachers who have the most students in a given class. So for English, i have 10 teachers teaching the class - and i want the ones with the highes # of students. You can output key= classid, value=-1*#ofstudent,teacherid as the values. The values will then be sorted, by # of students. You can thus pick teacher in the the first value of your reducer, and that will be the teacher for class id = xyz , with the highes number of students. You can also be smart in your mapper by running a combiner to remove the teacherids who are clearly not maximal. On Mon, Apr 23, 2012 at 9:38 PM, Lac Trung trungnb3...@gmail.com wrote: Hello everyone ! I have a problem with MapReduce [:(] like that : I have 4 file input with 3 fields : teacherId, classId, numberOfStudent (numberOfStudent is ordered by desc for each teach) Output is top 30 classId that numberOfStudent is max for each teacher. My approach is MapReduce like Wordcount example. But I don't know how to determine key for map function. I run Wordcount example, understand its code but I have no experience at programming MapReduce. Can anyone help me to resolve this problem ? Thanks so much ! -- Lạc Trung 20083535 -- Jay Vyas MMSB/UCHC -- Lạc Trung 20083535 -- Jay Vyas MMSB/UCHC
Re: Determine the key of Map function
Thanks Jay so much ! I will try this. ^^ Vào 10:52 Ngày 24 tháng 4 năm 2012, Jay Vyas jayunit...@gmail.com đã viết: Ahh... Well than the key will be teacher, and the value will simply be -1 * # students, class_id . Then, you will see in the reducer that the first 3 entries will always be the ones you wanted. On Mon, Apr 23, 2012 at 10:17 PM, Lac Trung trungnb3...@gmail.com wrote: Hi Jay ! I think it's a bit difference here. I want to get 30 classId for each teacherId that have most students. For example : get 3 classId. (File1) 1) Teacher1, Class11, 30 2) Teacher1, Class12, 29 3) Teacher1, Class13, 28 4) Teacher1, Class14, 27 ... n ... n+1) Teacher2, Class21, 45 n+2) Teacher2, Class22, 44 n+3) Teacher2, Class23, 43 n+4) Teacher2, Class24, 42 ... n+m ... = return 3 line 1, 2, 3 for Teacher1 and line n+1, n+2, n+3 for Teacher2 Vào 09:52 Ngày 24 tháng 4 năm 2012, Jay Vyas jayunit...@gmail.com đã viết: Its somewhat tricky to understand exactly what you need from your explanation, but I believe you want teachers who have the most students in a given class. So for English, i have 10 teachers teaching the class - and i want the ones with the highes # of students. You can output key= classid, value=-1*#ofstudent,teacherid as the values. The values will then be sorted, by # of students. You can thus pick teacher in the the first value of your reducer, and that will be the teacher for class id = xyz , with the highes number of students. You can also be smart in your mapper by running a combiner to remove the teacherids who are clearly not maximal. On Mon, Apr 23, 2012 at 9:38 PM, Lac Trung trungnb3...@gmail.com wrote: Hello everyone ! I have a problem with MapReduce [:(] like that : I have 4 file input with 3 fields : teacherId, classId, numberOfStudent (numberOfStudent is ordered by desc for each teach) Output is top 30 classId that numberOfStudent is max for each teacher. My approach is MapReduce like Wordcount example. But I don't know how to determine key for map function. I run Wordcount example, understand its code but I have no experience at programming MapReduce. Can anyone help me to resolve this problem ? Thanks so much ! -- Lạc Trung 20083535 -- Jay Vyas MMSB/UCHC -- Lạc Trung 20083535 -- Jay Vyas MMSB/UCHC -- Lạc Trung 20083535
Re: Determine the key of Map function
Ah, as I said before, I have no experience at programming MapReduce. So, can you give me some documents or websites or something about the thing you said above ? (Thousand things start hard - VietNam) Thanks so much ^^! Vào 10:54 Ngày 24 tháng 4 năm 2012, Lac Trung trungnb3...@gmail.com đã viết: Thanks Jay so much ! I will try this. ^^ Vào 10:52 Ngày 24 tháng 4 năm 2012, Jay Vyas jayunit...@gmail.com đã viết: Ahh... Well than the key will be teacher, and the value will simply be -1 * # students, class_id . Then, you will see in the reducer that the first 3 entries will always be the ones you wanted. On Mon, Apr 23, 2012 at 10:17 PM, Lac Trung trungnb3...@gmail.com wrote: Hi Jay ! I think it's a bit difference here. I want to get 30 classId for each teacherId that have most students. For example : get 3 classId. (File1) 1) Teacher1, Class11, 30 2) Teacher1, Class12, 29 3) Teacher1, Class13, 28 4) Teacher1, Class14, 27 ... n ... n+1) Teacher2, Class21, 45 n+2) Teacher2, Class22, 44 n+3) Teacher2, Class23, 43 n+4) Teacher2, Class24, 42 ... n+m ... = return 3 line 1, 2, 3 for Teacher1 and line n+1, n+2, n+3 for Teacher2 Vào 09:52 Ngày 24 tháng 4 năm 2012, Jay Vyas jayunit...@gmail.com đã viết: Its somewhat tricky to understand exactly what you need from your explanation, but I believe you want teachers who have the most students in a given class. So for English, i have 10 teachers teaching the class - and i want the ones with the highes # of students. You can output key= classid, value=-1*#ofstudent,teacherid as the values. The values will then be sorted, by # of students. You can thus pick teacher in the the first value of your reducer, and that will be the teacher for class id = xyz , with the highes number of students. You can also be smart in your mapper by running a combiner to remove the teacherids who are clearly not maximal. On Mon, Apr 23, 2012 at 9:38 PM, Lac Trung trungnb3...@gmail.com wrote: Hello everyone ! I have a problem with MapReduce [:(] like that : I have 4 file input with 3 fields : teacherId, classId, numberOfStudent (numberOfStudent is ordered by desc for each teach) Output is top 30 classId that numberOfStudent is max for each teacher. My approach is MapReduce like Wordcount example. But I don't know how to determine key for map function. I run Wordcount example, understand its code but I have no experience at programming MapReduce. Can anyone help me to resolve this problem ? Thanks so much ! -- Lạc Trung 20083535 -- Jay Vyas MMSB/UCHC -- Lạc Trung 20083535 -- Jay Vyas MMSB/UCHC -- Lạc Trung 20083535 -- Lạc Trung 20083535
Re: Determine the key of Map function
Ah, as I said before, I have no experience at programming MapReduce. So, can you give me some documents or websites or something about programming the thing you said above? (Thousand things start hard - VietNam) Thanks so much ^^! Vào 10:54 Ngày 24 tháng 4 năm 2012, Lac Trung trungnb3...@gmail.com đã viết: Thanks Jay so much ! I will try this. ^^ Vào 10:52 Ngày 24 tháng 4 năm 2012, Jay Vyas jayunit...@gmail.com đã viết: Ahh... Well than the key will be teacher, and the value will simply be -1 * # students, class_id . Then, you will see in the reducer that the first 3 entries will always be the ones you wanted. On Mon, Apr 23, 2012 at 10:17 PM, Lac Trung trungnb3...@gmail.com wrote: Hi Jay ! I think it's a bit difference here. I want to get 30 classId for each teacherId that have most students. For example : get 3 classId. (File1) 1) Teacher1, Class11, 30 2) Teacher1, Class12, 29 3) Teacher1, Class13, 28 4) Teacher1, Class14, 27 ... n ... n+1) Teacher2, Class21, 45 n+2) Teacher2, Class22, 44 n+3) Teacher2, Class23, 43 n+4) Teacher2, Class24, 42 ... n+m ... = return 3 line 1, 2, 3 for Teacher1 and line n+1, n+2, n+3 for Teacher2 Vào 09:52 Ngày 24 tháng 4 năm 2012, Jay Vyas jayunit...@gmail.com đã viết: Its somewhat tricky to understand exactly what you need from your explanation, but I believe you want teachers who have the most students in a given class. So for English, i have 10 teachers teaching the class - and i want the ones with the highes # of students. You can output key= classid, value=-1*#ofstudent,teacherid as the values. The values will then be sorted, by # of students. You can thus pick teacher in the the first value of your reducer, and that will be the teacher for class id = xyz , with the highes number of students. You can also be smart in your mapper by running a combiner to remove the teacherids who are clearly not maximal. On Mon, Apr 23, 2012 at 9:38 PM, Lac Trung trungnb3...@gmail.com wrote: Hello everyone ! I have a problem with MapReduce [:(] like that : I have 4 file input with 3 fields : teacherId, classId, numberOfStudent (numberOfStudent is ordered by desc for each teach) Output is top 30 classId that numberOfStudent is max for each teacher. My approach is MapReduce like Wordcount example. But I don't know how to determine key for map function. I run Wordcount example, understand its code but I have no experience at programming MapReduce. Can anyone help me to resolve this problem ? Thanks so much ! -- Lạc Trung 20083535 -- Jay Vyas MMSB/UCHC -- Lạc Trung 20083535 -- Jay Vyas MMSB/UCHC -- Lạc Trung 20083535 -- Lạc Trung 20083535
RE: Determine the key of Map function
Hi Lac, As per my understanding based on your problem description, you need to the below things. 1. Mapper : Write a mapper which emits records from input files and convert intto key and values. Here this key should contain teacher id, class id and no of students, value can be empty(or null). 2. Partitioner : Write Custom partitioner to send all the records for a teacher id to one reducer. 3. Grouping Comaparator : Write a comparator to group the records based on teacher id. 4. Sorting Comparator : Write a comparator to sort the records based on teacher id and no of students. 5. Reducer : In the reducer, you will get the records for all teachers one after other and also in the sorted order(by no of students) for a teacher id. You can keep how many top records you want in the reducer and finally can be written. You can refer this doc for reference: http://www.inf.ed.ac.uk/publications/thesis/online/IM100859.pdf Thanks Devaraj From: Lac Trung [trungnb3...@gmail.com] Sent: Tuesday, April 24, 2012 10:11 AM To: common-user@hadoop.apache.org Subject: Re: Determine the key of Map function Ah, as I said before, I have no experience at programming MapReduce. So, can you give me some documents or websites or something about programming the thing you said above? (Thousand things start hard - VietNam) Thanks so much ^^! Vào 10:54 Ngày 24 tháng 4 năm 2012, Lac Trung trungnb3...@gmail.com đã viết: Thanks Jay so much ! I will try this. ^^ Vào 10:52 Ngày 24 tháng 4 năm 2012, Jay Vyas jayunit...@gmail.com đã viết: Ahh... Well than the key will be teacher, and the value will simply be -1 * # students, class_id . Then, you will see in the reducer that the first 3 entries will always be the ones you wanted. On Mon, Apr 23, 2012 at 10:17 PM, Lac Trung trungnb3...@gmail.com wrote: Hi Jay ! I think it's a bit difference here. I want to get 30 classId for each teacherId that have most students. For example : get 3 classId. (File1) 1) Teacher1, Class11, 30 2) Teacher1, Class12, 29 3) Teacher1, Class13, 28 4) Teacher1, Class14, 27 ... n ... n+1) Teacher2, Class21, 45 n+2) Teacher2, Class22, 44 n+3) Teacher2, Class23, 43 n+4) Teacher2, Class24, 42 ... n+m ... = return 3 line 1, 2, 3 for Teacher1 and line n+1, n+2, n+3 for Teacher2 Vào 09:52 Ngày 24 tháng 4 năm 2012, Jay Vyas jayunit...@gmail.com đã viết: Its somewhat tricky to understand exactly what you need from your explanation, but I believe you want teachers who have the most students in a given class. So for English, i have 10 teachers teaching the class - and i want the ones with the highes # of students. You can output key= classid, value=-1*#ofstudent,teacherid as the values. The values will then be sorted, by # of students. You can thus pick teacher in the the first value of your reducer, and that will be the teacher for class id = xyz , with the highes number of students. You can also be smart in your mapper by running a combiner to remove the teacherids who are clearly not maximal. On Mon, Apr 23, 2012 at 9:38 PM, Lac Trung trungnb3...@gmail.com wrote: Hello everyone ! I have a problem with MapReduce [:(] like that : I have 4 file input with 3 fields : teacherId, classId, numberOfStudent (numberOfStudent is ordered by desc for each teach) Output is top 30 classId that numberOfStudent is max for each teacher. My approach is MapReduce like Wordcount example. But I don't know how to determine key for map function. I run Wordcount example, understand its code but I have no experience at programming MapReduce. Can anyone help me to resolve this problem ? Thanks so much ! -- Lạc Trung 20083535 -- Jay Vyas MMSB/UCHC -- Lạc Trung 20083535 -- Jay Vyas MMSB/UCHC -- Lạc Trung 20083535 -- Lạc Trung 20083535