Re: isSplitable() problem
I have chosen to use Jay's suggestion as a quick workaround and am pleased to report that it seems to work well on small test inputs. My question now is, are the mappers guaranteed to receive the file's lines in order? Browsing the source suggests this is so, but I just want to make sure as my understanding of Hadoop is transubstantial. Thank you for your patience in answering my questions. On 23 April 2012 14:28, Harsh J ha...@cloudera.com wrote: Jay, On Mon, Apr 23, 2012 at 6:43 PM, JAX jayunit...@gmail.com wrote: Curious : Seems like you could aggregate the results in the mapper as a local variable or list of strings--- is there a way to know that your mapper has just read the LAST line of an input split? True. Can be one way to do it (unless aggregation of 'records' needs to happen live, and you don't wish to store it all in memory). Is there a cleanup or finalize method in mappers that is run at the end of a whole steam read to support these sort of chunked, in memor map/r operations? Yes there is. See: Old API: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Mapper.html (See Closeable's close()) New API: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/Mapper.html#cleanup(org.apache.hadoop.mapreduce.Mapper.Context) -- Harsh J
Re: Determine the key of Map function
Thanks so much ! Vào 12:21 Ngày 24 tháng 4 năm 2012, Devaraj k devara...@huawei.com đã viết: Hi Lac, As per my understanding based on your problem description, you need to the below things. 1. Mapper : Write a mapper which emits records from input files and convert intto key and values. Here this key should contain teacher id, class id and no of students, value can be empty(or null). 2. Partitioner : Write Custom partitioner to send all the records for a teacher id to one reducer. 3. Grouping Comaparator : Write a comparator to group the records based on teacher id. 4. Sorting Comparator : Write a comparator to sort the records based on teacher id and no of students. 5. Reducer : In the reducer, you will get the records for all teachers one after other and also in the sorted order(by no of students) for a teacher id. You can keep how many top records you want in the reducer and finally can be written. You can refer this doc for reference: http://www.inf.ed.ac.uk/publications/thesis/online/IM100859.pdf Thanks Devaraj From: Lac Trung [trungnb3...@gmail.com] Sent: Tuesday, April 24, 2012 10:11 AM To: common-user@hadoop.apache.org Subject: Re: Determine the key of Map function Ah, as I said before, I have no experience at programming MapReduce. So, can you give me some documents or websites or something about programming the thing you said above? (Thousand things start hard - VietNam) Thanks so much ^^! Vào 10:54 Ngày 24 tháng 4 năm 2012, Lac Trung trungnb3...@gmail.com đã viết: Thanks Jay so much ! I will try this. ^^ Vào 10:52 Ngày 24 tháng 4 năm 2012, Jay Vyas jayunit...@gmail.com đã viết: Ahh... Well than the key will be teacher, and the value will simply be -1 * # students, class_id . Then, you will see in the reducer that the first 3 entries will always be the ones you wanted. On Mon, Apr 23, 2012 at 10:17 PM, Lac Trung trungnb3...@gmail.com wrote: Hi Jay ! I think it's a bit difference here. I want to get 30 classId for each teacherId that have most students. For example : get 3 classId. (File1) 1) Teacher1, Class11, 30 2) Teacher1, Class12, 29 3) Teacher1, Class13, 28 4) Teacher1, Class14, 27 ... n ... n+1) Teacher2, Class21, 45 n+2) Teacher2, Class22, 44 n+3) Teacher2, Class23, 43 n+4) Teacher2, Class24, 42 ... n+m ... = return 3 line 1, 2, 3 for Teacher1 and line n+1, n+2, n+3 for Teacher2 Vào 09:52 Ngày 24 tháng 4 năm 2012, Jay Vyas jayunit...@gmail.com đã viết: Its somewhat tricky to understand exactly what you need from your explanation, but I believe you want teachers who have the most students in a given class. So for English, i have 10 teachers teaching the class - and i want the ones with the highes # of students. You can output key= classid, value=-1*#ofstudent,teacherid as the values. The values will then be sorted, by # of students. You can thus pick teacher in the the first value of your reducer, and that will be the teacher for class id = xyz , with the highes number of students. You can also be smart in your mapper by running a combiner to remove the teacherids who are clearly not maximal. On Mon, Apr 23, 2012 at 9:38 PM, Lac Trung trungnb3...@gmail.com wrote: Hello everyone ! I have a problem with MapReduce [:(] like that : I have 4 file input with 3 fields : teacherId, classId, numberOfStudent (numberOfStudent is ordered by desc for each teach) Output is top 30 classId that numberOfStudent is max for each teacher. My approach is MapReduce like Wordcount example. But I don't know how to determine key for map function. I run Wordcount example, understand its code but I have no experience at programming MapReduce. Can anyone help me to resolve this problem ? Thanks so much ! -- Lạc Trung 20083535 -- Jay Vyas MMSB/UCHC -- Lạc Trung 20083535 -- Jay Vyas MMSB/UCHC -- Lạc Trung 20083535 -- Lạc Trung 20083535 -- Lạc Trung 20083535
Re: isSplitable() problem
The current code guarantees that they will be received in order. There some patches that are likely to go in soon that would allow for the JVM itself to be reused. In those cases I believe that the mapper class would be recreated, so the only thing you would have to worry about would be static values that are updated while processing the data. -- Bobby Evans On 4/24/12 4:45 AM, Dan Drew wirefr...@googlemail.com wrote: I have chosen to use Jay's suggestion as a quick workaround and am pleased to report that it seems to work well on small test inputs. My question now is, are the mappers guaranteed to receive the file's lines in order? Browsing the source suggests this is so, but I just want to make sure as my understanding of Hadoop is transubstantial. Thank you for your patience in answering my questions. On 23 April 2012 14:28, Harsh J ha...@cloudera.com wrote: Jay, On Mon, Apr 23, 2012 at 6:43 PM, JAX jayunit...@gmail.com wrote: Curious : Seems like you could aggregate the results in the mapper as a local variable or list of strings--- is there a way to know that your mapper has just read the LAST line of an input split? True. Can be one way to do it (unless aggregation of 'records' needs to happen live, and you don't wish to store it all in memory). Is there a cleanup or finalize method in mappers that is run at the end of a whole steam read to support these sort of chunked, in memor map/r operations? Yes there is. See: Old API: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Mapper.html (See Closeable's close()) New API: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/Mapper.html#cleanup(org.apache.hadoop.mapreduce.Mapper.Context) -- Harsh J
Re: hadoop streaming and a directory containing large number of .tgz files
Sorry for reforwarding this email. I was not sure if it actually got through since I just got the confirmation regarding my membership to the mailing list. Thanks, Sunil. On Tue, Apr 24, 2012 at 7:12 PM, Sunil S Nandihalli sunil.nandiha...@gmail.com wrote: Hi Everybody, I am a newbie to hadoop. I have about 40K .tgz files each of approximately 3MB . I would like to process this as if it were a single large file formed by cat list-of-files | gnuparallel 'tar -Oxvf {} | sed 1d' output.txt how can I achieve this using hadoop-streaming or some-other similar library.. thanks, Sunil.
Re: hadoop streaming and a directory containing large number of .tgz files
Sunil You could use identity mappers, a single identity reducer and by not having output compression., Raj From: Sunil S Nandihalli sunil.nandiha...@gmail.com To: common-user@hadoop.apache.org Sent: Tuesday, April 24, 2012 7:01 AM Subject: Re: hadoop streaming and a directory containing large number of .tgz files Sorry for reforwarding this email. I was not sure if it actually got through since I just got the confirmation regarding my membership to the mailing list. Thanks, Sunil. On Tue, Apr 24, 2012 at 7:12 PM, Sunil S Nandihalli sunil.nandiha...@gmail.com wrote: Hi Everybody, I am a newbie to hadoop. I have about 40K .tgz files each of approximately 3MB . I would like to process this as if it were a single large file formed by cat list-of-files | gnuparallel 'tar -Oxvf {} | sed 1d' output.txt how can I achieve this using hadoop-streaming or some-other similar library.. thanks, Sunil.
RE: hadoop streaming and a directory containing large number of .tgz files
Hi Sunil, Please check HarFileSystem (Hadoop Archive Filesystem), it will be useful to solve your problem. Thanks Devaraj From: Sunil S Nandihalli [sunil.nandiha...@gmail.com] Sent: Tuesday, April 24, 2012 7:12 PM To: common-user@hadoop.apache.org Subject: hadoop streaming and a directory containing large number of .tgz files Hi Everybody, I am a newbie to hadoop. I have about 40K .tgz files each of approximately 3MB . I would like to process this as if it were a single large file formed by cat list-of-files | gnuparallel 'tar -Oxvf {} | sed 1d' output.txt how can I achieve this using hadoop-streaming or some-other similar library.. thanks, Sunil.
why does Text.setCapacity not double the array size as in most dynamic array implementations?
private void setCapacity(int len, boolean keepData) { if (bytes == null || bytes.length len) { byte[] newBytes = new byte[len]; if (bytes != null keepData) { System.arraycopy(bytes, 0, newBytes, 0, length); } bytes = newBytes; } } Why does Text.setCapacity only expand the array to the length of the new data? Why not instead set the length to double the new requested length or 3/2 times the existing length as in ArrayList so that the array size will grow exponentially as in most dynamic array implentations?
Re: why does Text.setCapacity not double the array size as in most dynamic array implementations?
Sorry, I just stumbled across HADOOP-6109 which made this change in trunk, I was looking at the Text in 1.0.2. Cant get this fix get backported to the Hadoop 1 versions? On 04/24/2012 11:01 PM, Jim Donofrio wrote: private void setCapacity(int len, boolean keepData) { if (bytes == null || bytes.length len) { byte[] newBytes = new byte[len]; if (bytes != null keepData) { System.arraycopy(bytes, 0, newBytes, 0, length); } bytes = newBytes; } } Why does Text.setCapacity only expand the array to the length of the new data? Why not instead set the length to double the new requested length or 3/2 times the existing length as in ArrayList so that the array size will grow exponentially as in most dynamic array implentations?