Re: isSplitable() problem

2012-04-24 Thread Dan Drew
I have chosen to use Jay's suggestion as a quick workaround and am pleased
to report that it seems to work well on small test inputs.

My question now is, are the mappers guaranteed to receive the file's lines
in order?

Browsing the source suggests this is so, but I just want to make sure as my
understanding of Hadoop is transubstantial.

Thank you for your patience in answering my questions.

On 23 April 2012 14:28, Harsh J ha...@cloudera.com wrote:

 Jay,

 On Mon, Apr 23, 2012 at 6:43 PM, JAX jayunit...@gmail.com wrote:
  Curious : Seems like you could aggregate the results in the mapper as a
 local variable or list of strings--- is there a way to know that your
 mapper has just read the LAST line of an input split?

 True. Can be one way to do it (unless aggregation of 'records' needs
 to happen live, and you don't wish to store it all in memory).

  Is there a cleanup or finalize method in mappers that is run at the
 end of a whole steam read to support these sort of chunked, in memor map/r
 operations?

 Yes there is. See:

 Old API:
 http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Mapper.html
 (See Closeable's close())

 New API:
 http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/Mapper.html#cleanup(org.apache.hadoop.mapreduce.Mapper.Context)


 --
 Harsh J



Re: Determine the key of Map function

2012-04-24 Thread Lac Trung
Thanks so much !


Vào 12:21 Ngày 24 tháng 4 năm 2012, Devaraj k devara...@huawei.com đã
viết:

 Hi Lac,

  As per my understanding based on your problem description, you need to
 the below things.

 1. Mapper : Write a mapper which emits records from input files and
 convert intto key and values. Here this key should contain teacher id,
 class id and no of students, value can be empty(or null).
 2. Partitioner : Write Custom partitioner to send all the records for a
 teacher id to one reducer.
 3. Grouping Comaparator : Write a comparator to group the records based on
 teacher id.
 4. Sorting Comparator : Write a comparator to sort the records based on
 teacher id and no of students.
 5. Reducer : In the reducer, you will get the records for all teachers one
 after other and also in the sorted order(by no of students) for a teacher
 id. You can keep how many top records you want in the reducer and finally
 can be written.

 You can refer this doc for reference:
 http://www.inf.ed.ac.uk/publications/thesis/online/IM100859.pdf

 Thanks
 Devaraj

 
 From: Lac Trung [trungnb3...@gmail.com]
 Sent: Tuesday, April 24, 2012 10:11 AM
 To: common-user@hadoop.apache.org
 Subject: Re: Determine the key of Map function

 Ah, as I said before, I have no experience at programming MapReduce. So,
 can you give me some documents or websites or something about programming
 the thing you said above? (Thousand things start hard - VietNam)
 Thanks so much ^^!

 Vào 10:54 Ngày 24 tháng 4 năm 2012, Lac Trung trungnb3...@gmail.com đã
 viết:

  Thanks Jay so much !
  I will try this.
  ^^
 
  Vào 10:52 Ngày 24 tháng 4 năm 2012, Jay Vyas jayunit...@gmail.com đã
  viết:
 
  Ahh... Well than the key will be teacher, and the value will simply be
 
  -1 * # students, class_id .
 
  Then, you will see in the reducer that the first 3 entries will always
 be
  the ones you wanted.
 
  On Mon, Apr 23, 2012 at 10:17 PM, Lac Trung trungnb3...@gmail.com
  wrote:
 
   Hi Jay !
   I think it's a bit difference here. I want to get 30 classId for each
   teacherId that have most students.
   For example : get 3 classId.
   (File1)
   1) Teacher1, Class11, 30
   2) Teacher1, Class12, 29
   3) Teacher1, Class13, 28
   4) Teacher1, Class14, 27
   ... n ...
  
   n+1) Teacher2, Class21, 45
   n+2) Teacher2, Class22, 44
   n+3) Teacher2, Class23, 43
   n+4) Teacher2, Class24, 42
   ... n+m ...
  
   = return 3 line 1, 2, 3 for Teacher1 and line n+1, n+2, n+3 for
  Teacher2
  
  
   Vào 09:52 Ngày 24 tháng 4 năm 2012, Jay Vyas jayunit...@gmail.com
 đã
   viết:
  
Its somewhat tricky to understand exactly what you need from your
explanation, but I believe you want teachers who have the most
  students
   in
a given class.  So for English, i have 10 teachers teaching the
 class
  -
   and
i want the ones with the highes # of students.
   
You can output key= classid, value=-1*#ofstudent,teacherid as
 the
values.
   
The values will then be sorted, by # of students.  You can thus pick
teacher in the the first value of your reducer, and that will be the
teacher for class id = xyz , with the highes number of students.
   
You can also be smart in your mapper by running a combiner to remove
  the
teacherids who are clearly not maximal.
   
On Mon, Apr 23, 2012 at 9:38 PM, Lac Trung trungnb3...@gmail.com
   wrote:
   
 Hello everyone !

 I have a problem with MapReduce [:(] like that :
 I have 4 file input with 3 fields : teacherId, classId,
  numberOfStudent
 (numberOfStudent is ordered by desc for each teach)
 Output is top 30 classId that numberOfStudent is max for each
  teacher.
 My approach is MapReduce like Wordcount example. But I don't know
  how
   to
 determine key for map function.
 I run Wordcount example, understand its code but I have no
  experience
   at
 programming MapReduce.

 Can anyone help me to resolve this problem ?
 Thanks so much !


 --
 Lạc Trung
 20083535

   
   
   
--
Jay Vyas
MMSB/UCHC
   
  
  
  
   --
   Lạc Trung
   20083535
  
 
 
 
  --
  Jay Vyas
  MMSB/UCHC
 
 
 
 
  --
  Lạc Trung
  20083535
 
 


 --
 Lạc Trung
 20083535




-- 
Lạc Trung
20083535


Re: isSplitable() problem

2012-04-24 Thread Robert Evans
The current code guarantees that they will be received in order.  There some 
patches that are likely to go in soon that would allow for the JVM itself to be 
reused.  In those cases I believe that the mapper class would be recreated, so 
the only thing you would have to worry about would be static values that are 
updated while processing the data.

-- Bobby Evans

On 4/24/12 4:45 AM, Dan Drew wirefr...@googlemail.com wrote:

I have chosen to use Jay's suggestion as a quick workaround and am pleased
to report that it seems to work well on small test inputs.

My question now is, are the mappers guaranteed to receive the file's lines
in order?

Browsing the source suggests this is so, but I just want to make sure as my
understanding of Hadoop is transubstantial.

Thank you for your patience in answering my questions.

On 23 April 2012 14:28, Harsh J ha...@cloudera.com wrote:

 Jay,

 On Mon, Apr 23, 2012 at 6:43 PM, JAX jayunit...@gmail.com wrote:
  Curious : Seems like you could aggregate the results in the mapper as a
 local variable or list of strings--- is there a way to know that your
 mapper has just read the LAST line of an input split?

 True. Can be one way to do it (unless aggregation of 'records' needs
 to happen live, and you don't wish to store it all in memory).

  Is there a cleanup or finalize method in mappers that is run at the
 end of a whole steam read to support these sort of chunked, in memor map/r
 operations?

 Yes there is. See:

 Old API:
 http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/Mapper.html
 (See Closeable's close())

 New API:
 http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/Mapper.html#cleanup(org.apache.hadoop.mapreduce.Mapper.Context)


 --
 Harsh J




Re: hadoop streaming and a directory containing large number of .tgz files

2012-04-24 Thread Sunil S Nandihalli
Sorry for reforwarding this email. I was not sure if it actually got
through since I just got the confirmation regarding my membership to the
mailing list.
Thanks,
Sunil.

On Tue, Apr 24, 2012 at 7:12 PM, Sunil S Nandihalli 
sunil.nandiha...@gmail.com wrote:

 Hi Everybody,
  I am a newbie to hadoop. I have about 40K .tgz files each of
 approximately 3MB . I would like to process this as if it were a single
 large file formed by
 cat list-of-files | gnuparallel 'tar -Oxvf {} | sed 1d'  output.txt
 how can I achieve this using hadoop-streaming or some-other similar
 library..


 thanks,
 Sunil.



Re: hadoop streaming and a directory containing large number of .tgz files

2012-04-24 Thread Raj Vishwanathan
Sunil

You could use identity mappers, a single identity reducer and by not having 
output compression.,

Raj




 From: Sunil S Nandihalli sunil.nandiha...@gmail.com
To: common-user@hadoop.apache.org 
Sent: Tuesday, April 24, 2012 7:01 AM
Subject: Re: hadoop streaming and a directory containing large number of .tgz 
files
 
Sorry for reforwarding this email. I was not sure if it actually got
through since I just got the confirmation regarding my membership to the
mailing list.
Thanks,
Sunil.

On Tue, Apr 24, 2012 at 7:12 PM, Sunil S Nandihalli 
sunil.nandiha...@gmail.com wrote:

 Hi Everybody,
  I am a newbie to hadoop. I have about 40K .tgz files each of
 approximately 3MB . I would like to process this as if it were a single
 large file formed by
 cat list-of-files | gnuparallel 'tar -Oxvf {} | sed 1d'  output.txt
 how can I achieve this using hadoop-streaming or some-other similar
 library..


 thanks,
 Sunil.





RE: hadoop streaming and a directory containing large number of .tgz files

2012-04-24 Thread Devaraj k
Hi Sunil,

Please check HarFileSystem (Hadoop Archive Filesystem), it will be useful 
to solve your problem.

Thanks
Devaraj

From: Sunil S Nandihalli [sunil.nandiha...@gmail.com]
Sent: Tuesday, April 24, 2012 7:12 PM
To: common-user@hadoop.apache.org
Subject: hadoop streaming and a directory containing large number of .tgz files

Hi Everybody,
 I am a newbie to hadoop. I have about 40K .tgz files each of approximately
3MB . I would like to process this as if it were a single large file formed
by
cat list-of-files | gnuparallel 'tar -Oxvf {} | sed 1d'  output.txt
how can I achieve this using hadoop-streaming or some-other similar
library..


thanks,
Sunil.


why does Text.setCapacity not double the array size as in most dynamic array implementations?

2012-04-24 Thread Jim Donofrio

  private void setCapacity(int len, boolean keepData) {
if (bytes == null || bytes.length  len) {
  byte[] newBytes = new byte[len];
  if (bytes != null  keepData) {
System.arraycopy(bytes, 0, newBytes, 0, length);
  }
  bytes = newBytes;
}
  }

Why does Text.setCapacity only expand the array to the length of the new 
data? Why not instead set the length to double the new requested length 
or 3/2 times the existing length as in ArrayList so that the array size 
will grow exponentially as in most dynamic array implentations?




Re: why does Text.setCapacity not double the array size as in most dynamic array implementations?

2012-04-24 Thread Jim Donofrio
Sorry, I just stumbled across HADOOP-6109 which made this change in 
trunk, I was looking at the Text in 1.0.2. Cant get this fix get 
backported to the Hadoop 1 versions?


On 04/24/2012 11:01 PM, Jim Donofrio wrote:

private void setCapacity(int len, boolean keepData) {
if (bytes == null || bytes.length  len) {
byte[] newBytes = new byte[len];
if (bytes != null  keepData) {
System.arraycopy(bytes, 0, newBytes, 0, length);
}
bytes = newBytes;
}
}

Why does Text.setCapacity only expand the array to the length of the new
data? Why not instead set the length to double the new requested length
or 3/2 times the existing length as in ArrayList so that the array size
will grow exponentially as in most dynamic array implentations?