date:20120423

Re: hadoop.tmp.dir with multiple disks

2012-04-23 Thread mete

Harsh, thanks for the heads up, that seemed to do the trick.

Jay, i am building local files from the input, then compressing them on the
local drive, then copy back to hdfs.
So in my case it is really about io to the local fs..

On Sun, Apr 22, 2012 at 5:44 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

 Since each hadoop tasks is isolated from others having more tmp
 directories allows you to isolate that disk bandwidth as well. By
 listing the disks you give more firepower to shuffle-sorting and
 merging processes.

 Edward

 On Sun, Apr 22, 2012 at 10:02 AM, Jay Vyas jayunit...@gmail.com wrote:
  I don't understand why multiple disks would be particularly beneficial
 for
  a Map/Reduce job. would I/O for a map/reduce job be i/o *as well as
 CPU
  bound* ?   I would think that simply reading and parsing large files
 would
  still require dedicated CPU blocks. ?
 
  On Sun, Apr 22, 2012 at 3:14 AM, Harsh J ha...@cloudera.com wrote:
 
  You can use mapred.local.dir for this purpose. It accepts a list of
  directories tasks may use, just like dfs.data.dir uses multiple disks
  for block writes/reads.
 
  On Sun, Apr 22, 2012 at 12:50 PM, mete efk...@gmail.com wrote:
   Hello folks,
  
   I have a job that processes text files from hdfs on local fs (temp
   directory) and then copies those back to hdfs.
   I added another drive to each server to have better io performance,
 but
  as
   far as i could see hadoop.tmp.dir will not benefit from multiple
  disks,even
   if i setup two different folders on different disks. (dfs.data.dir
 works
   fine). As a result the disk with temp folder set is highy utilized,
 where
   the other one is a little bit idle.
   Does anyone have an idea on what to do? (i am using cdh3u3)
  
   Thanks in advance
   Mete
 
 
 
  --
  Harsh J
 
 
 
 
  --
  Jay Vyas
  MMSB/UCHC

Re: Reading data output by MapFileOutputFormat

2012-04-23 Thread Harsh J

Ali,

MapFiles are explained at
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html
- Please give it a read and it should solve half your questions. In
short, MapFile is two files - one raw SequenceFile and another an
index file built on top of it.

The reason MR does not provide a MapFileInputFormat is that you don't
need to use the index file in MR jobs (no lookups for input-driven
jobs). Hence the SequenceFileInputFormat suffices to read the data (it
ignores the index file, and only reads the sequence ones that carries
the data).

If you wish to make use of MapFile's index abilities for lookups/etc.,
use the MapFile.Reader class directly in your implementation.

On Mon, Apr 23, 2012 at 4:23 PM, Ali Safdar Kureishy
safdar.kurei...@gmail.com wrote:
 Hi,

 If I use a *MapFileOutputFormat* to output some data, I see that each
 reducer's output is a folder (part-0, for example), and inside that
 folder are two files: data and index.

 However, there is no corresponding MapFileInputFormat, to read back this
 folder (part-0). Instead, *SequenceFileInputFormat* seems to read the
 data. So, I have some questions:
 - does SequenceFileInputFormat actually read *all* the data that was output
 by MapFileOutputFormat? Or is some relationship data between the data and
 index files lost in this process that would have been better handled by
 another InputFormat class? In other words, is SequenceFileInputFormat the
 right InputFormat to read data written by MapFileOutputFormat?
 - how is it that SequenceFileInputFormat works to read outputs from
 *both*MapFileOutputFormat and SequenceFileOutputFormat? That would
 imply that
 MapFileOutputFormat and SequenceFileOutputFormat output the same data, OR
 that SequenceFileInputFormat internally handles both differently. What is
 the reality?

 Thanks,
 Safdar



-- 
Harsh J

isSplitable() problem

2012-04-23 Thread Dan Drew

I require each input file to be processed by each mapper as a whole.

I subclass c.o.a.h.mapreduce.lib.input.TextInputFormat and override
isSplitable() to invariably return false.

The job is configured to use this subclass as the input format class via
setInputFormatClass(). The job runs without error, yet the logs reveal
files are still processed line by line by the mappers.

Any help would be greatly appreciated,
Thanks

Re: isSplitable() problem

2012-04-23 Thread Harsh J

Dan,

Split and reading a whole file as a chunk are two slightly different
things. The former controls if your files ought to be split across
mappers (useful if there are multiple blocks of file in HDFS). The
latter needs to be achieved differently.

The TextInputFormat provides by default a LineRecordReader, which as
it name goes - reads whatever stream is provided to it line-by-line.
This is regardless of the file's block splits (a very different thing
than line splits).

You need to implement your own RecordReader and return it from your
InputFormat to do what you want it to - i.e. read the whole stream
into an object and then pass it out to the Mapper.

On Mon, Apr 23, 2012 at 5:10 PM, Dan Drew wirefr...@googlemail.com wrote:
 I require each input file to be processed by each mapper as a whole.

 I subclass c.o.a.h.mapreduce.lib.input.TextInputFormat and override
 isSplitable() to invariably return false.

 The job is configured to use this subclass as the input format class via
 setInputFormatClass(). The job runs without error, yet the logs reveal
 files are still processed line by line by the mappers.

 Any help would be greatly appreciated,
 Thanks



-- 
Harsh J

Re: isSplitable() problem

2012-04-23 Thread Dan Drew

Thanks for the clarification.

On 23 April 2012 12:52, Harsh J ha...@cloudera.com wrote:

 Dan,

 Split and reading a whole file as a chunk are two slightly different
 things. The former controls if your files ought to be split across
 mappers (useful if there are multiple blocks of file in HDFS). The
 latter needs to be achieved differently.

 The TextInputFormat provides by default a LineRecordReader, which as
 it name goes - reads whatever stream is provided to it line-by-line.
 This is regardless of the file's block splits (a very different thing
 than line splits).

 You need to implement your own RecordReader and return it from your
 InputFormat to do what you want it to - i.e. read the whole stream
 into an object and then pass it out to the Mapper.

 On Mon, Apr 23, 2012 at 5:10 PM, Dan Drew wirefr...@googlemail.com
 wrote:
  I require each input file to be processed by each mapper as a whole.
 
  I subclass c.o.a.h.mapreduce.lib.input.TextInputFormat and override
  isSplitable() to invariably return false.
 
  The job is configured to use this subclass as the input format class via
  setInputFormatClass(). The job runs without error, yet the logs reveal
  files are still processed line by line by the mappers.
 
  Any help would be greatly appreciated,
  Thanks



 --
 Harsh J

Re: Algorithms used in fairscheduler 0.20.205

2012-04-23 Thread Merto Mertek

Anyone?

On 19 April 2012 17:34, Merto Mertek masmer...@gmail.com wrote:

 I could find that the closest doc matching the current implementation of
 the fairscheduler could be find in this 
 documenthttp://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-55.htmlfrom
  Matei Zaharia et al.. Another documented from delay scheduling can be
 found from year 2010..

 a) I am interested if there maybe exist any newer documented version of
 the implementation?
 b) Are there any other algorithms in addition to delay scheduling,
 copy-compute splitting algorithm and  fairshare calculation algorithm
 that are important for the cluster performance and fairsharing?
 c) Is there maybe any connection between copy-compute splitting and
 mapreduce phases (copy-sort-reduce)?

 Thank you..

Re: Reading data output by MapFileOutputFormat

2012-04-23 Thread Ali Safdar Kureishy

Thanks Harsh! This is very helpful.

Regards,
Ali

On Mon, Apr 23, 2012 at 2:08 PM, Harsh J ha...@cloudera.com wrote:
 Ali,

 MapFiles are explained at
 http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html
 - Please give it a read and it should solve half your questions. In
 short, MapFile is two files - one raw SequenceFile and another an
 index file built on top of it.

 The reason MR does not provide a MapFileInputFormat is that you don't
 need to use the index file in MR jobs (no lookups for input-driven
 jobs). Hence the SequenceFileInputFormat suffices to read the data (it
 ignores the index file, and only reads the sequence ones that carries
 the data).

 If you wish to make use of MapFile's index abilities for lookups/etc.,
 use the MapFile.Reader class directly in your implementation.

 On Mon, Apr 23, 2012 at 4:23 PM, Ali Safdar Kureishy
 safdar.kurei...@gmail.com wrote:
 Hi,

 If I use a *MapFileOutputFormat* to output some data, I see that each
 reducer's output is a folder (part-0, for example), and inside that
 folder are two files: data and index.

 However, there is no corresponding MapFileInputFormat, to read back this
 folder (part-0). Instead, *SequenceFileInputFormat* seems to read the
 data. So, I have some questions:
 - does SequenceFileInputFormat actually read *all* the data that was output
 by MapFileOutputFormat? Or is some relationship data between the data and
 index files lost in this process that would have been better handled by
 another InputFormat class? In other words, is SequenceFileInputFormat the
 right InputFormat to read data written by MapFileOutputFormat?
 - how is it that SequenceFileInputFormat works to read outputs from
 *both*MapFileOutputFormat and SequenceFileOutputFormat? That would
 imply that
 MapFileOutputFormat and SequenceFileOutputFormat output the same data, OR
 that SequenceFileInputFormat internally handles both differently. What is
 the reality?

 Thanks,
 Safdar



 --
 Harsh J

Re: How to set the KeyRange in an Hadoop+Cassandra job

2012-04-23 Thread Harsh J

Hey Filippo,

I think this question best belongs to the Cassandra user lists
(u...@cassandra.apache.org) as it is pretty specific to its APIs and
implementation.

On Mon, Apr 23, 2012 at 6:37 PM, Filippo Diotalevi fili...@ntoklo.com wrote:
 Hi,
 I'm trying to set the KeyRange using the ConfigHelper API for a Cassandra 
 1.0.8 + Hadoop job.

 The only API I see is

 public static void setInputRange(Configuration conf, String startToken, 
 String endToken)

 which allows me to set a key range specifying the startToken and endToken, 
 but I'd like to set the starting and ending row key. How can I achieve that?

 Thanks,
 --
 Filippo Diotalevi
 fili...@ntoklo.com
 m: +44(0)7543 286 481
 www.ntoklo.com




-- 
Harsh J

Re: Distributing MapReduce on a computer cluster

2012-04-23 Thread Prashant Kommireddi

Shailesh, there's a lot that goes into distributing work across
tasks/nodes. It's not just distributing work but also fault-tolerance,
data locality etc that come into play. It might be good to refer
Hadoop apache docs or Tom White's definitive guide.

Sent from my iPhone

On Apr 23, 2012, at 11:03 AM, Shailesh Samudrala shailesh2...@gmail.com wrote:

 Hello,

 I am trying to design my own MapReduce Implementation and I want to know
 how hadoop is able to distribute its workload across multiple computers.
 Can anyone shed more light on this? thanks!

Design question

2012-04-23 Thread Mohit Anchlia

I just wanted to check how do people design their storage directories for
data that is sent to the system continuously. For eg: for a given
functionality we get data feed continuously writen to sequencefile, that is
then coverted to more structured format using map reduce and stored in tab
separated files. For such continuous feed what's the best way to organize
directories and the names? Should it be just based of timestamp or
something better that helps in organizing data.

Second part of question, is it better to store output in sequence files so
that we can take advantage of compression per record. This seems to be
required since gzip/snappy compression of entire file would launch only one
map tasks.

And the last question, when compressing a flat file should it first be
split into multiple files so that we get multiple mappers if we need to run
another job on this file? LZO is another alternative but then it requires
additional configuration, is it preferred?

Any articles or suggestions would be very helpful.

Determine the key of Map function

2012-04-23 Thread Lac Trung

Hello everyone !

I have a problem with MapReduce [:(] like that :
I have 4 file input with 3 fields : teacherId, classId, numberOfStudent
(numberOfStudent is ordered by desc for each teach)
Output is top 30 classId that numberOfStudent is max for each teacher.
My approach is MapReduce like Wordcount example. But I don't know how to
determine key for map function.
I run Wordcount example, understand its code but I have no experience at
programming MapReduce.

Can anyone help me to resolve this problem ?
Thanks so much !


-- 
Lạc Trung
20083535

Re: Determine the key of Map function

2012-04-23 Thread Jay Vyas

Its somewhat tricky to understand exactly what you need from your
explanation, but I believe you want teachers who have the most students in
a given class.  So for English, i have 10 teachers teaching the class - and
i want the ones with the highes # of students.

You can output key= classid, value=-1*#ofstudent,teacherid as the
values.

The values will then be sorted, by # of students.  You can thus pick
teacher in the the first value of your reducer, and that will be the
teacher for class id = xyz , with the highes number of students.

You can also be smart in your mapper by running a combiner to remove the
teacherids who are clearly not maximal.

On Mon, Apr 23, 2012 at 9:38 PM, Lac Trung trungnb3...@gmail.com wrote:

 Hello everyone !

 I have a problem with MapReduce [:(] like that :
 I have 4 file input with 3 fields : teacherId, classId, numberOfStudent
 (numberOfStudent is ordered by desc for each teach)
 Output is top 30 classId that numberOfStudent is max for each teacher.
 My approach is MapReduce like Wordcount example. But I don't know how to
 determine key for map function.
 I run Wordcount example, understand its code but I have no experience at
 programming MapReduce.

 Can anyone help me to resolve this problem ?
 Thanks so much !


 --
 Lạc Trung
 20083535




-- 
Jay Vyas
MMSB/UCHC

Re: Determine the key of Map function

2012-04-23 Thread Lac Trung

Hi Jay !
I think it's a bit difference here. I want to get 30 classId for each
teacherId that have most students.
For example : get 3 classId.
(File1)
1) Teacher1, Class11, 30
2) Teacher1, Class12, 29
3) Teacher1, Class13, 28
4) Teacher1, Class14, 27
... n ...

n+1) Teacher2, Class21, 45
n+2) Teacher2, Class22, 44
n+3) Teacher2, Class23, 43
n+4) Teacher2, Class24, 42
... n+m ...

= return 3 line 1, 2, 3 for Teacher1 and line n+1, n+2, n+3 for Teacher2


Vào 09:52 Ngày 24 tháng 4 năm 2012, Jay Vyas jayunit...@gmail.com đã viết:

 Its somewhat tricky to understand exactly what you need from your
 explanation, but I believe you want teachers who have the most students in
 a given class.  So for English, i have 10 teachers teaching the class - and
 i want the ones with the highes # of students.

 You can output key= classid, value=-1*#ofstudent,teacherid as the
 values.

 The values will then be sorted, by # of students.  You can thus pick
 teacher in the the first value of your reducer, and that will be the
 teacher for class id = xyz , with the highes number of students.

 You can also be smart in your mapper by running a combiner to remove the
 teacherids who are clearly not maximal.

 On Mon, Apr 23, 2012 at 9:38 PM, Lac Trung trungnb3...@gmail.com wrote:

  Hello everyone !
 
  I have a problem with MapReduce [:(] like that :
  I have 4 file input with 3 fields : teacherId, classId, numberOfStudent
  (numberOfStudent is ordered by desc for each teach)
  Output is top 30 classId that numberOfStudent is max for each teacher.
  My approach is MapReduce like Wordcount example. But I don't know how to
  determine key for map function.
  I run Wordcount example, understand its code but I have no experience at
  programming MapReduce.
 
  Can anyone help me to resolve this problem ?
  Thanks so much !
 
 
  --
  Lạc Trung
  20083535
 



 --
 Jay Vyas
 MMSB/UCHC




-- 
Lạc Trung
20083535

Re: Determine the key of Map function

2012-04-23 Thread Jay Vyas

Ahh... Well than the key will be teacher, and the value will simply be

-1 * # students, class_id .

Then, you will see in the reducer that the first 3 entries will always be
the ones you wanted.

On Mon, Apr 23, 2012 at 10:17 PM, Lac Trung trungnb3...@gmail.com wrote:

 Hi Jay !
 I think it's a bit difference here. I want to get 30 classId for each
 teacherId that have most students.
 For example : get 3 classId.
 (File1)
 1) Teacher1, Class11, 30
 2) Teacher1, Class12, 29
 3) Teacher1, Class13, 28
 4) Teacher1, Class14, 27
 ... n ...

 n+1) Teacher2, Class21, 45
 n+2) Teacher2, Class22, 44
 n+3) Teacher2, Class23, 43
 n+4) Teacher2, Class24, 42
 ... n+m ...

 = return 3 line 1, 2, 3 for Teacher1 and line n+1, n+2, n+3 for Teacher2


 Vào 09:52 Ngày 24 tháng 4 năm 2012, Jay Vyas jayunit...@gmail.com đã
 viết:

  Its somewhat tricky to understand exactly what you need from your
  explanation, but I believe you want teachers who have the most students
 in
  a given class.  So for English, i have 10 teachers teaching the class -
 and
  i want the ones with the highes # of students.
 
  You can output key= classid, value=-1*#ofstudent,teacherid as the
  values.
 
  The values will then be sorted, by # of students.  You can thus pick
  teacher in the the first value of your reducer, and that will be the
  teacher for class id = xyz , with the highes number of students.
 
  You can also be smart in your mapper by running a combiner to remove the
  teacherids who are clearly not maximal.
 
  On Mon, Apr 23, 2012 at 9:38 PM, Lac Trung trungnb3...@gmail.com
 wrote:
 
   Hello everyone !
  
   I have a problem with MapReduce [:(] like that :
   I have 4 file input with 3 fields : teacherId, classId, numberOfStudent
   (numberOfStudent is ordered by desc for each teach)
   Output is top 30 classId that numberOfStudent is max for each teacher.
   My approach is MapReduce like Wordcount example. But I don't know how
 to
   determine key for map function.
   I run Wordcount example, understand its code but I have no experience
 at
   programming MapReduce.
  
   Can anyone help me to resolve this problem ?
   Thanks so much !
  
  
   --
   Lạc Trung
   20083535
  
 
 
 
  --
  Jay Vyas
  MMSB/UCHC
 



 --
 Lạc Trung
 20083535




-- 
Jay Vyas
MMSB/UCHC

Re: Determine the key of Map function

2012-04-23 Thread Lac Trung

Thanks Jay so much !
I will try this.
^^

Vào 10:52 Ngày 24 tháng 4 năm 2012, Jay Vyas jayunit...@gmail.com đã viết:

 Ahh... Well than the key will be teacher, and the value will simply be

 -1 * # students, class_id .

 Then, you will see in the reducer that the first 3 entries will always be
 the ones you wanted.

 On Mon, Apr 23, 2012 at 10:17 PM, Lac Trung trungnb3...@gmail.com wrote:

  Hi Jay !
  I think it's a bit difference here. I want to get 30 classId for each
  teacherId that have most students.
  For example : get 3 classId.
  (File1)
  1) Teacher1, Class11, 30
  2) Teacher1, Class12, 29
  3) Teacher1, Class13, 28
  4) Teacher1, Class14, 27
  ... n ...
 
  n+1) Teacher2, Class21, 45
  n+2) Teacher2, Class22, 44
  n+3) Teacher2, Class23, 43
  n+4) Teacher2, Class24, 42
  ... n+m ...
 
  = return 3 line 1, 2, 3 for Teacher1 and line n+1, n+2, n+3 for Teacher2
 
 
  Vào 09:52 Ngày 24 tháng 4 năm 2012, Jay Vyas jayunit...@gmail.com đã
  viết:
 
   Its somewhat tricky to understand exactly what you need from your
   explanation, but I believe you want teachers who have the most students
  in
   a given class.  So for English, i have 10 teachers teaching the class -
  and
   i want the ones with the highes # of students.
  
   You can output key= classid, value=-1*#ofstudent,teacherid as the
   values.
  
   The values will then be sorted, by # of students.  You can thus pick
   teacher in the the first value of your reducer, and that will be the
   teacher for class id = xyz , with the highes number of students.
  
   You can also be smart in your mapper by running a combiner to remove
 the
   teacherids who are clearly not maximal.
  
   On Mon, Apr 23, 2012 at 9:38 PM, Lac Trung trungnb3...@gmail.com
  wrote:
  
Hello everyone !
   
I have a problem with MapReduce [:(] like that :
I have 4 file input with 3 fields : teacherId, classId,
 numberOfStudent
(numberOfStudent is ordered by desc for each teach)
Output is top 30 classId that numberOfStudent is max for each
 teacher.
My approach is MapReduce like Wordcount example. But I don't know how
  to
determine key for map function.
I run Wordcount example, understand its code but I have no experience
  at
programming MapReduce.
   
Can anyone help me to resolve this problem ?
Thanks so much !
   
   
--
Lạc Trung
20083535
   
  
  
  
   --
   Jay Vyas
   MMSB/UCHC
  
 
 
 
  --
  Lạc Trung
  20083535
 



 --
 Jay Vyas
 MMSB/UCHC




-- 
Lạc Trung
20083535

Re: Determine the key of Map function

2012-04-23 Thread Lac Trung

Ah, as I said before, I have no experience at programming MapReduce. So,
can you give me some documents or websites or something about the thing you
said above ? (Thousand things start hard - VietNam)
Thanks so much ^^!

Vào 10:54 Ngày 24 tháng 4 năm 2012, Lac Trung trungnb3...@gmail.com đã
viết:

 Thanks Jay so much !
 I will try this.
 ^^

 Vào 10:52 Ngày 24 tháng 4 năm 2012, Jay Vyas jayunit...@gmail.com đã
 viết:

 Ahh... Well than the key will be teacher, and the value will simply be

 -1 * # students, class_id .

 Then, you will see in the reducer that the first 3 entries will always be
 the ones you wanted.

 On Mon, Apr 23, 2012 at 10:17 PM, Lac Trung trungnb3...@gmail.com
 wrote:

  Hi Jay !
  I think it's a bit difference here. I want to get 30 classId for each
  teacherId that have most students.
  For example : get 3 classId.
  (File1)
  1) Teacher1, Class11, 30
  2) Teacher1, Class12, 29
  3) Teacher1, Class13, 28
  4) Teacher1, Class14, 27
  ... n ...
 
  n+1) Teacher2, Class21, 45
  n+2) Teacher2, Class22, 44
  n+3) Teacher2, Class23, 43
  n+4) Teacher2, Class24, 42
  ... n+m ...
 
  = return 3 line 1, 2, 3 for Teacher1 and line n+1, n+2, n+3 for
 Teacher2
 
 
  Vào 09:52 Ngày 24 tháng 4 năm 2012, Jay Vyas jayunit...@gmail.com đã
  viết:
 
   Its somewhat tricky to understand exactly what you need from your
   explanation, but I believe you want teachers who have the most
 students
  in
   a given class.  So for English, i have 10 teachers teaching the class
 -
  and
   i want the ones with the highes # of students.
  
   You can output key= classid, value=-1*#ofstudent,teacherid as the
   values.
  
   The values will then be sorted, by # of students.  You can thus pick
   teacher in the the first value of your reducer, and that will be the
   teacher for class id = xyz , with the highes number of students.
  
   You can also be smart in your mapper by running a combiner to remove
 the
   teacherids who are clearly not maximal.
  
   On Mon, Apr 23, 2012 at 9:38 PM, Lac Trung trungnb3...@gmail.com
  wrote:
  
Hello everyone !
   
I have a problem with MapReduce [:(] like that :
I have 4 file input with 3 fields : teacherId, classId,
 numberOfStudent
(numberOfStudent is ordered by desc for each teach)
Output is top 30 classId that numberOfStudent is max for each
 teacher.
My approach is MapReduce like Wordcount example. But I don't know
 how
  to
determine key for map function.
I run Wordcount example, understand its code but I have no
 experience
  at
programming MapReduce.
   
Can anyone help me to resolve this problem ?
Thanks so much !
   
   
--
Lạc Trung
20083535
   
  
  
  
   --
   Jay Vyas
   MMSB/UCHC
  
 
 
 
  --
  Lạc Trung
  20083535
 



 --
 Jay Vyas
 MMSB/UCHC




 --
 Lạc Trung
 20083535




-- 
Lạc Trung
20083535

Re: Determine the key of Map function

2012-04-23 Thread Lac Trung

Ah, as I said before, I have no experience at programming MapReduce. So,
can you give me some documents or websites or something about programming
the thing you said above? (Thousand things start hard - VietNam)
Thanks so much ^^!

Vào 10:54 Ngày 24 tháng 4 năm 2012, Lac Trung trungnb3...@gmail.com đã
viết:

 Thanks Jay so much !
 I will try this.
 ^^

 Vào 10:52 Ngày 24 tháng 4 năm 2012, Jay Vyas jayunit...@gmail.com đã
 viết:

 Ahh... Well than the key will be teacher, and the value will simply be

 -1 * # students, class_id .

 Then, you will see in the reducer that the first 3 entries will always be
 the ones you wanted.

 On Mon, Apr 23, 2012 at 10:17 PM, Lac Trung trungnb3...@gmail.com
 wrote:

  Hi Jay !
  I think it's a bit difference here. I want to get 30 classId for each
  teacherId that have most students.
  For example : get 3 classId.
  (File1)
  1) Teacher1, Class11, 30
  2) Teacher1, Class12, 29
  3) Teacher1, Class13, 28
  4) Teacher1, Class14, 27
  ... n ...
 
  n+1) Teacher2, Class21, 45
  n+2) Teacher2, Class22, 44
  n+3) Teacher2, Class23, 43
  n+4) Teacher2, Class24, 42
  ... n+m ...
 
  = return 3 line 1, 2, 3 for Teacher1 and line n+1, n+2, n+3 for
 Teacher2
 
 
  Vào 09:52 Ngày 24 tháng 4 năm 2012, Jay Vyas jayunit...@gmail.com đã
  viết:
 
   Its somewhat tricky to understand exactly what you need from your
   explanation, but I believe you want teachers who have the most
 students
  in
   a given class.  So for English, i have 10 teachers teaching the class
 -
  and
   i want the ones with the highes # of students.
  
   You can output key= classid, value=-1*#ofstudent,teacherid as the
   values.
  
   The values will then be sorted, by # of students.  You can thus pick
   teacher in the the first value of your reducer, and that will be the
   teacher for class id = xyz , with the highes number of students.
  
   You can also be smart in your mapper by running a combiner to remove
 the
   teacherids who are clearly not maximal.
  
   On Mon, Apr 23, 2012 at 9:38 PM, Lac Trung trungnb3...@gmail.com
  wrote:
  
Hello everyone !
   
I have a problem with MapReduce [:(] like that :
I have 4 file input with 3 fields : teacherId, classId,
 numberOfStudent
(numberOfStudent is ordered by desc for each teach)
Output is top 30 classId that numberOfStudent is max for each
 teacher.
My approach is MapReduce like Wordcount example. But I don't know
 how
  to
determine key for map function.
I run Wordcount example, understand its code but I have no
 experience
  at
programming MapReduce.
   
Can anyone help me to resolve this problem ?
Thanks so much !
   
   
--
Lạc Trung
20083535
   
  
  
  
   --
   Jay Vyas
   MMSB/UCHC
  
 
 
 
  --
  Lạc Trung
  20083535
 



 --
 Jay Vyas
 MMSB/UCHC




 --
 Lạc Trung
 20083535




-- 
Lạc Trung
20083535

RE: Determine the key of Map function

2012-04-23 Thread Devaraj k

Hi Lac,

 As per my understanding based on your problem description, you need to the 
below things.

1. Mapper : Write a mapper which emits records from input files and convert 
intto key and values. Here this key should contain teacher id, class id and no 
of students, value can be empty(or null).
2. Partitioner : Write Custom partitioner to send all the records for a teacher 
id to one reducer.
3. Grouping Comaparator : Write a comparator to group the records based on 
teacher id.
4. Sorting Comparator : Write a comparator to sort the records based on teacher 
id and no of students.
5. Reducer : In the reducer, you will get the records for all teachers one 
after other and also in the sorted order(by no of students) for a teacher id. 
You can keep how many top records you want in the reducer and finally can be 
written.

You can refer this doc for reference:
http://www.inf.ed.ac.uk/publications/thesis/online/IM100859.pdf

Thanks
Devaraj


From: Lac Trung [trungnb3...@gmail.com]
Sent: Tuesday, April 24, 2012 10:11 AM
To: common-user@hadoop.apache.org
Subject: Re: Determine the key of Map function

Ah, as I said before, I have no experience at programming MapReduce. So,
can you give me some documents or websites or something about programming
the thing you said above? (Thousand things start hard - VietNam)
Thanks so much ^^!

Vào 10:54 Ngày 24 tháng 4 năm 2012, Lac Trung trungnb3...@gmail.com đã
viết:

 Thanks Jay so much !
 I will try this.
 ^^

 Vào 10:52 Ngày 24 tháng 4 năm 2012, Jay Vyas jayunit...@gmail.com đã
 viết:

 Ahh... Well than the key will be teacher, and the value will simply be

 -1 * # students, class_id .

 Then, you will see in the reducer that the first 3 entries will always be
 the ones you wanted.

 On Mon, Apr 23, 2012 at 10:17 PM, Lac Trung trungnb3...@gmail.com
 wrote:

  Hi Jay !
  I think it's a bit difference here. I want to get 30 classId for each
  teacherId that have most students.
  For example : get 3 classId.
  (File1)
  1) Teacher1, Class11, 30
  2) Teacher1, Class12, 29
  3) Teacher1, Class13, 28
  4) Teacher1, Class14, 27
  ... n ...
 
  n+1) Teacher2, Class21, 45
  n+2) Teacher2, Class22, 44
  n+3) Teacher2, Class23, 43
  n+4) Teacher2, Class24, 42
  ... n+m ...
 
  = return 3 line 1, 2, 3 for Teacher1 and line n+1, n+2, n+3 for
 Teacher2
 
 
  Vào 09:52 Ngày 24 tháng 4 năm 2012, Jay Vyas jayunit...@gmail.com đã
  viết:
 
   Its somewhat tricky to understand exactly what you need from your
   explanation, but I believe you want teachers who have the most
 students
  in
   a given class.  So for English, i have 10 teachers teaching the class
 -
  and
   i want the ones with the highes # of students.
  
   You can output key= classid, value=-1*#ofstudent,teacherid as the
   values.
  
   The values will then be sorted, by # of students.  You can thus pick
   teacher in the the first value of your reducer, and that will be the
   teacher for class id = xyz , with the highes number of students.
  
   You can also be smart in your mapper by running a combiner to remove
 the
   teacherids who are clearly not maximal.
  
   On Mon, Apr 23, 2012 at 9:38 PM, Lac Trung trungnb3...@gmail.com
  wrote:
  
Hello everyone !
   
I have a problem with MapReduce [:(] like that :
I have 4 file input with 3 fields : teacherId, classId,
 numberOfStudent
(numberOfStudent is ordered by desc for each teach)
Output is top 30 classId that numberOfStudent is max for each
 teacher.
My approach is MapReduce like Wordcount example. But I don't know
 how
  to
determine key for map function.
I run Wordcount example, understand its code but I have no
 experience
  at
programming MapReduce.
   
Can anyone help me to resolve this problem ?
Thanks so much !
   
   
--
Lạc Trung
20083535
   
  
  
  
   --
   Jay Vyas
   MMSB/UCHC
  
 
 
 
  --
  Lạc Trung
  20083535
 



 --
 Jay Vyas
 MMSB/UCHC




 --
 Lạc Trung
 20083535




--
Lạc Trung
20083535

Re: hadoop.tmp.dir with multiple disks

Re: Reading data output by MapFileOutputFormat

isSplitable() problem

Re: isSplitable() problem

Re: isSplitable() problem

Re: Algorithms used in fairscheduler 0.20.205

Re: Reading data output by MapFileOutputFormat

Re: How to set the KeyRange in an Hadoop+Cassandra job

Re: Distributing MapReduce on a computer cluster

Design question

Determine the key of Map function

Re: Determine the key of Map function

Re: Determine the key of Map function

Re: Determine the key of Map function

Re: Determine the key of Map function

Re: Determine the key of Map function

Re: Determine the key of Map function

RE: Determine the key of Map function

18 matches

Site Navigation

Mail list logo

Footer information