Re: How to best decide mapper output/reducer input for a huge string?

2013-09-29 Thread Jens Scheidtmann
Dear Pavan,

If it was working well, runtime would be shorter. What makes you sure this
is Hbase or Hadoop related? What percentage of time is spent in your
algorithms?

Use System.getTimeMillies() and run your program on the first 100,000
Records single threaded and print to stdout. See were time is
spent.Estimate total runtime from that, see if there is a speed-up.

Jens

Am Dienstag, 24. September 2013 schrieb Pavan Sudheendra :

 No, I'm pretty sure the job is executing fine.. Just that the time it
 takes to complete the whole process, is too much that's all..




Re: How to best decide mapper output/reducer input for a huge string?

2013-09-23 Thread Pavan Sudheendra
@Rahul, Yes you are right. 21 mappers are spawned where all the 21 mappers
are functional at the same time.. Although, @Pradeep, i should do the
compression like you say.. I'll give it a shot.. As far as i can see, i
think i'll need to implement Writable and write out the key of the mapper
using the specific data types instead of writing it out as a string which
might slow the operation down..


On Mon, Sep 23, 2013 at 9:29 AM, Pradeep Gollakota pradeep...@gmail.comwrote:

 Pavan,

 It's hard to tell whether there's anything wrong with your design or not
 since you haven't given us specific enough details. The best thing you can
 do is instrument your code and see what is taking a long time. Rahul
 mentioned a problem that I myself have seen before, with only one region
 (or a couple) having any data. So even if you have 21 regions, only mapper
 might be doing the heavy lifting.

 A combiner is hugely helpful in terms of reducing the data output of
 mappers. Writing a combiner is a best practice and you should almost always
 have one. Compression can be turned on by setting the following properties
 in your job config.
  property
  name mapreduce.map.output.compress /name
  value true/value
  /property
  property
  namemapreduce.map.output.compress.codec/name
  valueorg.apache.hadoop.io.compress.GzipCodec/value
  /property
 You can also try other compression codes such as Lzo, Snappy, Bzip2, etc.
 depending on your use cases. Gzip is really slow but gets the best
 compression ratios. Snappy/Lzo are a lot faster but don't have as good of a
 compression ratio. If your computations are CPU bound, then you'd probably
 want to use Snappy/Lzo. If your computations are I/O bound, and your CPUs
 are idle, you can use Gzip. You'll have to experiment and find the best
 settings for you. There are a lot of other tweaks that you can try to get
 the best performance out of your cluster.

 One of the best things you can do is to install Ganglia (or some other
 similar tool) on your cluster and monitor usage of resources while your job
 is running. This will tell you if your job is I/O bound or CPU bound.

 Take a look at this paper by Intel about optimizing your Hadoop cluster
 and see if that fits your deployment.
 http://software.intel.com/sites/default/files/m/f/4/3/2/f/31124-Optimizing_Hadoop_2010_final.pdf

 If your cluster is already optimized and your job is not I/O bound, then
 there might be a problem with your algorithm and might warrant a redesign.

 Hope this helps!
 - Pradeep


 On Sun, Sep 22, 2013 at 8:14 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 One mapper is spawned per hbase table region. You can use the admin ui of
 hbase to find the number of regions per table. It might happen that all the
 data is sitting in a single region , so a single mapper is spawned and you
 are not getting enough parallel work getting done.

 If that is the case then you can recreate the tables with predefined
 splits to create more regions.

 Thanks,
 Rahul


 On Sun, Sep 22, 2013 at 4:38 AM, John Lilley john.lil...@redpoint.netwrote:

  Pavan,

 How large are the rows in HBase?  22 million rows is not very much but
 you mentioned “huge strings”.  Can you tell which part of the processing is
 the limiting factor (read from HBase, mapper output, reducers)?

 John

 ** **

 ** **

 *From:* Pavan Sudheendra [mailto:pavan0...@gmail.com]
 *Sent:* Saturday, September 21, 2013 2:17 AM
 *To:* user@hadoop.apache.org
 *Subject:* Re: How to best decide mapper output/reducer input for a
 huge string?

 ** **

 No, I don't have a combiner in place. Is it necessary? How do I make my
 map output compressed? Yes, the Tables in HBase are compressed.

 Although, there's no real bottleneck, the time it takes to process the
 entire table is huge. I have to constantly check if i can optimize it
 somehow.. 

 Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see
 any thing wrong with my design? Does it require any kind of re-work? Thank
 you so much for helping..

 ** **

 On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota pradeep...@gmail.com
 wrote:

 One thing that comes to mind is that your keys are Strings which are
 highly inefficient. You might get a lot better performance if you write a
 custom writable for your Key object using the appropriate data types. For
 example, use a long (LongWritable) for timestamps. This should make
 (de)serialization a lot faster. If HouseHoldId is an integer, your speed of
 comparisons for sorting will also go up.

 ** **

 Ensure that your map output's are being compressed. Are your tables in
 HBase compressed? Do you have a combiner?

 ** **

 Have you been able to profile your code to see where the bottlenecks are?
 

 ** **

 On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra pavan0...@gmail.com
 wrote:

 Hi Pradeep,

 Yes.. Basically i'm only writing the key part as the map output.. The V
 of K,V

Re: How to best decide mapper output/reducer input for a huge string?

2013-09-23 Thread Pavan Sudheendra
@John, to be really frank i don't know what the limiting factor is.. It
might be all of them or a subset of them.. Cannot tell..


On Mon, Sep 23, 2013 at 2:58 PM, Pavan Sudheendra pavan0...@gmail.comwrote:

 @Rahul, Yes you are right. 21 mappers are spawned where all the 21 mappers
 are functional at the same time.. Although, @Pradeep, i should do the
 compression like you say.. I'll give it a shot.. As far as i can see, i
 think i'll need to implement Writable and write out the key of the mapper
 using the specific data types instead of writing it out as a string which
 might slow the operation down..


 On Mon, Sep 23, 2013 at 9:29 AM, Pradeep Gollakota 
 pradeep...@gmail.comwrote:

 Pavan,

 It's hard to tell whether there's anything wrong with your design or not
 since you haven't given us specific enough details. The best thing you can
 do is instrument your code and see what is taking a long time. Rahul
 mentioned a problem that I myself have seen before, with only one region
 (or a couple) having any data. So even if you have 21 regions, only mapper
 might be doing the heavy lifting.

 A combiner is hugely helpful in terms of reducing the data output of
 mappers. Writing a combiner is a best practice and you should almost always
 have one. Compression can be turned on by setting the following properties
 in your job config.
  property
  name mapreduce.map.output.compress /name
  value true/value
  /property
  property
  namemapreduce.map.output.compress.codec/name
  valueorg.apache.hadoop.io.compress.GzipCodec/value
  /property
 You can also try other compression codes such as Lzo, Snappy, Bzip2, etc.
 depending on your use cases. Gzip is really slow but gets the best
 compression ratios. Snappy/Lzo are a lot faster but don't have as good of a
 compression ratio. If your computations are CPU bound, then you'd probably
 want to use Snappy/Lzo. If your computations are I/O bound, and your CPUs
 are idle, you can use Gzip. You'll have to experiment and find the best
 settings for you. There are a lot of other tweaks that you can try to get
 the best performance out of your cluster.

 One of the best things you can do is to install Ganglia (or some other
 similar tool) on your cluster and monitor usage of resources while your job
 is running. This will tell you if your job is I/O bound or CPU bound.

 Take a look at this paper by Intel about optimizing your Hadoop cluster
 and see if that fits your deployment.
 http://software.intel.com/sites/default/files/m/f/4/3/2/f/31124-Optimizing_Hadoop_2010_final.pdf

 If your cluster is already optimized and your job is not I/O bound, then
 there might be a problem with your algorithm and might warrant a redesign.

 Hope this helps!
 - Pradeep


 On Sun, Sep 22, 2013 at 8:14 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 One mapper is spawned per hbase table region. You can use the admin ui
 of hbase to find the number of regions per table. It might happen that all
 the data is sitting in a single region , so a single mapper is spawned and
 you are not getting enough parallel work getting done.

 If that is the case then you can recreate the tables with predefined
 splits to create more regions.

 Thanks,
 Rahul


 On Sun, Sep 22, 2013 at 4:38 AM, John Lilley 
 john.lil...@redpoint.netwrote:

  Pavan,

 How large are the rows in HBase?  22 million rows is not very much but
 you mentioned “huge strings”.  Can you tell which part of the processing is
 the limiting factor (read from HBase, mapper output, reducers)?

 John

 ** **

 ** **

 *From:* Pavan Sudheendra [mailto:pavan0...@gmail.com]
 *Sent:* Saturday, September 21, 2013 2:17 AM
 *To:* user@hadoop.apache.org
 *Subject:* Re: How to best decide mapper output/reducer input for a
 huge string?

 ** **

 No, I don't have a combiner in place. Is it necessary? How do I make my
 map output compressed? Yes, the Tables in HBase are compressed.

 Although, there's no real bottleneck, the time it takes to process the
 entire table is huge. I have to constantly check if i can optimize it
 somehow.. 

 Oh okay.. I'll implement a Custom Writable.. Apart from that, do you
 see any thing wrong with my design? Does it require any kind of re-work?
 Thank you so much for helping..

 ** **

 On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota 
 pradeep...@gmail.com wrote:

 One thing that comes to mind is that your keys are Strings which are
 highly inefficient. You might get a lot better performance if you write a
 custom writable for your Key object using the appropriate data types. For
 example, use a long (LongWritable) for timestamps. This should make
 (de)serialization a lot faster. If HouseHoldId is an integer, your speed of
 comparisons for sorting will also go up.

 ** **

 Ensure that your map output's are being compressed. Are your tables in
 HBase compressed? Do you have a combiner?

 ** **

 Have you been able to profile your code to see where

RE: How to best decide mapper output/reducer input for a huge string?

2013-09-23 Thread John Lilley
You might try creating a stub MR job in which the mappers produce no output; 
that would isolate the time spent reading from HBase without the trouble of 
instrumenting your code.
John


From: Pavan Sudheendra [mailto:pavan0...@gmail.com]
Sent: Monday, September 23, 2013 3:31 AM
To: user@hadoop.apache.org
Subject: Re: How to best decide mapper output/reducer input for a huge string?

@John, to be really frank i don't know what the limiting factor is.. It might 
be all of them or a subset of them.. Cannot tell..

On Mon, Sep 23, 2013 at 2:58 PM, Pavan Sudheendra 
pavan0...@gmail.commailto:pavan0...@gmail.com wrote:
@Rahul, Yes you are right. 21 mappers are spawned where all the 21 mappers are 
functional at the same time.. Although, @Pradeep, i should do the compression 
like you say.. I'll give it a shot.. As far as i can see, i think i'll need to 
implement Writable and write out the key of the mapper using the specific data 
types instead of writing it out as a string which might slow the operation 
down..

On Mon, Sep 23, 2013 at 9:29 AM, Pradeep Gollakota 
pradeep...@gmail.commailto:pradeep...@gmail.com wrote:
Pavan,

It's hard to tell whether there's anything wrong with your design or not since 
you haven't given us specific enough details. The best thing you can do is 
instrument your code and see what is taking a long time. Rahul mentioned a 
problem that I myself have seen before, with only one region (or a couple) 
having any data. So even if you have 21 regions, only mapper might be doing the 
heavy lifting.

A combiner is hugely helpful in terms of reducing the data output of mappers. 
Writing a combiner is a best practice and you should almost always have one. 
Compression can be turned on by setting the following properties in your job 
config.
property
name mapreduce.map.output.compress /name
value true/value
/property
property
namemapreduce.map.output.compress.codec/name
valueorg.apache.hadoop.io.compress.GzipCodec/value
/property
You can also try other compression codes such as Lzo, Snappy, Bzip2, etc. 
depending on your use cases. Gzip is really slow but gets the best compression 
ratios. Snappy/Lzo are a lot faster but don't have as good of a compression 
ratio. If your computations are CPU bound, then you'd probably want to use 
Snappy/Lzo. If your computations are I/O bound, and your CPUs are idle, you can 
use Gzip. You'll have to experiment and find the best settings for you. There 
are a lot of other tweaks that you can try to get the best performance out of 
your cluster.

One of the best things you can do is to install Ganglia (or some other similar 
tool) on your cluster and monitor usage of resources while your job is running. 
This will tell you if your job is I/O bound or CPU bound.

Take a look at this paper by Intel about optimizing your Hadoop cluster and see 
if that fits your deployment. 
http://software.intel.com/sites/default/files/m/f/4/3/2/f/31124-Optimizing_Hadoop_2010_final.pdf

If your cluster is already optimized and your job is not I/O bound, then there 
might be a problem with your algorithm and might warrant a redesign.

Hope this helps!
- Pradeep

On Sun, Sep 22, 2013 at 8:14 PM, Rahul Bhattacharjee 
rahul.rec@gmail.commailto:rahul.rec@gmail.com wrote:
One mapper is spawned per hbase table region. You can use the admin ui of hbase 
to find the number of regions per table. It might happen that all the data is 
sitting in a single region , so a single mapper is spawned and you are not 
getting enough parallel work getting done.
If that is the case then you can recreate the tables with predefined splits to 
create more regions.
Thanks,
Rahul

On Sun, Sep 22, 2013 at 4:38 AM, John Lilley 
john.lil...@redpoint.netmailto:john.lil...@redpoint.net wrote:
Pavan,
How large are the rows in HBase?  22 million rows is not very much but you 
mentioned huge strings.  Can you tell which part of the processing is the 
limiting factor (read from HBase, mapper output, reducers)?
John


From: Pavan Sudheendra [mailto:pavan0...@gmail.commailto:pavan0...@gmail.com]
Sent: Saturday, September 21, 2013 2:17 AM
To: user@hadoop.apache.orgmailto:user@hadoop.apache.org
Subject: Re: How to best decide mapper output/reducer input for a huge string?

No, I don't have a combiner in place. Is it necessary? How do I make my map 
output compressed? Yes, the Tables in HBase are compressed.
Although, there's no real bottleneck, the time it takes to process the entire 
table is huge. I have to constantly check if i can optimize it somehow..
Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see any 
thing wrong with my design? Does it require any kind of re-work? Thank you so 
much for helping..

On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota 
pradeep...@gmail.commailto:pradeep...@gmail.com wrote:
One thing that comes to mind is that your keys are Strings which are highly 
inefficient. You might get a lot better performance if you write a custom

RE: How to best decide mapper output/reducer input for a huge string?

2013-09-23 Thread Pavan Sudheendra
No, I'm pretty sure the job is executing fine.. Just that the time it takes
to complete the whole process, is too much that's all..

I didn't mean to say the mapper or the reducer doesn't work.. Just that
it's very slow and I'm trying to figure out where it's happening in my
code.

Regards,
Pavan
On Sep 23, 2013 11:49 PM, John Lilley john.lil...@redpoint.net wrote:

  You might try creating a “stub” MR job in which the mappers produce no
 output; that would isolate the time spent reading from HBase without the
 trouble of instrumenting your code.

 John

 ** **

 ** **

 *From:* Pavan Sudheendra [mailto:pavan0...@gmail.com]
 *Sent:* Monday, September 23, 2013 3:31 AM
 *To:* user@hadoop.apache.org
 *Subject:* Re: How to best decide mapper output/reducer input for a huge
 string?

 ** **

 @John, to be really frank i don't know what the limiting factor is.. It
 might be all of them or a subset of them.. Cannot tell.. 

 ** **

 On Mon, Sep 23, 2013 at 2:58 PM, Pavan Sudheendra pavan0...@gmail.com
 wrote:

 @Rahul, Yes you are right. 21 mappers are spawned where all the 21 mappers
 are functional at the same time.. Although, @Pradeep, i should do the
 compression like you say.. I'll give it a shot.. As far as i can see, i
 think i'll need to implement Writable and write out the key of the mapper
 using the specific data types instead of writing it out as a string which
 might slow the operation down..

 ** **

 On Mon, Sep 23, 2013 at 9:29 AM, Pradeep Gollakota pradeep...@gmail.com
 wrote:

 Pavan,

 ** **

 It's hard to tell whether there's anything wrong with your design or not
 since you haven't given us specific enough details. The best thing you can
 do is instrument your code and see what is taking a long time. Rahul
 mentioned a problem that I myself have seen before, with only one region
 (or a couple) having any data. So even if you have 21 regions, only mapper
 might be doing the heavy lifting.

 ** **

 A combiner is hugely helpful in terms of reducing the data output of
 mappers. Writing a combiner is a best practice and you should almost always
 have one. Compression can be turned on by setting the following properties
 in your job config.

 property

 name mapreduce.map.output.compress /name 

 value true/value 

 /property

 property

 namemapreduce.map.output.compress.codec/name

 valueorg.apache.hadoop.io.compress.GzipCodec/value

 /property

 You can also try other compression codes such as Lzo, Snappy, Bzip2, etc.
 depending on your use cases. Gzip is really slow but gets the best
 compression ratios. Snappy/Lzo are a lot faster but don't have as good of a
 compression ratio. If your computations are CPU bound, then you'd probably
 want to use Snappy/Lzo. If your computations are I/O bound, and your CPUs
 are idle, you can use Gzip. You'll have to experiment and find the best
 settings for you. There are a lot of other tweaks that you can try to get
 the best performance out of your cluster.

 ** **

 One of the best things you can do is to install Ganglia (or some other
 similar tool) on your cluster and monitor usage of resources while your job
 is running. This will tell you if your job is I/O bound or CPU bound.

 ** **

 Take a look at this paper by Intel about optimizing your Hadoop cluster
 and see if that fits your deployment.
 http://software.intel.com/sites/default/files/m/f/4/3/2/f/31124-Optimizing_Hadoop_2010_final.pdf
 

 ** **

 If your cluster is already optimized and your job is not I/O bound, then
 there might be a problem with your algorithm and might warrant a redesign.
 

 ** **

 Hope this helps!

 - Pradeep

 ** **

 On Sun, Sep 22, 2013 at 8:14 PM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 One mapper is spawned per hbase table region. You can use the admin ui of
 hbase to find the number of regions per table. It might happen that all the
 data is sitting in a single region , so a single mapper is spawned and you
 are not getting enough parallel work getting done.

 If that is the case then you can recreate the tables with predefined
 splits to create more regions.

 Thanks,

 Rahul

 ** **

 On Sun, Sep 22, 2013 at 4:38 AM, John Lilley john.lil...@redpoint.net
 wrote:

 Pavan,

 How large are the rows in HBase?  22 million rows is not very much but you
 mentioned “huge strings”.  Can you tell which part of the processing is the
 limiting factor (read from HBase, mapper output, reducers)?

 John

  

  

 *From:* Pavan Sudheendra [mailto:pavan0...@gmail.com]
 *Sent:* Saturday, September 21, 2013 2:17 AM
 *To:* user@hadoop.apache.org
 *Subject:* Re: How to best decide mapper output/reducer input for a huge
 string?

  

 No, I don't have a combiner in place. Is it necessary? How do I make my
 map output compressed? Yes, the Tables in HBase are compressed.

 Although, there's no real

Re: How to best decide mapper output/reducer input for a huge string?

2013-09-22 Thread Rahul Bhattacharjee
One mapper is spawned per hbase table region. You can use the admin ui of
hbase to find the number of regions per table. It might happen that all the
data is sitting in a single region , so a single mapper is spawned and you
are not getting enough parallel work getting done.

If that is the case then you can recreate the tables with predefined splits
to create more regions.

Thanks,
Rahul


On Sun, Sep 22, 2013 at 4:38 AM, John Lilley john.lil...@redpoint.netwrote:

  Pavan,

 How large are the rows in HBase?  22 million rows is not very much but you
 mentioned “huge strings”.  Can you tell which part of the processing is the
 limiting factor (read from HBase, mapper output, reducers)?

 John

 ** **

 ** **

 *From:* Pavan Sudheendra [mailto:pavan0...@gmail.com]
 *Sent:* Saturday, September 21, 2013 2:17 AM
 *To:* user@hadoop.apache.org
 *Subject:* Re: How to best decide mapper output/reducer input for a huge
 string?

 ** **

 No, I don't have a combiner in place. Is it necessary? How do I make my
 map output compressed? Yes, the Tables in HBase are compressed.

 Although, there's no real bottleneck, the time it takes to process the
 entire table is huge. I have to constantly check if i can optimize it
 somehow.. 

 Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see
 any thing wrong with my design? Does it require any kind of re-work? Thank
 you so much for helping..

 ** **

 On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota pradeep...@gmail.com
 wrote:

 One thing that comes to mind is that your keys are Strings which are
 highly inefficient. You might get a lot better performance if you write a
 custom writable for your Key object using the appropriate data types. For
 example, use a long (LongWritable) for timestamps. This should make
 (de)serialization a lot faster. If HouseHoldId is an integer, your speed of
 comparisons for sorting will also go up.

 ** **

 Ensure that your map output's are being compressed. Are your tables in
 HBase compressed? Do you have a combiner?

 ** **

 Have you been able to profile your code to see where the bottlenecks are?*
 ***

 ** **

 On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra pavan0...@gmail.com
 wrote:

 Hi Pradeep,

 Yes.. Basically i'm only writing the key part as the map output.. The V of
 K,V is not of much use to me.. But i'm hoping to change that if it leads
 to faster execution.. I'm kind of a newbie so looking to make the
 map/reduce job run a lot faster.. 

 Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But
 seems if i write a map output for each and every row of a 19 m row HBase
 table, its taking nearly a day to complete.. (21 mappers and 21 reducers)*
 ***

 ** **

 I have looked at both Pig/Hive to do the job but i'm supposed to do this
 via a MR job.. So, cannot use either of that.. Do you recommend me to try
 something if i have the data in that format?

 ** **

 On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota pradeep...@gmail.com
 wrote:

 I'm sorry but I don't understand your question. Is the output of the
 mapper you're describing the key portion? If it is the key, then your data
 should already be sorted by HouseHoldId since it occurs first in your key.
 

 ** **

 The SortComparator will tell Hadoop how to sort your data. So you use this
 if you have a need for a non lexical sort order. The GroupingComparator
 will tell Hadoop how to group your data for the reducer. All KV-pairs from
 the same group will be given to the same Reducer.

 ** **

 If your reduce computation needs all the KV-pairs for the same
 HouseHoldId, then you will need to write a GroupingComparator.

 ** **

 Also, have you considered using a higher level abstraction on Hadoop such
 as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT
 easier to write in these languages.

 ** **

 Hope this helps!

 - Pradeep

 ** **

 On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra pavan0...@gmail.com
 wrote:

 I need to improve my MR jobs which uses HBase as source as well as sink..
 ** **

 Basically, i'm reading data from 3 HBase Tables in the mapper, writing
 them out as one huge string for the reducer to do some computation and dump
 into a HBase Table.. 

 Table1 ~ 19 million rows.

 Table2 ~ 2 million rows.

 Table3 ~ 900,000 rows.

 The output of the mapper is something like this : 

 HouseHoldId contentID name duration genre type channelId personId 
 televisionID timestamp

 I'm interested in sorting it on the basis of the HouseHoldID value so i'm
 using this technique. I'm not interested in the V part of pair so i'm kind
 of ignoring it. My mapper class is defined as follows:

 public static class AnalyzeMapper extends TableMapperText, IntWritable { 
 }

 For my MR job to be completed, it takes 22 hours to complete which is not
 desirable at all. I'm supposed to optimize

Re: How to best decide mapper output/reducer input for a huge string?

2013-09-22 Thread Pradeep Gollakota
Pavan,

It's hard to tell whether there's anything wrong with your design or not
since you haven't given us specific enough details. The best thing you can
do is instrument your code and see what is taking a long time. Rahul
mentioned a problem that I myself have seen before, with only one region
(or a couple) having any data. So even if you have 21 regions, only mapper
might be doing the heavy lifting.

A combiner is hugely helpful in terms of reducing the data output of
mappers. Writing a combiner is a best practice and you should almost always
have one. Compression can be turned on by setting the following properties
in your job config.
property
name mapreduce.map.output.compress /name
value true/value
/property
property
namemapreduce.map.output.compress.codec/name
valueorg.apache.hadoop.io.compress.GzipCodec/value
/property
You can also try other compression codes such as Lzo, Snappy, Bzip2, etc.
depending on your use cases. Gzip is really slow but gets the best
compression ratios. Snappy/Lzo are a lot faster but don't have as good of a
compression ratio. If your computations are CPU bound, then you'd probably
want to use Snappy/Lzo. If your computations are I/O bound, and your CPUs
are idle, you can use Gzip. You'll have to experiment and find the best
settings for you. There are a lot of other tweaks that you can try to get
the best performance out of your cluster.

One of the best things you can do is to install Ganglia (or some other
similar tool) on your cluster and monitor usage of resources while your job
is running. This will tell you if your job is I/O bound or CPU bound.

Take a look at this paper by Intel about optimizing your Hadoop cluster and
see if that fits your deployment.
http://software.intel.com/sites/default/files/m/f/4/3/2/f/31124-Optimizing_Hadoop_2010_final.pdf

If your cluster is already optimized and your job is not I/O bound, then
there might be a problem with your algorithm and might warrant a redesign.

Hope this helps!
- Pradeep


On Sun, Sep 22, 2013 at 8:14 PM, Rahul Bhattacharjee 
rahul.rec@gmail.com wrote:

 One mapper is spawned per hbase table region. You can use the admin ui of
 hbase to find the number of regions per table. It might happen that all the
 data is sitting in a single region , so a single mapper is spawned and you
 are not getting enough parallel work getting done.

 If that is the case then you can recreate the tables with predefined
 splits to create more regions.

 Thanks,
 Rahul


 On Sun, Sep 22, 2013 at 4:38 AM, John Lilley john.lil...@redpoint.netwrote:

  Pavan,

 How large are the rows in HBase?  22 million rows is not very much but
 you mentioned “huge strings”.  Can you tell which part of the processing is
 the limiting factor (read from HBase, mapper output, reducers)?

 John

 ** **

 ** **

 *From:* Pavan Sudheendra [mailto:pavan0...@gmail.com]
 *Sent:* Saturday, September 21, 2013 2:17 AM
 *To:* user@hadoop.apache.org
 *Subject:* Re: How to best decide mapper output/reducer input for a huge
 string?

 ** **

 No, I don't have a combiner in place. Is it necessary? How do I make my
 map output compressed? Yes, the Tables in HBase are compressed.

 Although, there's no real bottleneck, the time it takes to process the
 entire table is huge. I have to constantly check if i can optimize it
 somehow.. 

 Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see
 any thing wrong with my design? Does it require any kind of re-work? Thank
 you so much for helping..

 ** **

 On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota pradeep...@gmail.com
 wrote:

 One thing that comes to mind is that your keys are Strings which are
 highly inefficient. You might get a lot better performance if you write a
 custom writable for your Key object using the appropriate data types. For
 example, use a long (LongWritable) for timestamps. This should make
 (de)serialization a lot faster. If HouseHoldId is an integer, your speed of
 comparisons for sorting will also go up.

 ** **

 Ensure that your map output's are being compressed. Are your tables in
 HBase compressed? Do you have a combiner?

 ** **

 Have you been able to profile your code to see where the bottlenecks are?
 

 ** **

 On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra pavan0...@gmail.com
 wrote:

 Hi Pradeep,

 Yes.. Basically i'm only writing the key part as the map output.. The V
 of K,V is not of much use to me.. But i'm hoping to change that if it
 leads to faster execution.. I'm kind of a newbie so looking to make the
 map/reduce job run a lot faster.. 

 Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But
 seems if i write a map output for each and every row of a 19 m row HBase
 table, its taking nearly a day to complete.. (21 mappers and 21 reducers)
 

 ** **

 I have looked at both Pig/Hive to do the job but i'm supposed to do this
 via a MR job.. So, cannot use either

How to best decide mapper output/reducer input for a huge string?

2013-09-21 Thread Pavan Sudheendra
I need to improve my MR jobs which uses HBase as source as well as sink..

Basically, i'm reading data from 3 HBase Tables in the mapper, writing them
out as one huge string for the reducer to do some computation and dump into
a HBase Table..

Table1 ~ 19 million rows.Table2 ~ 2 million rows.Table3 ~ 900,000 rows.

The output of the mapper is something like this :

HouseHoldId contentID name duration genre type channelId personId
televisionID timestamp

I'm interested in sorting it on the basis of the HouseHoldID value so i'm
using this technique. I'm not interested in the V part of pair so i'm kind
of ignoring it. My mapper class is defined as follows:

public static class AnalyzeMapper extends TableMapperText, IntWritable { }

For my MR job to be completed, it takes 22 hours to complete which is not
desirable at all. I'm supposed to optimize this somehow to run a lot faster
somehow..

scan.setCaching(750);
scan.setCacheBlocks(false); TableMapReduceUtil.initTableMapperJob (
   Table1,   // input
HBase table name
   scan,
   AnalyzeMapper.class,// mapper
   Text.class, //
mapper output key
   IntWritable.class,  //
mapper output value
   job);

TableMapReduceUtil.initTableReducerJob(
OutputTable,//
output table
AnalyzeReducerTable.class,  //
reducer class
job);
job.setNumReduceTasks(RegionCount);

My HBase Table1 has 21 regions so 21 mappers are spawned. We are running a
8 node cloudera cluster.

Should i use a custom SortComparator or a Group Comparator?


-- 
Regards-
Pavan


Re: How to best decide mapper output/reducer input for a huge string?

2013-09-21 Thread Pradeep Gollakota
I'm sorry but I don't understand your question. Is the output of the mapper
you're describing the key portion? If it is the key, then your data should
already be sorted by HouseHoldId since it occurs first in your key.

The SortComparator will tell Hadoop how to sort your data. So you use this
if you have a need for a non lexical sort order. The GroupingComparator
will tell Hadoop how to group your data for the reducer. All KV-pairs from
the same group will be given to the same Reducer.

If your reduce computation needs all the KV-pairs for the same HouseHoldId,
then you will need to write a GroupingComparator.

Also, have you considered using a higher level abstraction on Hadoop such
as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT
easier to write in these languages.

Hope this helps!
- Pradeep


On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra pavan0...@gmail.comwrote:

 I need to improve my MR jobs which uses HBase as source as well as sink..

 Basically, i'm reading data from 3 HBase Tables in the mapper, writing
 them out as one huge string for the reducer to do some computation and dump
 into a HBase Table..

 Table1 ~ 19 million rows.Table2 ~ 2 million rows.Table3 ~ 900,000 rows.

 The output of the mapper is something like this :

 HouseHoldId contentID name duration genre type channelId personId 
 televisionID timestamp

 I'm interested in sorting it on the basis of the HouseHoldID value so i'm
 using this technique. I'm not interested in the V part of pair so i'm kind
 of ignoring it. My mapper class is defined as follows:

 public static class AnalyzeMapper extends TableMapperText, IntWritable { }

 For my MR job to be completed, it takes 22 hours to complete which is not
 desirable at all. I'm supposed to optimize this somehow to run a lot faster
 somehow..

 scan.setCaching(750);
 scan.setCacheBlocks(false); TableMapReduceUtil.initTableMapperJob (
Table1,   // input HBase table 
 name
scan,
AnalyzeMapper.class,// mapper
Text.class, // mapper 
 output key
IntWritable.class,  // mapper 
 output value
job);

 TableMapReduceUtil.initTableReducerJob(
 OutputTable,// output 
 table
 AnalyzeReducerTable.class,  // 
 reducer class
 job);
 job.setNumReduceTasks(RegionCount);

 My HBase Table1 has 21 regions so 21 mappers are spawned. We are running a
 8 node cloudera cluster.

 Should i use a custom SortComparator or a Group Comparator?


 --
 Regards-
 Pavan



Re: How to best decide mapper output/reducer input for a huge string?

2013-09-21 Thread Pavan Sudheendra
Hi Pradeep,
Yes.. Basically i'm only writing the key part as the map output.. The V of
K,V is not of much use to me.. But i'm hoping to change that if it leads
to faster execution.. I'm kind of a newbie so looking to make the
map/reduce job run a lot faster..

Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But
seems if i write a map output for each and every row of a 19 m row HBase
table, its taking nearly a day to complete.. (21 mappers and 21 reducers)

I have looked at both Pig/Hive to do the job but i'm supposed to do this
via a MR job.. So, cannot use either of that.. Do you recommend me to try
something if i have the data in that format?


On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota pradeep...@gmail.comwrote:

 I'm sorry but I don't understand your question. Is the output of the
 mapper you're describing the key portion? If it is the key, then your data
 should already be sorted by HouseHoldId since it occurs first in your key.

 The SortComparator will tell Hadoop how to sort your data. So you use this
 if you have a need for a non lexical sort order. The GroupingComparator
 will tell Hadoop how to group your data for the reducer. All KV-pairs from
 the same group will be given to the same Reducer.

 If your reduce computation needs all the KV-pairs for the same
 HouseHoldId, then you will need to write a GroupingComparator.

 Also, have you considered using a higher level abstraction on Hadoop such
 as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT
 easier to write in these languages.

 Hope this helps!
 - Pradeep


 On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra pavan0...@gmail.comwrote:

 I need to improve my MR jobs which uses HBase as source as well as sink..

 Basically, i'm reading data from 3 HBase Tables in the mapper, writing
 them out as one huge string for the reducer to do some computation and dump
 into a HBase Table..

 Table1 ~ 19 million rows.Table2 ~ 2 million rows.Table3 ~ 900,000 rows.

 The output of the mapper is something like this :

 HouseHoldId contentID name duration genre type channelId personId 
 televisionID timestamp

 I'm interested in sorting it on the basis of the HouseHoldID value so i'm
 using this technique. I'm not interested in the V part of pair so i'm kind
 of ignoring it. My mapper class is defined as follows:

 public static class AnalyzeMapper extends TableMapperText, IntWritable { }

 For my MR job to be completed, it takes 22 hours to complete which is not
 desirable at all. I'm supposed to optimize this somehow to run a lot faster
 somehow..

 scan.setCaching(750);
 scan.setCacheBlocks(false); TableMapReduceUtil.initTableMapperJob (
Table1,   // input HBase 
 table name
scan,
AnalyzeMapper.class,// mapper
Text.class, // mapper 
 output key
IntWritable.class,  // mapper 
 output value
job);

 TableMapReduceUtil.initTableReducerJob(
 OutputTable,// 
 output table
 AnalyzeReducerTable.class,  // 
 reducer class
 job);
 job.setNumReduceTasks(RegionCount);

 My HBase Table1 has 21 regions so 21 mappers are spawned. We are running
 a 8 node cloudera cluster.

 Should i use a custom SortComparator or a Group Comparator?


 --
 Regards-
 Pavan





-- 
Regards-
Pavan


Re: How to best decide mapper output/reducer input for a huge string?

2013-09-21 Thread Pradeep Gollakota
One thing that comes to mind is that your keys are Strings which are highly
inefficient. You might get a lot better performance if you write a custom
writable for your Key object using the appropriate data types. For example,
use a long (LongWritable) for timestamps. This should make
(de)serialization a lot faster. If HouseHoldId is an integer, your speed of
comparisons for sorting will also go up.

Ensure that your map output's are being compressed. Are your tables in
HBase compressed? Do you have a combiner?

Have you been able to profile your code to see where the bottlenecks are?


On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra pavan0...@gmail.comwrote:

 Hi Pradeep,
 Yes.. Basically i'm only writing the key part as the map output.. The V of
 K,V is not of much use to me.. But i'm hoping to change that if it leads
 to faster execution.. I'm kind of a newbie so looking to make the
 map/reduce job run a lot faster..

 Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But
 seems if i write a map output for each and every row of a 19 m row HBase
 table, its taking nearly a day to complete.. (21 mappers and 21 reducers)

 I have looked at both Pig/Hive to do the job but i'm supposed to do this
 via a MR job.. So, cannot use either of that.. Do you recommend me to try
 something if i have the data in that format?


 On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota 
 pradeep...@gmail.comwrote:

 I'm sorry but I don't understand your question. Is the output of the
 mapper you're describing the key portion? If it is the key, then your data
 should already be sorted by HouseHoldId since it occurs first in your key.

 The SortComparator will tell Hadoop how to sort your data. So you use
 this if you have a need for a non lexical sort order. The
 GroupingComparator will tell Hadoop how to group your data for the reducer.
 All KV-pairs from the same group will be given to the same Reducer.

 If your reduce computation needs all the KV-pairs for the same
 HouseHoldId, then you will need to write a GroupingComparator.

 Also, have you considered using a higher level abstraction on Hadoop such
 as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT
 easier to write in these languages.

 Hope this helps!
 - Pradeep


 On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra 
 pavan0...@gmail.comwrote:

 I need to improve my MR jobs which uses HBase as source as well as
 sink..

 Basically, i'm reading data from 3 HBase Tables in the mapper, writing
 them out as one huge string for the reducer to do some computation and dump
 into a HBase Table..

 Table1 ~ 19 million rows.Table2 ~ 2 million rows.Table3 ~ 900,000 rows.

 The output of the mapper is something like this :

 HouseHoldId contentID name duration genre type channelId personId 
 televisionID timestamp

 I'm interested in sorting it on the basis of the HouseHoldID value so
 i'm using this technique. I'm not interested in the V part of pair so i'm
 kind of ignoring it. My mapper class is defined as follows:

 public static class AnalyzeMapper extends TableMapperText, IntWritable { }

 For my MR job to be completed, it takes 22 hours to complete which is
 not desirable at all. I'm supposed to optimize this somehow to run a lot
 faster somehow..

 scan.setCaching(750);
 scan.setCacheBlocks(false); TableMapReduceUtil.initTableMapperJob (
Table1,   // input HBase 
 table name
scan,
AnalyzeMapper.class,// mapper
Text.class, // mapper 
 output key
IntWritable.class,  // mapper 
 output value
job);

 TableMapReduceUtil.initTableReducerJob(
 OutputTable,// 
 output table
 AnalyzeReducerTable.class,  // 
 reducer class
 job);
 job.setNumReduceTasks(RegionCount);

 My HBase Table1 has 21 regions so 21 mappers are spawned. We are running
 a 8 node cloudera cluster.

 Should i use a custom SortComparator or a Group Comparator?


 --
 Regards-
 Pavan





 --
 Regards-
 Pavan



Re: How to best decide mapper output/reducer input for a huge string?

2013-09-21 Thread Pavan Sudheendra
No, I don't have a combiner in place. Is it necessary? How do I make my map
output compressed? Yes, the Tables in HBase are compressed.

Although, there's no real bottleneck, the time it takes to process the
entire table is huge. I have to constantly check if i can optimize it
somehow..

Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see
any thing wrong with my design? Does it require any kind of re-work? Thank
you so much for helping..


On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota pradeep...@gmail.comwrote:

 One thing that comes to mind is that your keys are Strings which are
 highly inefficient. You might get a lot better performance if you write a
 custom writable for your Key object using the appropriate data types. For
 example, use a long (LongWritable) for timestamps. This should make
 (de)serialization a lot faster. If HouseHoldId is an integer, your speed of
 comparisons for sorting will also go up.

 Ensure that your map output's are being compressed. Are your tables in
 HBase compressed? Do you have a combiner?

 Have you been able to profile your code to see where the bottlenecks are?


 On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra pavan0...@gmail.comwrote:

 Hi Pradeep,
 Yes.. Basically i'm only writing the key part as the map output.. The V
 of K,V is not of much use to me.. But i'm hoping to change that if it
 leads to faster execution.. I'm kind of a newbie so looking to make the
 map/reduce job run a lot faster..

 Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But
 seems if i write a map output for each and every row of a 19 m row HBase
 table, its taking nearly a day to complete.. (21 mappers and 21 reducers)

 I have looked at both Pig/Hive to do the job but i'm supposed to do this
 via a MR job.. So, cannot use either of that.. Do you recommend me to try
 something if i have the data in that format?


 On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota pradeep...@gmail.com
  wrote:

 I'm sorry but I don't understand your question. Is the output of the
 mapper you're describing the key portion? If it is the key, then your data
 should already be sorted by HouseHoldId since it occurs first in your key.

 The SortComparator will tell Hadoop how to sort your data. So you use
 this if you have a need for a non lexical sort order. The
 GroupingComparator will tell Hadoop how to group your data for the reducer.
 All KV-pairs from the same group will be given to the same Reducer.

 If your reduce computation needs all the KV-pairs for the same
 HouseHoldId, then you will need to write a GroupingComparator.

 Also, have you considered using a higher level abstraction on Hadoop
 such as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are
 a LOT easier to write in these languages.

 Hope this helps!
 - Pradeep


 On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra 
 pavan0...@gmail.comwrote:

 I need to improve my MR jobs which uses HBase as source as well as
 sink..

 Basically, i'm reading data from 3 HBase Tables in the mapper, writing
 them out as one huge string for the reducer to do some computation and dump
 into a HBase Table..

 Table1 ~ 19 million rows.Table2 ~ 2 million rows.Table3 ~ 900,000 rows.

 The output of the mapper is something like this :

 HouseHoldId contentID name duration genre type channelId personId 
 televisionID timestamp

 I'm interested in sorting it on the basis of the HouseHoldID value so
 i'm using this technique. I'm not interested in the V part of pair so i'm
 kind of ignoring it. My mapper class is defined as follows:

 public static class AnalyzeMapper extends TableMapperText, IntWritable { 
 }

 For my MR job to be completed, it takes 22 hours to complete which is
 not desirable at all. I'm supposed to optimize this somehow to run a lot
 faster somehow..

 scan.setCaching(750);
 scan.setCacheBlocks(false); TableMapReduceUtil.initTableMapperJob (
Table1,   // input HBase 
 table name
scan,
AnalyzeMapper.class,// mapper
Text.class, // mapper 
 output key
IntWritable.class,  // mapper 
 output value
job);

 TableMapReduceUtil.initTableReducerJob(
 OutputTable,// 
 output table
 AnalyzeReducerTable.class,  // 
 reducer class
 job);
 job.setNumReduceTasks(RegionCount);

 My HBase Table1 has 21 regions so 21 mappers are spawned. We are
 running a 8 node cloudera cluster.

 Should i use a custom SortComparator or a Group Comparator?


 --
 Regards-
 Pavan





 --
 Regards-
 Pavan





-- 
Regards-
Pavan


RE: How to best decide mapper output/reducer input for a huge string?

2013-09-21 Thread John Lilley
Pavan,
How large are the rows in HBase?  22 million rows is not very much but you 
mentioned huge strings.  Can you tell which part of the processing is the 
limiting factor (read from HBase, mapper output, reducers)?
John


From: Pavan Sudheendra [mailto:pavan0...@gmail.com]
Sent: Saturday, September 21, 2013 2:17 AM
To: user@hadoop.apache.org
Subject: Re: How to best decide mapper output/reducer input for a huge string?

No, I don't have a combiner in place. Is it necessary? How do I make my map 
output compressed? Yes, the Tables in HBase are compressed.
Although, there's no real bottleneck, the time it takes to process the entire 
table is huge. I have to constantly check if i can optimize it somehow..
Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see any 
thing wrong with my design? Does it require any kind of re-work? Thank you so 
much for helping..

On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota 
pradeep...@gmail.commailto:pradeep...@gmail.com wrote:
One thing that comes to mind is that your keys are Strings which are highly 
inefficient. You might get a lot better performance if you write a custom 
writable for your Key object using the appropriate data types. For example, use 
a long (LongWritable) for timestamps. This should make (de)serialization a lot 
faster. If HouseHoldId is an integer, your speed of comparisons for sorting 
will also go up.

Ensure that your map output's are being compressed. Are your tables in HBase 
compressed? Do you have a combiner?

Have you been able to profile your code to see where the bottlenecks are?

On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra 
pavan0...@gmail.commailto:pavan0...@gmail.com wrote:
Hi Pradeep,
Yes.. Basically i'm only writing the key part as the map output.. The V of 
K,V is not of much use to me.. But i'm hoping to change that if it leads to 
faster execution.. I'm kind of a newbie so looking to make the map/reduce job 
run a lot faster..
Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But seems 
if i write a map output for each and every row of a 19 m row HBase table, its 
taking nearly a day to complete.. (21 mappers and 21 reducers)

I have looked at both Pig/Hive to do the job but i'm supposed to do this via a 
MR job.. So, cannot use either of that.. Do you recommend me to try something 
if i have the data in that format?

On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota 
pradeep...@gmail.commailto:pradeep...@gmail.com wrote:
I'm sorry but I don't understand your question. Is the output of the mapper 
you're describing the key portion? If it is the key, then your data should 
already be sorted by HouseHoldId since it occurs first in your key.

The SortComparator will tell Hadoop how to sort your data. So you use this if 
you have a need for a non lexical sort order. The GroupingComparator will tell 
Hadoop how to group your data for the reducer. All KV-pairs from the same group 
will be given to the same Reducer.

If your reduce computation needs all the KV-pairs for the same HouseHoldId, 
then you will need to write a GroupingComparator.

Also, have you considered using a higher level abstraction on Hadoop such as 
Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT easier 
to write in these languages.

Hope this helps!
- Pradeep

On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra 
pavan0...@gmail.commailto:pavan0...@gmail.com wrote:

I need to improve my MR jobs which uses HBase as source as well as sink..

Basically, i'm reading data from 3 HBase Tables in the mapper, writing them out 
as one huge string for the reducer to do some computation and dump into a HBase 
Table..

Table1 ~ 19 million rows.

Table2 ~ 2 million rows.

Table3 ~ 900,000 rows.

The output of the mapper is something like this :

HouseHoldId contentID name duration genre type channelId personId televisionID 
timestamp

I'm interested in sorting it on the basis of the HouseHoldID value so i'm using 
this technique. I'm not interested in the V part of pair so i'm kind of 
ignoring it. My mapper class is defined as follows:

public static class AnalyzeMapper extends TableMapperText, IntWritable { }

For my MR job to be completed, it takes 22 hours to complete which is not 
desirable at all. I'm supposed to optimize this somehow to run a lot faster 
somehow..

scan.setCaching(750);

scan.setCacheBlocks(false);

TableMapReduceUtil.initTableMapperJob (

   Table1,   // input HBase table 
name

   scan,

   AnalyzeMapper.class,// mapper

   Text.class, // mapper output 
key

   IntWritable.class,  // mapper output 
value

   job);



TableMapReduceUtil.initTableReducerJob(

OutputTable