[jira] [Commented] (HBASE-10932) Improve RowCounter to allow mapper number set/control

2014-04-25 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13981271#comment-13981271
 ] 

Jean-Daniel Cryans commented on HBASE-10932:


bq. But I didn't quite catch the point of job scheduler, in my understanding 
job scheduler is cluster-level and cannot be configured per-job, right? 

Well, by using a scheduler, you can constrain certain types of jobs so that 
they don't run as fast as they can. For example, with the fair scheduler you 
can configure a pool (let's call it the slow pool) to have only {{maxMaps}} 
running concurrently on the cluster. Then, when you run your {{RowCounter}} 
jobs and whatnot, you can tie them automatically to the slow pool. Hadoop 
cluster operators usually know how to use a scheduler, whereas having to rely 
on the person who runs the jobs to configure them correctly can lead to human 
errors like oops I forgot to pass the maps configuration to my row counter and 
now the website is down.

It also works well if you have two users who want to concurrently run a row 
counter; they'll both get in the slow pool and only two mappers will run 
(alternating between the two jobs, unless you set different weights because one 
user is more important than the other, etc etc). If you were to rely on 
individual users specifying the correct number of maps, and they both set their 
job to use two, then you'd have four mappers running. Back to square one.

Anyways, all of this to say that there's a more generic way of doing this, and 
it already exists. Can we close this jira, [~carp84]?

 Improve RowCounter to allow mapper number set/control
 -

 Key: HBASE-10932
 URL: https://issues.apache.org/jira/browse/HBASE-10932
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce
Reporter: Yu Li
Assignee: Yu Li
Priority: Minor
 Attachments: HBASE-10932_v1.patch, HBASE-10932_v2.patch


 The typical use case of RowCounter is to do some kind of data integrity 
 checking, like after exporting some data from RDBMS to HBase, or from one 
 HBase cluster to another, making sure the row(record) number matches. Such 
 check commonly won't require much on response time.
 Meanwhile, based on current impl, RowCounter will launch one mapper per 
 region, and each mapper will send one scan request. Assuming the table is 
 kind of big like having tens of regions, and the cpu core number of the whole 
 MR cluster is also enough, the parallel scan requests sent by mapper would be 
 a real burden for the HBase cluster.
 So in this JIRA, we're proposing to make rowcounter support an additional 
 option --maps to specify mapper number, and make each mapper able to scan 
 more than one region of the target table.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10932) Improve RowCounter to allow mapper number set/control

2014-04-25 Thread Yu Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13981441#comment-13981441
 ] 

Yu Li commented on HBASE-10932:
---

Hi [~jdcryans],

If we follow this logic, do you mean the -m option of DistCp also useless?

IMHO, the configuration of job scheduler in JT/Yarn is the server-side 
configuration, while the -m option is the client-side configuration, and both 
are necessary.

Back to the scheduler discussion, I believe job scheduler could only limit the 
max resource one user could use, and it depends on the user to decide how he 
uses the resource assigned to him. Like in the example you gave, what if the 
slow pool have 4 slots while only one user submit a rowcounter and he prefers 
only 2 maps running in parallel? I'm afraid asking the cluster operator to 
create another slow pool with only 2 slots is not a good solution.

In a common hbase ETL application, user would need to first do distcp, then 
bulkload, then rowcounter to check data integrity, and he would prefer distcp 
to run as fast as possible w/ low scan workload during rowcounter. In this 
case, he would need to submit the distcp job to the fast queue while the 
rowcounter job to the slow queue? And he also needs to get access to both 
queues...

Anyway, this is a real requirement from user in our product env, and I'm just 
trying to contribute this to community in case this can help other users. But 
if you still think it useless, just go ahead and close it, you're the boss 
after all. :-)

And no matter what decision made, thanks for your time on reviewing this JIRA 
and discussion.

 Improve RowCounter to allow mapper number set/control
 -

 Key: HBASE-10932
 URL: https://issues.apache.org/jira/browse/HBASE-10932
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce
Reporter: Yu Li
Assignee: Yu Li
Priority: Minor
 Attachments: HBASE-10932_v1.patch, HBASE-10932_v2.patch


 The typical use case of RowCounter is to do some kind of data integrity 
 checking, like after exporting some data from RDBMS to HBase, or from one 
 HBase cluster to another, making sure the row(record) number matches. Such 
 check commonly won't require much on response time.
 Meanwhile, based on current impl, RowCounter will launch one mapper per 
 region, and each mapper will send one scan request. Assuming the table is 
 kind of big like having tens of regions, and the cpu core number of the whole 
 MR cluster is also enough, the parallel scan requests sent by mapper would be 
 a real burden for the HBase cluster.
 So in this JIRA, we're proposing to make rowcounter support an additional 
 option --maps to specify mapper number, and make each mapper able to scan 
 more than one region of the target table.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10932) Improve RowCounter to allow mapper number set/control

2014-04-24 Thread Yu Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13980642#comment-13980642
 ] 

Yu Li commented on HBASE-10932:
---

Hi [~jdcryans],

Sorry I forgot about this issue also...

{quote}
I doubt that RowCounter is the only job that needs to be throttled, what about 
VerifyReplication? Or Export?
This can be simply handled by a correctly configured job scheduler, that's what 
they do.
{quote}
I see, so you're suggesting to find a more generous solution for such control. 
But I didn't quite catch the point of job scheduler, in my understanding job 
scheduler is cluster-level and cannot be configured per-job, right? If so, I'm 
not sure whether we can change the scheduling policy just for hbase, since 
commonly lots of other kinds of jobs will be running in the MR/Yarn cluster and 
the hbase jobs is only a small portion
Anyway, this is an interesting topic and will spend some more time thinking 
about the VerifyReplication/Export cases and the general solution

 Improve RowCounter to allow mapper number set/control
 -

 Key: HBASE-10932
 URL: https://issues.apache.org/jira/browse/HBASE-10932
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce
Reporter: Yu Li
Assignee: Yu Li
Priority: Minor
 Attachments: HBASE-10932_v1.patch, HBASE-10932_v2.patch


 The typical use case of RowCounter is to do some kind of data integrity 
 checking, like after exporting some data from RDBMS to HBase, or from one 
 HBase cluster to another, making sure the row(record) number matches. Such 
 check commonly won't require much on response time.
 Meanwhile, based on current impl, RowCounter will launch one mapper per 
 region, and each mapper will send one scan request. Assuming the table is 
 kind of big like having tens of regions, and the cpu core number of the whole 
 MR cluster is also enough, the parallel scan requests sent by mapper would be 
 a real burden for the HBase cluster.
 So in this JIRA, we're proposing to make rowcounter support an additional 
 option --maps to specify mapper number, and make each mapper able to scan 
 more than one region of the target table.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10932) Improve RowCounter to allow mapper number set/control

2014-04-18 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13974306#comment-13974306
 ] 

Jean-Daniel Cryans commented on HBASE-10932:


Hey [~carp84], I forgot about this issue, let me address your latest replies.

bq. I thought it's designed for purpose to make each mapper just scan one 
single region

That's more an implementation detail than a design, and we can further improve 
the implementation by giving more control to the power users.

bq. This is useful especially in multi-tenant env, when we need to check data 
integrity for one user after data importing meanwhile don't want the scan 
burden to slow down RT of other users' request.

Right, but again, resource management is a broader issue. I doubt that 
RowCounter is the only job that needs to be throttled, what about 
VerifyReplication? Or Export? Those jobs usually aren't latency sensitive and 
can run in the background. This can be simply handled by a correctly configured 
job scheduler, that's what they do.

 Improve RowCounter to allow mapper number set/control
 -

 Key: HBASE-10932
 URL: https://issues.apache.org/jira/browse/HBASE-10932
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce
Reporter: Yu Li
Assignee: Yu Li
Priority: Minor
 Attachments: HBASE-10932_v1.patch, HBASE-10932_v2.patch


 The typical use case of RowCounter is to do some kind of data integrity 
 checking, like after exporting some data from RDBMS to HBase, or from one 
 HBase cluster to another, making sure the row(record) number matches. Such 
 check commonly won't require much on response time.
 Meanwhile, based on current impl, RowCounter will launch one mapper per 
 region, and each mapper will send one scan request. Assuming the table is 
 kind of big like having tens of regions, and the cpu core number of the whole 
 MR cluster is also enough, the parallel scan requests sent by mapper would be 
 a real burden for the HBase cluster.
 So in this JIRA, we're proposing to make rowcounter support an additional 
 option --maps to specify mapper number, and make each mapper able to scan 
 more than one region of the target table.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10932) Improve RowCounter to allow mapper number set/control

2014-04-09 Thread Yu Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13963852#comment-13963852
 ] 

Yu Li commented on HBASE-10932:
---

Hi [~jdcryans] and [~ndimiduk],

Are you suggesting using the RowCounter in the old o.a.hadoop.hbase.mapred 
package? It seems to me this class is deprecated and it's using old mapred 
APIs. What's more, while issue the rowcounter command using hbase/hadoop 
script, it will launch RowCounter in o.a.hadoop.hbase.mapreduce package by 
default.

For the getSplits method in the new TableInputFormatBase, from the method 
comments, it's designed to make splits number matching number of regions, so I 
don't think this is a bug but something to improve for the *in-use* RowCounter:
{code}
  /**
   * Calculates the splits that will serve as input for the map tasks. The
   * number of splits matches the number of regions in a table.
   *
   * @param context  The current job context.
   * @return The list of input splits.
   * @throws IOException When creating the list of splits fails.
   * @see org.apache.hadoop.mapreduce.InputFormat#getSplits(
   *   org.apache.hadoop.mapreduce.JobContext)
   */
{code}
And this is the exact reason we introduce a new RowCounterInputFormat class to 
override the getSplits method rather than modifying the existing one.

As to the new parameter, yes user could pass -Dmapred.map.tasks, but I think it 
better to add an explicit parameter so user could see how it works from usage 
message. IMHO, assuming hbase users have background of MR might not be a good 
idea.

 Improve RowCounter to allow mapper number set/control
 -

 Key: HBASE-10932
 URL: https://issues.apache.org/jira/browse/HBASE-10932
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce
Reporter: Yu Li
Assignee: Yu Li
Priority: Minor
 Attachments: HBASE-10932_v1.patch


 The typical use case of RowCounter is to do some kind of data integrity 
 checking, like after exporting some data from RDBMS to HBase, or from one 
 HBase cluster to another, making sure the row(record) number matches. Such 
 check commonly won't require much on response time.
 Meanwhile, based on current impl, RowCounter will launch one mapper per 
 region, and each mapper will send one scan request. Assuming the table is 
 kind of big like having tens of regions, and the cpu core number of the whole 
 MR cluster is also enough, the parallel scan requests sent by mapper would be 
 a real burden for the HBase cluster.
 So in this JIRA, we're proposing to make rowcounter support an additional 
 option --maps to specify mapper number, and make each mapper able to scan 
 more than one region of the target table.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10932) Improve RowCounter to allow mapper number set/control

2014-04-09 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13964318#comment-13964318
 ] 

Jean-Daniel Cryans commented on HBASE-10932:


bq. Are you suggesting using the RowCounter in the old o.a.hadoop.hbase.mapred 
package?

No, I'm suggesting fixing the new TableInputFormatBase to be able to get splits 
to crossover regions. Why have a different InputFormat for RowCounter? What 
makes RowCounter so special that it's the only MR job that would beneficiate 
from this functionality? 

I was pointing at the old TableInputFormatBase to show that it used to do this, 
and that the new one doesn't do it (I'm guessing because MR doesn't pass 
mapred.map.tasks as a hint anymore).

bq. IMHO, assuming hbase users have background of MR might not be a good idea.

I understand the concern but I don't see the value here since it's not like 
you're trying to use a more HBase-y concept to describe mappers, the 
configuration parameter is still called maps. Even if you call it something 
else, how do you then explain what it does without relying on MR concepts and 
then how do you decide how mappers you need without having prior knowledge of 
MR and your own cluster setup?

I think this new configuration parameter is more suitable for advanced usage 
since to set it correctly you need to know how your cluster is laid out and you 
think you can do better than the default behavior.

Going back to your original problem:

bq. Assuming the table is kind of big like having tens of regions, and the cpu 
core number of the whole MR cluster is also enough, the parallel scan requests 
sent by mapper would be a real burden for the HBase cluster.

In the MRv1 world you specify the number of mapper slots per machine, so using 
this --maps configuration may or may not lessen the burden on the cluster. For 
example, 5 mapper slots per machine, 5 machines and 25 regions (so everything 
fits nicely). By default, you'll get 25 mappers running at the same time, 5 per 
machine. Let's say you use this new --maps configuration and set to 20. Well 
there's nothing preventing the JobTracker from filling up 4 machines and leave 
one quiet (maybe because it's already running something, etc).

YARN does a much better job at this since it takes into account CPUs and 
memory, so it might just solve your problem without requiring additional tuning.

 Improve RowCounter to allow mapper number set/control
 -

 Key: HBASE-10932
 URL: https://issues.apache.org/jira/browse/HBASE-10932
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce
Reporter: Yu Li
Assignee: Yu Li
Priority: Minor
 Attachments: HBASE-10932_v1.patch, HBASE-10932_v2.patch


 The typical use case of RowCounter is to do some kind of data integrity 
 checking, like after exporting some data from RDBMS to HBase, or from one 
 HBase cluster to another, making sure the row(record) number matches. Such 
 check commonly won't require much on response time.
 Meanwhile, based on current impl, RowCounter will launch one mapper per 
 region, and each mapper will send one scan request. Assuming the table is 
 kind of big like having tens of regions, and the cpu core number of the whole 
 MR cluster is also enough, the parallel scan requests sent by mapper would be 
 a real burden for the HBase cluster.
 So in this JIRA, we're proposing to make rowcounter support an additional 
 option --maps to specify mapper number, and make each mapper able to scan 
 more than one region of the target table.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10932) Improve RowCounter to allow mapper number set/control

2014-04-09 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13964370#comment-13964370
 ] 

haosdent commented on HBASE-10932:
--

{quote}
the configuration parameter is still called maps.
{quote}
scanner.num maybe better.  

{quote}
Let's say you use this new maps configuration and set to 20.
{quote}
If I am a user, maybe I would set to 2 or other lower value here.

Anyway, I think this issue is an useful issue. Because of have some import 
online businesses in my clusters, any unnecessary heavy IO could unacceptable. 
[~jdcryans] focus on code style while [~carp84] focus on how to handle this 
scenario and make the number of mappers configurable. Maybe we need a consensus 
about choose which way to workaround this issue here. Just my opinions.

 Improve RowCounter to allow mapper number set/control
 -

 Key: HBASE-10932
 URL: https://issues.apache.org/jira/browse/HBASE-10932
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce
Reporter: Yu Li
Assignee: Yu Li
Priority: Minor
 Attachments: HBASE-10932_v1.patch, HBASE-10932_v2.patch


 The typical use case of RowCounter is to do some kind of data integrity 
 checking, like after exporting some data from RDBMS to HBase, or from one 
 HBase cluster to another, making sure the row(record) number matches. Such 
 check commonly won't require much on response time.
 Meanwhile, based on current impl, RowCounter will launch one mapper per 
 region, and each mapper will send one scan request. Assuming the table is 
 kind of big like having tens of regions, and the cpu core number of the whole 
 MR cluster is also enough, the parallel scan requests sent by mapper would be 
 a real burden for the HBase cluster.
 So in this JIRA, we're proposing to make rowcounter support an additional 
 option --maps to specify mapper number, and make each mapper able to scan 
 more than one region of the target table.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10932) Improve RowCounter to allow mapper number set/control

2014-04-09 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13964389#comment-13964389
 ] 

Jean-Daniel Cryans commented on HBASE-10932:


bq.  Because of have some import online businesses in my clusters, any 
unnecessary heavy IO could unacceptable.

Isn't that a resource management concern? Won't using a proper scheduler on 
MapReduce or YARN be way more effective than relying on HBase users setting a 
number of scans?

bq.  Jean-Daniel Cryans focus on code style 

I'm sorry if I'm misunderstanding you, but I'm not following what you're trying 
to say here.

 Improve RowCounter to allow mapper number set/control
 -

 Key: HBASE-10932
 URL: https://issues.apache.org/jira/browse/HBASE-10932
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce
Reporter: Yu Li
Assignee: Yu Li
Priority: Minor
 Attachments: HBASE-10932_v1.patch, HBASE-10932_v2.patch


 The typical use case of RowCounter is to do some kind of data integrity 
 checking, like after exporting some data from RDBMS to HBase, or from one 
 HBase cluster to another, making sure the row(record) number matches. Such 
 check commonly won't require much on response time.
 Meanwhile, based on current impl, RowCounter will launch one mapper per 
 region, and each mapper will send one scan request. Assuming the table is 
 kind of big like having tens of regions, and the cpu core number of the whole 
 MR cluster is also enough, the parallel scan requests sent by mapper would be 
 a real burden for the HBase cluster.
 So in this JIRA, we're proposing to make rowcounter support an additional 
 option --maps to specify mapper number, and make each mapper able to scan 
 more than one region of the target table.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10932) Improve RowCounter to allow mapper number set/control

2014-04-09 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13964433#comment-13964433
 ] 

Jean-Daniel Cryans commented on HBASE-10932:


Thinking about this more:

bq. If I am a user, maybe I would set to 2 or other lower value here

It sounds like, in order to solve your use case without setting up a scheduler, 
you could simply use the count command in the shell, since the only thing 
below 2 scans is 1 scan :)

 Improve RowCounter to allow mapper number set/control
 -

 Key: HBASE-10932
 URL: https://issues.apache.org/jira/browse/HBASE-10932
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce
Reporter: Yu Li
Assignee: Yu Li
Priority: Minor
 Attachments: HBASE-10932_v1.patch, HBASE-10932_v2.patch


 The typical use case of RowCounter is to do some kind of data integrity 
 checking, like after exporting some data from RDBMS to HBase, or from one 
 HBase cluster to another, making sure the row(record) number matches. Such 
 check commonly won't require much on response time.
 Meanwhile, based on current impl, RowCounter will launch one mapper per 
 region, and each mapper will send one scan request. Assuming the table is 
 kind of big like having tens of regions, and the cpu core number of the whole 
 MR cluster is also enough, the parallel scan requests sent by mapper would be 
 a real burden for the HBase cluster.
 So in this JIRA, we're proposing to make rowcounter support an additional 
 option --maps to specify mapper number, and make each mapper able to scan 
 more than one region of the target table.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10932) Improve RowCounter to allow mapper number set/control

2014-04-09 Thread Yu Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13964924#comment-13964924
 ] 

Yu Li commented on HBASE-10932:
---

Hi [~jdcryans]
{quote}
What makes RowCounter so special that it's the only MR job that would 
beneficiate from this functionality?
I was pointing at the old TableInputFormatBase to show that it used to do this, 
and that the new one doesn't do it 
{quote}
Ok, got your point now. And yes, we could remove the special InputFormat for 
RowCounter and _*fix*_ the new TableInputFormatBase. I created the special 
InputFormat for RowCounter just because from the comments of the new 
TableInputFormatBase's getSplits method, I thought it's designed for purpose to 
make each mapper just scan one single region...

{quote}
I'm guessing because MR doesn't pass mapred.map.tasks as a hint anymore
{quote}
In my understanding, it still passes mapred.map.tasks as a hint, only that the 
param is contained in the JobContext, so no need of a special int param for 
getSplits any more.
Regarding the parameter to pass the mapred.map.tasks hint, I'm referring to 
distcp command, it has a special -m param there:
{noformat}
usage: distcp OPTIONS [source_path...] target_path
OPTIONS
...
-m arg   Max number of concurrent maps to use for copy
{noformat}

{quote}
Well there's nothing preventing the JobTracker from filling up 4 machines and 
leave one quiet
{quote}
Oh, there's some misunderstanding here. While talking about real burden for 
the HBase cluster, I didn't mean CPU burden caused by MR job but IO burden 
caused by scan requests. If we have 25 mappers there would be 25 scan requests, 
while w/ 20 mappers there would only be 20 scan requests. This is useful 
especially in multi-tenant env, when we need to check data integrity for one 
user after data importing meanwhile don't want the scan burden to slow down RT 
of other users' request. Makes sense? :-)

 Improve RowCounter to allow mapper number set/control
 -

 Key: HBASE-10932
 URL: https://issues.apache.org/jira/browse/HBASE-10932
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce
Reporter: Yu Li
Assignee: Yu Li
Priority: Minor
 Attachments: HBASE-10932_v1.patch, HBASE-10932_v2.patch


 The typical use case of RowCounter is to do some kind of data integrity 
 checking, like after exporting some data from RDBMS to HBase, or from one 
 HBase cluster to another, making sure the row(record) number matches. Such 
 check commonly won't require much on response time.
 Meanwhile, based on current impl, RowCounter will launch one mapper per 
 region, and each mapper will send one scan request. Assuming the table is 
 kind of big like having tens of regions, and the cpu core number of the whole 
 MR cluster is also enough, the parallel scan requests sent by mapper would be 
 a real burden for the HBase cluster.
 So in this JIRA, we're proposing to make rowcounter support an additional 
 option --maps to specify mapper number, and make each mapper able to scan 
 more than one region of the target table.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10932) Improve RowCounter to allow mapper number set/control

2014-04-08 Thread Yu Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13962998#comment-13962998
 ] 

Yu Li commented on HBASE-10932:
---

In the implementation, we will check and make sure the map number set to be 
smaller than region number of the target table. And if the map number larger 
than region number, it will go in the old way, or say one mapper per region.

Will attach the patch soon.

 Improve RowCounter to allow mapper number set/control
 -

 Key: HBASE-10932
 URL: https://issues.apache.org/jira/browse/HBASE-10932
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce
Reporter: Yu Li
Assignee: Yu Li
Priority: Minor

 The typical use case of RowCounter is to do some kind of data integrity 
 checking, like after exporting some data from RDBMS to HBase, or from one 
 HBase cluster to another, making sure the row(record) number matches. Such 
 check commonly won't require much on response time.
 Meanwhile, based on current impl, RowCounter will launch one mapper per 
 region, and each mapper will send one scan request. Assuming the table is 
 kind of big like having tens of regions, and the cpu core number of the whole 
 MR cluster is also enough, the parallel scan requests sent by mapper would be 
 a real burden for the HBase cluster.
 So in this JIRA, we're proposing to make rowcounter support an additional 
 option --maps to specify mapper number, and make each mapper able to scan 
 more than one region of the target table.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10932) Improve RowCounter to allow mapper number set/control

2014-04-08 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13963167#comment-13963167
 ] 

Jean-Daniel Cryans commented on HBASE-10932:


It's probably a bug that TableInputFormatBase doesn't do it, looking at the old 
one (in org.apache.hadoop.hbase.mapred) you can see that it does this:

{code}
   * Splits are created in number equal to the smallest between numSplits and
   * the number of {@link HRegion}s in the table. If the number of splits is
   * smaller than the number of {@link HRegion}s then splits are spanned across
   * multiple {@link HRegion}s and are grouped the most evenly possible. In the
   * case splits are uneven the bigger splits are placed first in the
   * {@link InputSplit} array.
{code}

And you don't need a new parameter.

 Improve RowCounter to allow mapper number set/control
 -

 Key: HBASE-10932
 URL: https://issues.apache.org/jira/browse/HBASE-10932
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce
Reporter: Yu Li
Assignee: Yu Li
Priority: Minor
 Attachments: HBASE-10932_v1.patch


 The typical use case of RowCounter is to do some kind of data integrity 
 checking, like after exporting some data from RDBMS to HBase, or from one 
 HBase cluster to another, making sure the row(record) number matches. Such 
 check commonly won't require much on response time.
 Meanwhile, based on current impl, RowCounter will launch one mapper per 
 region, and each mapper will send one scan request. Assuming the table is 
 kind of big like having tens of regions, and the cpu core number of the whole 
 MR cluster is also enough, the parallel scan requests sent by mapper would be 
 a real burden for the HBase cluster.
 So in this JIRA, we're proposing to make rowcounter support an additional 
 option --maps to specify mapper number, and make each mapper able to scan 
 more than one region of the target table.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-10932) Improve RowCounter to allow mapper number set/control

2014-04-08 Thread Nick Dimiduk (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-10932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13963230#comment-13963230
 ] 

Nick Dimiduk commented on HBASE-10932:
--

bq. It's probably a bug that TableInputFormatBase doesn't do it, looking at the 
old one (in org.apache.hadoop.hbase.mapred)

+1

 Improve RowCounter to allow mapper number set/control
 -

 Key: HBASE-10932
 URL: https://issues.apache.org/jira/browse/HBASE-10932
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce
Reporter: Yu Li
Assignee: Yu Li
Priority: Minor
 Attachments: HBASE-10932_v1.patch


 The typical use case of RowCounter is to do some kind of data integrity 
 checking, like after exporting some data from RDBMS to HBase, or from one 
 HBase cluster to another, making sure the row(record) number matches. Such 
 check commonly won't require much on response time.
 Meanwhile, based on current impl, RowCounter will launch one mapper per 
 region, and each mapper will send one scan request. Assuming the table is 
 kind of big like having tens of regions, and the cpu core number of the whole 
 MR cluster is also enough, the parallel scan requests sent by mapper would be 
 a real burden for the HBase cluster.
 So in this JIRA, we're proposing to make rowcounter support an additional 
 option --maps to specify mapper number, and make each mapper able to scan 
 more than one region of the target table.



--
This message was sent by Atlassian JIRA
(v6.2#6252)