Re: Regarding Designing Hbase Table - for a banking scenario

2012-12-24 Thread Ramasubramanian Narayanan
Hi,
Thanks for your reply...
Could you please help in answering my 3rd question...

regards,
Rams

On Mon, Dec 24, 2012 at 1:23 PM, Mohammad Tariq donta...@gmail.com wrote:

 design a rowkey w.r.t the row that we populate? for
  example, for specific rows I may have Columns A+B+C constitue a rowkey,
 for
  some other records IN THE SAME TABLE, column B+C+D can be used as a
 rowkey?



Re: Regarding Rowkey and Column Family

2012-12-24 Thread Jean-Marc Spaggiari
Hi Rams,

How are you going to access you data?

HBase will create one cell (Which mean rowkey+timestamp+...+data) for
eache cell.

Are you really going to sometime access Address Line1 without
accessing Address Line2?

Are you really going to access the City wihtout accessing the State?

If not, why not just put a JSon object with all this data in a single cell?

So at the end your table will look llike:

*Table Name : Customer*
*
*
*Field Name Column Family*
Customer Information CF1
Address CF1


In Customer Information you bundle:
Customer Number  CF1
DOB  CF1
FNameCF1
MNameCF1
LNameCF1

And in Address you bundle:
Address Type CF2
Address Line1CF2
Address Line2CF2
Address Line3CF2
Address Line4CF2
StateCF2
City CF2
Country  CF2

But if you always access the address when you access the customer
information, then the best way might be to just put all those field in
a single JSon object, and have just one CF and on C in your table...

Regarding the key, if you customer number is sequential and you insert
based on this field, you will hotspot one server at a time... If the
number is random, then it's ok.

HTH.

JM

2012/12/24, Mohammad Tariq donta...@gmail.com:
 it is. but why do you want  to do that? you will run into issues once your
 data starts growing. each cell, along with the actual value stores few
 additional things, *row, column *and the *version. *as a result you will
 loose space if you do that.

 Best Regards,
 Tariq
 +91-9741563634
 https://mtariq.jux.com/


 On Mon, Dec 24, 2012 at 5:00 PM, Ramasubramanian Narayanan 
 ramasubramanian.naraya...@gmail.com wrote:

 Hi,

 Is it ok to have same column into different column familes?

 regards,
 Rams

 On Mon, Dec 24, 2012 at 4:06 PM, Mohammad Tariq donta...@gmail.com
 wrote:

  you are creating 2 different rows here. cf means how column are clubbed
  together as a single entity which is represented by that cf. but here
  you
  are creating 2 different rows having one cf each, CF1 and CF2
 respectively.
  if you want to have 1 row with 2 cf, you have to do use same rowkey for
  both the cf.
 
 
 
  Best Regards,
  Tariq
  +91-9741563634
  https://mtariq.jux.com/
 
 
  On Mon, Dec 24, 2012 at 3:41 PM, Ramasubramanian Narayanan 
  ramasubramanian.naraya...@gmail.com wrote:
 
   Hi,
  
   *Table Name : Customer*
   *
   *
   *Field Name Column Family*
   Customer Number  CF1
   DOB  CF1
   FNameCF1
   MNameCF1
   LNameCF1
   Address Type CF2
   Address Line1CF2
   Address Line2CF2
   Address Line3CF2
   Address Line4CF2
   StateCF2
   City CF2
   Country  CF2
  
   Is it good to have rowkey as follows for the same table?
  
   Rowkey Design:
   --
   For CF1 : Customer Number + MMD (business date)
   For CF2 : Customer Number + Address Type
  
   Note :
   Address Type can be any of HOME/OFFICE/OTHERS
  
   regards,
   Rams
  
 




Re: Regarding Rowkey and Column Family

2012-12-24 Thread Ramasubramanian
Hi,

Thanks for your detailed explanation. 

The address will be multiple ones for a single customer. For example a same 
customer can hold home address, office address, etc., hence I grouped into 
different column family. 

1. Is my approach is correct?

2. What can we have as a rowkey for both these column families?

3. I think customer Number is sequence hence planning to include MMDD along 
with customer number in the rowkey. Is that fine?

Regards,
Rams

On 24-Dec-2012, at 7:54 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote:

 Hi Rams,
 
 How are you going to access you data?
 
 HBase will create one cell (Which mean rowkey+timestamp+...+data) for
 eache cell.
 
 Are you really going to sometime access Address Line1 without
 accessing Address Line2?
 
 Are you really going to access the City wihtout accessing the State?
 
 If not, why not just put a JSon object with all this data in a single cell?
 
 So at the end your table will look llike:
 
 *Table Name : Customer*
 *
 *
 *Field Name Column Family*
 Customer Information CF1
 Address CF1
 
 
 In Customer Information you bundle:
 Customer Number  CF1
 DOB  CF1
 FNameCF1
 MNameCF1
 LNameCF1
 
 And in Address you bundle:
 Address Type CF2
 Address Line1CF2
 Address Line2CF2
 Address Line3CF2
 Address Line4CF2
 StateCF2
 City CF2
 Country  CF2
 
 But if you always access the address when you access the customer
 information, then the best way might be to just put all those field in
 a single JSon object, and have just one CF and on C in your table...
 
 Regarding the key, if you customer number is sequential and you insert
 based on this field, you will hotspot one server at a time... If the
 number is random, then it's ok.
 
 HTH.
 
 JM
 
 2012/12/24, Mohammad Tariq donta...@gmail.com:
 it is. but why do you want  to do that? you will run into issues once your
 data starts growing. each cell, along with the actual value stores few
 additional things, *row, column *and the *version. *as a result you will
 loose space if you do that.
 
 Best Regards,
 Tariq
 +91-9741563634
 https://mtariq.jux.com/
 
 
 On Mon, Dec 24, 2012 at 5:00 PM, Ramasubramanian Narayanan 
 ramasubramanian.naraya...@gmail.com wrote:
 
 Hi,
 
 Is it ok to have same column into different column familes?
 
 regards,
 Rams
 
 On Mon, Dec 24, 2012 at 4:06 PM, Mohammad Tariq donta...@gmail.com
 wrote:
 
 you are creating 2 different rows here. cf means how column are clubbed
 together as a single entity which is represented by that cf. but here
 you
 are creating 2 different rows having one cf each, CF1 and CF2
 respectively.
 if you want to have 1 row with 2 cf, you have to do use same rowkey for
 both the cf.
 
 
 
 Best Regards,
 Tariq
 +91-9741563634
 https://mtariq.jux.com/
 
 
 On Mon, Dec 24, 2012 at 3:41 PM, Ramasubramanian Narayanan 
 ramasubramanian.naraya...@gmail.com wrote:
 
 Hi,
 
 *Table Name : Customer*
 *
 *
 *Field Name Column Family*
 Customer Number  CF1
 DOB  CF1
 FNameCF1
 MNameCF1
 LNameCF1
 Address Type CF2
 Address Line1CF2
 Address Line2CF2
 Address Line3CF2
 Address Line4CF2
 StateCF2
 City CF2
 Country  CF2
 
 Is it good to have rowkey as follows for the same table?
 
 Rowkey Design:
 --
 For CF1 : Customer Number + MMD (business date)
 For CF2 : Customer Number + Address Type
 
 Note :
 Address Type can be any of HOME/OFFICE/OTHERS
 
 regards,
 Rams
 


Re: Regarding Rowkey and Column Family

2012-12-24 Thread Jean-Marc Spaggiari
Hi Rams,

Even if a customer can have multiple addresses, you can still simply
put them all on the same field...

A ArrayList of address, converted in a JSon sting, in a single HBase
cell will still do it.

You can have them on separated cells if you think you will access them
separatly. You can also have different columns identifiers for each
type of address you can have.

Like you have CF1 for all you fields, C=Infos for the customer info,
C=PHY for Physical address, C=HOM for home address, C=OFF for office
address, and so on?

The idea is to reduce the CFs if not required, and really think about
the way you access your data.

If you access all the address at the same time, then simply put all of
them on the same cell, on a Array of Address converted in String with
JSon. So simple ;)

JM

2012/12/24, Ramasubramanian ramasubramanian.naraya...@gmail.com:
 Hi,

 Let me explain the scenario.

 For address of the customer we have designed 3 tables (in relational way)

 1. Address link table
 Will have key columns like
   Address type - physical or email /fax/phone/URL/ etc.,
   Address category- (home/work)
   Primary address indicator
   Bad address indicator
   Etc.,
 2. Physical address
 This will contain the actual physical address. A customer can have n
 Number of addresses.
Fields :
- address type (physical)
- address category (home/work/etc.,)
 - address1
 - address 2
 .
 3. Electronic address
 It will contain email/fax/phone/URL etc, and it's value
  Fields :
- address type (email /fax/phone/URL/ etc.,)
- address category (home/work/etc.,)
- value ( actual value based on address type. Like actual phone
 number)


 Now in the above scenario, while designing in hbase, I am going to eliminate
 link table and have those fields in both physical and electronic address.

 So both the tables has common fields like address type and address category.
 Hence thought of having these two fields common for both the set of fields.
 (In a single table)

 Regards,
 Rams

 On 24-Dec-2012, at 6:45 PM, Mohammad Tariq donta...@gmail.com wrote:

 it is. but why do you want  to do that? you will run into issues once
 your
 data starts growing. each cell, along with the actual value stores few
 additional things, *row, column *and the *version. *as a result you will
 loose space if you do that.

 Best Regards,
 Tariq
 +91-9741563634
 https://mtariq.jux.com/


 On Mon, Dec 24, 2012 at 5:00 PM, Ramasubramanian Narayanan 
 ramasubramanian.naraya...@gmail.com wrote:

 Hi,

 Is it ok to have same column into different column familes?

 regards,
 Rams

 On Mon, Dec 24, 2012 at 4:06 PM, Mohammad Tariq donta...@gmail.com
 wrote:

 you are creating 2 different rows here. cf means how column are clubbed
 together as a single entity which is represented by that cf. but here
 you
 are creating 2 different rows having one cf each, CF1 and CF2
 respectively.
 if you want to have 1 row with 2 cf, you have to do use same rowkey for
 both the cf.



 Best Regards,
 Tariq
 +91-9741563634
 https://mtariq.jux.com/


 On Mon, Dec 24, 2012 at 3:41 PM, Ramasubramanian Narayanan 
 ramasubramanian.naraya...@gmail.com wrote:

 Hi,

 *Table Name : Customer*
 *
 *
 *Field Name Column Family*
 Customer Number  CF1
 DOB  CF1
 FNameCF1
 MNameCF1
 LNameCF1
 Address Type CF2
 Address Line1CF2
 Address Line2CF2
 Address Line3CF2
 Address Line4CF2
 StateCF2
 City CF2
 Country  CF2

 Is it good to have rowkey as follows for the same table?

 Rowkey Design:
 --
 For CF1 : Customer Number + MMD (business date)
 For CF2 : Customer Number + Address Type

 Note :
 Address Type can be any of HOME/OFFICE/OTHERS

 regards,
 Rams




Re: Hbase Count Aggregate Function

2012-12-24 Thread Jean-Marc Spaggiari
Hi Dalia,

You already sent the same question yesterday ;) Just give some time to
people to look at it.

JM

2012/12/24, Dalia Sobhy dalia.mohso...@hotmail.com:

 Dear all,

 I have 50,000 row with diagnosis qualifier = cardiac, and another 50,000
 rows with renal.

 When I type this in Hbase shell,

 import org.apache.hadoop.hbase.filter.CompareFilter
 import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
 import org.apache.hadoop.hbase.filter.SubstringComparator
 import org.apache.hadoop.hbase.util.Bytes

 scan 'patient', { COLUMNS = info:diagnosis, FILTER =
 SingleColumnValueFilter.new(Bytes.toBytes('info'),
  Bytes.toBytes('diagnosis'),
  CompareFilter::CompareOp.valueOf('EQUAL'),
  SubstringComparator.new('cardiac'))}

 Output = 50,000 row

 import org.apache.hadoop.hbase.filter.CompareFilter
 import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
 import org.apache.hadoop.hbase.filter.SubstringComparator
 import org.apache.hadoop.hbase.util.Bytes

 count 'patient', { COLUMNS = info:diagnosis, FILTER =
 SingleColumnValueFilter.new(Bytes.toBytes('info'),
  Bytes.toBytes('diagnosis'),
  CompareFilter::CompareOp.valueOf('EQUAL'),
  SubstringComparator.new('cardiac'))}
 Output = 100,000 row

 Even though I tried it using Hbase Java API, Aggregation Client Instance,
 and I enabled the Coprocessor aggregation for the table.
 rowCount = aggregationClient.rowCount(TABLE_NAME, null, scan)

 Also when measuring the improved performance on case of adding more nodes
 the operation takes the same time.

 So any advice please?

 I have been throughout all this mess from a couple of weeks

 Thanks,


Re: Hbase Count Aggregate Function

2012-12-24 Thread ramkrishna vasudevan
So you find that scan with a filter and count with the same filter is
giving you different results?

Regards
Ram

On Mon, Dec 24, 2012 at 8:33 PM, Dalia Sobhy dalia.mohso...@hotmail.comwrote:


 Dear all,

 I have 50,000 row with diagnosis qualifier = cardiac, and another 50,000
 rows with renal.

 When I type this in Hbase shell,

 import org.apache.hadoop.hbase.filter.CompareFilter
 import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
 import org.apache.hadoop.hbase.filter.SubstringComparator
 import org.apache.hadoop.hbase.util.Bytes

 scan 'patient', { COLUMNS = info:diagnosis, FILTER =
 SingleColumnValueFilter.new(Bytes.toBytes('info'),
  Bytes.toBytes('diagnosis'),
  CompareFilter::CompareOp.valueOf('EQUAL'),
  SubstringComparator.new('cardiac'))}

 Output = 50,000 row

 import org.apache.hadoop.hbase.filter.CompareFilter
 import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
 import org.apache.hadoop.hbase.filter.SubstringComparator
 import org.apache.hadoop.hbase.util.Bytes

 count 'patient', { COLUMNS = info:diagnosis, FILTER =
 SingleColumnValueFilter.new(Bytes.toBytes('info'),
  Bytes.toBytes('diagnosis'),
  CompareFilter::CompareOp.valueOf('EQUAL'),
  SubstringComparator.new('cardiac'))}
 Output = 100,000 row

 Even though I tried it using Hbase Java API, Aggregation Client Instance,
 and I enabled the Coprocessor aggregation for the table.
 rowCount = aggregationClient.rowCount(TABLE_NAME, null, scan)

 Also when measuring the improved performance on case of adding more nodes
 the operation takes the same time.

 So any advice please?

 I have been throughout all this mess from a couple of weeks

 Thanks,


RE: Hbase Count Aggregate Function

2012-12-24 Thread Dalia Sobhy

yeah scan gives the correct number of rows, while count returns the total 
number of rows. 

Both are using the same filter, I even tried it using Java API, using row count 
method.

rowCount = aggregationClient.rowCount(TABLE_NAME, null, scan);

I get the total number of rows not the number of rows filtered.

So any idea ??

Thanks Ram :)

 Date: Mon, 24 Dec 2012 21:57:54 +0530
 Subject: Re: Hbase Count Aggregate Function
 From: ramkrishna.s.vasude...@gmail.com
 To: user@hbase.apache.org
 
 So you find that scan with a filter and count with the same filter is
 giving you different results?
 
 Regards
 Ram
 
 On Mon, Dec 24, 2012 at 8:33 PM, Dalia Sobhy 
 dalia.mohso...@hotmail.comwrote:
 
 
  Dear all,
 
  I have 50,000 row with diagnosis qualifier = cardiac, and another 50,000
  rows with renal.
 
  When I type this in Hbase shell,
 
  import org.apache.hadoop.hbase.filter.CompareFilter
  import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
  import org.apache.hadoop.hbase.filter.SubstringComparator
  import org.apache.hadoop.hbase.util.Bytes
 
  scan 'patient', { COLUMNS = info:diagnosis, FILTER =
  SingleColumnValueFilter.new(Bytes.toBytes('info'),
   Bytes.toBytes('diagnosis'),
   CompareFilter::CompareOp.valueOf('EQUAL'),
   SubstringComparator.new('cardiac'))}
 
  Output = 50,000 row
 
  import org.apache.hadoop.hbase.filter.CompareFilter
  import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
  import org.apache.hadoop.hbase.filter.SubstringComparator
  import org.apache.hadoop.hbase.util.Bytes
 
  count 'patient', { COLUMNS = info:diagnosis, FILTER =
  SingleColumnValueFilter.new(Bytes.toBytes('info'),
   Bytes.toBytes('diagnosis'),
   CompareFilter::CompareOp.valueOf('EQUAL'),
   SubstringComparator.new('cardiac'))}
  Output = 100,000 row
 
  Even though I tried it using Hbase Java API, Aggregation Client Instance,
  and I enabled the Coprocessor aggregation for the table.
  rowCount = aggregationClient.rowCount(TABLE_NAME, null, scan)
 
  Also when measuring the improved performance on case of adding more nodes
  the operation takes the same time.
 
  So any advice please?
 
  I have been throughout all this mess from a couple of weeks
 
  Thanks,
  

Re: Fixing badly distributed table manually.

2012-12-24 Thread Ivan Balashov

Vincent Barat vbarat@... writes:

 
 Hi,
 
 Balancing regions between RS is correctly handled by HBase : I mean 
 that your RSs always manage the same number of regions (the balancer 
 takes care of it).
 
 Unfortunately, balancing all the regions of one particular table 
 between the RS of your cluster is not always easy, since HBase (as 
 for 0.90.3) when it comes to splitting a region, create the new one 
 always on the same RS. This means that if you start with a 1 region 
 only table, and then you insert lots of data into it, new regions 
 will always be created to the same RS (if you insert is a M/R job, 
 you saturate this RS). Eventually, the balancer at a time will 
 decide to balance one of these regions to other RS, limiting the 
 issue, but it is not controllable.
 
 Here at Capptain, we solved this problem by developing a special 
 Python script, based on the HBase shell, allowing to entirely 
 balance all the regions of all tables to all RS. It ensure that 
 regions of tables are uniformly deployed on all RS of the cluster, 
 with a minimum region transitions.
 
 It is fast, and even if it can trigger a lot of region transitions, 
 there is very few impact at runtime and it can be run safely.
 
 If you are interested, just let me know, I can share it.
 
 Regards,
 

Vincent,

I would much like to see and possibly use the script that you 
mentioned. We've just run  into the same issue (after the table 
has been truncated it was re-created with only 1 region, and 
after data loading and manual splits we ended up having all 
regions within the same RS).

If you could share the script, it will be really appreciated, 
I believe not only by me.

Thanks,
Ivan 








Re: Hbase Count Aggregate Function

2012-12-24 Thread ramkrishna vasudevan
Okie, seeing the shell script and the code I feel that while you use this
counter, the user's filter is not taken into account.
It adds a FirstKeyOnlyFilter and proceeds with the scan. :(.

Regards
Ram

On Mon, Dec 24, 2012 at 10:11 PM, Dalia Sobhy dalia.mohso...@hotmail.comwrote:


 yeah scan gives the correct number of rows, while count returns the total
 number of rows.

 Both are using the same filter, I even tried it using Java API, using row
 count method.

 rowCount = aggregationClient.rowCount(TABLE_NAME, null, scan);

 I get the total number of rows not the number of rows filtered.

 So any idea ??

 Thanks Ram :)

  Date: Mon, 24 Dec 2012 21:57:54 +0530
  Subject: Re: Hbase Count Aggregate Function
  From: ramkrishna.s.vasude...@gmail.com
  To: user@hbase.apache.org
 
  So you find that scan with a filter and count with the same filter is
  giving you different results?
 
  Regards
  Ram
 
  On Mon, Dec 24, 2012 at 8:33 PM, Dalia Sobhy dalia.mohso...@hotmail.com
 wrote:
 
  
   Dear all,
  
   I have 50,000 row with diagnosis qualifier = cardiac, and another
 50,000
   rows with renal.
  
   When I type this in Hbase shell,
  
   import org.apache.hadoop.hbase.filter.CompareFilter
   import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
   import org.apache.hadoop.hbase.filter.SubstringComparator
   import org.apache.hadoop.hbase.util.Bytes
  
   scan 'patient', { COLUMNS = info:diagnosis, FILTER =
   SingleColumnValueFilter.new(Bytes.toBytes('info'),
Bytes.toBytes('diagnosis'),
CompareFilter::CompareOp.valueOf('EQUAL'),
SubstringComparator.new('cardiac'))}
  
   Output = 50,000 row
  
   import org.apache.hadoop.hbase.filter.CompareFilter
   import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
   import org.apache.hadoop.hbase.filter.SubstringComparator
   import org.apache.hadoop.hbase.util.Bytes
  
   count 'patient', { COLUMNS = info:diagnosis, FILTER =
   SingleColumnValueFilter.new(Bytes.toBytes('info'),
Bytes.toBytes('diagnosis'),
CompareFilter::CompareOp.valueOf('EQUAL'),
SubstringComparator.new('cardiac'))}
   Output = 100,000 row
  
   Even though I tried it using Hbase Java API, Aggregation Client
 Instance,
   and I enabled the Coprocessor aggregation for the table.
   rowCount = aggregationClient.rowCount(TABLE_NAME, null, scan)
  
   Also when measuring the improved performance on case of adding more
 nodes
   the operation takes the same time.
  
   So any advice please?
  
   I have been throughout all this mess from a couple of weeks
  
   Thanks,




RE: Hbase Count Aggregate Function

2012-12-24 Thread Dalia Sobhy

So do you have a suggestion how to enable/work the filter?

 Date: Mon, 24 Dec 2012 22:22:49 +0530
 Subject: Re: Hbase Count Aggregate Function
 From: ramkrishna.s.vasude...@gmail.com
 To: user@hbase.apache.org
 
 Okie, seeing the shell script and the code I feel that while you use this
 counter, the user's filter is not taken into account.
 It adds a FirstKeyOnlyFilter and proceeds with the scan. :(.
 
 Regards
 Ram
 
 On Mon, Dec 24, 2012 at 10:11 PM, Dalia Sobhy 
 dalia.mohso...@hotmail.comwrote:
 
 
  yeah scan gives the correct number of rows, while count returns the total
  number of rows.
 
  Both are using the same filter, I even tried it using Java API, using row
  count method.
 
  rowCount = aggregationClient.rowCount(TABLE_NAME, null, scan);
 
  I get the total number of rows not the number of rows filtered.
 
  So any idea ??
 
  Thanks Ram :)
 
   Date: Mon, 24 Dec 2012 21:57:54 +0530
   Subject: Re: Hbase Count Aggregate Function
   From: ramkrishna.s.vasude...@gmail.com
   To: user@hbase.apache.org
  
   So you find that scan with a filter and count with the same filter is
   giving you different results?
  
   Regards
   Ram
  
   On Mon, Dec 24, 2012 at 8:33 PM, Dalia Sobhy dalia.mohso...@hotmail.com
  wrote:
  
   
Dear all,
   
I have 50,000 row with diagnosis qualifier = cardiac, and another
  50,000
rows with renal.
   
When I type this in Hbase shell,
   
import org.apache.hadoop.hbase.filter.CompareFilter
import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
import org.apache.hadoop.hbase.filter.SubstringComparator
import org.apache.hadoop.hbase.util.Bytes
   
scan 'patient', { COLUMNS = info:diagnosis, FILTER =
SingleColumnValueFilter.new(Bytes.toBytes('info'),
 Bytes.toBytes('diagnosis'),
 CompareFilter::CompareOp.valueOf('EQUAL'),
 SubstringComparator.new('cardiac'))}
   
Output = 50,000 row
   
import org.apache.hadoop.hbase.filter.CompareFilter
import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
import org.apache.hadoop.hbase.filter.SubstringComparator
import org.apache.hadoop.hbase.util.Bytes
   
count 'patient', { COLUMNS = info:diagnosis, FILTER =
SingleColumnValueFilter.new(Bytes.toBytes('info'),
 Bytes.toBytes('diagnosis'),
 CompareFilter::CompareOp.valueOf('EQUAL'),
 SubstringComparator.new('cardiac'))}
Output = 100,000 row
   
Even though I tried it using Hbase Java API, Aggregation Client
  Instance,
and I enabled the Coprocessor aggregation for the table.
rowCount = aggregationClient.rowCount(TABLE_NAME, null, scan)
   
Also when measuring the improved performance on case of adding more
  nodes
the operation takes the same time.
   
So any advice please?
   
I have been throughout all this mess from a couple of weeks
   
Thanks,
 
 
  

Where to place logs

2012-12-24 Thread Varun Sharma
Hi,

I am wondering where people usually place hbase + hadoop logs. I have 4
disks and 1 very tiny disk with barely 500 megs (thats the typical setup on
amazon ec2). The 4 disks shall be used for hbase data. Since 500M is too
small, should I place logs on one of the 4 disks. Could it potentially
steal IOP(s) from hbase ? Does anyone have an idea how much of an overhead
logging really is ?

Varun


Re: Fixing badly distributed table manually.

2012-12-24 Thread Mohit Anchlia
On Mon, Dec 24, 2012 at 8:27 AM, Ivan Balashov ibalas...@gmail.com wrote:


 Vincent Barat vbarat@... writes:

 
  Hi,
 
  Balancing regions between RS is correctly handled by HBase : I mean
  that your RSs always manage the same number of regions (the balancer
  takes care of it).
 
  Unfortunately, balancing all the regions of one particular table
  between the RS of your cluster is not always easy, since HBase (as
  for 0.90.3) when it comes to splitting a region, create the new one
  always on the same RS. This means that if you start with a 1 region
  only table, and then you insert lots of data into it, new regions
  will always be created to the same RS (if you insert is a M/R job,
  you saturate this RS). Eventually, the balancer at a time will
  decide to balance one of these regions to other RS, limiting the
  issue, but it is not controllable.
 
  Here at Capptain, we solved this problem by developing a special
  Python script, based on the HBase shell, allowing to entirely
  balance all the regions of all tables to all RS. It ensure that
  regions of tables are uniformly deployed on all RS of the cluster,
  with a minimum region transitions.
 


Is it possible to describe the logic at high level on what you did?

  It is fast, and even if it can trigger a lot of region transitions,
  there is very few impact at runtime and it can be run safely.
 
  If you are interested, just let me know, I can share it.
 
  Regards,
 

 Vincent,

 I would much like to see and possibly use the script that you
 mentioned. We've just run  into the same issue (after the table
 has been truncated it was re-created with only 1 region, and
 after data loading and manual splits we ended up having all
 regions within the same RS).

 If you could share the script, it will be really appreciated,
 I believe not only by me.

 Thanks,
 Ivan









Re: Hbase Count Aggregate Function

2012-12-24 Thread ramkrishna vasudevan
Hi
You could have custom filter implemented which is similar to
FirstKeyOnlyfilter.
Implement the filterKeyValue method such that it should match your keyvalue
(the specific qualifier that you are looking for).

Deploy it in your cluster.  It should work.

Regards
Ram

On Mon, Dec 24, 2012 at 10:35 PM, Dalia Sobhy dalia.mohso...@hotmail.comwrote:


 So do you have a suggestion how to enable/work the filter?

  Date: Mon, 24 Dec 2012 22:22:49 +0530
  Subject: Re: Hbase Count Aggregate Function
  From: ramkrishna.s.vasude...@gmail.com
  To: user@hbase.apache.org
 
  Okie, seeing the shell script and the code I feel that while you use this
  counter, the user's filter is not taken into account.
  It adds a FirstKeyOnlyFilter and proceeds with the scan. :(.
 
  Regards
  Ram
 
  On Mon, Dec 24, 2012 at 10:11 PM, Dalia Sobhy 
 dalia.mohso...@hotmail.comwrote:
 
  
   yeah scan gives the correct number of rows, while count returns the
 total
   number of rows.
  
   Both are using the same filter, I even tried it using Java API, using
 row
   count method.
  
   rowCount = aggregationClient.rowCount(TABLE_NAME, null, scan);
  
   I get the total number of rows not the number of rows filtered.
  
   So any idea ??
  
   Thanks Ram :)
  
Date: Mon, 24 Dec 2012 21:57:54 +0530
Subject: Re: Hbase Count Aggregate Function
From: ramkrishna.s.vasude...@gmail.com
To: user@hbase.apache.org
   
So you find that scan with a filter and count with the same filter is
giving you different results?
   
Regards
Ram
   
On Mon, Dec 24, 2012 at 8:33 PM, Dalia Sobhy 
 dalia.mohso...@hotmail.com
   wrote:
   

 Dear all,

 I have 50,000 row with diagnosis qualifier = cardiac, and another
   50,000
 rows with renal.

 When I type this in Hbase shell,

 import org.apache.hadoop.hbase.filter.CompareFilter
 import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
 import org.apache.hadoop.hbase.filter.SubstringComparator
 import org.apache.hadoop.hbase.util.Bytes

 scan 'patient', { COLUMNS = info:diagnosis, FILTER =
 SingleColumnValueFilter.new(Bytes.toBytes('info'),
  Bytes.toBytes('diagnosis'),
  CompareFilter::CompareOp.valueOf('EQUAL'),
  SubstringComparator.new('cardiac'))}

 Output = 50,000 row

 import org.apache.hadoop.hbase.filter.CompareFilter
 import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
 import org.apache.hadoop.hbase.filter.SubstringComparator
 import org.apache.hadoop.hbase.util.Bytes

 count 'patient', { COLUMNS = info:diagnosis, FILTER =
 SingleColumnValueFilter.new(Bytes.toBytes('info'),
  Bytes.toBytes('diagnosis'),
  CompareFilter::CompareOp.valueOf('EQUAL'),
  SubstringComparator.new('cardiac'))}
 Output = 100,000 row

 Even though I tried it using Hbase Java API, Aggregation Client
   Instance,
 and I enabled the Coprocessor aggregation for the table.
 rowCount = aggregationClient.rowCount(TABLE_NAME, null, scan)

 Also when measuring the improved performance on case of adding more
   nodes
 the operation takes the same time.

 So any advice please?

 I have been throughout all this mess from a couple of weeks

 Thanks,
  
  




RE: Hbase Count Aggregate Function

2012-12-24 Thread Dalia Sobhy

Do you mean I implement a new rowCount method in Aggregation Client Class.

I cannot understand, could u illustrate with a code sample Ram?

Thanks,

 Date: Tue, 25 Dec 2012 00:21:14 +0530
 Subject: Re: Hbase Count Aggregate Function
 From: ramkrishna.s.vasude...@gmail.com
 To: user@hbase.apache.org
 
 Hi
 You could have custom filter implemented which is similar to
 FirstKeyOnlyfilter.
 Implement the filterKeyValue method such that it should match your keyvalue
 (the specific qualifier that you are looking for).
 
 Deploy it in your cluster.  It should work.
 
 Regards
 Ram
 
 On Mon, Dec 24, 2012 at 10:35 PM, Dalia Sobhy 
 dalia.mohso...@hotmail.comwrote:
 
 
  So do you have a suggestion how to enable/work the filter?
 
   Date: Mon, 24 Dec 2012 22:22:49 +0530
   Subject: Re: Hbase Count Aggregate Function
   From: ramkrishna.s.vasude...@gmail.com
   To: user@hbase.apache.org
  
   Okie, seeing the shell script and the code I feel that while you use this
   counter, the user's filter is not taken into account.
   It adds a FirstKeyOnlyFilter and proceeds with the scan. :(.
  
   Regards
   Ram
  
   On Mon, Dec 24, 2012 at 10:11 PM, Dalia Sobhy 
  dalia.mohso...@hotmail.comwrote:
  
   
yeah scan gives the correct number of rows, while count returns the
  total
number of rows.
   
Both are using the same filter, I even tried it using Java API, using
  row
count method.
   
rowCount = aggregationClient.rowCount(TABLE_NAME, null, scan);
   
I get the total number of rows not the number of rows filtered.
   
So any idea ??
   
Thanks Ram :)
   
 Date: Mon, 24 Dec 2012 21:57:54 +0530
 Subject: Re: Hbase Count Aggregate Function
 From: ramkrishna.s.vasude...@gmail.com
 To: user@hbase.apache.org

 So you find that scan with a filter and count with the same filter is
 giving you different results?

 Regards
 Ram

 On Mon, Dec 24, 2012 at 8:33 PM, Dalia Sobhy 
  dalia.mohso...@hotmail.com
wrote:

 
  Dear all,
 
  I have 50,000 row with diagnosis qualifier = cardiac, and another
50,000
  rows with renal.
 
  When I type this in Hbase shell,
 
  import org.apache.hadoop.hbase.filter.CompareFilter
  import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
  import org.apache.hadoop.hbase.filter.SubstringComparator
  import org.apache.hadoop.hbase.util.Bytes
 
  scan 'patient', { COLUMNS = info:diagnosis, FILTER =
  SingleColumnValueFilter.new(Bytes.toBytes('info'),
   Bytes.toBytes('diagnosis'),
   CompareFilter::CompareOp.valueOf('EQUAL'),
   SubstringComparator.new('cardiac'))}
 
  Output = 50,000 row
 
  import org.apache.hadoop.hbase.filter.CompareFilter
  import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
  import org.apache.hadoop.hbase.filter.SubstringComparator
  import org.apache.hadoop.hbase.util.Bytes
 
  count 'patient', { COLUMNS = info:diagnosis, FILTER =
  SingleColumnValueFilter.new(Bytes.toBytes('info'),
   Bytes.toBytes('diagnosis'),
   CompareFilter::CompareOp.valueOf('EQUAL'),
   SubstringComparator.new('cardiac'))}
  Output = 100,000 row
 
  Even though I tried it using Hbase Java API, Aggregation Client
Instance,
  and I enabled the Coprocessor aggregation for the table.
  rowCount = aggregationClient.rowCount(TABLE_NAME, null, scan)
 
  Also when measuring the improved performance on case of adding more
nodes
  the operation takes the same time.
 
  So any advice please?
 
  I have been throughout all this mess from a couple of weeks
 
  Thanks,
   
   
 
 
  

RE: Hbase Count Aggregate Function

2012-12-24 Thread Dalia Sobhy

This is my function:

public long CountByDiagnosis(String diagnosis) throws IOException
  {
customConf.setStrings(hbase.zookeeper.quorum,hbaseZookeeperQuorum);
customConf.setLong(hbase.rpc.timeout, 60);
customConf.setLong(hbase.client.scanner.caching, 1000);
configuration = HBaseConfiguration.create(customConf);
aggregationClient = new AggregationClient(configuration);

scan.addFamily(CF);

//Filter by a particular Diagnosis
SingleColumnValueFilter filter1 = new SingleColumnValueFilter(
  CF,
  Column,
  CompareOp.EQUAL,
  Bytes.toBytes(diagnosis)
  );
scan.setFilter(filter1);

long rowCount = -1;
//Count the number of patients suffering from cardiac diagnosis
try {
  rowCount = aggregationClient.rowCount(TABLE_NAME, null, scan);
} catch (Throwable e) {
  e.printStackTrace();
}
return rowCount;

  }
 


 Date: Tue, 25 Dec 2012 00:21:14 +0530
 Subject: Re: Hbase Count Aggregate Function
 From: ramkrishna.s.vasude...@gmail.com
 To: user@hbase.apache.org
 
 Hi
 You could have custom filter implemented which is similar to
 FirstKeyOnlyfilter.
 Implement the filterKeyValue method such that it should match your keyvalue
 (the specific qualifier that you are looking for).
 
 Deploy it in your cluster.  It should work.
 
 Regards
 Ram
 
 On Mon, Dec 24, 2012 at 10:35 PM, Dalia Sobhy 
 dalia.mohso...@hotmail.comwrote:
 
 
  So do you have a suggestion how to enable/work the filter?
 
   Date: Mon, 24 Dec 2012 22:22:49 +0530
   Subject: Re: Hbase Count Aggregate Function
   From: ramkrishna.s.vasude...@gmail.com
   To: user@hbase.apache.org
  
   Okie, seeing the shell script and the code I feel that while you use this
   counter, the user's filter is not taken into account.
   It adds a FirstKeyOnlyFilter and proceeds with the scan. :(.
  
   Regards
   Ram
  
   On Mon, Dec 24, 2012 at 10:11 PM, Dalia Sobhy 
  dalia.mohso...@hotmail.comwrote:
  
   
yeah scan gives the correct number of rows, while count returns the
  total
number of rows.
   
Both are using the same filter, I even tried it using Java API, using
  row
count method.
   
rowCount = aggregationClient.rowCount(TABLE_NAME, null, scan);
   
I get the total number of rows not the number of rows filtered.
   
So any idea ??
   
Thanks Ram :)
   
 Date: Mon, 24 Dec 2012 21:57:54 +0530
 Subject: Re: Hbase Count Aggregate Function
 From: ramkrishna.s.vasude...@gmail.com
 To: user@hbase.apache.org

 So you find that scan with a filter and count with the same filter is
 giving you different results?

 Regards
 Ram

 On Mon, Dec 24, 2012 at 8:33 PM, Dalia Sobhy 
  dalia.mohso...@hotmail.com
wrote:

 
  Dear all,
 
  I have 50,000 row with diagnosis qualifier = cardiac, and another
50,000
  rows with renal.
 
  When I type this in Hbase shell,
 
  import org.apache.hadoop.hbase.filter.CompareFilter
  import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
  import org.apache.hadoop.hbase.filter.SubstringComparator
  import org.apache.hadoop.hbase.util.Bytes
 
  scan 'patient', { COLUMNS = info:diagnosis, FILTER =
  SingleColumnValueFilter.new(Bytes.toBytes('info'),
   Bytes.toBytes('diagnosis'),
   CompareFilter::CompareOp.valueOf('EQUAL'),
   SubstringComparator.new('cardiac'))}
 
  Output = 50,000 row
 
  import org.apache.hadoop.hbase.filter.CompareFilter
  import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
  import org.apache.hadoop.hbase.filter.SubstringComparator
  import org.apache.hadoop.hbase.util.Bytes
 
  count 'patient', { COLUMNS = info:diagnosis, FILTER =
  SingleColumnValueFilter.new(Bytes.toBytes('info'),
   Bytes.toBytes('diagnosis'),
   CompareFilter::CompareOp.valueOf('EQUAL'),
   SubstringComparator.new('cardiac'))}
  Output = 100,000 row
 
  Even though I tried it using Hbase Java API, Aggregation Client
Instance,
  and I enabled the Coprocessor aggregation for the table.
  rowCount = aggregationClient.rowCount(TABLE_NAME, null, scan)
 
  Also when measuring the improved performance on case of adding more
nodes
  the operation takes the same time.
 
  So any advice please?
 
  I have been throughout all this mess from a couple of weeks
 
  Thanks,
   
   
 
 
  

Re: Hbase Question

2012-12-24 Thread 周梦想
Hi Dalia,

I think you can make a small sample of the table to do the test, then
you'll find what's the difference of scan and count.
because you can count it by human.

Best regards,
Andy

2012/12/24 Dalia Sobhy dalia.mohso...@hotmail.com


 Dear all,

 I have 50,000 row with diagnosis qualifier = cardiac, and another 50,000
 rows with renal.

 When I type this in Hbase shell,

 import org.apache.hadoop.hbase.filter.CompareFilter
 import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
 import org.apache.hadoop.hbase.filter.SubstringComparator
 import org.apache.hadoop.hbase.util.Bytes

 scan 'patient', { COLUMNS = info:diagnosis, FILTER =
 SingleColumnValueFilter.new(Bytes.toBytes('info'),
  Bytes.toBytes('diagnosis'),
  CompareFilter::CompareOp.valueOf('EQUAL'),
  SubstringComparator.new('cardiac'))}

 Output = 50,000 row

 import org.apache.hadoop.hbase.filter.CompareFilter
 import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
 import org.apache.hadoop.hbase.filter.SubstringComparator
 import org.apache.hadoop.hbase.util.Bytes

 count 'patient', { COLUMNS = info:diagnosis, FILTER =
 SingleColumnValueFilter.new(Bytes.toBytes('info'),
  Bytes.toBytes('diagnosis'),
  CompareFilter::CompareOp.valueOf('EQUAL'),
  SubstringComparator.new('cardiac'))}
 Output = 100,000 row

 Even though I tried it using Hbase Java API, Aggregation Client Instance,
 and I enabled the Coprocessor aggregation for the table.
 rowCount = aggregationClient.rowCount(TABLE_NAME, null, scan)

 Also when measuring the improved performance on case of adding more nodes
 the operation takes the same time.

 So any advice please?

 I have been throughout all this mess from a couple of weeks

 Thanks,