Re: Regarding Designing Hbase Table - for a banking scenario
Hi, Thanks for your reply... Could you please help in answering my 3rd question... regards, Rams On Mon, Dec 24, 2012 at 1:23 PM, Mohammad Tariq donta...@gmail.com wrote: design a rowkey w.r.t the row that we populate? for example, for specific rows I may have Columns A+B+C constitue a rowkey, for some other records IN THE SAME TABLE, column B+C+D can be used as a rowkey?
Re: Regarding Rowkey and Column Family
Hi Rams, How are you going to access you data? HBase will create one cell (Which mean rowkey+timestamp+...+data) for eache cell. Are you really going to sometime access Address Line1 without accessing Address Line2? Are you really going to access the City wihtout accessing the State? If not, why not just put a JSon object with all this data in a single cell? So at the end your table will look llike: *Table Name : Customer* * * *Field Name Column Family* Customer Information CF1 Address CF1 In Customer Information you bundle: Customer Number CF1 DOB CF1 FNameCF1 MNameCF1 LNameCF1 And in Address you bundle: Address Type CF2 Address Line1CF2 Address Line2CF2 Address Line3CF2 Address Line4CF2 StateCF2 City CF2 Country CF2 But if you always access the address when you access the customer information, then the best way might be to just put all those field in a single JSon object, and have just one CF and on C in your table... Regarding the key, if you customer number is sequential and you insert based on this field, you will hotspot one server at a time... If the number is random, then it's ok. HTH. JM 2012/12/24, Mohammad Tariq donta...@gmail.com: it is. but why do you want to do that? you will run into issues once your data starts growing. each cell, along with the actual value stores few additional things, *row, column *and the *version. *as a result you will loose space if you do that. Best Regards, Tariq +91-9741563634 https://mtariq.jux.com/ On Mon, Dec 24, 2012 at 5:00 PM, Ramasubramanian Narayanan ramasubramanian.naraya...@gmail.com wrote: Hi, Is it ok to have same column into different column familes? regards, Rams On Mon, Dec 24, 2012 at 4:06 PM, Mohammad Tariq donta...@gmail.com wrote: you are creating 2 different rows here. cf means how column are clubbed together as a single entity which is represented by that cf. but here you are creating 2 different rows having one cf each, CF1 and CF2 respectively. if you want to have 1 row with 2 cf, you have to do use same rowkey for both the cf. Best Regards, Tariq +91-9741563634 https://mtariq.jux.com/ On Mon, Dec 24, 2012 at 3:41 PM, Ramasubramanian Narayanan ramasubramanian.naraya...@gmail.com wrote: Hi, *Table Name : Customer* * * *Field Name Column Family* Customer Number CF1 DOB CF1 FNameCF1 MNameCF1 LNameCF1 Address Type CF2 Address Line1CF2 Address Line2CF2 Address Line3CF2 Address Line4CF2 StateCF2 City CF2 Country CF2 Is it good to have rowkey as follows for the same table? Rowkey Design: -- For CF1 : Customer Number + MMD (business date) For CF2 : Customer Number + Address Type Note : Address Type can be any of HOME/OFFICE/OTHERS regards, Rams
Re: Regarding Rowkey and Column Family
Hi, Thanks for your detailed explanation. The address will be multiple ones for a single customer. For example a same customer can hold home address, office address, etc., hence I grouped into different column family. 1. Is my approach is correct? 2. What can we have as a rowkey for both these column families? 3. I think customer Number is sequence hence planning to include MMDD along with customer number in the rowkey. Is that fine? Regards, Rams On 24-Dec-2012, at 7:54 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi Rams, How are you going to access you data? HBase will create one cell (Which mean rowkey+timestamp+...+data) for eache cell. Are you really going to sometime access Address Line1 without accessing Address Line2? Are you really going to access the City wihtout accessing the State? If not, why not just put a JSon object with all this data in a single cell? So at the end your table will look llike: *Table Name : Customer* * * *Field Name Column Family* Customer Information CF1 Address CF1 In Customer Information you bundle: Customer Number CF1 DOB CF1 FNameCF1 MNameCF1 LNameCF1 And in Address you bundle: Address Type CF2 Address Line1CF2 Address Line2CF2 Address Line3CF2 Address Line4CF2 StateCF2 City CF2 Country CF2 But if you always access the address when you access the customer information, then the best way might be to just put all those field in a single JSon object, and have just one CF and on C in your table... Regarding the key, if you customer number is sequential and you insert based on this field, you will hotspot one server at a time... If the number is random, then it's ok. HTH. JM 2012/12/24, Mohammad Tariq donta...@gmail.com: it is. but why do you want to do that? you will run into issues once your data starts growing. each cell, along with the actual value stores few additional things, *row, column *and the *version. *as a result you will loose space if you do that. Best Regards, Tariq +91-9741563634 https://mtariq.jux.com/ On Mon, Dec 24, 2012 at 5:00 PM, Ramasubramanian Narayanan ramasubramanian.naraya...@gmail.com wrote: Hi, Is it ok to have same column into different column familes? regards, Rams On Mon, Dec 24, 2012 at 4:06 PM, Mohammad Tariq donta...@gmail.com wrote: you are creating 2 different rows here. cf means how column are clubbed together as a single entity which is represented by that cf. but here you are creating 2 different rows having one cf each, CF1 and CF2 respectively. if you want to have 1 row with 2 cf, you have to do use same rowkey for both the cf. Best Regards, Tariq +91-9741563634 https://mtariq.jux.com/ On Mon, Dec 24, 2012 at 3:41 PM, Ramasubramanian Narayanan ramasubramanian.naraya...@gmail.com wrote: Hi, *Table Name : Customer* * * *Field Name Column Family* Customer Number CF1 DOB CF1 FNameCF1 MNameCF1 LNameCF1 Address Type CF2 Address Line1CF2 Address Line2CF2 Address Line3CF2 Address Line4CF2 StateCF2 City CF2 Country CF2 Is it good to have rowkey as follows for the same table? Rowkey Design: -- For CF1 : Customer Number + MMD (business date) For CF2 : Customer Number + Address Type Note : Address Type can be any of HOME/OFFICE/OTHERS regards, Rams
Re: Regarding Rowkey and Column Family
Hi Rams, Even if a customer can have multiple addresses, you can still simply put them all on the same field... A ArrayList of address, converted in a JSon sting, in a single HBase cell will still do it. You can have them on separated cells if you think you will access them separatly. You can also have different columns identifiers for each type of address you can have. Like you have CF1 for all you fields, C=Infos for the customer info, C=PHY for Physical address, C=HOM for home address, C=OFF for office address, and so on? The idea is to reduce the CFs if not required, and really think about the way you access your data. If you access all the address at the same time, then simply put all of them on the same cell, on a Array of Address converted in String with JSon. So simple ;) JM 2012/12/24, Ramasubramanian ramasubramanian.naraya...@gmail.com: Hi, Let me explain the scenario. For address of the customer we have designed 3 tables (in relational way) 1. Address link table Will have key columns like Address type - physical or email /fax/phone/URL/ etc., Address category- (home/work) Primary address indicator Bad address indicator Etc., 2. Physical address This will contain the actual physical address. A customer can have n Number of addresses. Fields : - address type (physical) - address category (home/work/etc.,) - address1 - address 2 . 3. Electronic address It will contain email/fax/phone/URL etc, and it's value Fields : - address type (email /fax/phone/URL/ etc.,) - address category (home/work/etc.,) - value ( actual value based on address type. Like actual phone number) Now in the above scenario, while designing in hbase, I am going to eliminate link table and have those fields in both physical and electronic address. So both the tables has common fields like address type and address category. Hence thought of having these two fields common for both the set of fields. (In a single table) Regards, Rams On 24-Dec-2012, at 6:45 PM, Mohammad Tariq donta...@gmail.com wrote: it is. but why do you want to do that? you will run into issues once your data starts growing. each cell, along with the actual value stores few additional things, *row, column *and the *version. *as a result you will loose space if you do that. Best Regards, Tariq +91-9741563634 https://mtariq.jux.com/ On Mon, Dec 24, 2012 at 5:00 PM, Ramasubramanian Narayanan ramasubramanian.naraya...@gmail.com wrote: Hi, Is it ok to have same column into different column familes? regards, Rams On Mon, Dec 24, 2012 at 4:06 PM, Mohammad Tariq donta...@gmail.com wrote: you are creating 2 different rows here. cf means how column are clubbed together as a single entity which is represented by that cf. but here you are creating 2 different rows having one cf each, CF1 and CF2 respectively. if you want to have 1 row with 2 cf, you have to do use same rowkey for both the cf. Best Regards, Tariq +91-9741563634 https://mtariq.jux.com/ On Mon, Dec 24, 2012 at 3:41 PM, Ramasubramanian Narayanan ramasubramanian.naraya...@gmail.com wrote: Hi, *Table Name : Customer* * * *Field Name Column Family* Customer Number CF1 DOB CF1 FNameCF1 MNameCF1 LNameCF1 Address Type CF2 Address Line1CF2 Address Line2CF2 Address Line3CF2 Address Line4CF2 StateCF2 City CF2 Country CF2 Is it good to have rowkey as follows for the same table? Rowkey Design: -- For CF1 : Customer Number + MMD (business date) For CF2 : Customer Number + Address Type Note : Address Type can be any of HOME/OFFICE/OTHERS regards, Rams
Re: Hbase Count Aggregate Function
Hi Dalia, You already sent the same question yesterday ;) Just give some time to people to look at it. JM 2012/12/24, Dalia Sobhy dalia.mohso...@hotmail.com: Dear all, I have 50,000 row with diagnosis qualifier = cardiac, and another 50,000 rows with renal. When I type this in Hbase shell, import org.apache.hadoop.hbase.filter.CompareFilter import org.apache.hadoop.hbase.filter.SingleColumnValueFilter import org.apache.hadoop.hbase.filter.SubstringComparator import org.apache.hadoop.hbase.util.Bytes scan 'patient', { COLUMNS = info:diagnosis, FILTER = SingleColumnValueFilter.new(Bytes.toBytes('info'), Bytes.toBytes('diagnosis'), CompareFilter::CompareOp.valueOf('EQUAL'), SubstringComparator.new('cardiac'))} Output = 50,000 row import org.apache.hadoop.hbase.filter.CompareFilter import org.apache.hadoop.hbase.filter.SingleColumnValueFilter import org.apache.hadoop.hbase.filter.SubstringComparator import org.apache.hadoop.hbase.util.Bytes count 'patient', { COLUMNS = info:diagnosis, FILTER = SingleColumnValueFilter.new(Bytes.toBytes('info'), Bytes.toBytes('diagnosis'), CompareFilter::CompareOp.valueOf('EQUAL'), SubstringComparator.new('cardiac'))} Output = 100,000 row Even though I tried it using Hbase Java API, Aggregation Client Instance, and I enabled the Coprocessor aggregation for the table. rowCount = aggregationClient.rowCount(TABLE_NAME, null, scan) Also when measuring the improved performance on case of adding more nodes the operation takes the same time. So any advice please? I have been throughout all this mess from a couple of weeks Thanks,
Re: Hbase Count Aggregate Function
So you find that scan with a filter and count with the same filter is giving you different results? Regards Ram On Mon, Dec 24, 2012 at 8:33 PM, Dalia Sobhy dalia.mohso...@hotmail.comwrote: Dear all, I have 50,000 row with diagnosis qualifier = cardiac, and another 50,000 rows with renal. When I type this in Hbase shell, import org.apache.hadoop.hbase.filter.CompareFilter import org.apache.hadoop.hbase.filter.SingleColumnValueFilter import org.apache.hadoop.hbase.filter.SubstringComparator import org.apache.hadoop.hbase.util.Bytes scan 'patient', { COLUMNS = info:diagnosis, FILTER = SingleColumnValueFilter.new(Bytes.toBytes('info'), Bytes.toBytes('diagnosis'), CompareFilter::CompareOp.valueOf('EQUAL'), SubstringComparator.new('cardiac'))} Output = 50,000 row import org.apache.hadoop.hbase.filter.CompareFilter import org.apache.hadoop.hbase.filter.SingleColumnValueFilter import org.apache.hadoop.hbase.filter.SubstringComparator import org.apache.hadoop.hbase.util.Bytes count 'patient', { COLUMNS = info:diagnosis, FILTER = SingleColumnValueFilter.new(Bytes.toBytes('info'), Bytes.toBytes('diagnosis'), CompareFilter::CompareOp.valueOf('EQUAL'), SubstringComparator.new('cardiac'))} Output = 100,000 row Even though I tried it using Hbase Java API, Aggregation Client Instance, and I enabled the Coprocessor aggregation for the table. rowCount = aggregationClient.rowCount(TABLE_NAME, null, scan) Also when measuring the improved performance on case of adding more nodes the operation takes the same time. So any advice please? I have been throughout all this mess from a couple of weeks Thanks,
RE: Hbase Count Aggregate Function
yeah scan gives the correct number of rows, while count returns the total number of rows. Both are using the same filter, I even tried it using Java API, using row count method. rowCount = aggregationClient.rowCount(TABLE_NAME, null, scan); I get the total number of rows not the number of rows filtered. So any idea ?? Thanks Ram :) Date: Mon, 24 Dec 2012 21:57:54 +0530 Subject: Re: Hbase Count Aggregate Function From: ramkrishna.s.vasude...@gmail.com To: user@hbase.apache.org So you find that scan with a filter and count with the same filter is giving you different results? Regards Ram On Mon, Dec 24, 2012 at 8:33 PM, Dalia Sobhy dalia.mohso...@hotmail.comwrote: Dear all, I have 50,000 row with diagnosis qualifier = cardiac, and another 50,000 rows with renal. When I type this in Hbase shell, import org.apache.hadoop.hbase.filter.CompareFilter import org.apache.hadoop.hbase.filter.SingleColumnValueFilter import org.apache.hadoop.hbase.filter.SubstringComparator import org.apache.hadoop.hbase.util.Bytes scan 'patient', { COLUMNS = info:diagnosis, FILTER = SingleColumnValueFilter.new(Bytes.toBytes('info'), Bytes.toBytes('diagnosis'), CompareFilter::CompareOp.valueOf('EQUAL'), SubstringComparator.new('cardiac'))} Output = 50,000 row import org.apache.hadoop.hbase.filter.CompareFilter import org.apache.hadoop.hbase.filter.SingleColumnValueFilter import org.apache.hadoop.hbase.filter.SubstringComparator import org.apache.hadoop.hbase.util.Bytes count 'patient', { COLUMNS = info:diagnosis, FILTER = SingleColumnValueFilter.new(Bytes.toBytes('info'), Bytes.toBytes('diagnosis'), CompareFilter::CompareOp.valueOf('EQUAL'), SubstringComparator.new('cardiac'))} Output = 100,000 row Even though I tried it using Hbase Java API, Aggregation Client Instance, and I enabled the Coprocessor aggregation for the table. rowCount = aggregationClient.rowCount(TABLE_NAME, null, scan) Also when measuring the improved performance on case of adding more nodes the operation takes the same time. So any advice please? I have been throughout all this mess from a couple of weeks Thanks,
Re: Fixing badly distributed table manually.
Vincent Barat vbarat@... writes: Hi, Balancing regions between RS is correctly handled by HBase : I mean that your RSs always manage the same number of regions (the balancer takes care of it). Unfortunately, balancing all the regions of one particular table between the RS of your cluster is not always easy, since HBase (as for 0.90.3) when it comes to splitting a region, create the new one always on the same RS. This means that if you start with a 1 region only table, and then you insert lots of data into it, new regions will always be created to the same RS (if you insert is a M/R job, you saturate this RS). Eventually, the balancer at a time will decide to balance one of these regions to other RS, limiting the issue, but it is not controllable. Here at Capptain, we solved this problem by developing a special Python script, based on the HBase shell, allowing to entirely balance all the regions of all tables to all RS. It ensure that regions of tables are uniformly deployed on all RS of the cluster, with a minimum region transitions. It is fast, and even if it can trigger a lot of region transitions, there is very few impact at runtime and it can be run safely. If you are interested, just let me know, I can share it. Regards, Vincent, I would much like to see and possibly use the script that you mentioned. We've just run into the same issue (after the table has been truncated it was re-created with only 1 region, and after data loading and manual splits we ended up having all regions within the same RS). If you could share the script, it will be really appreciated, I believe not only by me. Thanks, Ivan
Re: Hbase Count Aggregate Function
Okie, seeing the shell script and the code I feel that while you use this counter, the user's filter is not taken into account. It adds a FirstKeyOnlyFilter and proceeds with the scan. :(. Regards Ram On Mon, Dec 24, 2012 at 10:11 PM, Dalia Sobhy dalia.mohso...@hotmail.comwrote: yeah scan gives the correct number of rows, while count returns the total number of rows. Both are using the same filter, I even tried it using Java API, using row count method. rowCount = aggregationClient.rowCount(TABLE_NAME, null, scan); I get the total number of rows not the number of rows filtered. So any idea ?? Thanks Ram :) Date: Mon, 24 Dec 2012 21:57:54 +0530 Subject: Re: Hbase Count Aggregate Function From: ramkrishna.s.vasude...@gmail.com To: user@hbase.apache.org So you find that scan with a filter and count with the same filter is giving you different results? Regards Ram On Mon, Dec 24, 2012 at 8:33 PM, Dalia Sobhy dalia.mohso...@hotmail.com wrote: Dear all, I have 50,000 row with diagnosis qualifier = cardiac, and another 50,000 rows with renal. When I type this in Hbase shell, import org.apache.hadoop.hbase.filter.CompareFilter import org.apache.hadoop.hbase.filter.SingleColumnValueFilter import org.apache.hadoop.hbase.filter.SubstringComparator import org.apache.hadoop.hbase.util.Bytes scan 'patient', { COLUMNS = info:diagnosis, FILTER = SingleColumnValueFilter.new(Bytes.toBytes('info'), Bytes.toBytes('diagnosis'), CompareFilter::CompareOp.valueOf('EQUAL'), SubstringComparator.new('cardiac'))} Output = 50,000 row import org.apache.hadoop.hbase.filter.CompareFilter import org.apache.hadoop.hbase.filter.SingleColumnValueFilter import org.apache.hadoop.hbase.filter.SubstringComparator import org.apache.hadoop.hbase.util.Bytes count 'patient', { COLUMNS = info:diagnosis, FILTER = SingleColumnValueFilter.new(Bytes.toBytes('info'), Bytes.toBytes('diagnosis'), CompareFilter::CompareOp.valueOf('EQUAL'), SubstringComparator.new('cardiac'))} Output = 100,000 row Even though I tried it using Hbase Java API, Aggregation Client Instance, and I enabled the Coprocessor aggregation for the table. rowCount = aggregationClient.rowCount(TABLE_NAME, null, scan) Also when measuring the improved performance on case of adding more nodes the operation takes the same time. So any advice please? I have been throughout all this mess from a couple of weeks Thanks,
RE: Hbase Count Aggregate Function
So do you have a suggestion how to enable/work the filter? Date: Mon, 24 Dec 2012 22:22:49 +0530 Subject: Re: Hbase Count Aggregate Function From: ramkrishna.s.vasude...@gmail.com To: user@hbase.apache.org Okie, seeing the shell script and the code I feel that while you use this counter, the user's filter is not taken into account. It adds a FirstKeyOnlyFilter and proceeds with the scan. :(. Regards Ram On Mon, Dec 24, 2012 at 10:11 PM, Dalia Sobhy dalia.mohso...@hotmail.comwrote: yeah scan gives the correct number of rows, while count returns the total number of rows. Both are using the same filter, I even tried it using Java API, using row count method. rowCount = aggregationClient.rowCount(TABLE_NAME, null, scan); I get the total number of rows not the number of rows filtered. So any idea ?? Thanks Ram :) Date: Mon, 24 Dec 2012 21:57:54 +0530 Subject: Re: Hbase Count Aggregate Function From: ramkrishna.s.vasude...@gmail.com To: user@hbase.apache.org So you find that scan with a filter and count with the same filter is giving you different results? Regards Ram On Mon, Dec 24, 2012 at 8:33 PM, Dalia Sobhy dalia.mohso...@hotmail.com wrote: Dear all, I have 50,000 row with diagnosis qualifier = cardiac, and another 50,000 rows with renal. When I type this in Hbase shell, import org.apache.hadoop.hbase.filter.CompareFilter import org.apache.hadoop.hbase.filter.SingleColumnValueFilter import org.apache.hadoop.hbase.filter.SubstringComparator import org.apache.hadoop.hbase.util.Bytes scan 'patient', { COLUMNS = info:diagnosis, FILTER = SingleColumnValueFilter.new(Bytes.toBytes('info'), Bytes.toBytes('diagnosis'), CompareFilter::CompareOp.valueOf('EQUAL'), SubstringComparator.new('cardiac'))} Output = 50,000 row import org.apache.hadoop.hbase.filter.CompareFilter import org.apache.hadoop.hbase.filter.SingleColumnValueFilter import org.apache.hadoop.hbase.filter.SubstringComparator import org.apache.hadoop.hbase.util.Bytes count 'patient', { COLUMNS = info:diagnosis, FILTER = SingleColumnValueFilter.new(Bytes.toBytes('info'), Bytes.toBytes('diagnosis'), CompareFilter::CompareOp.valueOf('EQUAL'), SubstringComparator.new('cardiac'))} Output = 100,000 row Even though I tried it using Hbase Java API, Aggregation Client Instance, and I enabled the Coprocessor aggregation for the table. rowCount = aggregationClient.rowCount(TABLE_NAME, null, scan) Also when measuring the improved performance on case of adding more nodes the operation takes the same time. So any advice please? I have been throughout all this mess from a couple of weeks Thanks,
Where to place logs
Hi, I am wondering where people usually place hbase + hadoop logs. I have 4 disks and 1 very tiny disk with barely 500 megs (thats the typical setup on amazon ec2). The 4 disks shall be used for hbase data. Since 500M is too small, should I place logs on one of the 4 disks. Could it potentially steal IOP(s) from hbase ? Does anyone have an idea how much of an overhead logging really is ? Varun
Re: Fixing badly distributed table manually.
On Mon, Dec 24, 2012 at 8:27 AM, Ivan Balashov ibalas...@gmail.com wrote: Vincent Barat vbarat@... writes: Hi, Balancing regions between RS is correctly handled by HBase : I mean that your RSs always manage the same number of regions (the balancer takes care of it). Unfortunately, balancing all the regions of one particular table between the RS of your cluster is not always easy, since HBase (as for 0.90.3) when it comes to splitting a region, create the new one always on the same RS. This means that if you start with a 1 region only table, and then you insert lots of data into it, new regions will always be created to the same RS (if you insert is a M/R job, you saturate this RS). Eventually, the balancer at a time will decide to balance one of these regions to other RS, limiting the issue, but it is not controllable. Here at Capptain, we solved this problem by developing a special Python script, based on the HBase shell, allowing to entirely balance all the regions of all tables to all RS. It ensure that regions of tables are uniformly deployed on all RS of the cluster, with a minimum region transitions. Is it possible to describe the logic at high level on what you did? It is fast, and even if it can trigger a lot of region transitions, there is very few impact at runtime and it can be run safely. If you are interested, just let me know, I can share it. Regards, Vincent, I would much like to see and possibly use the script that you mentioned. We've just run into the same issue (after the table has been truncated it was re-created with only 1 region, and after data loading and manual splits we ended up having all regions within the same RS). If you could share the script, it will be really appreciated, I believe not only by me. Thanks, Ivan
Re: Hbase Count Aggregate Function
Hi You could have custom filter implemented which is similar to FirstKeyOnlyfilter. Implement the filterKeyValue method such that it should match your keyvalue (the specific qualifier that you are looking for). Deploy it in your cluster. It should work. Regards Ram On Mon, Dec 24, 2012 at 10:35 PM, Dalia Sobhy dalia.mohso...@hotmail.comwrote: So do you have a suggestion how to enable/work the filter? Date: Mon, 24 Dec 2012 22:22:49 +0530 Subject: Re: Hbase Count Aggregate Function From: ramkrishna.s.vasude...@gmail.com To: user@hbase.apache.org Okie, seeing the shell script and the code I feel that while you use this counter, the user's filter is not taken into account. It adds a FirstKeyOnlyFilter and proceeds with the scan. :(. Regards Ram On Mon, Dec 24, 2012 at 10:11 PM, Dalia Sobhy dalia.mohso...@hotmail.comwrote: yeah scan gives the correct number of rows, while count returns the total number of rows. Both are using the same filter, I even tried it using Java API, using row count method. rowCount = aggregationClient.rowCount(TABLE_NAME, null, scan); I get the total number of rows not the number of rows filtered. So any idea ?? Thanks Ram :) Date: Mon, 24 Dec 2012 21:57:54 +0530 Subject: Re: Hbase Count Aggregate Function From: ramkrishna.s.vasude...@gmail.com To: user@hbase.apache.org So you find that scan with a filter and count with the same filter is giving you different results? Regards Ram On Mon, Dec 24, 2012 at 8:33 PM, Dalia Sobhy dalia.mohso...@hotmail.com wrote: Dear all, I have 50,000 row with diagnosis qualifier = cardiac, and another 50,000 rows with renal. When I type this in Hbase shell, import org.apache.hadoop.hbase.filter.CompareFilter import org.apache.hadoop.hbase.filter.SingleColumnValueFilter import org.apache.hadoop.hbase.filter.SubstringComparator import org.apache.hadoop.hbase.util.Bytes scan 'patient', { COLUMNS = info:diagnosis, FILTER = SingleColumnValueFilter.new(Bytes.toBytes('info'), Bytes.toBytes('diagnosis'), CompareFilter::CompareOp.valueOf('EQUAL'), SubstringComparator.new('cardiac'))} Output = 50,000 row import org.apache.hadoop.hbase.filter.CompareFilter import org.apache.hadoop.hbase.filter.SingleColumnValueFilter import org.apache.hadoop.hbase.filter.SubstringComparator import org.apache.hadoop.hbase.util.Bytes count 'patient', { COLUMNS = info:diagnosis, FILTER = SingleColumnValueFilter.new(Bytes.toBytes('info'), Bytes.toBytes('diagnosis'), CompareFilter::CompareOp.valueOf('EQUAL'), SubstringComparator.new('cardiac'))} Output = 100,000 row Even though I tried it using Hbase Java API, Aggregation Client Instance, and I enabled the Coprocessor aggregation for the table. rowCount = aggregationClient.rowCount(TABLE_NAME, null, scan) Also when measuring the improved performance on case of adding more nodes the operation takes the same time. So any advice please? I have been throughout all this mess from a couple of weeks Thanks,
RE: Hbase Count Aggregate Function
Do you mean I implement a new rowCount method in Aggregation Client Class. I cannot understand, could u illustrate with a code sample Ram? Thanks, Date: Tue, 25 Dec 2012 00:21:14 +0530 Subject: Re: Hbase Count Aggregate Function From: ramkrishna.s.vasude...@gmail.com To: user@hbase.apache.org Hi You could have custom filter implemented which is similar to FirstKeyOnlyfilter. Implement the filterKeyValue method such that it should match your keyvalue (the specific qualifier that you are looking for). Deploy it in your cluster. It should work. Regards Ram On Mon, Dec 24, 2012 at 10:35 PM, Dalia Sobhy dalia.mohso...@hotmail.comwrote: So do you have a suggestion how to enable/work the filter? Date: Mon, 24 Dec 2012 22:22:49 +0530 Subject: Re: Hbase Count Aggregate Function From: ramkrishna.s.vasude...@gmail.com To: user@hbase.apache.org Okie, seeing the shell script and the code I feel that while you use this counter, the user's filter is not taken into account. It adds a FirstKeyOnlyFilter and proceeds with the scan. :(. Regards Ram On Mon, Dec 24, 2012 at 10:11 PM, Dalia Sobhy dalia.mohso...@hotmail.comwrote: yeah scan gives the correct number of rows, while count returns the total number of rows. Both are using the same filter, I even tried it using Java API, using row count method. rowCount = aggregationClient.rowCount(TABLE_NAME, null, scan); I get the total number of rows not the number of rows filtered. So any idea ?? Thanks Ram :) Date: Mon, 24 Dec 2012 21:57:54 +0530 Subject: Re: Hbase Count Aggregate Function From: ramkrishna.s.vasude...@gmail.com To: user@hbase.apache.org So you find that scan with a filter and count with the same filter is giving you different results? Regards Ram On Mon, Dec 24, 2012 at 8:33 PM, Dalia Sobhy dalia.mohso...@hotmail.com wrote: Dear all, I have 50,000 row with diagnosis qualifier = cardiac, and another 50,000 rows with renal. When I type this in Hbase shell, import org.apache.hadoop.hbase.filter.CompareFilter import org.apache.hadoop.hbase.filter.SingleColumnValueFilter import org.apache.hadoop.hbase.filter.SubstringComparator import org.apache.hadoop.hbase.util.Bytes scan 'patient', { COLUMNS = info:diagnosis, FILTER = SingleColumnValueFilter.new(Bytes.toBytes('info'), Bytes.toBytes('diagnosis'), CompareFilter::CompareOp.valueOf('EQUAL'), SubstringComparator.new('cardiac'))} Output = 50,000 row import org.apache.hadoop.hbase.filter.CompareFilter import org.apache.hadoop.hbase.filter.SingleColumnValueFilter import org.apache.hadoop.hbase.filter.SubstringComparator import org.apache.hadoop.hbase.util.Bytes count 'patient', { COLUMNS = info:diagnosis, FILTER = SingleColumnValueFilter.new(Bytes.toBytes('info'), Bytes.toBytes('diagnosis'), CompareFilter::CompareOp.valueOf('EQUAL'), SubstringComparator.new('cardiac'))} Output = 100,000 row Even though I tried it using Hbase Java API, Aggregation Client Instance, and I enabled the Coprocessor aggregation for the table. rowCount = aggregationClient.rowCount(TABLE_NAME, null, scan) Also when measuring the improved performance on case of adding more nodes the operation takes the same time. So any advice please? I have been throughout all this mess from a couple of weeks Thanks,
RE: Hbase Count Aggregate Function
This is my function: public long CountByDiagnosis(String diagnosis) throws IOException { customConf.setStrings(hbase.zookeeper.quorum,hbaseZookeeperQuorum); customConf.setLong(hbase.rpc.timeout, 60); customConf.setLong(hbase.client.scanner.caching, 1000); configuration = HBaseConfiguration.create(customConf); aggregationClient = new AggregationClient(configuration); scan.addFamily(CF); //Filter by a particular Diagnosis SingleColumnValueFilter filter1 = new SingleColumnValueFilter( CF, Column, CompareOp.EQUAL, Bytes.toBytes(diagnosis) ); scan.setFilter(filter1); long rowCount = -1; //Count the number of patients suffering from cardiac diagnosis try { rowCount = aggregationClient.rowCount(TABLE_NAME, null, scan); } catch (Throwable e) { e.printStackTrace(); } return rowCount; } Date: Tue, 25 Dec 2012 00:21:14 +0530 Subject: Re: Hbase Count Aggregate Function From: ramkrishna.s.vasude...@gmail.com To: user@hbase.apache.org Hi You could have custom filter implemented which is similar to FirstKeyOnlyfilter. Implement the filterKeyValue method such that it should match your keyvalue (the specific qualifier that you are looking for). Deploy it in your cluster. It should work. Regards Ram On Mon, Dec 24, 2012 at 10:35 PM, Dalia Sobhy dalia.mohso...@hotmail.comwrote: So do you have a suggestion how to enable/work the filter? Date: Mon, 24 Dec 2012 22:22:49 +0530 Subject: Re: Hbase Count Aggregate Function From: ramkrishna.s.vasude...@gmail.com To: user@hbase.apache.org Okie, seeing the shell script and the code I feel that while you use this counter, the user's filter is not taken into account. It adds a FirstKeyOnlyFilter and proceeds with the scan. :(. Regards Ram On Mon, Dec 24, 2012 at 10:11 PM, Dalia Sobhy dalia.mohso...@hotmail.comwrote: yeah scan gives the correct number of rows, while count returns the total number of rows. Both are using the same filter, I even tried it using Java API, using row count method. rowCount = aggregationClient.rowCount(TABLE_NAME, null, scan); I get the total number of rows not the number of rows filtered. So any idea ?? Thanks Ram :) Date: Mon, 24 Dec 2012 21:57:54 +0530 Subject: Re: Hbase Count Aggregate Function From: ramkrishna.s.vasude...@gmail.com To: user@hbase.apache.org So you find that scan with a filter and count with the same filter is giving you different results? Regards Ram On Mon, Dec 24, 2012 at 8:33 PM, Dalia Sobhy dalia.mohso...@hotmail.com wrote: Dear all, I have 50,000 row with diagnosis qualifier = cardiac, and another 50,000 rows with renal. When I type this in Hbase shell, import org.apache.hadoop.hbase.filter.CompareFilter import org.apache.hadoop.hbase.filter.SingleColumnValueFilter import org.apache.hadoop.hbase.filter.SubstringComparator import org.apache.hadoop.hbase.util.Bytes scan 'patient', { COLUMNS = info:diagnosis, FILTER = SingleColumnValueFilter.new(Bytes.toBytes('info'), Bytes.toBytes('diagnosis'), CompareFilter::CompareOp.valueOf('EQUAL'), SubstringComparator.new('cardiac'))} Output = 50,000 row import org.apache.hadoop.hbase.filter.CompareFilter import org.apache.hadoop.hbase.filter.SingleColumnValueFilter import org.apache.hadoop.hbase.filter.SubstringComparator import org.apache.hadoop.hbase.util.Bytes count 'patient', { COLUMNS = info:diagnosis, FILTER = SingleColumnValueFilter.new(Bytes.toBytes('info'), Bytes.toBytes('diagnosis'), CompareFilter::CompareOp.valueOf('EQUAL'), SubstringComparator.new('cardiac'))} Output = 100,000 row Even though I tried it using Hbase Java API, Aggregation Client Instance, and I enabled the Coprocessor aggregation for the table. rowCount = aggregationClient.rowCount(TABLE_NAME, null, scan) Also when measuring the improved performance on case of adding more nodes the operation takes the same time. So any advice please? I have been throughout all this mess from a couple of weeks Thanks,
Re: Hbase Question
Hi Dalia, I think you can make a small sample of the table to do the test, then you'll find what's the difference of scan and count. because you can count it by human. Best regards, Andy 2012/12/24 Dalia Sobhy dalia.mohso...@hotmail.com Dear all, I have 50,000 row with diagnosis qualifier = cardiac, and another 50,000 rows with renal. When I type this in Hbase shell, import org.apache.hadoop.hbase.filter.CompareFilter import org.apache.hadoop.hbase.filter.SingleColumnValueFilter import org.apache.hadoop.hbase.filter.SubstringComparator import org.apache.hadoop.hbase.util.Bytes scan 'patient', { COLUMNS = info:diagnosis, FILTER = SingleColumnValueFilter.new(Bytes.toBytes('info'), Bytes.toBytes('diagnosis'), CompareFilter::CompareOp.valueOf('EQUAL'), SubstringComparator.new('cardiac'))} Output = 50,000 row import org.apache.hadoop.hbase.filter.CompareFilter import org.apache.hadoop.hbase.filter.SingleColumnValueFilter import org.apache.hadoop.hbase.filter.SubstringComparator import org.apache.hadoop.hbase.util.Bytes count 'patient', { COLUMNS = info:diagnosis, FILTER = SingleColumnValueFilter.new(Bytes.toBytes('info'), Bytes.toBytes('diagnosis'), CompareFilter::CompareOp.valueOf('EQUAL'), SubstringComparator.new('cardiac'))} Output = 100,000 row Even though I tried it using Hbase Java API, Aggregation Client Instance, and I enabled the Coprocessor aggregation for the table. rowCount = aggregationClient.rowCount(TABLE_NAME, null, scan) Also when measuring the improved performance on case of adding more nodes the operation takes the same time. So any advice please? I have been throughout all this mess from a couple of weeks Thanks,