Re: Improvment - speed up HBase to 2-3 times

2020-09-16 Thread onmstester onmstester
Hi,



Do you mean row/sec by ops/sec? or partition/sec (in cassandra terms), if so 
then how many rows per op or partition? what's your data model and the host 
spec?

Is your client remote or on the host?

Sent using https://www.zoho.com/mail/




 On Wed, 16 Sep 2020 14:11:35 +0430 Sergey Semenoff 
 wrote 


Hi *! 
 
I think everybody who working with the real BigData know – performance is 
very important. 
 
Unfortunaly our lovely HBase slower then Cassandra approximately in 2 times 
when reading huge amount of data. 
 
 
For example – this is Cassandra the performance test run from 2 hosts 
(client side) 
 
Host1 - Throughput(ops/sec), 231 021 
 
Host2 - Throughput(ops/sec), 224 691 
 
 
 
Summary ~450 000. 
 
HBase shows in the same conditions only 210 000. 
 
 
 
Maybe this is one of the reason why Cassandra is more popular (see 
https://db-engines.com/en/ranking/wide+column+store) 
 
I’ve done an improvment which can make HBase faster up 2-3 times (it 
depends of many reasons, and sometimes even faster). 
 
With the improvement HBase speed up to 430 000 ops/sec. 
 
See the picture in attachment. 
 
 
 
If you interested to get this improvement in release you can help to 
attract some developers attention here - 
https://issues.apache.org/jira/browse/HBASE-23887 
 
Put some line there with your opinion and vote if you think it could be 
useful for your work. 
 
I believe discussion about this approach can make HBase more useful and 
popular. 
 
 
 
Thanks for attention) 
 
With the best regards, 
 
Pustota

Re: Migrating from Apache Cassandra to Hbase

2018-09-11 Thread onmstester onmstester
Thank you Josh and Allan, Sorry for the rush, this question was in my mind for 
some months! but i thought i should be familiar good enough with one side of 
"vs". I've been struggling with Cassandra since and almost forgot that there 
was a "vs" in my mind! One main feature of Cassandra is that by providing one 
key (partition key), it could retrieve thousands of rows with a few IOPS 
because that all rows related to a partition are almost in the same place of 
disk. This is why having 8 partition keys, need to store one row in 8 places. 
Logically, i can not think of a faster mechanism to load this amount of data 
other than keeping them in the same place on disk. I wonder how using an 
indexing mechanism (like HBase mechanism) would result in same performance as 
Cassandra for retrieving thousands of rows related to a single partition key 
(architecture-wise)? because anyway it should load rows with some foreign key 
(indexes) with multiple access (too many IOPS and much slower). Although, i'm 
going to read HBase documents (technical and user manuals), launch a testing 
cluster with > 10 nodes with my application logic on HBase and would try to 
tune its performance (too many questions to ask in this forum) and whatever 
I've done for Apache Cassandra, But these questions, i can't wait such a long 
time to get an answer for. Sent using Zoho Mail  On Wed, 12 Sep 2018 
07:12:05 +0430 Allan Yang  wrote  You can use Phoenix 
+ HBase and use index in Phoenix. But since you need 8 different kind of query, 
you may need to create 8 different indices and thus 8 index tables. But unlike 
Cassandra, you do not have to store all the column data in all tables 
redundantly. On the other hand, you can use non-covered index, making a simple 
mapping between the index column and the rowkey. So there won't be 8x space. 
For the 2nd question. In HBase, there won't be a node join-remove problem, 
since the storage layer(using HDFS) and computing layer are completely 
separated. You don't have to move data if a HBase node joined in or moved out. 
For the 3rd question, please refer to Josh Elser in the previous relay, it is 
just a 'marketing trash', HBase is a high performance, low lantancy ONLINE 
storage system, which has already been massively used in many real-time 
production systems. Best Regards Allan Yang Josh Elser  
于2018年9月11日周二 下午9:26写道: > Please be patient in getting a response to questinos 
you post to this > list as we're all volunteers. > > On 9/8/18 2:16 AM, 
onmstester onmstester wrote: > > Hi, Currently I'm using Apache Cassandra as 
backend for my restfull > application. Having a cluster of 30 nodes (each 
having 12 cores, 64gb ram > and 6 TB disk which 50% of the disk been used) 
write and read throughput is > more than satisfactory for us. The input is a 
fixed set of long and int > columns which we need to query it based on every 
column, so having 8 > columns there should be 8 tables based on Cassandra query 
plan > recommendation. The cassandra keyspace schema would be someting like 
this: > Table 1 (timebucket,col1, ...,col8, primary key(timebuecket,col1)) to > 
handle select * from input where timebucket = X and col1 = Y  Table 8 > 
(timebucket,col1, ...,col8, primary key(timebuecket,col8)) So for each > input 
row, there would be 8X insert in Cassandra (not considering RF) and > using TTL 
of 12 months, production cluster should keep about 2 Peta Bytes > of data With 
recommended node density for Cassandra cluster (2 TB per > node), i need a 
cluster with more than 1000 nodes (which i can not afford) > So long story 
short: I'm looking for an alternative to Apache Cassandra for > this 
application. How HBase would solve these problem: > > > > 1. 8X data redundancy 
due to needed queries > > HBase provides one intrinsic "index" over the data in 
your table and > that is the "rowkey". If you need to access the same data 8 
different > ways, you would need to come up with 8 indexes. > > FWIW, this is 
not what I commonly see. Usually there are 2 or 3 lookups > that need to happen 
in the "fast path", not 8. Perhaps you need to take > another look at your 
application needs? > > > 2. nodes with large data density (30 TB data on each 
node if No.1 could > not be solved in HBase), how HBase would handle compaction 
and node > join-remove problems while there is only 5 * 6 TB 7200 SATA Disk 
available > on each node? How much Hbase needs as empty space for template 
files of > compaction? > > HBase uses a distributed filesystem to ensure that 
data is available to > be read by any RegionServer. Obviously, that filesystem 
needs to have > sufficient capacity to write a new file which is approximately 
the sum > of the file sizes being compacted. > > > 3. Also i read in some 
documents (inclu

Fwd: Migrating from Apache Cassandra to Hbase

2018-09-10 Thread onmstester onmstester
Any idea? Sent using Zoho Mail  Forwarded message  From 
: onmstester onmstester  To : 
"user" Date : Sat, 08 Sep 2018 10:46:25 +0430 Subject : 
Migrating from Apache Cassandra to Hbase  Forwarded message 
 Hi, Currently I'm using Apache Cassandra as backend for my 
restfull application. Having a cluster of 30 nodes (each having 12 cores, 64gb 
ram and 6 TB disk which 50% of the disk been used) write and read throughput is 
more than satisfactory for us. The input is a fixed set of long and int columns 
which we need to query it based on every column, so having 8 columns there 
should be 8 tables based on Cassandra query plan recommendation. The cassandra 
keyspace schema would be someting like this: Table 1 (timebucket,col1, 
...,col8, primary key(timebuecket,col1)) to handle select * from input where 
timebucket = X and col1 = Y  Table 8 (timebucket,col1, ...,col8, primary 
key(timebuecket,col8)) So for each input row, there would be 8X insert in 
Cassandra (not considering RF) and using TTL of 12 months, production cluster 
should keep about 2 Peta Bytes of data With recommended node density for 
Cassandra cluster (2 TB per node), i need a cluster with more than 1000 nodes 
(which i can not afford) So long story short: I'm looking for an alternative to 
Apache Cassandra for this application. How HBase would solve these problem: 1. 
8X data redundancy due to needed queries 2. nodes with large data density (30 
TB data on each node if No.1 could not be solved in HBase), how HBase would 
handle compaction and node join-remove problems while there is only 5 * 6 TB 
7200 SATA Disk available on each node? How much Hbase needs as empty space for 
template files of compaction? 3. Also i read in some documents (including 
datastax's) that HBase is more of a offline & data-lake backend that better not 
to be used as web application backendd which needs less than some seconds QoS 
in response time. Thanks in advance Sent using Zoho Mail

Migrating from Apache Cassandra to Hbase

2018-09-08 Thread onmstester onmstester
Hi, Currently I'm using Apache Cassandra as backend for my restfull 
application. Having a cluster of 30 nodes (each having 12 cores, 64gb ram and 6 
TB disk which 50% of the disk been used) write and read throughput is more than 
satisfactory for us. The input is a fixed set of long and int columns which we 
need to query it based on every column, so having 8 columns there should be 8 
tables based on Cassandra query plan recommendation. The cassandra keyspace 
schema would be someting like this: Table 1 (timebucket,col1, ...,col8, primary 
key(timebuecket,col1)) to handle select * from input where timebucket = X and 
col1 = Y  Table 8 (timebucket,col1, ...,col8, primary 
key(timebuecket,col8)) So for each input row, there would be 8X insert in 
Cassandra (not considering RF) and using TTL of 12 months, production cluster 
should keep about 2 Peta Bytes of data With recommended node density for 
Cassandra cluster (2 TB per node), i need a cluster with more than 1000 nodes 
(which i can not afford) So long story short: I'm looking for an alternative to 
Apache Cassandra for this application. How HBase would solve these problem: 1. 
8X data redundancy due to needed queries 2. nodes with large data density (30 
TB data on each node if No.1 could not be solved in HBase), how HBase would 
handle compaction and node join-remove problems while there is only 5 * 6 TB 
7200 SATA Disk available on each node? How much Hbase needs as empty space for 
template files of compaction? 3. Also i read in some documents (including 
datastax's) that HBase is more of a offline & data-lake backend that better not 
to be used as web application backendd which needs less than some seconds QoS 
in response time. Thanks in advance Sent using Zoho Mail