Re: Migrating from Apache Cassandra to Hbase

onmstester onmstester Tue, 11 Sep 2018 21:27:15 -0700
Thank you Josh and Allan, Sorry for the rush, this question was in my mind for 
some months! but i thought i should be familiar good enough with one side of 
"vs". I've been struggling with Cassandra since and almost forgot that there 
was a "vs" in my mind! One main feature of Cassandra is that by providing one 
key (partition key), it could retrieve thousands of rows with a few IOPS 
because that all rows related to a partition are almost in the same place of 
disk. This is why having 8 partition keys, need to store one row in 8 places. 
Logically, i can not think of a faster mechanism to load this amount of data 
other than keeping them in the same place on disk. I wonder how using an 
indexing mechanism (like HBase mechanism) would result in same performance as 
Cassandra for retrieving thousands of rows related to a single partition key 
(architecture-wise)? because anyway it should load rows with some foreign key 
(indexes) with multiple access (too many IOPS and much slower). Although, i'm 
going to read HBase documents (technical and user manuals), launch a testing 
cluster with > 10 nodes with my application logic on HBase and would try to 
tune its performance (too many questions to ask in this forum) and whatever 
I've done for Apache Cassandra, But these questions, i can't wait such a long 
time to get an answer for. Sent using Zoho Mail ---- On Wed, 12 Sep 2018 
07:12:05 +0430 Allan Yang <[email protected]> wrote ---- You can use Phoenix 
+ HBase and use index in Phoenix. But since you need 8 different kind of query, 
you may need to create 8 different indices and thus 8 index tables. But unlike 
Cassandra, you do not have to store all the column data in all tables 
redundantly. On the other hand, you can use non-covered index, making a simple 
mapping between the index column and the rowkey. So there won't be 8x space. 
For the 2nd question. In HBase, there won't be a node join-remove problem, 
since the storage layer(using HDFS) and computing layer are completely 
separated. You don't have to move data if a HBase node joined in or moved out. 
For the 3rd question, please refer to Josh Elser in the previous relay, it is 
just a 'marketing trash', HBase is a high performance, low lantancy ONLINE 
storage system, which has already been massively used in many real-time 
production systems. Best Regards Allan Yang Josh Elser <[email protected]> 
于2018年9月11日周二 下午9:26写道： > Please be patient in getting a response to questinos 
you post to this > list as we're all volunteers. > > On 9/8/18 2:16 AM, 
onmstester onmstester wrote: > > Hi, Currently I'm using Apache Cassandra as 
backend for my restfull > application. Having a cluster of 30 nodes (each 
having 12 cores, 64gb ram > and 6 TB disk which 50% of the disk been used) 
write and read throughput is > more than satisfactory for us. The input is a 
fixed set of long and int > columns which we need to query it based on every 
column, so having 8 > columns there should be 8 tables based on Cassandra query 
plan > recommendation. The cassandra keyspace schema would be someting like 
this: > Table 1 (timebucket,col1, ...,col8, primary key(timebuecket,col1)) to > 
handle select * from input where timebucket = X and col1 = Y .... Table 8 > 
(timebucket,col1, ...,col8, primary key(timebuecket,col8)) So for each > input 
row, there would be 8X insert in Cassandra (not considering RF) and > using TTL 
of 12 months, production cluster should keep about 2 Peta Bytes > of data With 
recommended node density for Cassandra cluster (2 TB per > node), i need a 
cluster with more than 1000 nodes (which i can not afford) > So long story 
short: I'm looking for an alternative to Apache Cassandra for > this 
application. How HBase would solve these problem: > > > > 1. 8X data redundancy 
due to needed queries > > HBase provides one intrinsic "index" over the data in 
your table and > that is the "rowkey". If you need to access the same data 8 
different > ways, you would need to come up with 8 indexes. > > FWIW, this is 
not what I commonly see. Usually there are 2 or 3 lookups > that need to happen 
in the "fast path", not 8. Perhaps you need to take > another look at your 
application needs? > > > 2. nodes with large data density (30 TB data on each 
node if No.1 could > not be solved in HBase), how HBase would handle compaction 
and node > join-remove problems while there is only 5 * 6 TB 7200 SATA Disk 
available > on each node? How much Hbase needs as empty space for template 
files of > compaction? > > HBase uses a distributed filesystem to ensure that 
data is available to > be read by any RegionServer. Obviously, that filesystem 
needs to have > sufficient capacity to write a new file which is approximately 
the sum > of the file sizes being compacted. > > > 3. Also i read in some 
documents (including datastax's) that HBase is > more > of a offline & 
data-lake backend that better not to be used as web > application backendd 
which needs less than some seconds QoS in response > time. Thanks in advance 
Sent using Zoho Mail > > Sounds like marketing trash to me. The entire premise 
around HBase's > architecture is: > > * Low latency random writes/updates > * 
Low latency random reads > * High throughput writes via batch tools (e.g. Bulk 
loading) > > IIRC, many early adopters of HBase were using it in the 
critical-path > for web applications. >
Re: Migrating from Apache Cassandra to Hbase

Reply via email to