Thank you Josh and Allan, Sorry for the rush, this question was in my mind for
some months! but i thought i should be familiar good enough with one side of
"vs". I've been struggling with Cassandra since and almost forgot that there
was a "vs" in my mind! One main feature of Cassandra is that by providing one
key (partition key), it could retrieve thousands of rows with a few IOPS
because that all rows related to a partition are almost in the same place of
disk. This is why having 8 partition keys, need to store one row in 8 places.
Logically, i can not think of a faster mechanism to load this amount of data
other than keeping them in the same place on disk. I wonder how using an
indexing mechanism (like HBase mechanism) would result in same performance as
Cassandra for retrieving thousands of rows related to a single partition key
(architecture-wise)? because anyway it should load rows with some foreign key
(indexes) with multiple access (too many IOPS and much slower). Although, i'm
going to read HBase documents (technical and user manuals), launch a testing
cluster with > 10 nodes with my application logic on HBase and would try to
tune its performance (too many questions to ask in this forum) and whatever
I've done for Apache Cassandra, But these questions, i can't wait such a long
time to get an answer for. Sent using Zoho Mail ---- On Wed, 12 Sep 2018
07:12:05 +0430 Allan Yang <allan...@apache.org> wrote ---- You can use Phoenix
+ HBase and use index in Phoenix. But since you need 8 different kind of query,
you may need to create 8 different indices and thus 8 index tables. But unlike
Cassandra, you do not have to store all the column data in all tables
redundantly. On the other hand, you can use non-covered index, making a simple
mapping between the index column and the rowkey. So there won't be 8x space.
For the 2nd question. In HBase, there won't be a node join-remove problem,
since the storage layer(using HDFS) and computing layer are completely
separated. You don't have to move data if a HBase node joined in or moved out.
For the 3rd question, please refer to Josh Elser in the previous relay, it is
just a 'marketing trash', HBase is a high performance, low lantancy ONLINE
storage system, which has already been massively used in many real-time
production systems. Best Regards Allan Yang Josh Elser <els...@apache.org>
于2018年9月11日周二 下午9:26写道: > Please be patient in getting a response to questinos
you post to this > list as we're all volunteers. > > On 9/8/18 2:16 AM,
onmstester onmstester wrote: > > Hi, Currently I'm using Apache Cassandra as
backend for my restfull > application. Having a cluster of 30 nodes (each
having 12 cores, 64gb ram > and 6 TB disk which 50% of the disk been used)
write and read throughput is > more than satisfactory for us. The input is a
fixed set of long and int > columns which we need to query it based on every
column, so having 8 > columns there should be 8 tables based on Cassandra query
plan > recommendation. The cassandra keyspace schema would be someting like
this: > Table 1 (timebucket,col1, ...,col8, primary key(timebuecket,col1)) to >
handle select * from input where timebucket = X and col1 = Y .... Table 8 >
(timebucket,col1, ...,col8, primary key(timebuecket,col8)) So for each > input
row, there would be 8X insert in Cassandra (not considering RF) and > using TTL
of 12 months, production cluster should keep about 2 Peta Bytes > of data With
recommended node density for Cassandra cluster (2 TB per > node), i need a
cluster with more than 1000 nodes (which i can not afford) > So long story
short: I'm looking for an alternative to Apache Cassandra for > this
application. How HBase would solve these problem: > > > > 1. 8X data redundancy
due to needed queries > > HBase provides one intrinsic "index" over the data in
your table and > that is the "rowkey". If you need to access the same data 8
different > ways, you would need to come up with 8 indexes. > > FWIW, this is
not what I commonly see. Usually there are 2 or 3 lookups > that need to happen
in the "fast path", not 8. Perhaps you need to take > another look at your
application needs? > > > 2. nodes with large data density (30 TB data on each
node if No.1 could > not be solved in HBase), how HBase would handle compaction
and node > join-remove problems while there is only 5 * 6 TB 7200 SATA Disk
available > on each node? How much Hbase needs as empty space for template
files of > compaction? > > HBase uses a distributed filesystem to ensure that
data is available to > be read by any RegionServer. Obviously, that filesystem
needs to have > sufficient capacity to write a new file which is approximately
the sum > of the file sizes being compacted. > > > 3. Also i read in some
documents (including datastax's) that HBase is > more > of a offline &
data-lake backend that better not to be used as web > application backendd
which needs less than some seconds QoS in response > time. Thanks in advance
Sent using Zoho Mail > > Sounds like marketing trash to me. The entire premise
around HBase's > architecture is: > > * Low latency random writes/updates > *
Low latency random reads > * High throughput writes via batch tools (e.g. Bulk
loading) > > IIRC, many early adopters of HBase were using it in the
critical-path > for web applications. >