You can use Phoenix + HBase and use index in Phoenix. But since you need 8
different kind of query, you may need to create 8 different indices and
thus 8 index tables. But unlike Cassandra, you do not have to store all the
column data in all tables redundantly. On the other hand, you can use
non-covered index,  making a simple mapping between the index column and
the rowkey. So there won't be 8x space.

For the 2nd question. In HBase, there won't be a node join-remove problem,
since the storage layer(using HDFS) and computing layer are completely
separated. You don't have to move data if a HBase node joined in or moved
out.

For the 3rd question, please refer to Josh Elser in the previous relay, it
is just a 'marketing trash', HBase is a high performance, low lantancy
ONLINE storage system, which has already been massively used in many
real-time production systems.

Best Regards
Allan Yang


Josh Elser <[email protected]> 于2018年9月11日周二 下午9:26写道:

> Please be patient in getting a response to questinos you post to this
> list as we're all volunteers.
>
> On 9/8/18 2:16 AM, onmstester onmstester wrote:
> > Hi, Currently I'm using Apache Cassandra as backend for my restfull
> application. Having a cluster of 30 nodes (each having 12 cores, 64gb ram
> and 6 TB disk which 50% of the disk been used) write and read throughput is
> more than satisfactory for us. The input is a fixed set of long and int
> columns which we need to query it based on every column, so having 8
> columns there should be 8 tables based on Cassandra query plan
> recommendation. The cassandra keyspace schema would be someting like this:
> Table 1 (timebucket,col1, ...,col8, primary key(timebuecket,col1)) to
> handle select * from input where timebucket = X and col1 = Y .... Table 8
> (timebucket,col1, ...,col8, primary key(timebuecket,col8)) So for each
> input row, there would be 8X insert in Cassandra (not considering RF) and
> using TTL of 12 months, production cluster should keep about 2 Peta Bytes
> of data With recommended node density for Cassandra cluster (2 TB per
> node), i need a cluster with more than 1000 nodes (which i can not afford)
> So long story short: I'm looking for an alternative to Apache Cassandra for
> this application. How HBase would solve these problem:
>
>
> > 1. 8X data redundancy due to needed queries
>
> HBase provides one intrinsic "index" over the data in your table and
> that is the "rowkey". If you need to access the same data 8 different
> ways, you would need to come up with 8 indexes.
>
> FWIW, this is not what I commonly see. Usually there are 2 or 3 lookups
> that need to happen in the "fast path", not 8. Perhaps you need to take
> another look at your application needs?
>
> > 2. nodes with large data density (30 TB data on each node if No.1 could
> not be solved in HBase), how HBase would handle compaction and node
> join-remove problems while there is only 5 * 6 TB 7200 SATA Disk available
> on each node? How much Hbase needs as empty space for template files of
> compaction?
>
> HBase uses a distributed filesystem to ensure that data is available to
> be read by any RegionServer. Obviously, that filesystem needs to have
> sufficient capacity to write a new file which is approximately the sum
> of the file sizes being compacted.
>
> > 3. Also i read in some documents (including datastax's) that HBase is
> more
> of a offline & data-lake backend that better not to be used as web
> application backendd which needs less than some seconds QoS in response
> time. Thanks in advance Sent using Zoho Mail
>
> Sounds like marketing trash to me. The entire premise around HBase's
> architecture is:
>
> * Low latency random writes/updates
> * Low latency random reads
> * High throughput writes via batch tools (e.g. Bulk loading)
>
> IIRC, many early adopters of HBase were using it in the critical-path
> for web applications.
>

Reply via email to