My interest here is thinking about how HBase Coprocessors can support their use 
case. Also, to make people aware that this team at ICT CAS is interested in 
contributing their work. 

However, the ICT guys are not on the list so I took this question to them. 
Below is the response:

>>>
In micro benchmark, the throughput is almost 47000 Records/S, row size is 1KB 
and there are 3 nodes, so the throughput per node is 15.3MB/S.
In synthetic application benchmarks, the row size is 118 bytes, cluster 
throughput is 300000 Records/S. So if calculate I/O rate by record number, per 
node it’s 2.11MB/S.

In our opinion, there are two reasons:

In our test, we found HBase's performance is relevant to record number .For 
example in experiments on the 1GB data set (micro benchmark), if we split 1GB 
data into 10KB*0.1M records, the performance of get, put and scan of HBase was 
much better than 1KB*1M records data set.

On the other hand, in synthetic application benchmark, we use multi-dimensional 
range queries to get data. For example, in our paper the query is like:
"select * from ServiceTime where (primaryKey > K1 and primaryKey < K2) and 
(time > k3 and time < k4) and (service = ‘CPU Load’)".
If we choose one CCIT which indexed by "time" to scan data, the records that 
“primaryKey” and “service” don't meet the requirements will be filtered and not 
counted in this test. So we can't calculate I/O rate by the record number. 
That’s the primary reason.
<<<

Best regards,

    - Andy


--- On Thu, 12/9/10, Vladimir Rodionov <[email protected]> wrote:

> From: Vladimir Rodionov <[email protected]>
> Subject: RE: subtopic for coprocessor hackathon: CCIndex
> To: "[email protected]" <[email protected]>
> Date: Thursday, December 9, 2010, 11:14 AM
> 90M records, 118 bytes each ~ 10GB of
> data (w/o compression)
> 16 node cluster
> 300K records per sec = 35MB -> ~2MB per sec per node
> 
> May be there is something I missed here but these numbers
> really do not impress. Old good brute force scan M/R job on
> 16 node grid should be much faster. 
> 
> Best regards,
> Vladimir Rodionov
> Principal Platform Engineer
> Carrier IQ, www.carrieriq.com
> e-mail: [email protected]
> 
> ________________________________________
> From: Andrew Purtell [[email protected]]
> Sent: Thursday, December 09, 2010 10:06 AM
> To: [email protected]
> Subject: subtopic for coprocessor hackathon: CCIndex
> 
> While in Beijing I met with a group at the Institute of
> Computing at the Chinese Academy of Sciences who are
> interested in contributing a secondary indexing scheme for
> HBase. It is my understanding this is the same group that
> contributed RCFile to Hive. See at the links below a slide
> deck and technical report describing what they have done,
> called CCIndex.
> 
> Slides: https://iridiant.s3.amazonaws.com/ccindex_v1.pdf
> Paper: https://iridiant.s3.amazonaws.com/CCIndex.pdf
> 
> We discussed initially posting their code -- based on
> 0.20.1 -- up on GitHub and this was agreed. This should be
> happening soon.
> 
> We also discussed a possible path for contribution of this
> work in maintainable/distributable form as a coprocessor
> based reimplementation, considering support in the framework
> for what CCindex needs at a low level (I/O concerns), and
> splitting out the rest into a coprocessor. I've heard other
> talk of implementing secondary indexing using a coprocessor
> foundation. I think CCIndex is one option on the table, a
> starting point for discussion.
> 
> Best regards,
> 
>     - Andy



Reply via email to