90M records, 118 bytes each ~ 10GB of data (w/o compression)
16 node cluster
300K records per sec = 35MB -> ~2MB per sec per node

May be there is something I missed here but these numbers really do not impress.
Old good brute force scan M/R job on 16 node grid should be much faster. 

Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: [email protected]

________________________________________
From: Andrew Purtell [[email protected]]
Sent: Thursday, December 09, 2010 10:06 AM
To: [email protected]
Subject: subtopic for coprocessor hackathon: CCIndex

While in Beijing I met with a group at the Institute of Computing at the 
Chinese Academy of Sciences who are interested in contributing a secondary 
indexing scheme for HBase. It is my understanding this is the same group that 
contributed RCFile to Hive. See at the links below a slide deck and technical 
report describing what they have done, called CCIndex.

Slides: https://iridiant.s3.amazonaws.com/ccindex_v1.pdf
Paper: https://iridiant.s3.amazonaws.com/CCIndex.pdf

We discussed initially posting their code -- based on 0.20.1 -- up on GitHub 
and this was agreed. This should be happening soon.

We also discussed a possible path for contribution of this work in 
maintainable/distributable form as a coprocessor based reimplementation, 
considering support in the framework for what CCindex needs at a low level (I/O 
concerns), and splitting out the rest into a coprocessor. I've heard other talk 
of implementing secondary indexing using a coprocessor foundation. I think 
CCIndex is one option on the table, a starting point for discussion.

Best regards,

    - Andy



Reply via email to