90M records, 118 bytes each ~ 10GB of data (w/o compression) 16 node cluster 300K records per sec = 35MB -> ~2MB per sec per node
May be there is something I missed here but these numbers really do not impress. Old good brute force scan M/R job on 16 node grid should be much faster. Best regards, Vladimir Rodionov Principal Platform Engineer Carrier IQ, www.carrieriq.com e-mail: [email protected] ________________________________________ From: Andrew Purtell [[email protected]] Sent: Thursday, December 09, 2010 10:06 AM To: [email protected] Subject: subtopic for coprocessor hackathon: CCIndex While in Beijing I met with a group at the Institute of Computing at the Chinese Academy of Sciences who are interested in contributing a secondary indexing scheme for HBase. It is my understanding this is the same group that contributed RCFile to Hive. See at the links below a slide deck and technical report describing what they have done, called CCIndex. Slides: https://iridiant.s3.amazonaws.com/ccindex_v1.pdf Paper: https://iridiant.s3.amazonaws.com/CCIndex.pdf We discussed initially posting their code -- based on 0.20.1 -- up on GitHub and this was agreed. This should be happening soon. We also discussed a possible path for contribution of this work in maintainable/distributable form as a coprocessor based reimplementation, considering support in the framework for what CCindex needs at a low level (I/O concerns), and splitting out the rest into a coprocessor. I've heard other talk of implementing secondary indexing using a coprocessor foundation. I think CCIndex is one option on the table, a starting point for discussion. Best regards, - Andy
