> > From the limitations you mention, 1) and 2) we can live with, but 3) > could be why my quick tests are already giving incorrect record > counts. That sounds like a show stopper straight away right? > > One option for us would be HBase for the primary store for random > access, and periodic (e.g. 12 hourly) exports to HDFS for all the full > scanning. Would you consider that sane? > > Is your primary access pattern scans or random reads and writes? If it's primarily scans, you could consider just keeping flat files. If you need random reads and writes over a small amount of data, sqoop it out to MySQL or such. If you need random reads and writes across all your data (I'm assuming big numbers), sure, have HBase as your authoritative store and scan over it. If you need to do lots of scans and often, export it out every now and then (so you minimize the "staleness" of the exported data) and scan over that.
Having said that, scan performance being an order of magnitude slower doesn't seem right. You might be able to tune the cluster and extract better performance.
