Michael, I do not think its the competitor to Solr, Solr/HBase or Cloudera Search, but it can be good addition to the HBase SQL front-end, such as Phoenix .
On Wed, Aug 14, 2013 at 8:45 AM, Michael Segel <michael_se...@hotmail.com>wrote: > Guys, > > Sorry to be a debbie downer here, but really this is not a good idea. > Here's why: > > In terms of design, you have some serious scalability and performance > issues when compared to alternatives. > > > Let me try to give you a real life example. * > > CCCIS (CCC Information Services) is the middle man in the US between the > auto repair shop and the insurance company. They have one competitor but > they handle most of the accident claims in the US. > So when you go to your authorized repair shop, they have this application > called Pathways which takes down all of your information and the accident, > the parts required to be replaced and sends it first to CCC which then > sends it on to your insurance company. In short CCC collects a lot of > information about the type of vehicles, the accidents, the cost of parts, > labor to put your car back on the road. As the middle man they collect a > lot of very useful information… > > So imagine you have a large data warehouse in HBase of all of the claims. > Your primary key is going to be a composite of the insurer and the claim_id. > > But you're going to want to also index based on the make/model, type of > accident, driver details, location… , VIN > > This will allow your actuaries to figure out the average cost of a front > end collision, by make and model, by state/zip. > Or by age bracket, who's a better driver? > > Imagine that the claim table will have a column for the claim in its > entirety as an Avro doc (JSON) along with the important fields broken out > separately. (For this example the schema isn't that important.) > > So you want to find the average cost of a front end collision of a VOLVO > S80 for the past 3 model years. > > Now, you have an index based on manufacturer/model/year. > > Using your index scheme, you now have to query every RS for the row keys > in the index. > Then you have to take these results and then put them in a sort order in > order to use the index. > > Note: This isn't too bad if you're doing a simple query against one index. > You can do the work by RS and then join the results from all RS. > > However… what happens if you have two indexes and your result set is going > to be the intersection of the indexes? > > Or you're going to do a join between two tables using the indexes to limit > the result set? > > Now your design breaks down quickly. > > And then there's another problem. > Your index may be relatively much smaller than your base table. > In this example… the insurance claim is a huge record. I would say 2-3 > orders of magnitude larger than the row key. Since you split your index > at the same rate you split your table… you will have a lot of regions for > your index. > > Again,this may lead to other issues…. > > Is it better than doing a full table scan? Sure. > > Are there better alternatives? > Yes. > Apply KISS. (Keep it simple) > > Still using an inverted table, let HBase manage it rather than trying to > tie it to the underlying base table. > While its not perfect, its lighter, and will perform better in the general > use cases. (You could even use Async HBase to decouple the write to the > base table and the update to the index.) > > Same model could be applied to a Lucene index as well. > > Just Saying…. > > -Mike > > > *FULL DISCLOSURE > I am a consultant and CCC was a client of mine back in the late '90s. In > one project I worked on ProEFT (now defunct) and an ODS, also now defunct. > The example is a hypothetical of what I would do if I were CCC and wanted > to use Big Data to help manage Auto claims. Any resemblance to any actual > work being done by CCC in the Big Data space is pure coincidence. ;-) > > On Aug 13, 2013, at 1:31 PM, Andrew Purtell <apurt...@apache.org> wrote: > > > Thanks so much for the contribution! > > > > On Mon, Aug 12, 2013 at 11:19 PM, rajeshbabu chintaguntla < > > rajeshbabu.chintagun...@huawei.com> wrote: > > > >> Hi, > >> > >> We have been working on implementing secondary index in HBase, and had > >> shared an overview of our design in the 2012 Hadoop Technical > Conference > >> at Beijing(http://bit.ly/hbtc12-hindex). We are pleased to open source > it > >> today. > >> > >> The project is available on github. > >> https://github.com/Huawei-Hadoop/hindex > >> > >> It is 100% Java, compatible with Apache HBase 0.94.8, and is open > sourced > >> under Apache Software License v2. > >> > >> Following features are supported currently. > >> - multiple indexes on table, > >> - multi column index, > >> - index based on part of a column value, > >> - equals and range condition scans using index, and > >> - bulk loading data to indexed table (Indexing done with bulk > >> load) > >> > >> We now plan to raise HBase JIRA(s) to make it available in Apache > release, > >> and can hopefully continue our work on this in the community. > >> > >> Regards > >> Rajeshbabu > >> > >> > > > > > > -- > > Best regards, > > > > - Andy > > > > Problems worthy of attack prove their worth by hitting back. - Piet Hein > > (via Tom White) > >