The priorities 1), 2) and 3) are pretty well stated. - Victor
On 4/5/10 3:58 PM, "ext Andrew Purtell" <[email protected]> wrote: Just some ideas, possibly half-baked: > From: Amandeep Khurana > Subject: Re: Using SPARQL against HBase > To: [email protected] > 1. We want to have a SPARQL query engine over it that can return > results to queries in real time, comparable to other systems out > there. And since we will have HBase as the storage layer, we want > to scale well. Generally, I wonder if HBase may be able to trade disk space for query processing time for expected common queries. So part of the story here could be using coprocessors (HBASE-2000) as a mapping layer between the clients and the plain/simple BigTable store. For example, an RDF and graph relation aware coprocessor could produce and cache projections on the fly and use structure aware data placement strategies for sharding -- so the table or tables exposed to the client for enabling queries may be only a logical construct backed by one or more real tables with far different structure, and there would be intelligence for managing the construct running within the regionservers. Projections could be built lazily (via interprocess BSP?), triggered by a new query or an admin action. (And possibly the results could be cached with TTLs for automatic garbage collection for managing the total size of the store.) This opens up a range of implementation options that the basic BigTable architecture would not support. This is like installing a purpose-built RDF store within an existing HBase+Hadoop deployment. > 2. We want to enable large scale processing as well, > leveraging Hadoop (maybe? read about this on Cloudera's blog), > and maybe something like Pregel. Edward, didn't you do some work implementing graph operations using BSP message passing within the Hadoop framework? What were your findings? I think a coprocessor could implement a Pregel-like distributed graph processing model internally to the region servers, using ZooKeeper primitives for rendezvous. > These things are fluid and the first step would be to spec > out features that we want to build in In my opinion as a potential user of such a service, the design priorities should be something like: 1) Scale. 2) Real time queries. 3) Support a reasonable subset of possible queries over the data. Obviously both #1 and #2 are in tension with #3, so some expressiveness could be sacrificed. #1 and #2 are in tension as well. It would not be desirable to provide for all possible queries to be returned in real time given the cost of that is an unsupportable space explosion. My rationale for the above is a BigTable hosted RDF store could have less expressiveness than alternatives but that would be acceptable if the reason for considering the solution is the 'Big' in BigTable. But this is not the only consideration. Also if it can be fast for the common cases even with moderately sized data, it is a good alternative and may be already installed as part of a larger strategy employing the Hadoop stack. We should consider a motivating use case, or a few of them. For me, I'd like a canonical source of provenance. We have a patchwork of tracking systems. I'd like to be able to link the provenance for all of our workflows and data, inputs and outputs at each stage. Should support fast queries for weighting inputs to predictive models. Should support bulk queries also, so as we assess or reassess the reliability and trustworthiness of a source or service we would be able to trace all data and all conclusions contributed by the entity and all that build upon it -- the whole cascade of it -- by following the linkage. We would be able to invalidate any conclusions based on data or process we deem (at some arbitrary time) flawed or untrustworthy. This "provenance store" would be a new metaindex over several workflows and data islands. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.99.7575&rep=rep1&type=pdf http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.3562&rep=rep1&type=pdf Deletions would be rare, if ever. - Andy
