Just some ideas, possibly half-baked:

> From: Amandeep Khurana
> Subject: Re: Using SPARQL against HBase
> To: [email protected]
> 1. We want to have a SPARQL query engine over it that can return
> results to queries in real time, comparable to other systems out
> there. And since we will have HBase as the storage layer, we want
> to scale well.

Generally, I wonder if HBase may be able to trade disk space for query 
processing time for expected common queries. 

So part of the story here could be using coprocessors (HBASE-2000) as a mapping 
layer between the clients and the plain/simple BigTable store. For example, an 
RDF and graph relation aware coprocessor could produce and cache projections on 
the fly and use structure aware data placement strategies for sharding -- so 
the table or tables exposed to the client for enabling queries may be only a 
logical construct backed by one or more real tables with far different 
structure, and there would be intelligence for managing the construct running 
within the regionservers. Projections could be built lazily (via interprocess 
BSP?), triggered by a new query or an admin action. (And possibly the results 
could be cached with TTLs for automatic garbage collection for managing the 
total size of the store.)

This opens up a range of implementation options that the basic BigTable 
architecture would not support. This is like installing a purpose-built RDF 
store within an existing HBase+Hadoop deployment.

> 2. We want to enable large scale processing as well,
> leveraging Hadoop (maybe? read about this on Cloudera's blog),
> and maybe something like Pregel.

Edward, didn't you do some work implementing graph operations using BSP message 
passing within the Hadoop framework? What were your findings? 

I think a coprocessor could implement a Pregel-like distributed graph 
processing model internally to the region servers, using ZooKeeper primitives 
for rendezvous. 

> These things are fluid and the first step would be to spec
> out features that we want to build in

In my opinion as a potential user of such a service, the design priorities 
should be something like:

1) Scale.

2) Real time queries.

3) Support a reasonable subset of possible queries over the data.

Obviously both #1 and #2 are in tension with #3, so some expressiveness could 
be sacrificed. 

#1 and #2 are in tension as well. It would not be desirable to provide for all 
possible queries to be returned in real time given the cost of that is an 
unsupportable space explosion. 

My rationale for the above is a BigTable hosted RDF store could have less 
expressiveness than alternatives but that would be acceptable if the reason for 
considering the solution is the 'Big' in BigTable. But this is not the only 
consideration. Also if it can be fast for the common cases even with moderately 
sized data, it is a good alternative and may be already installed as part of a 
larger strategy employing the Hadoop stack. 

We should consider a motivating use case, or a few of them. 

For me, I'd like a canonical source of provenance. We have a patchwork of 
tracking systems. I'd like to be able to link the provenance for all of our 
workflows and data, inputs and outputs at each stage. Should support fast 
queries for weighting inputs to predictive models. Should support bulk queries 
also, so as we assess or reassess the reliability and trustworthiness of a 
source or service we would be able to trace all data and all conclusions 
contributed by the entity and all that build upon it -- the whole cascade of it 
-- by following the linkage. We would be able to invalidate any conclusions 
based on data or process we deem (at some arbitrary time) flawed or 
untrustworthy. This "provenance store" would be a new metaindex over several 
workflows and data islands. 

  
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.99.7575&rep=rep1&type=pdf

  
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.3562&rep=rep1&type=pdf

Deletions would be rare, if ever. 

   - Andy





Reply via email to