Re: relational data aggregation language

Shantanu Kumar Sun, 03 Oct 2010 09:24:53 -0700

I looked at Tutorial D - it's pretty interesting. Here are few top-of-
my-head observations:


* Which RDBMS do you use? If you are free to choose a new RDBMS,
probably you can pick one that provides most of the computational
functionality (as SQL constructs/functions) out of the box. For
example Oracle, MS SQL Server, PostgreSQL etc. The reason is
performance - the more you can compute within the database, the less
amount of data you will need to fetch in order to process.

* The kinds of computations you need to solve look like a superset of
what SQL can provide. So, I think you will have to re-state the
problem in terms of computations/iterations over SQL result-sets,
which is probably what you are currently doing using the imperative
language. If you can split every problem in terms of (a) computation
you need versus (b) SQL queries you need to fire, then you can easily
do it using Clojure itself without needing any DSL.

* If you want a DSL for this, I suppose it should make maximum use of
the database's inbuilt query functions/constructs to maximize
performance. This also means the DSL implementation needs to be
database-aware. Secondly, it is possible to write functions in Clojure
that would emit appropriate SQL clauses (as long as it is doable) to
compute certain pieces of information. Looking at multiple use cases
(covering various aspects - fuzzy vs deterministic) will be helpful.

Regards,
Shantanu

On Oct 3, 5:10 pm, Shantanu Kumar <kumar.shant...@gmail.com> wrote:
> On Oct 3, 1:16 pm, Ross Gayler <r.gay...@gmail.com> wrote:
>
>
>
>
>
> > Thanks Michael.
>
> > > This sounds very similar to NoSQL and Map/Reduce?
>
> > I'm not so sure about that (which may be mostly due to my ignorance of
> > NoSQL and Map/Reduce). The amount of data involved in my problem is
> > quite small and any infrastructure aimed at massive scaling may bring
> > a load of conceptual and implementation baggage that is unnecessary/
> > unhelpful.
>
> > Let me restate my problem:
>
> > I have a bunch of statistician colleagues with minimal programming
> > skills. (I am also a statistician, but with slightly better
> > programming skills.) As part of our analytical workflow we take data
> > sets and preprocess them by adding new variables that are typically
> > aggregate functions of other values. We source the data form a
> > database/file, add the new variables, and store the augmented data in
> > a database/file for subsequent, extensive and extended (a couple of
> > months) analysis with other tools (off the shelf statistical packages
> > such as SAS and R).  After the analyses are complete, some subset of
> > the preprocessing calculations need to be implemented in an
> > operational environment. This is currently done by completely re-
> > implementing them in yet another fairly basic imperative language.
>
> > The preprocessing in our analytical environment is usually written in
> > a combination of SQL and the SAS data manipulation language (think of
> > it as a very basic imperative language with macros but no user-defined
> > functions). The statisticians take a long time to get their
> > preprocessing right (they're not good at nested queries in SQL and
> > make all the usual errors iterating over arrays of values with
> > imperative code). So my primary goal is to find/build a query language
> > that minimises the cognitive impedance mismatch with the statisticians
> > and minimises their opportunity for error.
>
> > Another goal is that the same mechanism should be applicable in our
> > statistical analytical environment and the corporate deployment
> > environment(s). The most different operational environment is online
> > and realtime. The data describing one case gets thrown at some code
> > that (among other things) implements the preprocessing with some
> > embedded imperative code. So, linking in some Java byte code to do the
> > preprocessing on a single case sounds feasible, whereas replacing/
> > augmenting the current corporate infrastructure with NoSQL and a CPU
> > farm is more aggravation with corporate IT than I am paid for.
>
> > The final goal is that the preprocessing mechanism should be no slower
> > than the current methods in each of the deployment environments. The
> > hardest one is probably in our statistical analysis environment, but
> > there we do have the option of farming the work across multiple CPUs
> > if needed.
>
> > Let me describe the computational scale of the problem - it is really
> > quite small.
>
> > Data is organised as completely independent cases.  One case might
> > contain 500 primitive values for a total size of ~1kb. Preprocessing
> > might calulate another 500 values, each of those being an aggregate
> > function of some subset (say, 20 values) of the original 500 values.
> > Currently, all these new values are calculated independently of each
> > other, but there is a lot of overlap of intermediate results and,
> > therefore, potential for optimisation of the computational effort
> > required to calculate the entire set of results within a single case.
>
> > In our statistical analytical environment the preprocessing is carried
> > out in batch mode. A large dataset might contain 1M cases (~1GB of
> > data). We can churn through the preprocessing at ~300 cases/second on
> > a modest PC.  Higher throughput in our analytical environment would be
> > a bonus, but not essential.
>
> > So I see the problem as primarily about the conceptual design of the
> > query language, with some side constraints about implementation
> > compatibility across a range of deployment environments and adequate
> > throughput performance.
>
> > As I mentioned in an earlier post, I'll probably assemble a collection
> > of representative queries, express them in a variety of query
> > languages, and try to assess how compatible the different query
> > languages are with the way my colleagues want to think about the
> > proble.
>
> Seeing examples (perhaps quite a few of them) will be certainly
> useful. (Due to my non-stats background) I may not have understood
> your use-cases correctly, but are these helpful for you?
>
> http://github.com/MrHus/rql
>
> http://bitbucket.org/kumarshantanu/sqlrat/wiki/Clause
>
> The SQLRat clause API will be part of 0.2 release (expected very
> soon).
>
> Regards,
> Shantanu
>
>
>
>
>
> > Ross
>
> > On Oct 3, 11:31 am, Michael Ossareh <ossa...@gmail.com> wrote:
>
> > > On Fri, Oct 1, 2010 at 17:55, Ross Gayler <r.gay...@gmail.com> wrote:
> > > > Hi,
>
> > > > This is probably an abuse of the Clojure forum, but it is a bit
> > > > Clojure-related and strikes me as the sort of thing that a bright,
> > > > eclectic bunch of Clojure users might know about. (Plus I'm not really
> > > > a software person, so I need all the help I can get.)
>
> > > > I am looking at the possibility of finding/building a declarative data
> > > > aggregation language operating on a small relational representation.
> > > > Each query identifies a set of rows satisfying some relational
> > > > predicate and calculates some aggregate function of a set of values
> > > > (e.g. min, max, sum). There might be ~20 input tables of up to ~1k
> > > > rows.  The data is immutable - it gets loaded and never changed. The
> > > > results of the queries get loaded as new rows in other tables and are
> > > > eventually used as input to other computations. There might be ~1k
> > > > queries. There is no requirement for transaction management or any
> > > > inherent concurrency (there is only one consumer of the results).
> > > > There is no requirement for persistent storage - the aggregation is
> > > > the only thing of interest. I would like the query language to map as
> > > > directly as possible to the task (SQL is powerful enough, but can get
> > > > very contorted and opaque for some of the queries). There is
> > > > considerable scope for optimisation of the calculations over the total
> > > > set of queries as partial results are common across many of the
> > > > queries.
>
> > > > I would like to be able to do this in Clojure (which I have not yet
> > > > used), partly for some very practical reasons to do with Java interop
> > > > and partly because Clojure looks very cool.
>
> > > > * Is there any existing Clojure functionality which looks like a good
> > > > fit to this problem?
>
> > > > I have looked at Clojure-Datalog. It looks like a pretty good fit
> > > > except that it lacks the aggregation operators. Apart from that the
> > > > deductive power is probably greater than I need (although that doesn't
> > > > necessarily cost me anything).  I know that there are other (non-
> > > > Clojure) Datalog implementations that have been extended with
> > > > aggregation operators (e.g. DLV
> > > >http://www.mat.unical.it/dlv-complex/dlv-complex).
>
> > > > Tutorial D (what SQL should have been
> > > >http://en.wikipedia.org/wiki/D_%28data_language_specification%29#Tuto...
> > > > )
> > > > might be a good fit, although once again, there is probably a lot of
> > > > conceptual and implementation baggage (e.g. Rel
> > > >http://dbappbuilder.sourceforge.net/Rel.php)
> > > > that I don't need.
>
> > > > * Is there a Clojure implementation of something like Tutorial D?
>
> > > > If there is no implementation of anything that meets my requirements
> > > > then I would be willing to look at the possibility of creating a
> > > > Domain Specific language.  However, I am wary of launching straight
> > > > into that because of the probability that anything I dreamed up would
> > > > be an ad hoc kludge rather than a semantically complete and consistent
> > > > language. Optimised execution would be a whole other can of worms.
>
> > > > * Does anyone know of any DSLs/formalisms for declaratively specifying
> > > > relational data aggregations?
>
> > > > Thanks
>
> > > > Ross
>
> > > This sounds very similar to NoSQL and 
> > > Map/Reduce?http://www.basho.com/Riak.html
>
> > > Where your predicate is a reduce fn?- Hide quoted text -
>
> > > - Show quoted text -

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: relational data aggregation language

Reply via email to