Jena: Spark vs. Drools

Paul Houle Mon, 11 May 2015 09:38:15 -0700

I just want to share a few of my desiderata for working with RDF data.
There really are a few of these that are contradictory in nature.  These
touch on the Graph/Model split and similar things.


One of them is streaming processing with tools like Spark,  where the real
point is raw speed,  and that comes down to getting as close to "zero copy"
as possible in terms of processing.

Sometimes I am looking at a stream of triples and I want to filter out
anything from 50% to 90% to 99.99% of them and I am often doing some kind
of map or reduce that works a triple at a time,  so the elephant in the
room is parsing time and memory consumption,  so something that is insanely
fast (like the Hadoop Writable) that is highly mutable is desirable.

Now I want it to be optional in a pipeline to shove facts into an in-memory
model,  because sometimes that is a great way to get things done,  and it
would be nice to be able not have to change my filtering code and have
confidence that what is happening under the hood is efficient,  without a
lot of mindless copying.

On the other hand I am also doing things where immutable data structures
are the way,  particularly I am using Jena classes with production rules
engines such as Drools.  From my current viewpoint,  RDFS and OWL are just
"logical theories" which are on the shelf together with logical theories on
other topics such as invoices and postal addresses.  In this model there is

(i) a small rule base,
(ii) a fair-sized "T-Box" like knowledge base (say 1-1M triples),  and
(iii) a small "A-Box" knowledge base which is streaming past the system in
the sense that it is doing a 'consultation' which may involve a number of
decisions,  then we toss the A-Box out.

I like the feature set of Drools but may end up using something
clojure-based for a rules engine,  basically for the reason that the source
code of OPS5 in LISP is about 3k LOC and Drools core is orders of magnitude
bigger.  When I look at data modelling problems people run into with
"business rules engine" it is clear that RDF is the right answer for many
such conundrums.



-- 
Paul Houle

*Applying Schemas for Natural Language Processing, Distributed Systems,
Classification and Text Mining and Data Lakes*

(607) 539 6254    paul.houle on Skype   [email protected]
https://legalentityidentifier.info/lei/lookup
<http://legalentityidentifier.info/lei/lookup>

Jena: Spark vs. Drools

Reply via email to