Re: Jena: Spark vs. Drools

Paul Houle Mon, 11 May 2015 15:25:45 -0700

I've processed cumulative terabytes of data with

https://github.com/paulhoule/infovore

which was developed pre-Elephas.  One issue I have is that at this scale
the reader has to be 100% bombproof.  Every triple I read is on a single
line,  but it guaranteed that there will be a bad triple in there
somewhere,  and the system needs to reject it and move on to the next
triple.  I looked at the code a while ago and found that Elephas does the
same thing I did,  which was create a new parser for each line,  which is
an awful solution in terms of speed.

For me it is not so much a matter of speed as it is cost,  as I spin these
clusters up in AWS and try to spin up the right number of machines so the
job finishes in a bit less than an hour.

So far I haven't done much Spark + RDFyet but I can say if I have to deal
with data sets that i can't process easily on one machine that will be the
way I go.  What I do know is that Spark works with Hadoop Writables and
other I/O stuff from Hadoop,  although I don't know if this is the optimal
solution.  A lot of my attraction to Spark is that it scales from the
application domain of Java parallel streams up to "huge data" and that is
important to me.

Probably the issue of "restartable parser for N-Triples" is separate from
the "parser that doesn't allocate anything it doesn't need to allocate".
So far as restartable Turtle,  I would look to

https://www.tbray.org/ongoing/When/201x/2015/02/26/JSON-Text-Sequences

I have almost no interest in DL reasoners, except for cases where they help
do something I want,  such using "rdfs:subPropertyOf" to get terms under a
common vocabulary.  I think DL has held back the semantic web more than
anything else;  i.e.  people like Kendall Clark can do things I wouldn't
think can be done in OWL,  but the people I talk to want to express
government regulations and business policies. and ask a question like "Is
Bank X adequately capitalized?"  Certainly I need to do things like convert
Fahrenheit and Centigrade to Kelvin (to extend the rdfs:subPropertyOf
concept) and that is trivial to do with production rules but impossible
with OWL/RDFS.

Now the SWRL idea where you extend production rules with RDFS/OWL is a good
idea and I also think SPIN is a good idea,  but for many of the data
transformation and classification tasks I do,  the RETE network execution
is close to ideal.  Also when it comes to things like mixed initiative
interaction,  complex event processing, complex decisions (think chess
playing where you have to consider the effects of many moves or route
optimization) and the asynchronous I/O morass that people are heading into
without helmets,  I think that kind of system has a lot to offer.

So far as "business rules engines" go,  a definite theme I see is that the
near-term state of the art is a lot better than people think because there
are so many communities that aren't talking.  I have a recent book on KR
that stops with MYCIN and doesn't say your bank probably uses ILOG or that
a program written in ILOG made the final decisions for IBM Watson.

Now Drools does suck for the simple reason that Drools doesn't really know
Java so it can't give you error messages that make sense and that is just
compounded by the decision tables and DSL stuff.  I think Drools has made
some of the same mistakes other BRMS have,  in terms of building a system
where there are enough different ways to do things that everybody from the
execs to the devs are driven crazy and that a modern system probably
involves.  (At least Drools did have enough sense to use Git for version
control)

a rules language
a brilliant DSL system that does a large amount of reasoning
an IDE that lets you render text and text+annotation documents in different
ways

but underlying it all the idea that complexity can be part of the problem
as much as part of the solution and that what makes life hard for devs
makes it hard for execs and vice versa.

On Mon, May 11, 2015 at 4:16 PM, Bruno P. Kinoshita <[email protected]>
wrote:

> Hi Paul
> I worked with Jena in a Hadoop/Hive cluster, but without Spark. There was
> only one job that took too long to work on my dataset, but I suspect it was
> due to something in my custom code - which could be replaced in parts now
> by Elephas - or due to the lack of optimization in the storage format or
> job parameters.
> In my case, I was doing some NLP with OpenNLP and creating triples that
> would be loaded later in a Jena graph. Since I didn't need to work on the
> graph/model in the cluster, I never had a similar case as yours.
> Few questions:
> - Have you looked at Giraph and other graph solutions for Hadoop too?
> Maybe it provides some abstraction layer that could be used in conjunction
> with Jena graphs.
> - Did you have to use some special configuration for persisting your
> datasets to disk too? Did you find some good examples or literature online
> that you could share with devs that don't have much experience with Spark
> (like me :-) ?
> - Would it make sense to try to use existing reasoners like Hermit and
> Pellet, instead of using Drools?
> - Have you used Elephas too? Anything that would be useful to Spark and
> could be added maybe?
> - Are you writing/blogging about it?
> In this same project, one of the third party libraries used Drools for
> rules to extract content from PDF. While I found it really powerful, it was
> hard to debug and adjust the parameters, as it had some custom code to
> manipulate excels and generate the rules.
> Thanks!Bruno
>
>       From: Paul Houle <[email protected]>
>  To: [email protected]
>  Sent: Tuesday, May 12, 2015 4:36 AM
>  Subject: Jena: Spark vs. Drools
>
> I just want to share a few of my desiderata for working with RDF data.
> There really are a few of these that are contradictory in nature.  These
> touch on the Graph/Model split and similar things.
>
> One of them is streaming processing with tools like Spark,  where the real
> point is raw speed,  and that comes down to getting as close to "zero copy"
> as possible in terms of processing.
>
> Sometimes I am looking at a stream of triples and I want to filter out
> anything from 50% to 90% to 99.99% of them and I am often doing some kind
> of map or reduce that works a triple at a time,  so the elephant in the
> room is parsing time and memory consumption,  so something that is insanely
> fast (like the Hadoop Writable) that is highly mutable is desirable.
>
> Now I want it to be optional in a pipeline to shove facts into an in-memory
> model,  because sometimes that is a great way to get things done,  and it
> would be nice to be able not have to change my filtering code and have
> confidence that what is happening under the hood is efficient,  without a
> lot of mindless copying.
>
> On the other hand I am also doing things where immutable data structures
> are the way,  particularly I am using Jena classes with production rules
> engines such as Drools.  From my current viewpoint,  RDFS and OWL are just
> "logical theories" which are on the shelf together with logical theories on
> other topics such as invoices and postal addresses.  In this model there is
>
> (i) a small rule base,
> (ii) a fair-sized "T-Box" like knowledge base (say 1-1M triples),  and
> (iii) a small "A-Box" knowledge base which is streaming past the system in
> the sense that it is doing a 'consultation' which may involve a number of
> decisions,  then we toss the A-Box out.
>
> I like the feature set of Drools but may end up using something
> clojure-based for a rules engine,  basically for the reason that the source
> code of OPS5 in LISP is about 3k LOC and Drools core is orders of magnitude
> bigger.  When I look at data modelling problems people run into with
> "business rules engine" it is clear that RDF is the right answer for many
> such conundrums.
>
>
>
> --
> Paul Houle
>
> *Applying Schemas for Natural Language Processing, Distributed Systems,
> Classification and Text Mining and Data Lakes*
>
> (607) 539 6254    paul.houle on Skype  [email protected]
> https://legalentityidentifier.info/lei/lookup
> <http://legalentityidentifier.info/lei/lookup>
>
>
>
>

-- 
Paul Houle

*Applying Schemas for Natural Language Processing, Distributed Systems,
Classification and Text Mining and Data Lakes*

(607) 539 6254    paul.houle on Skype   [email protected]
https://legalentityidentifier.info/lei/lookup
<http://legalentityidentifier.info/lei/lookup>

Re: Jena: Spark vs. Drools

Reply via email to