Comments inline: On 11/05/2015 23:23, "Paul Houle" <[email protected]> wrote:
>I've processed cumulative terabytes of data with > >https://github.com/paulhoule/infovore > >which was developed pre-Elephas. One issue I have is that at this scale >the reader has to be 100% bombproof. Every triple I read is on a single >line, but it guaranteed that there will be a bad triple in there >somewhere, and the system needs to reject it and move on to the next >triple. I looked at the code a while ago and found that Elephas does the >same thing I did, which was create a new parser for each line, which is >an awful solution in terms of speed. Well that's one behaviour available For some formats like NTriples we provide the ability to process in line based, batch based or whole file style See for list of which format supports which processing style: http://jena.apache.org/documentation/hadoop/io.html#input http://jena.apache.org/documentation/hadoop/io.html#input_1 Obviously in the batch/whole file style you are trading off error tolerance for performance as you suggest because with these processing styles we only create a single parser for the block or the file. In those styles an error aborts further processing of the batch because the parser is not recoverable. In the future perhaps we could improve this so in the case of an error you read to the next new line and then restart the parser. > >For me it is not so much a matter of speed as it is cost, as I spin these >clusters up in AWS and try to spin up the right number of machines so the >job finishes in a bit less than an hour. > >So far I haven't done much Spark + RDFyet but I can say if I have to deal >with data sets that i can't process easily on one machine that will be the >way I go. What I do know is that Spark works with Hadoop Writables and >other I/O stuff from Hadoop, although I don't know if this is the optimal >solution. A lot of my attraction to Spark is that it scales from the >application domain of Java parallel streams up to "huge data" and that is >important to me. The Elephas Writables all use Thrift as the underlying serialisation so there is minimal serialization/deserialization cost involved and they should work nicely with Spark and any other Hadoop ecosystem framework that support Writables For stream processing I would take a look at Apache Flink, it has very similar aims to Spark but is designed as a streaming engine from the ground up so provides true streaming unlike Spark's micro-batching approach. Overall it is not as mature in some areas but there are other areas where Flink is miles ahead of Spark especially in terms of memory management (the memory management improvements that Databricks recently announced they were planning to work on for Spark as Project Tungsten already exists and are mature in Flink). > >Probably the issue of "restartable parser for N-Triples" is separate from >the "parser that doesn't allocate anything it doesn't need to allocate". Yes, note that Jena parsers already maintain the minimal state necessary. As Andy recently noted separately on another email thread the NTriples parser only holds state for the current line and the Turtle parser only holds state from the start of some block of triples to the terminating '.' >So far as restartable Turtle, I would look to > >https://www.tbray.org/ongoing/When/201x/2015/02/26/JSON-Text-Sequences Nice We did some work internally at Cray with designed a parallel friendly serialisation for RDF tuples that achieves high compression while allowing for both parallel compression and decompression. We use some similar tricks to separate blocks and records within blocks where necessary. Maybe someday we will be able to publish this as open source, I haven't bugged my manager about this lately... Rob > >I have almost no interest in DL reasoners, except for cases where they >help >do something I want, such using "rdfs:subPropertyOf" to get terms under a >common vocabulary. I think DL has held back the semantic web more than >anything else; i.e. people like Kendall Clark can do things I wouldn't >think can be done in OWL, but the people I talk to want to express >government regulations and business policies. and ask a question like "Is >Bank X adequately capitalized?" Certainly I need to do things like >convert >Fahrenheit and Centigrade to Kelvin (to extend the rdfs:subPropertyOf >concept) and that is trivial to do with production rules but impossible >with OWL/RDFS. > >Now the SWRL idea where you extend production rules with RDFS/OWL is a >good >idea and I also think SPIN is a good idea, but for many of the data >transformation and classification tasks I do, the RETE network execution >is close to ideal. Also when it comes to things like mixed initiative >interaction, complex event processing, complex decisions (think chess >playing where you have to consider the effects of many moves or route >optimization) and the asynchronous I/O morass that people are heading into >without helmets, I think that kind of system has a lot to offer. > >So far as "business rules engines" go, a definite theme I see is that the >near-term state of the art is a lot better than people think because there >are so many communities that aren't talking. I have a recent book on KR >that stops with MYCIN and doesn't say your bank probably uses ILOG or that >a program written in ILOG made the final decisions for IBM Watson. > >Now Drools does suck for the simple reason that Drools doesn't really know >Java so it can't give you error messages that make sense and that is just >compounded by the decision tables and DSL stuff. I think Drools has made >some of the same mistakes other BRMS have, in terms of building a system >where there are enough different ways to do things that everybody from the >execs to the devs are driven crazy and that a modern system probably >involves. (At least Drools did have enough sense to use Git for version >control) > >a rules language >a brilliant DSL system that does a large amount of reasoning >an IDE that lets you render text and text+annotation documents in >different >ways > >but underlying it all the idea that complexity can be part of the problem >as much as part of the solution and that what makes life hard for devs >makes it hard for execs and vice versa. > > > > > > > > > >On Mon, May 11, 2015 at 4:16 PM, Bruno P. Kinoshita <[email protected]> >wrote: > >> Hi Paul >> I worked with Jena in a Hadoop/Hive cluster, but without Spark. There >>was >> only one job that took too long to work on my dataset, but I suspect it >>was >> due to something in my custom code - which could be replaced in parts >>now >> by Elephas - or due to the lack of optimization in the storage format or >> job parameters. >> In my case, I was doing some NLP with OpenNLP and creating triples that >> would be loaded later in a Jena graph. Since I didn't need to work on >>the >> graph/model in the cluster, I never had a similar case as yours. >> Few questions: >> - Have you looked at Giraph and other graph solutions for Hadoop too? >> Maybe it provides some abstraction layer that could be used in >>conjunction >> with Jena graphs. >> - Did you have to use some special configuration for persisting your >> datasets to disk too? Did you find some good examples or literature >>online >> that you could share with devs that don't have much experience with >>Spark >> (like me :-) ? >> - Would it make sense to try to use existing reasoners like Hermit and >> Pellet, instead of using Drools? >> - Have you used Elephas too? Anything that would be useful to Spark and >> could be added maybe? >> - Are you writing/blogging about it? >> In this same project, one of the third party libraries used Drools for >> rules to extract content from PDF. While I found it really powerful, it >>was >> hard to debug and adjust the parameters, as it had some custom code to >> manipulate excels and generate the rules. >> Thanks!Bruno >> >> From: Paul Houle <[email protected]> >> To: [email protected] >> Sent: Tuesday, May 12, 2015 4:36 AM >> Subject: Jena: Spark vs. Drools >> >> I just want to share a few of my desiderata for working with RDF data. >> There really are a few of these that are contradictory in nature. These >> touch on the Graph/Model split and similar things. >> >> One of them is streaming processing with tools like Spark, where the >>real >> point is raw speed, and that comes down to getting as close to "zero >>copy" >> as possible in terms of processing. >> >> Sometimes I am looking at a stream of triples and I want to filter out >> anything from 50% to 90% to 99.99% of them and I am often doing some >>kind >> of map or reduce that works a triple at a time, so the elephant in the >> room is parsing time and memory consumption, so something that is >>insanely >> fast (like the Hadoop Writable) that is highly mutable is desirable. >> >> Now I want it to be optional in a pipeline to shove facts into an >>in-memory >> model, because sometimes that is a great way to get things done, and >>it >> would be nice to be able not have to change my filtering code and have >> confidence that what is happening under the hood is efficient, without >>a >> lot of mindless copying. >> >> On the other hand I am also doing things where immutable data structures >> are the way, particularly I am using Jena classes with production rules >> engines such as Drools. From my current viewpoint, RDFS and OWL are >>just >> "logical theories" which are on the shelf together with logical >>theories on >> other topics such as invoices and postal addresses. In this model >>there is >> >> (i) a small rule base, >> (ii) a fair-sized "T-Box" like knowledge base (say 1-1M triples), and >> (iii) a small "A-Box" knowledge base which is streaming past the system >>in >> the sense that it is doing a 'consultation' which may involve a number >>of >> decisions, then we toss the A-Box out. >> >> I like the feature set of Drools but may end up using something >> clojure-based for a rules engine, basically for the reason that the >>source >> code of OPS5 in LISP is about 3k LOC and Drools core is orders of >>magnitude >> bigger. When I look at data modelling problems people run into with >> "business rules engine" it is clear that RDF is the right answer for >>many >> such conundrums. >> >> >> >> -- >> Paul Houle >> >> *Applying Schemas for Natural Language Processing, Distributed Systems, >> Classification and Text Mining and Data Lakes* >> >> (607) 539 6254 paul.houle on Skype [email protected] >> https://legalentityidentifier.info/lei/lookup >> <http://legalentityidentifier.info/lei/lookup> >> >> >> >> > > >-- >Paul Houle > >*Applying Schemas for Natural Language Processing, Distributed Systems, >Classification and Text Mining and Data Lakes* > >(607) 539 6254 paul.houle on Skype [email protected] >https://legalentityidentifier.info/lei/lookup ><http://legalentityidentifier.info/lei/lookup>
