Hello everyone,
I changed model to triples in RDD/Dataset, but there is a question. I have triples in Dataset of Spark now, and I need to put them into a Model or something else ,then output them into a file or TDB or somewhere else. As Dan mentioned before, is there any binary syntax for RDF? Or Is Jena supported distributed model to handling billions triples?(supporting parsing triples into a RDF file is OK).TDB’s MRSW is a quite problem for me. Thank you very much! Jason Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10 ________________________________ From: Andy Seaborne <a...@apache.org> Sent: Thursday, June 6, 2019 6:35:41 PM To: users@jena.apache.org Subject: Re: Jena Model is serializable in Java? On 06/06/2019 08:57, Scarlet Remilia wrote: > Hello everyone, > > My use case is a r2rml implementation, which could support millions or > billions rows from RDBMS and distributed parse them into RDF. > For now, We try to setup some small models in different spark executors to > parse individually, and finally union them all. That sounds more like a stream usage. Jena's StreamRDF and collect to a set (model or graph don't sound like they do anything for your application - sound like you are just using them as container of triples to move around. > I think RDD[Triple] is a good idea, but I need to review exist code to change > model into triples. > > an RDF syntax and write-then-read the RDF is also a resolution but is too > loose. It’s very hard to manage these files, especially there are too many > small models mentioned above. > > Thanks, > Jason > > Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10 > > From: Lorenz B.<mailto:buehm...@informatik.uni-leipzig.de> > Sent: Thursday, June 6, 2019 15:32 > To: users@jena.apache.org<mailto:users@jena.apache.org> > Subject: Re: Jena Model is serializable in Java? > > I don't see why one would want to share Model instances via Spark. I > mean, it's possible via wrapping it inside an object which is > serializable or some other wrapper method: > > object ModelWrapper extends Serializable { > lazy val model = ... > } > > rdd.map(s => ModelWrapper.model. ... ) > > > This makes the model being attached to some static code that can't be > changed during runtime and that's what Spark needs. > > Ideally, you'd use some broadcast variable, but indeed those are just > use to share smaller entities among the different Spark workers. For > smaller models like a schema this would work and is supposed to be more > efficient than having joins etc. (yes, there are also broadcast joins in > Spark, still data would be distributed during processing) - but it > depends ... > > I don't know your use-case nor why you need a Model, but what we did > when using Jena on Spark was to use RDD (or Dataset) of Triple objects, > i.e. RDD[Triple]. RDD is the fundamental shared datastructure of Spark > and this is the only way to scale when using very large datasets. > Parsing RDF triples from e.g. N-Triples directly into RDD[Triple] is > pretty easy. For Dataset you have to define a custom encoder (Kryo > encoder works though). > > But as already mentioned, your use-case or application would be needed > to give further advice if necessary. > >> Jason, >> >> I would argue that you should exchange a Set of triples, so you can take >> advantage of Spark's distributed nature. Your logic can materialize that >> list into a Graph or Model when needed to operate on it. Andy is right >> about being careful about the size - you may want to build a specialized >> set that throws if the set is too large, and you may want to experiment >> with it. >> >> Andy, >> >> Does Jena Riot (or contrib) provide a binary syntax for RDF that is optimal >> for fast parse? I'm recalling Michael Stonebraker's response to the >> BigTable paper - >> https://pdfs.semanticscholar.org/08d1/2e771d811bcd0d4bc81fa3993563efbaeadb.pdf, >> and also gSOAP and other binary XML formats. To this paper, the Google >> BigTable authors then responded that they don't use loose serializations >> such as provided by HDFS, but instead use structured data. >> >> This is hugely important to Jason's question because this is one of the >> benefits of using Spark instead of HDFS - Spark will handle distributing a >> huge dataset to multiple systems so that algorithm authors can operate on a >> vector (of Jena models?) far too large to fit in one machine. >> >> On Wed, Jun 5, 2019 at 4:40 PM Andy Seaborne <a...@apache.org> wrote: >> >>> Hi Jason, >>> >>> Models aren't serializable, nor are Graphs (the more system oriented >>> view of RDF) through Triples, Quads and Node are serializable. You can >>> send a list of triples. >>> >>> Or use an RDF syntax and write-then-read the RDF. >>> >>> But are the models small? RDF graph aren't always small so moving them >>> around may be expensive. >>> >>> Andy >>> >>> On 05/06/2019 17:59, Scarlet Remilia wrote: >>>> Hello everyone, >>>> I get a problem about Jena and Spark. >>>> I use Jena Model to handle some RDF models in my spark executor, but I >>> get a error: >>>> java.io.NotSerializableException: >>> org.apache.jena.rdf.model.impl.ModelCom >>>> Serialization stack: >>>> - object not serializable (class: >>> org.apache.jena.rdf.model.impl.ModelCom) >>>> - field (class: org.nari.r2rml.entities.Template, name: model, >>> type: interface org.apache.jena.rdf.model.Model) >>>> - object (class org.nari.r2rml.entities.Template, >>> org.nari.r2rml.entities.Template@23dc70c1) >>>> - field (class: org.nari.r2rml.entities.PredicateObjectMap, >>> name: objectTemplate, type: class org.nari.r2rml.entities.Template) >>>> - object (class org.nari.r2rml.entities.PredicateObjectMap, >>> org.nari.r2rml.entities.PredicateObjectMap@2de96eba) >>>> - writeObject data (class: java.util.ArrayList) >>>> - object (class java.util.ArrayList, >>> [org.nari.r2rml.entities.PredicateObjectMap@2de96eba]) >>>> - field (class: org.nari.r2rml.entities.LogicalTableMapping, >>> name: predicateObjectMaps, type: class java.util.ArrayList) >>>> - object (class org.nari.r2rml.entities.LogicalTableMapping, >>> org.nari.r2rml.entities.LogicalTableMapping@8e00c02) >>>> - field (class: org.nari.r2rml.beans.Impl.EachPartitonFunction, >>> name: logicalTableMapping, type: class >>> org.nari.r2rml.entities.LogicalTableMapping) >>>> - object (class org.nari.r2rml.beans.Impl.EachPartitonFunction, >>> org.nari.r2rml.beans.Impl.EachPartitonFunction@1e14b269) >>>> - field (class: >>> org.apache.spark.sql.Dataset$$anonfun$foreachPartition$2, name: func$4, >>> type: interface org.apache.spark.api.java.function.ForeachPartitionFunction) >>>> - object (class >>> org.apache.spark.sql.Dataset$$anonfun$foreachPartition$2, <function1>) >>>> at >>> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) >>>> at >>> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) >>>> at >>> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) >>>> at >>> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:400) >>>> ... 33 more >>>> >>>> All these classes implement serializable interface. >>>> So how could I serialize Jena model java object? >>>> >>>> Thanks very much! >>>> >>>> >>>> Jason >>>> >>>> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for >>> Windows 10 >>>> > -- > Lorenz Bühmann > AKSW group, University of Leipzig > Group: http://aksw.org - semantic web research center > >