This code can only work if the Turtle data isn't distributed across partitions in SPARK as Turtle isn't a splittable format like N-Triples would be. I'm wondering if you did consider this in your application?
> Serializing Model worked for me using Spark Dataset API. Model was not > associated with any database. > Models were constructed from turtle string and then transformed to > another serializable form. > > ... > private final Encoder<Model> MODEL_ENCODER = Encoders.bean(Model.class); > ... > .mapPartitions((Iterator<String> it) -> {...}, Encoders.STRING()) > .map(ttl -> { > Model m = > ModelFactory.createDefaultModel().read(IOUtils.toInputStream(ttl, > "UTF-8"), null, "TTL"); > m = getProperties(m); > m = populateLabels(m); > return m; > }, MODEL_ENCODER) > .flatMap(this::extractRelationalSentences, RELATION_ENCODER) > ... > > Full code here [1] > > Siddhesh > > [1] > https://github.com/SiddheshRane/sparkl/blob/0bd5b267ffaffdc2dc2d7e59a5b07f09706be8d2/src/main/java/siddhesh/sparkl/TagArticlesFromFile.java#L149 > > On Sat, Jun 8, 2019 at 5:05 PM Andy Seaborne <a...@apache.org> wrote: >> Hi Jason >> >> On 08/06/2019 11:49, Scarlet Remilia wrote: >>> Hello everyone, >>> >>> >>> >>> I changed model to triples in RDD/Dataset, but there is a question. >>> >>> I have triples in Dataset of Spark now, and I need to put them into a Model >>> or something else ,then output them into a file or TDB or somewhere else. >>> >>> As Dan mentioned before, is there any binary syntax for RDF? >> https://jena.apache.org/documentation/io/rdf-binary.html >> org.apache.jena.riot.thrift.* >> >>> Or Is Jena supported distributed model to handling billions >>> triples?(supporting parsing triples into a RDF file is OK).TDB’s MRSW is a >>> quite problem for me. >> (It's MR+SW - multiple reader AND single writer) >> >> Are you wanting to load smallish units of triples from multiple sources? >> >> Maybe you want to have all the streams send their output to a queue (in >> blocks, not triple by triple) and have TDB load from that queue. >> Multiple StreamRDF to a single StreamRDF, load the StreamTDB. >> >> There is the TDB2 parallel loader - that is, loading from a single >> source using internal parallelism, not loading from parallel inputs. >> (It's 5 threads for triples, more for quads). It load from a StreamRDF. >> >> NB - it can consume all the server's I/O bandwidth and a lot of CPU to >> make the machine unusable for anything else. It is quite hardware dependent. >> >> Andy >> >>> >>> >>> Thank you very much! >>> >>> Jason >>> >>> >>> >>> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows >>> 10 >>> >>> >>> >>> ________________________________ >>> From: Andy Seaborne <a...@apache.org> >>> Sent: Thursday, June 6, 2019 6:35:41 PM >>> To: users@jena.apache.org >>> Subject: Re: Jena Model is serializable in Java? >>> >>> >>> >>> On 06/06/2019 08:57, Scarlet Remilia wrote: >>>> Hello everyone, >>>> >>>> My use case is a r2rml implementation, which could support millions or >>>> billions rows from RDBMS and distributed parse them into RDF. >>>> For now, We try to setup some small models in different spark executors to >>>> parse individually, and finally union them all. >>> That sounds more like a stream usage. >>> >>> Jena's StreamRDF and collect to a set (model or graph don't sound like >>> they do anything for your application - sound like you are just using >>> them as container of triples to move around. >>> >>>> I think RDD[Triple] is a good idea, but I need to review exist code to >>>> change model into triples. >>>> >>>> an RDF syntax and write-then-read the RDF is also a resolution but is too >>>> loose. It’s very hard to manage these files, especially there are too many >>>> small models mentioned above. >>>> >>>> Thanks, >>>> Jason >>>> >>>> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows >>>> 10 >>>> >>>> From: Lorenz B.<mailto:buehm...@informatik.uni-leipzig.de> >>>> Sent: Thursday, June 6, 2019 15:32 >>>> To: users@jena.apache.org<mailto:users@jena.apache.org> >>>> Subject: Re: Jena Model is serializable in Java? >>>> >>>> I don't see why one would want to share Model instances via Spark. I >>>> mean, it's possible via wrapping it inside an object which is >>>> serializable or some other wrapper method: >>>> >>>> object ModelWrapper extends Serializable { >>>> lazy val model = ... >>>> } >>>> >>>> rdd.map(s => ModelWrapper.model. ... ) >>>> >>>> >>>> This makes the model being attached to some static code that can't be >>>> changed during runtime and that's what Spark needs. >>>> >>>> Ideally, you'd use some broadcast variable, but indeed those are just >>>> use to share smaller entities among the different Spark workers. For >>>> smaller models like a schema this would work and is supposed to be more >>>> efficient than having joins etc. (yes, there are also broadcast joins in >>>> Spark, still data would be distributed during processing) - but it >>>> depends ... >>>> >>>> I don't know your use-case nor why you need a Model, but what we did >>>> when using Jena on Spark was to use RDD (or Dataset) of Triple objects, >>>> i.e. RDD[Triple]. RDD is the fundamental shared datastructure of Spark >>>> and this is the only way to scale when using very large datasets. >>>> Parsing RDF triples from e.g. N-Triples directly into RDD[Triple] is >>>> pretty easy. For Dataset you have to define a custom encoder (Kryo >>>> encoder works though). >>>> >>>> But as already mentioned, your use-case or application would be needed >>>> to give further advice if necessary. >>>> >>>>> Jason, >>>>> >>>>> I would argue that you should exchange a Set of triples, so you can take >>>>> advantage of Spark's distributed nature. Your logic can materialize that >>>>> list into a Graph or Model when needed to operate on it. Andy is right >>>>> about being careful about the size - you may want to build a specialized >>>>> set that throws if the set is too large, and you may want to experiment >>>>> with it. >>>>> >>>>> Andy, >>>>> >>>>> Does Jena Riot (or contrib) provide a binary syntax for RDF that is >>>>> optimal >>>>> for fast parse? I'm recalling Michael Stonebraker's response to the >>>>> BigTable paper - >>>>> https://pdfs.semanticscholar.org/08d1/2e771d811bcd0d4bc81fa3993563efbaeadb.pdf, >>>>> and also gSOAP and other binary XML formats. To this paper, the Google >>>>> BigTable authors then responded that they don't use loose serializations >>>>> such as provided by HDFS, but instead use structured data. >>>>> >>>>> This is hugely important to Jason's question because this is one of the >>>>> benefits of using Spark instead of HDFS - Spark will handle distributing a >>>>> huge dataset to multiple systems so that algorithm authors can operate on >>>>> a >>>>> vector (of Jena models?) far too large to fit in one machine. >>>>> >>>>> On Wed, Jun 5, 2019 at 4:40 PM Andy Seaborne <a...@apache.org> wrote: >>>>> >>>>>> Hi Jason, >>>>>> >>>>>> Models aren't serializable, nor are Graphs (the more system oriented >>>>>> view of RDF) through Triples, Quads and Node are serializable. You can >>>>>> send a list of triples. >>>>>> >>>>>> Or use an RDF syntax and write-then-read the RDF. >>>>>> >>>>>> But are the models small? RDF graph aren't always small so moving them >>>>>> around may be expensive. >>>>>> >>>>>> Andy >>>>>> >>>>>> On 05/06/2019 17:59, Scarlet Remilia wrote: >>>>>>> Hello everyone, >>>>>>> I get a problem about Jena and Spark. >>>>>>> I use Jena Model to handle some RDF models in my spark executor, but I >>>>>> get a error: >>>>>>> java.io.NotSerializableException: >>>>>> org.apache.jena.rdf.model.impl.ModelCom >>>>>>> Serialization stack: >>>>>>> - object not serializable (class: >>>>>> org.apache.jena.rdf.model.impl.ModelCom) >>>>>>> - field (class: org.nari.r2rml.entities.Template, name: >>>>>>> model, >>>>>> type: interface org.apache.jena.rdf.model.Model) >>>>>>> - object (class org.nari.r2rml.entities.Template, >>>>>> org.nari.r2rml.entities.Template@23dc70c1) >>>>>>> - field (class: org.nari.r2rml.entities.PredicateObjectMap, >>>>>> name: objectTemplate, type: class org.nari.r2rml.entities.Template) >>>>>>> - object (class org.nari.r2rml.entities.PredicateObjectMap, >>>>>> org.nari.r2rml.entities.PredicateObjectMap@2de96eba) >>>>>>> - writeObject data (class: java.util.ArrayList) >>>>>>> - object (class java.util.ArrayList, >>>>>> [org.nari.r2rml.entities.PredicateObjectMap@2de96eba]) >>>>>>> - field (class: org.nari.r2rml.entities.LogicalTableMapping, >>>>>> name: predicateObjectMaps, type: class java.util.ArrayList) >>>>>>> - object (class org.nari.r2rml.entities.LogicalTableMapping, >>>>>> org.nari.r2rml.entities.LogicalTableMapping@8e00c02) >>>>>>> - field (class: >>>>>>> org.nari.r2rml.beans.Impl.EachPartitonFunction, >>>>>> name: logicalTableMapping, type: class >>>>>> org.nari.r2rml.entities.LogicalTableMapping) >>>>>>> - object (class >>>>>>> org.nari.r2rml.beans.Impl.EachPartitonFunction, >>>>>> org.nari.r2rml.beans.Impl.EachPartitonFunction@1e14b269) >>>>>>> - field (class: >>>>>> org.apache.spark.sql.Dataset$$anonfun$foreachPartition$2, name: func$4, >>>>>> type: interface >>>>>> org.apache.spark.api.java.function.ForeachPartitionFunction) >>>>>>> - object (class >>>>>> org.apache.spark.sql.Dataset$$anonfun$foreachPartition$2, <function1>) >>>>>>> at >>>>>> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) >>>>>>> at >>>>>> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) >>>>>>> at >>>>>> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) >>>>>>> at >>>>>> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:400) >>>>>>> ... 33 more >>>>>>> >>>>>>> All these classes implement serializable interface. >>>>>>> So how could I serialize Jena model java object? >>>>>>> >>>>>>> Thanks very much! >>>>>>> >>>>>>> >>>>>>> Jason >>>>>>> >>>>>>> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for >>>>>> Windows 10 >>>> -- >>>> Lorenz Bühmann >>>> AKSW group, University of Leipzig >>>> Group: http://aksw.org - semantic web research center >>>> >>>> > > > -- > Your greatest regret is the email ID you choose in 8th grade > > -- Lorenz Bühmann AKSW group, University of Leipzig Group: http://aksw.org - semantic web research center