This code can only work if the Turtle data isn't distributed across
partitions in SPARK as Turtle isn't a splittable format like N-Triples
would be. I'm wondering if you did consider this in your application?

> Serializing Model worked for me using Spark Dataset API. Model was not
> associated with any database.
> Models were constructed from turtle string and then transformed to
> another serializable form.
>
> ...
>     private final Encoder<Model> MODEL_ENCODER = Encoders.bean(Model.class);
> ...
>             .mapPartitions((Iterator<String> it) -> {...}, Encoders.STRING())
>             .map(ttl -> {
>                 Model m =
> ModelFactory.createDefaultModel().read(IOUtils.toInputStream(ttl,
> "UTF-8"), null, "TTL");
>                 m = getProperties(m);
>                 m = populateLabels(m);
>                 return m;
>             }, MODEL_ENCODER)
>             .flatMap(this::extractRelationalSentences, RELATION_ENCODER)
> ...
>
> Full code here [1]
>
> Siddhesh
>
> [1] 
> https://github.com/SiddheshRane/sparkl/blob/0bd5b267ffaffdc2dc2d7e59a5b07f09706be8d2/src/main/java/siddhesh/sparkl/TagArticlesFromFile.java#L149
>
> On Sat, Jun 8, 2019 at 5:05 PM Andy Seaborne <a...@apache.org> wrote:
>> Hi Jason
>>
>> On 08/06/2019 11:49, Scarlet Remilia wrote:
>>> Hello everyone,
>>>
>>>
>>>
>>> I changed model to triples in RDD/Dataset, but there is a question.
>>>
>>> I have triples in Dataset of Spark now, and I need to put them into a Model 
>>> or something else ,then output them into a file or TDB or somewhere else.
>>>
>>> As Dan mentioned before, is there any binary syntax for RDF?
>> https://jena.apache.org/documentation/io/rdf-binary.html
>> org.apache.jena.riot.thrift.*
>>
>>> Or Is Jena supported distributed model to handling billions 
>>> triples?(supporting parsing triples into a RDF file is OK).TDB’s MRSW is a 
>>> quite problem for me.
>> (It's MR+SW - multiple reader AND single writer)
>>
>> Are you wanting to load smallish units of triples from multiple sources?
>>
>> Maybe you want to have all the streams send their output to a queue (in
>> blocks, not triple by triple) and have TDB load from that queue.
>> Multiple StreamRDF to a single StreamRDF, load the StreamTDB.
>>
>> There is the TDB2 parallel loader - that is, loading from a single
>> source using internal parallelism, not loading from parallel inputs.
>> (It's 5 threads for triples, more for quads). It load from a StreamRDF.
>>
>> NB - it can consume all the server's I/O bandwidth and a lot of CPU to
>> make the machine unusable for anything else. It is quite hardware dependent.
>>
>>      Andy
>>
>>>
>>>
>>> Thank you very much!
>>>
>>> Jason
>>>
>>>
>>>
>>> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 
>>> 10
>>>
>>>
>>>
>>> ________________________________
>>> From: Andy Seaborne <a...@apache.org>
>>> Sent: Thursday, June 6, 2019 6:35:41 PM
>>> To: users@jena.apache.org
>>> Subject: Re: Jena Model is serializable in Java?
>>>
>>>
>>>
>>> On 06/06/2019 08:57, Scarlet Remilia wrote:
>>>> Hello everyone,
>>>>
>>>> My use case is a r2rml implementation, which could support millions or 
>>>> billions rows from RDBMS and distributed parse them into RDF.
>>>> For now, We try to setup some small models in different spark executors to 
>>>> parse individually, and finally union them all.
>>> That sounds more like a stream usage.
>>>
>>> Jena's StreamRDF and collect to a set (model or graph don't sound like
>>> they do anything for your application - sound like you are just using
>>> them as container of triples to move around.
>>>
>>>> I think RDD[Triple] is a good idea, but I need to review exist code to 
>>>> change model into triples.
>>>>
>>>> an RDF syntax and write-then-read the RDF is also a resolution but is too 
>>>> loose. It’s very hard to manage these files, especially there are too many 
>>>> small models mentioned above.
>>>>
>>>> Thanks,
>>>> Jason
>>>>
>>>> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 
>>>> 10
>>>>
>>>> From: Lorenz B.<mailto:buehm...@informatik.uni-leipzig.de>
>>>> Sent: Thursday, June 6, 2019 15:32
>>>> To: users@jena.apache.org<mailto:users@jena.apache.org>
>>>> Subject: Re: Jena Model is serializable in Java?
>>>>
>>>> I don't see why one would want to share Model instances via Spark. I
>>>> mean, it's possible via wrapping it inside an object which is
>>>> serializable or some other wrapper method:
>>>>
>>>> object ModelWrapper extends Serializable {
>>>> lazy val model = ...
>>>> }
>>>>
>>>> rdd.map(s => ModelWrapper.model. ... )
>>>>
>>>>
>>>> This makes the model being attached to some static code that can't be
>>>> changed during runtime and that's what Spark needs.
>>>>
>>>> Ideally, you'd use some broadcast variable, but indeed those are just
>>>> use to share smaller entities among the different Spark workers. For
>>>> smaller models like a schema this would work and is supposed to be more
>>>> efficient than having joins etc. (yes, there are also broadcast joins in
>>>> Spark, still data would be distributed during processing) - but it
>>>> depends ...
>>>>
>>>> I don't know your use-case nor why you need a Model, but what we did
>>>> when using Jena on Spark was to use RDD (or Dataset) of Triple objects,
>>>> i.e. RDD[Triple]. RDD is the fundamental shared datastructure of Spark
>>>> and this is the only way to scale when using very large datasets.
>>>> Parsing RDF triples from e.g. N-Triples directly into RDD[Triple] is
>>>> pretty easy. For Dataset you have to define a custom encoder (Kryo
>>>> encoder works though).
>>>>
>>>> But as already mentioned, your use-case or application would be needed
>>>> to give further advice if necessary.
>>>>
>>>>> Jason,
>>>>>
>>>>> I would argue that you should exchange a Set of triples, so you can take
>>>>> advantage of Spark's distributed nature.  Your logic can materialize that
>>>>> list into a Graph or Model when needed to operate on it.   Andy is right
>>>>> about being careful about the size - you may want to build a specialized
>>>>> set that throws if the set is too large, and you may want to experiment
>>>>> with it.
>>>>>
>>>>> Andy,
>>>>>
>>>>> Does Jena Riot (or contrib) provide a binary syntax for RDF that is 
>>>>> optimal
>>>>> for fast parse?  I'm recalling Michael Stonebraker's response to the
>>>>> BigTable paper -
>>>>> https://pdfs.semanticscholar.org/08d1/2e771d811bcd0d4bc81fa3993563efbaeadb.pdf,
>>>>> and also gSOAP and other binary XML formats.  To this paper, the Google
>>>>> BigTable authors then responded that they don't use loose serializations
>>>>> such as provided by HDFS, but instead use structured data.
>>>>>
>>>>> This is hugely important to Jason's question because this is one of the
>>>>> benefits of using Spark instead of HDFS - Spark will handle distributing a
>>>>> huge dataset to multiple systems so that algorithm authors can operate on 
>>>>> a
>>>>> vector (of Jena models?) far too large to fit in one machine.
>>>>>
>>>>> On Wed, Jun 5, 2019 at 4:40 PM Andy Seaborne <a...@apache.org> wrote:
>>>>>
>>>>>> Hi Jason,
>>>>>>
>>>>>> Models aren't serializable, nor are Graphs (the more system oriented
>>>>>> view of RDF) through  Triples, Quads and Node are serializable.  You can
>>>>>> send a list of triples.
>>>>>>
>>>>>> Or use an RDF syntax and write-then-read the RDF.
>>>>>>
>>>>>> But are the models small? RDF graph aren't always small so moving them
>>>>>> around may be expensive.
>>>>>>
>>>>>>        Andy
>>>>>>
>>>>>> On 05/06/2019 17:59, Scarlet Remilia wrote:
>>>>>>> Hello everyone,
>>>>>>> I get a problem about Jena and Spark.
>>>>>>> I use Jena Model to handle some RDF models in my spark executor, but I
>>>>>> get a error:
>>>>>>> java.io.NotSerializableException:
>>>>>> org.apache.jena.rdf.model.impl.ModelCom
>>>>>>> Serialization stack:
>>>>>>>            - object not serializable (class:
>>>>>> org.apache.jena.rdf.model.impl.ModelCom)
>>>>>>>            - field (class: org.nari.r2rml.entities.Template, name: 
>>>>>>> model,
>>>>>> type: interface org.apache.jena.rdf.model.Model)
>>>>>>>            - object (class org.nari.r2rml.entities.Template,
>>>>>> org.nari.r2rml.entities.Template@23dc70c1)
>>>>>>>            - field (class: org.nari.r2rml.entities.PredicateObjectMap,
>>>>>> name: objectTemplate, type: class org.nari.r2rml.entities.Template)
>>>>>>>            - object (class org.nari.r2rml.entities.PredicateObjectMap,
>>>>>> org.nari.r2rml.entities.PredicateObjectMap@2de96eba)
>>>>>>>            - writeObject data (class: java.util.ArrayList)
>>>>>>>            - object (class java.util.ArrayList,
>>>>>> [org.nari.r2rml.entities.PredicateObjectMap@2de96eba])
>>>>>>>            - field (class: org.nari.r2rml.entities.LogicalTableMapping,
>>>>>> name: predicateObjectMaps, type: class java.util.ArrayList)
>>>>>>>            - object (class org.nari.r2rml.entities.LogicalTableMapping,
>>>>>> org.nari.r2rml.entities.LogicalTableMapping@8e00c02)
>>>>>>>            - field (class: 
>>>>>>> org.nari.r2rml.beans.Impl.EachPartitonFunction,
>>>>>> name: logicalTableMapping, type: class
>>>>>> org.nari.r2rml.entities.LogicalTableMapping)
>>>>>>>            - object (class 
>>>>>>> org.nari.r2rml.beans.Impl.EachPartitonFunction,
>>>>>> org.nari.r2rml.beans.Impl.EachPartitonFunction@1e14b269)
>>>>>>>            - field (class:
>>>>>> org.apache.spark.sql.Dataset$$anonfun$foreachPartition$2, name: func$4,
>>>>>> type: interface 
>>>>>> org.apache.spark.api.java.function.ForeachPartitionFunction)
>>>>>>>            - object (class
>>>>>> org.apache.spark.sql.Dataset$$anonfun$foreachPartition$2, <function1>)
>>>>>>>            at
>>>>>> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
>>>>>>>            at
>>>>>> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
>>>>>>>            at
>>>>>> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
>>>>>>>            at
>>>>>> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:400)
>>>>>>>            ... 33 more
>>>>>>>
>>>>>>> All these classes implement serializable interface.
>>>>>>> So how could I serialize Jena model java object?
>>>>>>>
>>>>>>> Thanks very much!
>>>>>>>
>>>>>>>
>>>>>>> Jason
>>>>>>>
>>>>>>> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for
>>>>>> Windows 10
>>>> --
>>>> Lorenz Bühmann
>>>> AKSW group, University of Leipzig
>>>> Group: http://aksw.org - semantic web research center
>>>>
>>>>
>
>
> --
> Your greatest regret is the email ID you choose in 8th grade
>
>
-- 
Lorenz Bühmann
AKSW group, University of Leipzig
Group: http://aksw.org - semantic web research center

Reply via email to