On Wed, Jul 3, 2019 at 12:27 PM Lorenz B.
<buehm...@informatik.uni-leipzig.de> wrote:
>
> This code can only work if the Turtle data isn't distributed across
> partitions in SPARK as Turtle isn't a splittable format like N-Triples
> would be. I'm wondering if you did consider this in your application?

Yes this aspect was considered.
To begin, every partition contained a list of article URIs, then an
entire partition was mapped to a Model by fetching all triples where
the subject was the article URI.
This way it was guaranteed that a Model would be on one machine.

> > Serializing Model worked for me using Spark Dataset API. Model was not
> > associated with any database.
> > Models were constructed from turtle string and then transformed to
> > another serializable form.
> >
> > ...
> >     private final Encoder<Model> MODEL_ENCODER = Encoders.bean(Model.class);
> > ...
> >             .mapPartitions((Iterator<String> it) -> {...}, 
> > Encoders.STRING())
> >             .map(ttl -> {
> >                 Model m =
> > ModelFactory.createDefaultModel().read(IOUtils.toInputStream(ttl,
> > "UTF-8"), null, "TTL");
> >                 m = getProperties(m);
> >                 m = populateLabels(m);
> >                 return m;
> >             }, MODEL_ENCODER)
> >             .flatMap(this::extractRelationalSentences, RELATION_ENCODER)
> > ...
> >
> > Full code here [1]
> >
> > Siddhesh
> >
> > [1] 
> > https://github.com/SiddheshRane/sparkl/blob/0bd5b267ffaffdc2dc2d7e59a5b07f09706be8d2/src/main/java/siddhesh/sparkl/TagArticlesFromFile.java#L149
> >
> > On Sat, Jun 8, 2019 at 5:05 PM Andy Seaborne <a...@apache.org> wrote:
> >> Hi Jason
> >>
> >> On 08/06/2019 11:49, Scarlet Remilia wrote:
> >>> Hello everyone,
> >>>
> >>>
> >>>
> >>> I changed model to triples in RDD/Dataset, but there is a question.
> >>>
> >>> I have triples in Dataset of Spark now, and I need to put them into a 
> >>> Model or something else ,then output them into a file or TDB or somewhere 
> >>> else.
> >>>
> >>> As Dan mentioned before, is there any binary syntax for RDF?
> >> https://jena.apache.org/documentation/io/rdf-binary.html
> >> org.apache.jena.riot.thrift.*
> >>
> >>> Or Is Jena supported distributed model to handling billions 
> >>> triples?(supporting parsing triples into a RDF file is OK).TDB’s MRSW is 
> >>> a quite problem for me.
> >> (It's MR+SW - multiple reader AND single writer)
> >>
> >> Are you wanting to load smallish units of triples from multiple sources?
> >>
> >> Maybe you want to have all the streams send their output to a queue (in
> >> blocks, not triple by triple) and have TDB load from that queue.
> >> Multiple StreamRDF to a single StreamRDF, load the StreamTDB.
> >>
> >> There is the TDB2 parallel loader - that is, loading from a single
> >> source using internal parallelism, not loading from parallel inputs.
> >> (It's 5 threads for triples, more for quads). It load from a StreamRDF.
> >>
> >> NB - it can consume all the server's I/O bandwidth and a lot of CPU to
> >> make the machine unusable for anything else. It is quite hardware 
> >> dependent.
> >>
> >>      Andy
> >>
> >>>
> >>>
> >>> Thank you very much!
> >>>
> >>> Jason
> >>>
> >>>
> >>>
> >>> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for 
> >>> Windows 10
> >>>
> >>>
> >>>
> >>> ________________________________
> >>> From: Andy Seaborne <a...@apache.org>
> >>> Sent: Thursday, June 6, 2019 6:35:41 PM
> >>> To: users@jena.apache.org
> >>> Subject: Re: Jena Model is serializable in Java?
> >>>
> >>>
> >>>
> >>> On 06/06/2019 08:57, Scarlet Remilia wrote:
> >>>> Hello everyone,
> >>>>
> >>>> My use case is a r2rml implementation, which could support millions or 
> >>>> billions rows from RDBMS and distributed parse them into RDF.
> >>>> For now, We try to setup some small models in different spark executors 
> >>>> to parse individually, and finally union them all.
> >>> That sounds more like a stream usage.
> >>>
> >>> Jena's StreamRDF and collect to a set (model or graph don't sound like
> >>> they do anything for your application - sound like you are just using
> >>> them as container of triples to move around.
> >>>
> >>>> I think RDD[Triple] is a good idea, but I need to review exist code to 
> >>>> change model into triples.
> >>>>
> >>>> an RDF syntax and write-then-read the RDF is also a resolution but is 
> >>>> too loose. It’s very hard to manage these files, especially there are 
> >>>> too many small models mentioned above.
> >>>>
> >>>> Thanks,
> >>>> Jason
> >>>>
> >>>> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for 
> >>>> Windows 10
> >>>>
> >>>> From: Lorenz B.<mailto:buehm...@informatik.uni-leipzig.de>
> >>>> Sent: Thursday, June 6, 2019 15:32
> >>>> To: users@jena.apache.org<mailto:users@jena.apache.org>
> >>>> Subject: Re: Jena Model is serializable in Java?
> >>>>
> >>>> I don't see why one would want to share Model instances via Spark. I
> >>>> mean, it's possible via wrapping it inside an object which is
> >>>> serializable or some other wrapper method:
> >>>>
> >>>> object ModelWrapper extends Serializable {
> >>>> lazy val model = ...
> >>>> }
> >>>>
> >>>> rdd.map(s => ModelWrapper.model. ... )
> >>>>
> >>>>
> >>>> This makes the model being attached to some static code that can't be
> >>>> changed during runtime and that's what Spark needs.
> >>>>
> >>>> Ideally, you'd use some broadcast variable, but indeed those are just
> >>>> use to share smaller entities among the different Spark workers. For
> >>>> smaller models like a schema this would work and is supposed to be more
> >>>> efficient than having joins etc. (yes, there are also broadcast joins in
> >>>> Spark, still data would be distributed during processing) - but it
> >>>> depends ...
> >>>>
> >>>> I don't know your use-case nor why you need a Model, but what we did
> >>>> when using Jena on Spark was to use RDD (or Dataset) of Triple objects,
> >>>> i.e. RDD[Triple]. RDD is the fundamental shared datastructure of Spark
> >>>> and this is the only way to scale when using very large datasets.
> >>>> Parsing RDF triples from e.g. N-Triples directly into RDD[Triple] is
> >>>> pretty easy. For Dataset you have to define a custom encoder (Kryo
> >>>> encoder works though).
> >>>>
> >>>> But as already mentioned, your use-case or application would be needed
> >>>> to give further advice if necessary.
> >>>>
> >>>>> Jason,
> >>>>>
> >>>>> I would argue that you should exchange a Set of triples, so you can take
> >>>>> advantage of Spark's distributed nature.  Your logic can materialize 
> >>>>> that
> >>>>> list into a Graph or Model when needed to operate on it.   Andy is right
> >>>>> about being careful about the size - you may want to build a specialized
> >>>>> set that throws if the set is too large, and you may want to experiment
> >>>>> with it.
> >>>>>
> >>>>> Andy,
> >>>>>
> >>>>> Does Jena Riot (or contrib) provide a binary syntax for RDF that is 
> >>>>> optimal
> >>>>> for fast parse?  I'm recalling Michael Stonebraker's response to the
> >>>>> BigTable paper -
> >>>>> https://pdfs.semanticscholar.org/08d1/2e771d811bcd0d4bc81fa3993563efbaeadb.pdf,
> >>>>> and also gSOAP and other binary XML formats.  To this paper, the Google
> >>>>> BigTable authors then responded that they don't use loose serializations
> >>>>> such as provided by HDFS, but instead use structured data.
> >>>>>
> >>>>> This is hugely important to Jason's question because this is one of the
> >>>>> benefits of using Spark instead of HDFS - Spark will handle 
> >>>>> distributing a
> >>>>> huge dataset to multiple systems so that algorithm authors can operate 
> >>>>> on a
> >>>>> vector (of Jena models?) far too large to fit in one machine.
> >>>>>
> >>>>> On Wed, Jun 5, 2019 at 4:40 PM Andy Seaborne <a...@apache.org> wrote:
> >>>>>
> >>>>>> Hi Jason,
> >>>>>>
> >>>>>> Models aren't serializable, nor are Graphs (the more system oriented
> >>>>>> view of RDF) through  Triples, Quads and Node are serializable.  You 
> >>>>>> can
> >>>>>> send a list of triples.
> >>>>>>
> >>>>>> Or use an RDF syntax and write-then-read the RDF.
> >>>>>>
> >>>>>> But are the models small? RDF graph aren't always small so moving them
> >>>>>> around may be expensive.
> >>>>>>
> >>>>>>        Andy
> >>>>>>
> >>>>>> On 05/06/2019 17:59, Scarlet Remilia wrote:
> >>>>>>> Hello everyone,
> >>>>>>> I get a problem about Jena and Spark.
> >>>>>>> I use Jena Model to handle some RDF models in my spark executor, but I
> >>>>>> get a error:
> >>>>>>> java.io.NotSerializableException:
> >>>>>> org.apache.jena.rdf.model.impl.ModelCom
> >>>>>>> Serialization stack:
> >>>>>>>            - object not serializable (class:
> >>>>>> org.apache.jena.rdf.model.impl.ModelCom)
> >>>>>>>            - field (class: org.nari.r2rml.entities.Template, name: 
> >>>>>>> model,
> >>>>>> type: interface org.apache.jena.rdf.model.Model)
> >>>>>>>            - object (class org.nari.r2rml.entities.Template,
> >>>>>> org.nari.r2rml.entities.Template@23dc70c1)
> >>>>>>>            - field (class: org.nari.r2rml.entities.PredicateObjectMap,
> >>>>>> name: objectTemplate, type: class org.nari.r2rml.entities.Template)
> >>>>>>>            - object (class org.nari.r2rml.entities.PredicateObjectMap,
> >>>>>> org.nari.r2rml.entities.PredicateObjectMap@2de96eba)
> >>>>>>>            - writeObject data (class: java.util.ArrayList)
> >>>>>>>            - object (class java.util.ArrayList,
> >>>>>> [org.nari.r2rml.entities.PredicateObjectMap@2de96eba])
> >>>>>>>            - field (class: 
> >>>>>>> org.nari.r2rml.entities.LogicalTableMapping,
> >>>>>> name: predicateObjectMaps, type: class java.util.ArrayList)
> >>>>>>>            - object (class 
> >>>>>>> org.nari.r2rml.entities.LogicalTableMapping,
> >>>>>> org.nari.r2rml.entities.LogicalTableMapping@8e00c02)
> >>>>>>>            - field (class: 
> >>>>>>> org.nari.r2rml.beans.Impl.EachPartitonFunction,
> >>>>>> name: logicalTableMapping, type: class
> >>>>>> org.nari.r2rml.entities.LogicalTableMapping)
> >>>>>>>            - object (class 
> >>>>>>> org.nari.r2rml.beans.Impl.EachPartitonFunction,
> >>>>>> org.nari.r2rml.beans.Impl.EachPartitonFunction@1e14b269)
> >>>>>>>            - field (class:
> >>>>>> org.apache.spark.sql.Dataset$$anonfun$foreachPartition$2, name: func$4,
> >>>>>> type: interface 
> >>>>>> org.apache.spark.api.java.function.ForeachPartitionFunction)
> >>>>>>>            - object (class
> >>>>>> org.apache.spark.sql.Dataset$$anonfun$foreachPartition$2, <function1>)
> >>>>>>>            at
> >>>>>> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
> >>>>>>>            at
> >>>>>> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
> >>>>>>>            at
> >>>>>> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
> >>>>>>>            at
> >>>>>> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:400)
> >>>>>>>            ... 33 more
> >>>>>>>
> >>>>>>> All these classes implement serializable interface.
> >>>>>>> So how could I serialize Jena model java object?
> >>>>>>>
> >>>>>>> Thanks very much!
> >>>>>>>
> >>>>>>>
> >>>>>>> Jason
> >>>>>>>
> >>>>>>> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for
> >>>>>> Windows 10
> >>>> --
> >>>> Lorenz Bühmann
> >>>> AKSW group, University of Leipzig
> >>>> Group: http://aksw.org - semantic web research center
> >>>>
> >>>>
> >
> >
> > --
> > Your greatest regret is the email ID you choose in 8th grade
> >
> >
> --
> Lorenz Bühmann
> AKSW group, University of Leipzig
> Group: http://aksw.org - semantic web research center
>


-- 
Your greatest regret is the email ID you choose in 8th grade

Reply via email to