On Wed, Jul 3, 2019 at 12:27 PM Lorenz B. <buehm...@informatik.uni-leipzig.de> wrote: > > This code can only work if the Turtle data isn't distributed across > partitions in SPARK as Turtle isn't a splittable format like N-Triples > would be. I'm wondering if you did consider this in your application?
Yes this aspect was considered. To begin, every partition contained a list of article URIs, then an entire partition was mapped to a Model by fetching all triples where the subject was the article URI. This way it was guaranteed that a Model would be on one machine. > > Serializing Model worked for me using Spark Dataset API. Model was not > > associated with any database. > > Models were constructed from turtle string and then transformed to > > another serializable form. > > > > ... > > private final Encoder<Model> MODEL_ENCODER = Encoders.bean(Model.class); > > ... > > .mapPartitions((Iterator<String> it) -> {...}, > > Encoders.STRING()) > > .map(ttl -> { > > Model m = > > ModelFactory.createDefaultModel().read(IOUtils.toInputStream(ttl, > > "UTF-8"), null, "TTL"); > > m = getProperties(m); > > m = populateLabels(m); > > return m; > > }, MODEL_ENCODER) > > .flatMap(this::extractRelationalSentences, RELATION_ENCODER) > > ... > > > > Full code here [1] > > > > Siddhesh > > > > [1] > > https://github.com/SiddheshRane/sparkl/blob/0bd5b267ffaffdc2dc2d7e59a5b07f09706be8d2/src/main/java/siddhesh/sparkl/TagArticlesFromFile.java#L149 > > > > On Sat, Jun 8, 2019 at 5:05 PM Andy Seaborne <a...@apache.org> wrote: > >> Hi Jason > >> > >> On 08/06/2019 11:49, Scarlet Remilia wrote: > >>> Hello everyone, > >>> > >>> > >>> > >>> I changed model to triples in RDD/Dataset, but there is a question. > >>> > >>> I have triples in Dataset of Spark now, and I need to put them into a > >>> Model or something else ,then output them into a file or TDB or somewhere > >>> else. > >>> > >>> As Dan mentioned before, is there any binary syntax for RDF? > >> https://jena.apache.org/documentation/io/rdf-binary.html > >> org.apache.jena.riot.thrift.* > >> > >>> Or Is Jena supported distributed model to handling billions > >>> triples?(supporting parsing triples into a RDF file is OK).TDB’s MRSW is > >>> a quite problem for me. > >> (It's MR+SW - multiple reader AND single writer) > >> > >> Are you wanting to load smallish units of triples from multiple sources? > >> > >> Maybe you want to have all the streams send their output to a queue (in > >> blocks, not triple by triple) and have TDB load from that queue. > >> Multiple StreamRDF to a single StreamRDF, load the StreamTDB. > >> > >> There is the TDB2 parallel loader - that is, loading from a single > >> source using internal parallelism, not loading from parallel inputs. > >> (It's 5 threads for triples, more for quads). It load from a StreamRDF. > >> > >> NB - it can consume all the server's I/O bandwidth and a lot of CPU to > >> make the machine unusable for anything else. It is quite hardware > >> dependent. > >> > >> Andy > >> > >>> > >>> > >>> Thank you very much! > >>> > >>> Jason > >>> > >>> > >>> > >>> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for > >>> Windows 10 > >>> > >>> > >>> > >>> ________________________________ > >>> From: Andy Seaborne <a...@apache.org> > >>> Sent: Thursday, June 6, 2019 6:35:41 PM > >>> To: users@jena.apache.org > >>> Subject: Re: Jena Model is serializable in Java? > >>> > >>> > >>> > >>> On 06/06/2019 08:57, Scarlet Remilia wrote: > >>>> Hello everyone, > >>>> > >>>> My use case is a r2rml implementation, which could support millions or > >>>> billions rows from RDBMS and distributed parse them into RDF. > >>>> For now, We try to setup some small models in different spark executors > >>>> to parse individually, and finally union them all. > >>> That sounds more like a stream usage. > >>> > >>> Jena's StreamRDF and collect to a set (model or graph don't sound like > >>> they do anything for your application - sound like you are just using > >>> them as container of triples to move around. > >>> > >>>> I think RDD[Triple] is a good idea, but I need to review exist code to > >>>> change model into triples. > >>>> > >>>> an RDF syntax and write-then-read the RDF is also a resolution but is > >>>> too loose. It’s very hard to manage these files, especially there are > >>>> too many small models mentioned above. > >>>> > >>>> Thanks, > >>>> Jason > >>>> > >>>> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for > >>>> Windows 10 > >>>> > >>>> From: Lorenz B.<mailto:buehm...@informatik.uni-leipzig.de> > >>>> Sent: Thursday, June 6, 2019 15:32 > >>>> To: users@jena.apache.org<mailto:users@jena.apache.org> > >>>> Subject: Re: Jena Model is serializable in Java? > >>>> > >>>> I don't see why one would want to share Model instances via Spark. I > >>>> mean, it's possible via wrapping it inside an object which is > >>>> serializable or some other wrapper method: > >>>> > >>>> object ModelWrapper extends Serializable { > >>>> lazy val model = ... > >>>> } > >>>> > >>>> rdd.map(s => ModelWrapper.model. ... ) > >>>> > >>>> > >>>> This makes the model being attached to some static code that can't be > >>>> changed during runtime and that's what Spark needs. > >>>> > >>>> Ideally, you'd use some broadcast variable, but indeed those are just > >>>> use to share smaller entities among the different Spark workers. For > >>>> smaller models like a schema this would work and is supposed to be more > >>>> efficient than having joins etc. (yes, there are also broadcast joins in > >>>> Spark, still data would be distributed during processing) - but it > >>>> depends ... > >>>> > >>>> I don't know your use-case nor why you need a Model, but what we did > >>>> when using Jena on Spark was to use RDD (or Dataset) of Triple objects, > >>>> i.e. RDD[Triple]. RDD is the fundamental shared datastructure of Spark > >>>> and this is the only way to scale when using very large datasets. > >>>> Parsing RDF triples from e.g. N-Triples directly into RDD[Triple] is > >>>> pretty easy. For Dataset you have to define a custom encoder (Kryo > >>>> encoder works though). > >>>> > >>>> But as already mentioned, your use-case or application would be needed > >>>> to give further advice if necessary. > >>>> > >>>>> Jason, > >>>>> > >>>>> I would argue that you should exchange a Set of triples, so you can take > >>>>> advantage of Spark's distributed nature. Your logic can materialize > >>>>> that > >>>>> list into a Graph or Model when needed to operate on it. Andy is right > >>>>> about being careful about the size - you may want to build a specialized > >>>>> set that throws if the set is too large, and you may want to experiment > >>>>> with it. > >>>>> > >>>>> Andy, > >>>>> > >>>>> Does Jena Riot (or contrib) provide a binary syntax for RDF that is > >>>>> optimal > >>>>> for fast parse? I'm recalling Michael Stonebraker's response to the > >>>>> BigTable paper - > >>>>> https://pdfs.semanticscholar.org/08d1/2e771d811bcd0d4bc81fa3993563efbaeadb.pdf, > >>>>> and also gSOAP and other binary XML formats. To this paper, the Google > >>>>> BigTable authors then responded that they don't use loose serializations > >>>>> such as provided by HDFS, but instead use structured data. > >>>>> > >>>>> This is hugely important to Jason's question because this is one of the > >>>>> benefits of using Spark instead of HDFS - Spark will handle > >>>>> distributing a > >>>>> huge dataset to multiple systems so that algorithm authors can operate > >>>>> on a > >>>>> vector (of Jena models?) far too large to fit in one machine. > >>>>> > >>>>> On Wed, Jun 5, 2019 at 4:40 PM Andy Seaborne <a...@apache.org> wrote: > >>>>> > >>>>>> Hi Jason, > >>>>>> > >>>>>> Models aren't serializable, nor are Graphs (the more system oriented > >>>>>> view of RDF) through Triples, Quads and Node are serializable. You > >>>>>> can > >>>>>> send a list of triples. > >>>>>> > >>>>>> Or use an RDF syntax and write-then-read the RDF. > >>>>>> > >>>>>> But are the models small? RDF graph aren't always small so moving them > >>>>>> around may be expensive. > >>>>>> > >>>>>> Andy > >>>>>> > >>>>>> On 05/06/2019 17:59, Scarlet Remilia wrote: > >>>>>>> Hello everyone, > >>>>>>> I get a problem about Jena and Spark. > >>>>>>> I use Jena Model to handle some RDF models in my spark executor, but I > >>>>>> get a error: > >>>>>>> java.io.NotSerializableException: > >>>>>> org.apache.jena.rdf.model.impl.ModelCom > >>>>>>> Serialization stack: > >>>>>>> - object not serializable (class: > >>>>>> org.apache.jena.rdf.model.impl.ModelCom) > >>>>>>> - field (class: org.nari.r2rml.entities.Template, name: > >>>>>>> model, > >>>>>> type: interface org.apache.jena.rdf.model.Model) > >>>>>>> - object (class org.nari.r2rml.entities.Template, > >>>>>> org.nari.r2rml.entities.Template@23dc70c1) > >>>>>>> - field (class: org.nari.r2rml.entities.PredicateObjectMap, > >>>>>> name: objectTemplate, type: class org.nari.r2rml.entities.Template) > >>>>>>> - object (class org.nari.r2rml.entities.PredicateObjectMap, > >>>>>> org.nari.r2rml.entities.PredicateObjectMap@2de96eba) > >>>>>>> - writeObject data (class: java.util.ArrayList) > >>>>>>> - object (class java.util.ArrayList, > >>>>>> [org.nari.r2rml.entities.PredicateObjectMap@2de96eba]) > >>>>>>> - field (class: > >>>>>>> org.nari.r2rml.entities.LogicalTableMapping, > >>>>>> name: predicateObjectMaps, type: class java.util.ArrayList) > >>>>>>> - object (class > >>>>>>> org.nari.r2rml.entities.LogicalTableMapping, > >>>>>> org.nari.r2rml.entities.LogicalTableMapping@8e00c02) > >>>>>>> - field (class: > >>>>>>> org.nari.r2rml.beans.Impl.EachPartitonFunction, > >>>>>> name: logicalTableMapping, type: class > >>>>>> org.nari.r2rml.entities.LogicalTableMapping) > >>>>>>> - object (class > >>>>>>> org.nari.r2rml.beans.Impl.EachPartitonFunction, > >>>>>> org.nari.r2rml.beans.Impl.EachPartitonFunction@1e14b269) > >>>>>>> - field (class: > >>>>>> org.apache.spark.sql.Dataset$$anonfun$foreachPartition$2, name: func$4, > >>>>>> type: interface > >>>>>> org.apache.spark.api.java.function.ForeachPartitionFunction) > >>>>>>> - object (class > >>>>>> org.apache.spark.sql.Dataset$$anonfun$foreachPartition$2, <function1>) > >>>>>>> at > >>>>>> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) > >>>>>>> at > >>>>>> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) > >>>>>>> at > >>>>>> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) > >>>>>>> at > >>>>>> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:400) > >>>>>>> ... 33 more > >>>>>>> > >>>>>>> All these classes implement serializable interface. > >>>>>>> So how could I serialize Jena model java object? > >>>>>>> > >>>>>>> Thanks very much! > >>>>>>> > >>>>>>> > >>>>>>> Jason > >>>>>>> > >>>>>>> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for > >>>>>> Windows 10 > >>>> -- > >>>> Lorenz Bühmann > >>>> AKSW group, University of Leipzig > >>>> Group: http://aksw.org - semantic web research center > >>>> > >>>> > > > > > > -- > > Your greatest regret is the email ID you choose in 8th grade > > > > > -- > Lorenz Bühmann > AKSW group, University of Leipzig > Group: http://aksw.org - semantic web research center > -- Your greatest regret is the email ID you choose in 8th grade