Hello everyone,


I changed model to triples in RDD/Dataset, but there is a question.

I have triples in Dataset of Spark now, and I need to put them into a Model or 
something else ,then output them into a file or TDB or somewhere else.

As Dan mentioned before, is there any binary syntax for RDF? Or Is Jena 
supported distributed model to handling billions triples?(supporting parsing 
triples into a RDF file is OK).TDB’s MRSW is a quite problem for me.



Thank you very much!

Jason



Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10



________________________________
From: Andy Seaborne <a...@apache.org>
Sent: Thursday, June 6, 2019 6:35:41 PM
To: users@jena.apache.org
Subject: Re: Jena Model is serializable in Java?



On 06/06/2019 08:57, Scarlet Remilia wrote:
> Hello everyone,
>
> My use case is a r2rml implementation, which could support millions or 
> billions rows from RDBMS and distributed parse them into RDF.
> For now, We try to setup some small models in different spark executors to 
> parse individually, and finally union them all.

That sounds more like a stream usage.

Jena's StreamRDF and collect to a set (model or graph don't sound like
they do anything for your application - sound like you are just using
them as container of triples to move around.

> I think RDD[Triple] is a good idea, but I need to review exist code to change 
> model into triples.
>
> an RDF syntax and write-then-read the RDF is also a resolution but is too 
> loose. It’s very hard to manage these files, especially there are too many 
> small models mentioned above.
>
> Thanks,
> Jason
>
> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
>
> From: Lorenz B.<mailto:buehm...@informatik.uni-leipzig.de>
> Sent: Thursday, June 6, 2019 15:32
> To: users@jena.apache.org<mailto:users@jena.apache.org>
> Subject: Re: Jena Model is serializable in Java?
>
> I don't see why one would want to share Model instances via Spark. I
> mean, it's possible via wrapping it inside an object which is
> serializable or some other wrapper method:
>
> object ModelWrapper extends Serializable {
> lazy val model = ...
> }
>
> rdd.map(s => ModelWrapper.model. ... )
>
>
> This makes the model being attached to some static code that can't be
> changed during runtime and that's what Spark needs.
>
> Ideally, you'd use some broadcast variable, but indeed those are just
> use to share smaller entities among the different Spark workers. For
> smaller models like a schema this would work and is supposed to be more
> efficient than having joins etc. (yes, there are also broadcast joins in
> Spark, still data would be distributed during processing) - but it
> depends ...
>
> I don't know your use-case nor why you need a Model, but what we did
> when using Jena on Spark was to use RDD (or Dataset) of Triple objects,
> i.e. RDD[Triple]. RDD is the fundamental shared datastructure of Spark
> and this is the only way to scale when using very large datasets.
> Parsing RDF triples from e.g. N-Triples directly into RDD[Triple] is
> pretty easy. For Dataset you have to define a custom encoder (Kryo
> encoder works though).
>
> But as already mentioned, your use-case or application would be needed
> to give further advice if necessary.
>
>> Jason,
>>
>> I would argue that you should exchange a Set of triples, so you can take
>> advantage of Spark's distributed nature.  Your logic can materialize that
>> list into a Graph or Model when needed to operate on it.   Andy is right
>> about being careful about the size - you may want to build a specialized
>> set that throws if the set is too large, and you may want to experiment
>> with it.
>>
>> Andy,
>>
>> Does Jena Riot (or contrib) provide a binary syntax for RDF that is optimal
>> for fast parse?  I'm recalling Michael Stonebraker's response to the
>> BigTable paper -
>> https://pdfs.semanticscholar.org/08d1/2e771d811bcd0d4bc81fa3993563efbaeadb.pdf,
>> and also gSOAP and other binary XML formats.  To this paper, the Google
>> BigTable authors then responded that they don't use loose serializations
>> such as provided by HDFS, but instead use structured data.
>>
>> This is hugely important to Jason's question because this is one of the
>> benefits of using Spark instead of HDFS - Spark will handle distributing a
>> huge dataset to multiple systems so that algorithm authors can operate on a
>> vector (of Jena models?) far too large to fit in one machine.
>>
>> On Wed, Jun 5, 2019 at 4:40 PM Andy Seaborne <a...@apache.org> wrote:
>>
>>> Hi Jason,
>>>
>>> Models aren't serializable, nor are Graphs (the more system oriented
>>> view of RDF) through  Triples, Quads and Node are serializable.  You can
>>> send a list of triples.
>>>
>>> Or use an RDF syntax and write-then-read the RDF.
>>>
>>> But are the models small? RDF graph aren't always small so moving them
>>> around may be expensive.
>>>
>>>       Andy
>>>
>>> On 05/06/2019 17:59, Scarlet Remilia wrote:
>>>> Hello everyone,
>>>> I get a problem about Jena and Spark.
>>>> I use Jena Model to handle some RDF models in my spark executor, but I
>>> get a error:
>>>> java.io.NotSerializableException:
>>> org.apache.jena.rdf.model.impl.ModelCom
>>>> Serialization stack:
>>>>           - object not serializable (class:
>>> org.apache.jena.rdf.model.impl.ModelCom)
>>>>           - field (class: org.nari.r2rml.entities.Template, name: model,
>>> type: interface org.apache.jena.rdf.model.Model)
>>>>           - object (class org.nari.r2rml.entities.Template,
>>> org.nari.r2rml.entities.Template@23dc70c1)
>>>>           - field (class: org.nari.r2rml.entities.PredicateObjectMap,
>>> name: objectTemplate, type: class org.nari.r2rml.entities.Template)
>>>>           - object (class org.nari.r2rml.entities.PredicateObjectMap,
>>> org.nari.r2rml.entities.PredicateObjectMap@2de96eba)
>>>>           - writeObject data (class: java.util.ArrayList)
>>>>           - object (class java.util.ArrayList,
>>> [org.nari.r2rml.entities.PredicateObjectMap@2de96eba])
>>>>           - field (class: org.nari.r2rml.entities.LogicalTableMapping,
>>> name: predicateObjectMaps, type: class java.util.ArrayList)
>>>>           - object (class org.nari.r2rml.entities.LogicalTableMapping,
>>> org.nari.r2rml.entities.LogicalTableMapping@8e00c02)
>>>>           - field (class: org.nari.r2rml.beans.Impl.EachPartitonFunction,
>>> name: logicalTableMapping, type: class
>>> org.nari.r2rml.entities.LogicalTableMapping)
>>>>           - object (class org.nari.r2rml.beans.Impl.EachPartitonFunction,
>>> org.nari.r2rml.beans.Impl.EachPartitonFunction@1e14b269)
>>>>           - field (class:
>>> org.apache.spark.sql.Dataset$$anonfun$foreachPartition$2, name: func$4,
>>> type: interface org.apache.spark.api.java.function.ForeachPartitionFunction)
>>>>           - object (class
>>> org.apache.spark.sql.Dataset$$anonfun$foreachPartition$2, <function1>)
>>>>           at
>>> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
>>>>           at
>>> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
>>>>           at
>>> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
>>>>           at
>>> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:400)
>>>>           ... 33 more
>>>>
>>>> All these classes implement serializable interface.
>>>> So how could I serialize Jena model java object?
>>>>
>>>> Thanks very much!
>>>>
>>>>
>>>> Jason
>>>>
>>>> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for
>>> Windows 10
>>>>
> --
> Lorenz Bühmann
> AKSW group, University of Leipzig
> Group: http://aksw.org - semantic web research center
>
>

Reply via email to