Re: JSON-LD writer and the Titanium RdfDataset

Holger Knublauch Thu, 11 Jul 2024 01:10:12 -0700

Yes, using shapes to describe the closure is one option. I wonder if anyone has 
a similar algorithm that takes a JSON-LD frame and generates SPARQL queries for 
the triples that are visited by the frame, i.e. directly working on the frame 
JSON only?


Holger


> On 11 Jul 2024, at 7:18 AM, Nicholas Car <n...@kurrawong.net> wrote:
> 
> Hi Holger and all,
> 
> We do something similar to what I think you want done here with RDFrame, a 
> prototype tool we use so our APIs can extract triples from a store according 
> to a (SHACL or CQL) frame:
> 
> https://rdframe.dev.kurrawong.ai/
> 
> I suspect what we are doing is early days/simple stuff compared to what you 
> need but the principle of frame -> SPARQL seems relevant.
> 
> Cheers, Nick
> 
> 
> 
> 
> On Thursday, 11 July 2024 at 15:03, Holger Knublauch <hol...@topquadrant.com> 
> wrote:
> 
>> Hi Andy,
>> 
>> thanks for your response. To clarify, it would be a scenario such as a TDB 
>> with 1 million triples and the request is to produce a JSON-LD document from 
>> the "closure" around a given resource (in TopBraid's Source Code panel when 
>> the user navigates to a resource or through API calls). In other words: 
>> input is a Jena Graph, a start node and a JSON-LD frame document, and the 
>> output should be a JSON-LD describing the node and all reachable triples 
>> described by the frame.
>> 
>> So it sounds like Titanium cannot really be used for this as its algorithms 
>> can only operate on their own in-memory copy of a graph, and we cannot copy 
>> all 1 million triples into memory each time.
>> 
>> Holger
>> 
>>> On 10 Jul 2024, at 5:53 PM, Andy Seaborne a...@apache.org wrote:
>>> 
>>> Hi Holger,
>>> 
>>> How big is the database?
>>> What sort of framing are you aiming to do?
>>> Using framing to select some from a large database doesn't feel like the 
>>> way to extract triples as you've discovered. Framing can touch anywhere in 
>>> the JSON document.
>>> 
>>> This recent thread is relevant --
>>> https://lists.apache.org/thread/3mrcyf1ccry78rkxxb6vqsm4okfffzfl
>>> 
>>> That JSON-LD file is 280 million triples.
>>> 
>>> It's structure is
>>> 
>>> [{"@context": <url> , ... }
>>> ,{"@context": <url> , ... }
>>> ,{"@context": <url> , ... }
>>> ...
>>> ,{"@context": <url> , ... }
>>> ]
>>> 
>>> 9 million array entries.
>>> 
>>> It looks to me like it has been produced by text manipulation, taking each 
>>> entity, writing a separate, self-contained JSON-LD object then, by text, 
>>> making a big array. That, or a tool that is designed specially to write 
>>> large JSON-LD. e.g. the outer array.
>>> 
>>> That's the same context URL and would be a denial of service attack except 
>>> Titanium reads the whole file as JSON and runs out of space.
>>> 
>>> The JSON-LD algorithms do assume the whole document is available. Titanium 
>>> is a faithful implementation of the spec.
>>> 
>>> It is hard to work with.
>>> 
>>> In JSON the whole object needs to be seen - repeated member names (and 
>>> facto - last duplicate wins) and "@context" being at the end are possible. 
>>> Cases that don't occur in XML. Streaming JSON or JSON-LD is going to have 
>>> to relax the strictness somehow.
>>> 
>>> JSON-LD is designed around the assumption of small/medium sized data.
>>> 
>>> And this affects writing. That large file looks like it was specially 
>>> written or at least with a tool that is designed specially to write large 
>>> JSON-LD. e.g. the outer array.
>>> 
>>> Jena could do with some RDFFormats + writers for JSONLD at scale. Oen 
>>> obvious one is the one extending WriterStreamRDFBatched where a batch is 
>>> the subject and its immediate triples, then write similar to the case above 
>>> except in a way that is one context then the array is with "@graph".
>>> 
>>> https://www.w3.org/TR/json-ld11/#example-163-same-description-in-json-ld-context-shared-among-node-objects
>>> 
>>> That doesn't solve the reading side - a companion reader would be needed 
>>> that stream-reads JSON.
>>> 
>>> Contributions welcome!
>>> 
>>> Andy
>>> 
>>> On 10/07/2024 12:36, Holger Knublauch wrote:
>>> 
>>>> I am working on serializing partial RDF graphs to JSON-LD using the 
>>>> Jena-Titanium bridge.
>>>> Problem: For Titanium to "see" the triples it needs to have a complete 
>>>> copy. See JenaTitanion.convert which copies all Jena triples into a 
>>>> corresponding RdfDatset. This cannot scale if the graph is backed by a 
>>>> database, and we only want to export certain triples (esp for Framing). 
>>>> Titanium's RdfGraph does not provide an incremental function similar to 
>>>> Graph.find() but only returns a complete Java List of all triples.
>>>> Has anyone here run into the same problem and what would be a solution?
>>>> I guess one solution would be an incremental algorithm that "walks" a 
>>>> @context and JSON-LD frame document to collect all required Jena triples, 
>>>> producing a sub-graph that can then be sent to Titanium. But the 
>>>> complexity of such an algorithm is similar to having to implement my own 
>>>> JSON-LD engine, which feels like an overkill.
>>>> Holger

Re: JSON-LD writer and the Titanium RdfDataset

Reply via email to