Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Alessandro Adamou Thu, 12 Apr 2012 11:06:36 -0700

Hi Rupert,

I've been trying to implement your proposed solution for the Ontology IDlookahead with the MGraph wrapper.

I'm trying to make it simple now, then I will need to detect the[ontologyIRI, versionIRI] pair

However, BufferedInputStream.mark(int) does not seem to set the readlimit for me. No matter what value I set (even -1), Parser.parse()always goes through the whole graph, and when I try to reset() it afterfinding the ontologyID I always get an IOException("Stream closed")

I tried values much greater and much smaller than the file size inbytes, and tried to move the triple early and late in the file, no dice.

Perhaps I should just set a limit on the triples instead, but I wouldn'twant to read through a 100MiB file just to use the first 100 triples forguessing the ID. However, this could be inevitable since most formatsrequire to read the last chunk of a file in order to "close" the RDFcode (such as a </rdf:RDF> tag or so), but perhaps a SAX parser couldwork anyway?


any clue?

Alessandro


On 3/16/12 11:50 AM, Rupert Westenthaler wrote:

Hi Alessandro

Something like this could work:

This suggests to
* provide an MGraph wrapper that skips all triples other than the one need to 
determine the OntologyID
* Use a BufferedInputStream and mark the beginning
* Parse to your MGraphWrapper until you can determine the OntologyID
* throw some exception to stop the parsing
* reset the stream
* process the OntologyID
* If you need to import the parsed ontology you can reuse the resetted stream

Here is how the code might look.

class MyMGraph extends SimpleMGraph {

      String ontologyId;

     @Override
     protected boolean performAdd(Triple triple) {

           //fitler the interesting Triple
           if(triple is interesting){
               super.perfomAdd(triple)
           }
           //check the currently available triples for the Ontology ID
           checkOntologyId();

          if(ontologyId != null){
              throw new RuntimeException(); //stop importing
          }
          //TODO: add an limit to the triples you read
     }

     public getOntologyID(){
         return id
     }


}


If you use a BufferedInputStream you could do the following

BufferedInputStream bIn = new BufferedInputStream(in);
bIn.mark(Integer.MAX_VALUE); //set an appropriate limit
MyMGraph  graph = new MyMGraph();
try {
     parser.parse(graph,inputStream,rdfFormat)
} catch(RuntimeException e){ }
if(graph.getOntologyId() != null){
     bIn.reset(); //reset set the stream to the start
     //now do the logic you need to do
} else { //No OntologyID found
     //do some error handling
}


WDYT
Rupert



--
M.Sc. Alessandro Adamou

Alma Mater Studiorum - Università di Bologna
Department of Computer Science
Mura Anteo Zamboni 7, 40127 Bologna - Italy

Semantic Technology Laboratory (STLab)
Institute for Cognitive Science and Technology (ISTC)
National Research Council (CNR)
Via Nomentana 56, 00161 Rome - Italy


"I will give you everything, so long as you do not demand anything."
(Ettore Petrolini, 1917)

Not sent from my iSnobTechDevice

Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Reply via email to