Re: TDB: records not strictly increasing

2013-01-29 Thread Andy Seaborne



B/ A different, better approach is to build a special version of TDB. The 
changes needed are small but you need to build Jena.

These instructions apply to code in SVN as it is now, today.  Not the last 
release, not last week.  It's just easier to setup and explain from the current 
code base as a small recent change centralised the point you need to change and 
also introduced an easy to use testing feature.

1/ svn co the Jena code from trunk.


Done

2/ Build Jena
   mvn clean install


Done

It is easier to build and install than just package.

You must use the development releases of the other modules.
I don't think you need to set up maven to use the snapshot builds on Apache but 
if you do:

Set repository
http://jena.apache.org/download/maven.html

3/ mvn eclipse:eclipse to use Eclipse if you plan to use that to edit the code.

Didn't set up maven or use Eclipse.


4/ Setup to use this build for tdbdump.  e.g. the apache-jena or fuseki.

For added ease - use the Fuseki server jar which as everything in it

java -cp fuseki-server.jar tdb.tdbdump —version


java -cp jena-fuseki/target/jena-fuseki-0.2.6-SNAPSHOT-server.jar tdb.tdbdump 
—version

Jena:   VERSION: 2.10.0-SNAPSHOT
Jena:   BUILD_DATE: 2013-01-28T21:00:30+
ARQ:VERSION: 2.10.0-SNAPSHOT
ARQ:BUILD_DATE: 2013-01-28T21:00:30+
TDB:VERSION: 0.10.0-SNAPSHOT
TDB:BUILD_DATE: 2013-01-28T21:00:30+


Check timestamps/version numbers.

5/ Test create a small text file of a few triples.

--- D.ttl
@prefix : http://example/ .

:s1 :p 1 .
:s2 :p 2 .
:s3 :q 3 .
:s2 :q 4 .
:s1 :q 5 .

---

tdbdump --data D.ttl should dump the file with triples clustered by subject.

(no - you do not need to load a database - --data is a recent feature for 
testing)


java -cp jena-fuseki/target/jena-fuseki-0.2.6-SNAPSHOT-server.jar tdb.tdbdump 
--data D.ttl
http://example/s1 http://example/p 
1^^http://www.w3.org/2001/XMLSchema#integer .
http://example/s1 http://example/q 
5^^http://www.w3.org/2001/XMLSchema#integer .
http://example/s2 http://example/p 
2^^http://www.w3.org/2001/XMLSchema#integer .
http://example/s2 http://example/q 
4^^http://www.w3.org/2001/XMLSchema#integer .
http://example/s3 http://example/q 
3^^http://www.w3.org/2001/XMLSchema#integer .


6/ Edit com.hp.hpl.jena.tdb.index.TupleTable, static method chooseScanAllIndex

Change:
-
if ( tupleLen != 4 )
return indexes[0] ;
==
if ( tupleLen != 4 )
{
if ( indexes.length == 3 )
return indexes[1] ;
else
return indexes[0] ;
}
-

7/ Rebuild.

Yes - the tests for TDB should pass!

8/ check the new version

tdbdump --version

check the change

tdbdump --data D.ttl

and it should be n-triples clustered by property, different to earlier on.


java -cp jena-fuseki/target/jena-fuseki-0.2.6-SNAPSHOT-server.jar tdb.tdbdump 
--data D.ttl
http://example/s1 http://example/p 
1^^http://www.w3.org/2001/XMLSchema#integer .
http://example/s2 http://example/p 
2^^http://www.w3.org/2001/XMLSchema#integer .
http://example/s3 http://example/q 
3^^http://www.w3.org/2001/XMLSchema#integer .
http://example/s2 http://example/q 
4^^http://www.w3.org/2001/XMLSchema#integer .
http://example/s1 http://example/q 
5^^http://www.w3.org/2001/XMLSchema#integer .

Is it what you expect?


Yes.





9/ Dump your database.

Hope there is a good index.


It works and no errors were reported, however the size of the dump file is just 
84MB, which is considerable smaller than the actual tdb (~1GB)


Quite possible - especially if you have also been deleting stuff in the 
database as well as adding.





You can also try indexes[2] not indexes[1] to use the OSP index.
Each dumps the entire database, but in different triple orders.


I did also try this changes of indexes, and it gave me the same error

Exception in thread main com.hp.hpl.jena.tdb.base.StorageException: 
RecordRangeIterator: records not strictly increasing: 
021aa0a206cffe6b0005233d // 
021a2c0a06b85f9f0005233d


The OSP index is also broken.




10/ Clean up maven to get rid of the temporary build.

rm -r REPO/org/apache/jena/

11/ Rebuild the database with tdbloader/tdbloader2.


java -cp jena-fuseki/target/jena-fuseki-0.2.6-SNAPSHOT-server.jar tdb.tdbloader 
--loc=tdb tdb.dump

but the size of the tdb is smaller than the original tdb


The loader produces more compact indexes than if the data has been 
loaded incrementally.  This is even more the case for tdblaoder2.


Also if you have been deleting and adding, for 0.8, then the database 
can grow.  This is addressed, but not totlally fixed in 0.9.X



(the load is slower than if dumped in SPO order)

I tested the change here on that test file - I don't have a large corrupt 
database to try it on.


Any ideas of how to get it fixed are more than welcome.


Personally, I would adopt a 2 stream approach.

Do approach above 

Re: Combined query over different dataset and interlinking btween dataset

2013-01-29 Thread Vishal Sinha





 From: Andy Seaborne a...@apache.org
To: Vishal Sinha vishal.sinha...@yahoo.com 
Cc: users@jena.apache.org users@jena.apache.org 
Sent: Monday, January 28, 2013 5:16 PM
Subject: Re: Combined query over different dataset and interlinking btween 
dataset
 
On 28/01/13 06:24, Vishal Sinha wrote:


 
 *From:* Andy Seaborne a...@apache.org
 *To:* users@jena.apache.org
 *Sent:* Monday, January 28, 2013 3:25 AM
 *Subject:* Re: Combined query over different dataset and interlinking
 btween dataset

 On 25/01/13 06:43, ankur padia wrote:
   hello vishal,
  
        Based on my experience, list of statement would be help full for
   question 1 and for question 2 some if condition would be required to be
   specified. And I think Filter class would be helpful but as I haven't
   came across any tutorial on Filter. As a result it would be like
 hit and
   try.
  
   Regards,
   Ankur Padia
  
  
   On Fri, Jan 25, 2013 at 11:00 AM, Vishal Sinha
 vishal.sinha...@yahoo.com mailto:vishal.sinha...@yahoo.comwrote:
  
   Hi,
  
   I have created two Datasets using Jena.
   Each Datasets having two or three models.
  
   Lets say triples in Dataset1 are:
   x1 y1 z1.
   x2 y2 z2.
   x3 y3 z3.
   x4 y4 z4.
   x5 y5 z5.
   x6 y6 z6.
  
   Lets say triples in Dataset2 are:
   x11 y11 z11.
   x21 y22 z22.
   x33 y33 z33.
   x44 y44 z44.
   x55 y55 z55.
   x66 y66 z66.
  
   My questions:
   - How can I make combined query on these two data-sets, or lets say
   multiple datasets using Jena ?
   - How can I state that 'y22' in Dataset2 is actually same as 'y5' in
   Dataset1 ? Where should I keep this information?

Andy wrote:
 Do the datasets have named graphs in common?  If not, then making a
 single dataset with all the data in is one possibility.


Vishal wrote:
 Both the datasets has default graph models, not any named graph.

Then you can put both in one dataset, each as a named graph.  You can 
query a specific graph with GRAPH or the combined grapgs using 
unionDefaultGraph.

++ Thanks, it works now.


Otherwise, you can create a union graph and put each graph in it.  Less 
efficient but it depends if you have a lot of data or not (= TDB 
database and several million triples).

    Andy


      Andy

  
   Thanks,
  
   Vishal
  




Re: listInstances OntClass problem

2013-01-29 Thread Panagiotis Papadakos
Thanks for the reply Dave.

It seems that jena.apache.org/documentation/ontology/index.html
Fig. 5 explains what you are explaining in your email. I still believe
thought that javadocs should be more clear about this.

Thanks again

Regards
Papadakos Panagiotis


On Tue, Jan 29, 2013 at 3:24 PM, Dave Reynolds dave.e.reyno...@gmail.comwrote:

 On 24/01/13 13:08, Panagiotis Papadakos wrote:

 Ian and Dave, thank you both for your help.

 I didn't post the correct code and I am sorry for this.

 Regarding the ontology, I know it is not correct.
 Maybe changing Europe to European, Germany to German, etc. would be
 better.

 Now regarding the listInstances method, I still believe something is
 wrong either in the code, in the API or in my way of thinking.

 listInstances is supposed to return the instances, either direct or
 instances of its subclasses. Unfortunately if I use a simple RDF_MEM
 model with no inference, listInstances(false) for the Manufacturer class
 returns no result. Somehow I feel this is wrong.
 I was thinking that internally, since there is no inference, jena should
 visit each subclass, and the subclasses of them, etc. getting the direct
 instances of each one and returning all the instances of the class and
 its subclasses. Is this correct?


 No.

 The notion is that reasoning is the job of the reasoner and that the
 OntAPI provides convenient access to that, but doesn't duplicate it. There
 are a few special cases but in general if you want reasoning then configure
 a reasoner.


  Now regarding listInstances(true), I am supposing that it should return
 all direct instances of the class, even if these instances are also
 instances of a subclass (which for example can happen if I load the
 TestInference.rdf file).


 No. That's the point of direct, as it says in the javadoc setting
 direct=true means excluding sub-classes of this class.

 If something is also an instance of a subclass of C then it is not a
 direct instance of C and should not be returned by listInstances(true).

 Dave




-- 
http://www.flickr.com/photos/papadako


OWL Deprecation in Schemagen-generated classes

2013-01-29 Thread Joshua TAYLOR
A colleague and I have been using Jena's schemagen to get lots of
generated constants from a vocabulary we've developed.  We're at the
point that we're marking some of the vocabulary deprecated.  It would
be convenient for our application code that uses the vocabulary if the
vocabulary constants that are deprecated also had a Java deprecation
annotation.  Our application would then generate compiler warnings
where it used deprecated vocabulary.  This raises two questions:

* We didn't find anything in the Jena schemagen doc describing this.
Are we correct that schemagen can't presently do this?
* This probably isn't too hard to implement;  we might go and do it if
we get some free time.  Is there any interest in this?  (I.e., if we
submitted it as a patch, would it be added to Jena, and would it be
useful to anyone?)

Thanks,  //JT

-- 
Joshua Taylor, http://www.cs.rpi.edu/~tayloj/


Re: OWL Deprecation in Schemagen-generated classes

2013-01-29 Thread Joshua TAYLOR
 On Tue, Jan 29, 2013 at 12:55 PM, Joshua TAYLOR joshuaaa...@gmail.com wrote:
 A colleague and I have been using Jena's schemagen to get lots of
 generated constants from a vocabulary we've developed.  We're at the
 point that we're marking some of the vocabulary deprecated.  It would
 be convenient for our application code that uses the vocabulary if the
 vocabulary constants that are deprecated also had a Java deprecation
 annotation.  Our application would then generate compiler warnings
 where it used deprecated vocabulary.  This raises two questions:

 * We didn't find anything in the Jena schemagen doc describing this.
 Are we correct that schemagen can't presently do this?
 * This probably isn't too hard to implement;  we might go and do it if
 we get some free time.  Is there any interest in this?  (I.e., if we
 submitted it as a patch, would it be added to Jena, and would it be
 useful to anyone?)

On Tue, Jan 29, 2013 at 1:03 PM, Stephen Allen sal...@apache.org wrote:
 Sounds very useful to me, I use schemagen a fair amount.  Looking
 forward to a patch.  The best way to submit it would be to create a
 new issue on our JIRA site [1], and submit it there as an attachment.

 -Stephen

 [1] https://issues.apache.org/jira/browse/JENA

Sounds like a plan.  It's not a particularly high priority thing for
us at the moment, so I don't have any particular ETA, but it's on the
long-term to-do if we get the time list.  :)

//JT

-- 
Joshua Taylor, http://www.cs.rpi.edu/~tayloj/


Re: Binding causes hang in Fuseki

2013-01-29 Thread Rob Walpole
Cool, thanks guys, will give this a try tomorrow :-)

Rob


On Tue, Jan 29, 2013 at 7:36 PM, Andy Seaborne a...@apache.org wrote:

 On 29/01/13 18:21, Alexander Dutton wrote:


 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Hi Rob,

 On 29/01/13 18:11, Rob Walpole wrote:

 Am I doing something wrong here?


 The short answer is that the inner SELECT is evaluated first, leading to
 the results being calculated in the second case in a rather inefficient
 way.

 In the first inner SELECT ?deselected is bound, so it's quite quick to
 find all its ancestors.

 In the second, all possible ?deselected and ?ancestor pairs are returned
 by the inner query, which are then (effectively) filtered to remove all
 the pairs where ?deselected isn't whatever it was BINDed to.

 Here's more from the spec:
 http://www.w3.org/TR/**sparql11-query/#subquerieshttp://www.w3.org/TR/sparql11-query/#subqueries
 .

 I /think/ ARQ is able to perform some optimisations along these lines,
 but obviously not for your query.


 Spot on.

 If you remove the inner SELECT it should do better.



   { BIND(...) AS ?readyStatus)
 BIND(...) AS ?deselected)
 ?export rdfs:member ?member .
 ?export dri:username rwalpole^^xsd:string .
 ?export dri:exportStatus ?readyStatus
 OPTIONAL
   { ?deselected (dri:parent)+ ?ancestor

 FILTER EXISTS {?export rdfs:member ?ancestor }
   }
   }

 but technically this is a different query so it'll depend on your data as
 to whether it is right.

 http://www.sparql.org/query-**validator.htmlhttp://www.sparql.org/query-validator.html

 Andy



 Best regards,

 Alex

 PS. You don't need to do URI(http://?;); you can do a straight IRI
 literal: http://?

 - --
 Alexander Dutton
 Developer, Office of the CIO; data.ox.ac.uk, OxPoints
 IT Services, University of Oxford
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.13 (GNU/Linux)
 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

 iQEcBAEBAgAGBQJRCBMZAAoJEPotab**D1ANF7Fb0H/**jeCedjfCIuhI2KTNETOcrVR
 Gvl8N4k9ty4AN4F0xFKA3kcGCTR2CI**pgz/**hez6BM5s8mDqLc7ViNPXWxbUhb4kHh
 fxVuuoYBr13VhGnyufvWFliFeT3xSV**LO3eDUilzoja2pvH/Cx/**sNQvcHbi2Ee+EX
 MoWLyfSvtSGY2rXDmMAXvBz49wgk42**mC2Bsr5ptNUfXWQjzz6BXp5SxTKADy**SBXG
 Tm/**DmqGRclHxw233I6EcB9lKfDytTosVu**gH1Yl0BGEHiFPL2/wkkB+**AZiLIwCmb/
 cy+Y8/**I9PlD4onvYlDMRmP169HQVYt849Skx**5/TnTyjMBBNIgQiE8+cj0a/oDc8=
 =ZQec
 -END PGP SIGNATURE-





-- 

Rob Walpole
Email robkwalp...@gmail.com
Tel. +44 (0)7969 869881
Skype: RobertWalpolehttp://www.linkedin.com/in/robwalpole