A fault-tolerant and replicated data publishing solution (by Epimorphics)... and how to calculate the triples to add/remove?

2012-05-18 Thread Paolo Castagna
Hi,
I've just read this blog post from Andy:
http://www.epimorphics.com/web/wiki/epimorphics-builds-data-publish-platform-environment-agency

It describes a quite simple fault-tolerant and replicated data publishing 
solution using Apache Jena and Fuseki. Interesting.

It's a master/slave architecture. The master (called by Andy in his post 
'controller server') receives all updates and calculates the triples to be 
added, the triples to be removed so that changes
are 'idempotent' (i.e. they can be reapplied multiple times (in the same 
order!) with the same effect).

It would be interesting to know if the 'controller server' exposes a full 
SPARQL Update endpoint and/or the Graph Store HTTP Protocol and if that is the 
case how triples to be added/removed are
calculated. (This is something I wanted to learn for a while, but I still did 
not find the time... a small example would be wonderful! ;-)).

To conclude, I fully agree on the quite simple design and simple systems are 
easier to operate. The approach described can work well in a lot of scenarios 
where the rate of updates/writes isn't
excessive and you have mostly reads (which I still believe to be the case most 
of the times when you have RDF data, since data is often human 
generated/curated data).
My hope is to see something similar in the 'open' so that other Apache Jena and 
Fuseki users can benefit from an highly available and open source publishing 
solution for RDF data (and they can focus
their energies/efforts elsewhere: on the quality of their data modeling, data, 
applications, user experience, etc.).

Paolo

PS:
Disclaimer: I don't work for Epimorphics, those are just my personal opinions 
and, last but not least, I love simplicity.


Re: Ideas for an efficient TDB check?

2012-05-25 Thread Paolo Castagna
Hi André,
I know exactly how you feel and I had exactly the same need at times.

How you know if your TDB indexes are all fine?

Add the work 'production' to it and everything becomes more 'fun'. :-)
Fortunately, we use replication and have the ability to replay updates going
back as much as we want/need. This makes things more 'relaxing'. But, this is
not the answer you are searching for right now.

I do not have *the* answer for you nor a tool, but in the past I've done
something similar to what you suggested, a sort of TDB index verifier/health
checker. Here [1], it's just a quick and dirty solution (not scalable... it
keeps stuff in memory, etc.). But, perhaps, it provides you with ideas.

If a TDB health checking utility is useful and feasible, we should probably open
a JIRA issue for it and gather ideas on how to best implement this. It should
not be too much work.

You are still using TDB 0.8.10, but on-disk format hasn't changed... so it's
reasonable to expect such functionality would work with your indexes as well.

My 2 cents,
Paolo

 [1]
https://github.com/castagna/tdbloader4/blob/f5363fa49d16a04a362898c1a5084ade620ee81b/src/test/java/dev/TDBVerifier.java


Dr. André Lanka wrote:
 Hello Jena-Users,
 
 we are using Jena+TDB in production and are looking for an efficient
 method to check the validity of the TDB files on disk.
 
 Our situation is as follows.
 
 With Jena 2.6.4 and TDB 0.8.10 each of our servers stores triples in up
 to 4000 different TDB stores stored on its local hard drive. On average
 each store owns 1 million triples (with high variance). To get our
 system working fluently, we need massive parallel write access to the
 different stores, so one huge named graph is no alternative. Also we
 need to have all stores open and accessible.
 
 In order to get that large number of TDB stores opened in parallel, we
 customised the TDB code for our needs. For instance we introduced read
 caches shared between all stores (to avoid memory problems). Also we
 introduced basic capabilities to roll back transactions. (We took
 control over all data read from or written to ObjectFile and BlockMgr).
 
 So, in our situation we can't switch to the new TDB version over night.
 
 Now, the problem is that we had some disk issues a few days ago and want
 to check which stores have got broken (We know some of them are broken).
 
 Our initial idea is to iterate over all statements in the store and
 collect any S, P and O used in the store. Second step would be to check
 if any such URI is correctly mapped to an nodeID. And the other way round.
 
 Unfortunately we are not sure, if this will cover any possible file
 problem. Also, we think there could be a more efficient way to check the
 internal data structures.
 
 
 So, any idea (both high and low level) is highly appreciated.
 
 
 Thanks in advance
 André
 



Re: Import Messures

2012-06-24 Thread Paolo Castagna
Stefan Scheffler wrote:
 Hey Paolo.
 Thanks for your reply.
 I used tdbloader2 with an own Tokenizer / Errorhandler (which just
 catches / skips errors and writes them into a file).
 the command was /.tdbloader2 --loc=store srcpath/*
 
 Is there a possibility to do incremental loads with the script files or
 do i have to write a own program?

Hi Stefan,
if you want to run an incremental load you should use tdbloader, not tdbloader2.
tdbloader supports incremental loads, tdbloader2 not.

If you are loading large datasets make sure you have enough RAM (you can load
data on a machine with a lot of RAM and move indexes elsewhere).

Paolo

 
 Regards,
 Stefan
 
 Am 24.06.2012 10:42, schrieb Paolo Castagna:
 Hi Stefan,
 as Rob said, loading data into an empty TDB store is a different from
 loading
 data into an existing TDB store.

 I assume that for your second data load you used tdbloader not
 tdbloader2.

 tdbloader2 does not even support incremental data loads (i.e. it will
 overwrite
 your existing data). I suspect this is what is going on.

 Can you share the exact commands you used as well as links to the RDF
 data?
 (this way others can replicate your experiments).

 Regards,
 Paolo

 Stefan Scheffler wrote:
 Hello,
 At the moment i am doing some performance checks on tdb. The first i
 checked was the import of the tdbloader2 and i got some weird results.
 Maybe someone can help me out. Here are my testbase and the results.

 The first test was to store 12 GB of triples into an empty store (i used
 the german dbpedia).

 Load time: 16 minutes
 average loading: ca 81.000 triple / second
 index time: 40 minutes
 store size: 9,3GB


 The second test was to store the same data into an allready filled store
 As i started the import i created a store with 348.398.593 Triples from
 DNB and HBZ (which are german libraries, store size: 33 GB).
 Then i started to load the german dbpedia in.

 Load time: 3 hours and 4 minutes
 average loading: ca 7200 / second
 index time: 38 minutes
 store size: 19 GB!

 Why does the loading time increases that immense? My expectation was,
 that the index time increases. But it does not. There where no other big
 proccesses running nearby. And why does the store size shrink to 19GB? I
 am totally confused about that point.

 With friendly regards
 Stefan

 
 


Re: just trying to read in an RDF file ...

2012-06-24 Thread Paolo Castagna
Andy Seaborne wrote:
 On 24/06/12 09:23, Paolo Castagna wrote:
 Jena has moved to the Apache Software Foundation and it is not a TLPc
 (i.e. top level project).
 
 s/not/now/ 

Sorry. Yes, now! ;-)

Paolo

 
 Jena is a TLP.
 
 Andy
 


Re: Want to run SPARQL Query with Hadoop Map Reduce Framework

2012-06-25 Thread Paolo Castagna
Hi Mizanur,
when you have big RDF datasets, it might make sense to use MapReduce (but only 
if you already have an Hadoop cluster at hand. Is this your case?).
You say that your data is 'huge', just for the sake of curiosity... how many 
triples/quads is 'huge'? ;-)
Most of the use cases I've seen related to statistics on RDF datasets were 
trivial MapReduce jobs.

For a couple of examples on using MapReduce with RDF datasets have a look here:
https://github.com/castagna/jena-grande
https://github.com/castagna/tdbloader4

This, for example, is certainly not exactly what you need, but I am sure that 
with little changes you can get what you want:
https://github.com/castagna/tdbloader4/blob/master/src/main/java/org/apache/jena/tdbloader4/StatsDriver.java

Last but not least, you'll need to dump your RDF data out onto HDFS.
I suggest you use N-Triples/N-Quads serialization formats.

Running SPARQL queries on top of an Hadoop cluster is another (long and not 
easy) story.
But, it might be possible to translate part of the SPARQL algebra into Pig 
Latin scripts and use Pig.
In my opinion however, it makes more sense to use MapReduce to filter/slice 
massive datasets, load the result into a triple store and refine your data 
analysis using SPARQL there.

My 2 cents,
Paolo

Md. Mizanur Rahoman wrote:
 Dear All,
 
 I want to collect some statistics over RDF data. My triple store is
 Virtuoso and I am using Jena for executing my query.  I want to get some
 statistics like
 i) how many resources in my dataset ii) resources belong to in which
 position of dataset (i.e., sub/prd/obj) etc. As my data is huge, I want to
 use Hadoop Map Reduce in calculating such statistics.
 
 Can you please suggest.
 


Re: Want to run SPARQL Query with Hadoop Map Reduce Framework

2012-06-26 Thread Paolo Castagna
Md. Mizanur Rahoman wrote:
 Hi Paolo,
 
 Thanks for your reply.
 
 Right now I am only using DBPedia, Geoname and NYTimes for LOD cloud. And
 later on I want to extend my dataset.

Ok, so it's big, but not huge! ;-)
If you have enough RAM you can do everything on a single machine.

 By the way, yes, I can use sparql directly to collect my required
 statistics but my assumption is using Hadoop could give me some boosting in
 collecting those stat.

Well, it all depends if you already have an Hadoop cluster you can use.
If not, a single machine with a lot of RAM might be easier/faster/better.

 I will knock you after going through your links.

Sure, let me know how it goes.

Paolo

 
 -
 Sincerely
 Md Mizanur
 
 
 
 On Tue, Jun 26, 2012 at 12:50 AM, Paolo Castagna 
 castagna.li...@googlemail.com wrote:
 
 Hi Mizanur,
 when you have big RDF datasets, it might make sense to use MapReduce (but
 only if you already have an Hadoop cluster at hand. Is this your case?).
 You say that your data is 'huge', just for the sake of curiosity... how
 many triples/quads is 'huge'? ;-)
 Most of the use cases I've seen related to statistics on RDF datasets were
 trivial MapReduce jobs.

 For a couple of examples on using MapReduce with RDF datasets have a look
 here:
 https://github.com/castagna/jena-grande
 https://github.com/castagna/tdbloader4

 This, for example, is certainly not exactly what you need, but I am sure
 that with little changes you can get what you want:

 https://github.com/castagna/tdbloader4/blob/master/src/main/java/org/apache/jena/tdbloader4/StatsDriver.java

 Last but not least, you'll need to dump your RDF data out onto HDFS.
 I suggest you use N-Triples/N-Quads serialization formats.

 Running SPARQL queries on top of an Hadoop cluster is another (long and
 not easy) story.
 But, it might be possible to translate part of the SPARQL algebra into Pig
 Latin scripts and use Pig.
 In my opinion however, it makes more sense to use MapReduce to
 filter/slice massive datasets, load the result into a triple store and
 refine your data analysis using SPARQL there.

 My 2 cents,
 Paolo

 Md. Mizanur Rahoman wrote:
 Dear All,

 I want to collect some statistics over RDF data. My triple store is
 Virtuoso and I am using Jena for executing my query.  I want to get some
 statistics like
 i) how many resources in my dataset ii) resources belong to in which
 position of dataset (i.e., sub/prd/obj) etc. As my data is huge, I want
 to
 use Hadoop Map Reduce in calculating such statistics.

 Can you please suggest.

 
 
 


Re: LARQ prefix search results missing hits

2012-08-17 Thread Paolo Castagna
Hi Osma,
thanks for your help and feedback.

Does your problem go away without changing the code and using:
?lit pf:textMatch ( 'a*' 10 )

It's not a problem adding a couple of '0'...
However, I am thinking that this would just shift the problem, isn't it?

Paolo

On 15/08/12 10:31, Osma Suominen wrote:
 Hi Paolo!

 Thanks for your reply and sorry for the delay.

 I tested this again with today's svn snapshot and it's still a problem.

 However, after digging a bit further I found this in
 jena-larq/src/main/java/org/apache/jena/larq/LARQ.java:

 --clip--
 // The number of results returned by default
 public static final int NUM_RESULTS = 1000 ; // should
 we increase this? -- PC
 --clip--

 I changed NUM_RESULTS to 10 (added two zeros), built and installed
 my modified LARQ with mvn install (NB this required tweaking arq.ver
 and tdb.ver in jena-larq/pom.xml to match the current svn versions),
 rebuilt Fuseki and now the problem is gone!

 I would suggest that this constant be increased to something larger
 than 1000. Based on the code comment, you seem to have had similar
 thoughts sometime in the past :)

 Thanks,
 Osma


 15.07.2012 11:21, Paolo Castagna kirjoitti:
 Hi Osma,
 first of all, thanks for sharing your experience and clearly describing
 your problem.
 Further comments inline.

 On 13/07/12 14:13, Osma Suominen wrote:
 Hello!

 I'm trying to use a Fuseki SPARQL endpoint together with LARQ to
 create a system for accessing SKOS thesauri. The user interface
 includes an autocompletion widget. The idea is to use the LARQ index
 to make fast prefix queries on the concept labels.

 However, I've noticed that in some situations I get less results from
 the index than what I'd expect. This seems to happen when the LARQ
 part of the query internally produces many hits, such as when doing a
 single character prefix query (e.g. ?lit pf:textMatch 'a*').

 I'm using Fuseki 0.2.4-SNAPSHOT taken from SVN trunk on 2012-07-10 and
 LARQ 1.0.0-incubating. I compiled Fuseki with LARQ by adding the LARQ
 dependency to pom.xml and running mvn package. Other than this issue,
 Fuseki and LARQ queries seem to work fine. I'm using Ubuntu Linux
 12.04 LTS amd64 with OpenJDK 1.6.0_24 installed from the standard
 Ubuntu packages.


 Steps to repeat:

 1. package Fuseki with LARQ, as described above

 2. start Fuseki with the attached configuration file, i.e.
 ./fuseki-server --config=larq-config.ttl

 3. I'm using the STW thesaurus as an easily accessible example data
 set (though the problem was originally found with other data sets):
 - download http://zbw.eu/stw/versions/latest/download/stw.rdf.zip
 - unzip so you have stw.rdf

 4. load the thesaurus file into the endpoint:
 ./s-put http://localhost:3030/ds/data default stw.rdf

 6. build the LARQ index, e.g. this way:
 - kill Fuseki
 - rm -r /tmp/lucene
 - start Fuseki again, so the index will be built

 7. Make SPARQL queries from the web interface at http://localhost:3030

 First try this SPARQL query:

 PREFIX skos:http://www.w3.org/2004/02/skos/core#
 PREFIX pf:http://jena.hpl.hp.com/ARQ/property#
 SELECT DISTINCT * WHERE {
?lit pf:textMatch ar* .
?conc skos:prefLabel ?lit .
FILTER(REGEX(?lit, '^ar.*', 'i'))
 } ORDER BY ?lit

 I get 120 hits, including Arab@en.

 Now try the same query, but change the pf:textMatch argument to a*.
 This way I get only 32 results, not including Arab@en, even though
 the shorter prefix query should match a superset of what was matched
 by the first query (the regex should still filter it down to the same
 result set).


 This issue is not just about single character prefix queries. With
 enough data sets loaded into the same index, this happens with longer
 prefix queries as well.

 I think that the problem might be related to Lucene's default
 limitation of a maximum of 1024 clauses in boolean queries (and thus
 prefix query matches), as described in the Lucene FAQ:
 http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_a_TooManyClauses_exception.3F



 Yes, I think your hypothesis might be correct (I've not verified it
 yet).

 In case this is the problem, is there any way to tell LARQ to use a
 higher BooleanQuery.setMaxClauseCount() value so that this limit is
 not triggered? I find it a bit disturbing that hits are silently being
 lost. I couldn't see any special output on the Fuseki log.

 Not sure about this.

 Paolo


 Am I doing something wrong? If this is a genuine problem in LARQ, I
 can of course make a bug report.


 Thanks and best regards,
 Osma Suominen








Re: LARQ prefix search results missing hits

2012-08-26 Thread Paolo Castagna
Hi Osma

On 20/08/12 11:10, Osma Suominen wrote:
 Hi Paolo!
 
 Thanks for your quick reply.
 
 17.08.2012 20:16, Paolo Castagna wrote:
 Does your problem go away without changing the code and using:
 ?lit pf:textMatch ( 'a*' 10 )
 
 I tested this but it didn't help. If I use a parameter less than 1000
 then I get even fewer hits, but values above 1000 don't have any effect.

Right.

 I think the problem is this line in IndexLARQ.java:
 
 TopDocs topDocs = searcher.search(query, (Filter)null, LARQ.NUM_RESULTS ) ;
 
 As you can see the parameter for maximum number of hits is taken
 directly from the NUM_RESULTS constant. The value specified in the query
 has no effect on this level.

Correct.

 It's not a problem adding a couple of '0'...
 However, I am thinking that this would just shift the problem, isn't it?
 
 You're right, it would just shift the problem but a sufficiently large
 value could be used that never caused problems in practice. Maybe you
 could consider NUM_RESULTS = Integer.MAX_VALUE ? :)

A lot of use cases about search are to used to drive a UI for people and
often only the first few results are necessary.

Try to continue hit 'next ' on Google, how many results can you get?

;-)

Anyway, I increased the NUM_RESULT constant.

 Or maybe LARQ should use another variant of Lucene's
 IndexSearcher.search(), one which takes a Collector object instead of
 the integer n parameter. E.g. this:
 http://lucene.apache.org/core/old_versioned_docs/versions/3_1_0/api/core/org/apache/lucene/search/IndexSearcher.html#search%28org.apache.lucene.search.Query,%20org.apache.lucene.search.Filter,%20org.apache.lucene.search.Collector%29

Yes. That would be the thing to use if we want to retrieve all the
results from Lucene.

More thinking is necessary here...

In the meantime, you can find a LARQ SNAPSHOT here:
https://repository.apache.org/content/groups/snapshots/org/apache/jena/jena-larq/1.0.1-SNAPSHOT/

Paolo

 
 
 Thanks,
 Osma
 
 
 On 15/08/12 10:31, Osma Suominen wrote:
 Hi Paolo!

 Thanks for your reply and sorry for the delay.

 I tested this again with today's svn snapshot and it's still a problem.

 However, after digging a bit further I found this in
 jena-larq/src/main/java/org/apache/jena/larq/LARQ.java:

 --clip--
  // The number of results returned by default
  public static final int NUM_RESULTS = 1000 ; // should
 we increase this? -- PC
 --clip--

 I changed NUM_RESULTS to 10 (added two zeros), built and installed
 my modified LARQ with mvn install (NB this required tweaking arq.ver
 and tdb.ver in jena-larq/pom.xml to match the current svn versions),
 rebuilt Fuseki and now the problem is gone!

 I would suggest that this constant be increased to something larger
 than 1000. Based on the code comment, you seem to have had similar
 thoughts sometime in the past :)

 Thanks,
 Osma


 15.07.2012 11:21, Paolo Castagna kirjoitti:
 Hi Osma,
 first of all, thanks for sharing your experience and clearly describing
 your problem.
 Further comments inline.

 On 13/07/12 14:13, Osma Suominen wrote:
 Hello!

 I'm trying to use a Fuseki SPARQL endpoint together with LARQ to
 create a system for accessing SKOS thesauri. The user interface
 includes an autocompletion widget. The idea is to use the LARQ index
 to make fast prefix queries on the concept labels.

 However, I've noticed that in some situations I get less results from
 the index than what I'd expect. This seems to happen when the LARQ
 part of the query internally produces many hits, such as when doing a
 single character prefix query (e.g. ?lit pf:textMatch 'a*').

 I'm using Fuseki 0.2.4-SNAPSHOT taken from SVN trunk on 2012-07-10 and
 LARQ 1.0.0-incubating. I compiled Fuseki with LARQ by adding the LARQ
 dependency to pom.xml and running mvn package. Other than this issue,
 Fuseki and LARQ queries seem to work fine. I'm using Ubuntu Linux
 12.04 LTS amd64 with OpenJDK 1.6.0_24 installed from the standard
 Ubuntu packages.


 Steps to repeat:

 1. package Fuseki with LARQ, as described above

 2. start Fuseki with the attached configuration file, i.e.
  ./fuseki-server --config=larq-config.ttl

 3. I'm using the STW thesaurus as an easily accessible example data
 set (though the problem was originally found with other data sets):
  - download http://zbw.eu/stw/versions/latest/download/stw.rdf.zip
  - unzip so you have stw.rdf

 4. load the thesaurus file into the endpoint:
  ./s-put http://localhost:3030/ds/data default stw.rdf

 6. build the LARQ index, e.g. this way:
  - kill Fuseki
  - rm -r /tmp/lucene
  - start Fuseki again, so the index will be built

 7. Make SPARQL queries from the web interface at http://localhost:3030

 First try this SPARQL query:

 PREFIX skos:http://www.w3.org/2004/02/skos/core#
 PREFIX pf:http://jena.hpl.hp.com/ARQ/property#
 SELECT DISTINCT * WHERE {
 ?lit pf:textMatch ar* .
 ?conc skos:prefLabel ?lit .
 FILTER(REGEX(?lit, '^ar.*', 'i

Re: LARQ prefix search results missing hits

2012-09-10 Thread Paolo Castagna
Hi Osma

On 28/08/12 14:22, Osma Suominen wrote:
 Hi Paolo!

 Thanks a lot for the fix! I have tested the latest snapshot and it now
 works as expected. At least until I add lots of new data and hit the new
 limit :)


 You're of course right about the search use case. I think the problem
 here is that the LARQ index can be used for two very different use cases:

 A. Traditional IR, in which the user cares about only the first few
 results. Lucene is obviously very good at this, though full advantage
 (especially for non-English languages) of it can only be achieved by
 using specific Analyzer implementations, which appears not to be
 supported in LARQ, at least not without writing some Java code.

 B. Speeding up queries on literals for e.g. autocomplete search. While
 this can be done without a text index using FILTER(REGEX()), the queries
 tend to be quite slow, as the filter is applied only afterwards. In this
 case it is important that the text index returns all possible hits, not
 just the first ones.

 I have no idea which is the more important use case for LARQ, but I'm
 currently only interested in B because of the requirements of the
 application I'm building (ONKI Light, a SKOS vocabulary browser for
 SPARQL endpoints).

Do you have any idea/proposal to make LARQ be good for both these
use cases?

 Currently the benefits of LARQ (at least for the out-of-the-box
 configuration for Fuseki+LARQ) for both A and B are somewhat diminished
 by these limitations:

 1. The index is global and contains data from all named graphs mixed up.
 This means that when you have many named graphs with different data (as
 I do), and try to query only one graph, the LARQ query part will still
 return hits from all the other graphs, slowing down later parts of the
 query.

Yep.

I though about this while ago, but I haven't actually tried to implement
it. The changes to the index are trivial. The most
difficult part perhaps is on the property function side, but
maybe it's easy that as well.

I think this could be a good contribution, if you need it.

 2. Similarly, the index does not allow filtering by language on the
 query level. With multilingual data, you cannot make a query matching
 e.g. only English labels but will get hits from all the other languages
 as well.

Yep.

I have no proposal for this, but I understand the user need.

 3. The default implementation also doesn't store much context for the
 literal, meaning that you cannot restrict the search only to e.g.
 skos:prefLabel literal values in skos:Concept type resources. This will
 again increase the number of hits returned by the index internally.

I am not sure I follow this or I completely agree with you.

What you say is true, but LARQ provides a property function and you
can use it together with other triple patterns:

 {
   ?l pf:textMatch '...' .
   ?s skos:prefLabel ?l .
   ?s rdf:type skos:Concept .
 }

Now, we can argue on what a clever optimizer should/could do,
but from a point of view of the user, this is quite good and
powerful and it gets you what you want. Isn't it?

The syntax is very easy to remember and the property function
very easy to use.

The Lucene index can be kept quite simple and small.


 There may also be problems with prefix queries if you happen to hit the
 default BooleanQuery limit of 1024 clauses, but I haven't yet had this
 problem myself with LARQ. Another problem for use case B might be that
 the default Lucene StandardAnalyzer, which LARQ seems to use, filters
 common English stop words from the index and the query, which might
 interfer with the exact matching required for B.

 To be fair, other SPARQL text index implementations are not that good
 for prefix searches either. Virtuoso [1] requires at least 4 character
 prefixes to be specified (this can be changed by recompiling). AFAICT
 the 4store text index [2] doesn't support prefix queries at all, as the
 index structure requires whole words to be used (though possibly some
 creative use of subqueries and FILTER(REGEX()) could be used to still
 get some benefit of the index).

 Osma

 [1]
 http://docs.openlinksw.com/virtuoso/sparqlextensions.html#rdfsparqlrulefulltext

 [2] http://4store.org/trac/wiki/TextIndexing

 26.08.2012 22:49, Paolo Castagna wrote:
 Hi Osma

 On 20/08/12 11:10, Osma Suominen wrote:
 Hi Paolo!

 Thanks for your quick reply.

 17.08.2012 20:16, Paolo Castagna wrote:
 Does your problem go away without changing the code and using:
 ?lit pf:textMatch ( 'a*' 10 )

 I tested this but it didn't help. If I use a parameter less than 1000
 then I get even fewer hits, but values above 1000 don't have any effect.

 Right.

 I think the problem is this line in IndexLARQ.java:

 TopDocs topDocs = searcher.search(query, (Filter)null,
 LARQ.NUM_RESULTS ) ;

 As you can see the parameter for maximum number of hits is taken
 directly from the NUM_RESULTS constant. The value specified in the query
 has no effect on this level.

 Correct.

 It's

Re: LARQ prefix search results missing hits

2012-09-11 Thread Paolo Castagna
Apologies, this was a mistake.

Paolo

On 10 September 2012 23:07, Paolo Castagna castagna.li...@gmail.com wrote:
 Hi Osma

 On 28/08/12 14:22, Osma Suominen wrote:
 Hi Paolo!

 Thanks a lot for the fix! I have tested the latest snapshot and it now
 works as expected. At least until I add lots of new data and hit the new
 limit :)


 You're of course right about the search use case. I think the problem
 here is that the LARQ index can be used for two very different use cases:

 A. Traditional IR, in which the user cares about only the first few
 results. Lucene is obviously very good at this, though full advantage
 (especially for non-English languages) of it can only be achieved by
 using specific Analyzer implementations, which appears not to be
 supported in LARQ, at least not without writing some Java code.

 B. Speeding up queries on literals for e.g. autocomplete search. While
 this can be done without a text index using FILTER(REGEX()), the queries
 tend to be quite slow, as the filter is applied only afterwards. In this
 case it is important that the text index returns all possible hits, not
 just the first ones.

 I have no idea which is the more important use case for LARQ, but I'm
 currently only interested in B because of the requirements of the
 application I'm building (ONKI Light, a SKOS vocabulary browser for
 SPARQL endpoints).

 Do you have any idea/proposal to make LARQ be good for both these
 use cases?

 Currently the benefits of LARQ (at least for the out-of-the-box
 configuration for Fuseki+LARQ) for both A and B are somewhat diminished
 by these limitations:

 1. The index is global and contains data from all named graphs mixed up.
 This means that when you have many named graphs with different data (as
 I do), and try to query only one graph, the LARQ query part will still
 return hits from all the other graphs, slowing down later parts of the
 query.

 Yep.

 I though about this while ago, but I haven't actually tried to implement
 it. The changes to the index are trivial. The most
 difficult part perhaps is on the property function side, but
 maybe it's easy that as well.

 I think this could be a good contribution, if you need it.

 2. Similarly, the index does not allow filtering by language on the
 query level. With multilingual data, you cannot make a query matching
 e.g. only English labels but will get hits from all the other languages
 as well.

 Yep.

 I have no proposal for this, but I understand the user need.

 3. The default implementation also doesn't store much context for the
 literal, meaning that you cannot restrict the search only to e.g.
 skos:prefLabel literal values in skos:Concept type resources. This will
 again increase the number of hits returned by the index internally.

 I am not sure I follow this or I completely agree with you.

 What you say is true, but LARQ provides a property function and you
 can use it together with other triple patterns:

  {
?l pf:textMatch '...' .
?s skos:prefLabel ?l .
?s rdf:type skos:Concept .
  }

 Now, we can argue on what a clever optimizer should/could do,
 but from a point of view of the user, this is quite good and
 powerful and it gets you what you want. Isn't it?

 The syntax is very easy to remember and the property function
 very easy to use.

 The Lucene index can be kept quite simple and small.


 There may also be problems with prefix queries if you happen to hit the
 default BooleanQuery limit of 1024 clauses, but I haven't yet had this
 problem myself with LARQ. Another problem for use case B might be that
 the default Lucene StandardAnalyzer, which LARQ seems to use, filters
 common English stop words from the index and the query, which might
 interfer with the exact matching required for B.

 To be fair, other SPARQL text index implementations are not that good
 for prefix searches either. Virtuoso [1] requires at least 4 character
 prefixes to be specified (this can be changed by recompiling). AFAICT
 the 4store text index [2] doesn't support prefix queries at all, as the
 index structure requires whole words to be used (though possibly some
 creative use of subqueries and FILTER(REGEX()) could be used to still
 get some benefit of the index).

 Osma

 [1]
 http://docs.openlinksw.com/virtuoso/sparqlextensions.html#rdfsparqlrulefulltext

 [2] http://4store.org/trac/wiki/TextIndexing

 26.08.2012 22:49, Paolo Castagna wrote:
 Hi Osma

 On 20/08/12 11:10, Osma Suominen wrote:
 Hi Paolo!

 Thanks for your quick reply.

 17.08.2012 20:16, Paolo Castagna wrote:
 Does your problem go away without changing the code and using:
 ?lit pf:textMatch ( 'a*' 10 )

 I tested this but it didn't help. If I use a parameter less than 1000
 then I get even fewer hits, but values above 1000 don't have any effect.

 Right.

 I think the problem is this line in IndexLARQ.java:

 TopDocs topDocs = searcher.search(query, (Filter)null,
 LARQ.NUM_RESULTS ) ;

 As you can see the parameter for maximum number of hits

Re: SDB - community testing RC

2012-09-13 Thread Paolo Castagna
Ciao Francesco,
thanks for sharing. Just a couple of (late) comments.

On 6 September 2012 13:21, Francesco Panico fpan...@imolinfo.it wrote:
 It's two year my society (GruppoImola) works with jena. Our customers are
 banks and insurances, so it's important to store triples in a relational DB
 instand of File System.

If there will ever be a Powered By Apache Jena page somewhere on the web,
you should consider be on that page. :-)

One question, I imagine the culture of the customers you work with.
However, what would you say are the main motivations for them to use
RDBMS systems with Apache Jena?

 We focused on SDB. We have 5 customers with a semantic application in a
 production environment based on jena, sdb and semantic mediawiki.

:-)

Grazie mille for your feedback.

Paolo


Re: Fueski with Larq - query anomaly

2012-11-16 Thread Paolo Castagna

On 24/10/12 12:11, Osma Suominen wrote:

Hi Elli!

It seems that at least part of your problem is having duplicates in the
LARQ index. Have you tried creating the Lucene index using the
larqbuilder command line tool, instead of removing the index and just
letting Fuseki rebuild it when it starts? See the end of my tutorial [1]
for a recipe.

As I understand it, unless you give larqbuilder the --allow-duplicates
option, it will try to avoid duplicates in the index. Though the index
building will take longer.


Exactly.

Duplicate removal slow down indexing. In you want to index a large 
dataset you want to disable it and go faster.


Maybe that option should be renamed. Proposal?

Paolo



I've also noticed that it usually makes sense to place the pf:textMatch
pattern first in the query, otherwise it will be executed many times and
slow down the whole query, sometimes by a lot.

Hope this helps,
-Osma

[1] http://code.google.com/p/onki-light/wiki/InstallFusekiLARQ


On Tue, 23 Oct 2012, Elli Schwarz wrote:


Hello,


I am using Fuseki with Larq (thanks to Osma's recent instructions -
thanks Osma!)  where I recompiled Jena (after adding the Larq
dependency) to Jena revision 1399877 (this past Friday morning's
version of the trunk). I'm noticing the following anomaly when
querying the data:

First I insert the following triples:
prefix xsd: http://www.w3.org/2001/XMLSchema#
insert data {  graph urn:test:foo {
 urn:test:s1 urn:test:p1 foo^^xsd:string .
 urn:test:s1 urn:test:p2 foo^^xsd:string .
 urn:test:s2 urn:test:p3 foo^^xsd:string .
} }

Then I stop Fuseki, delete my index directory, and restart Fuseki. (As
an aside, I'd be very interested in a fix for this so I don't have to
restart Fuseki to rebuild the index - I'm watching JENA-164 and hoping
someone will be able to work on it soon!) Once Fuseki is back up, I
run the following query (I have default graph set as the union of
named graphs by default):
PREFIX pf: http://jena.hpl.hp.com/ARQ/property#
select * where {
 urn:test:s1 ?p ?lit .
 ?lit pf:textMatch foo . }

and I get 2 results as I expect:


| p | lit  |

| urn:test:p1 | foo^^http://www.w3.org/2001/XMLSchema#string |
| urn:test:p2 | foo^^http://www.w3.org/2001/XMLSchema#string |

However, when I flip the order of my query like this:

PREFIX pf: http://jena.hpl.hp.com/ARQ/property#
select * where {
 ?lit pf:textMatch foo .  urn:test:s1 ?p ?lit .
I get 6 results, instead of the two I expect:


| lit  | p |

| foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p1 |
| foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p2 |
| foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p1 |
| foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p2 |
| foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p1 |
| foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p2 |
My
guess as to what happens is that in the second query, first the query
executer executes the first line (the ?lit pf:textMatch foo) and
this returns 3 results for foo, since there are 3 literals for foo.
Then, the next line of the query has three bindings to ?lit, so it
produces the 6 results above (2 for each foo literal since there are
2 properties for urn:test:s1). I know that I can avoid this by using
a SELECT DISTINCT, but I still think the query shouldn't produce
different results based on switching the order. Additionally, if I put
this in a CONSTRUCT query, I can't use DISTINCT to eliminate the
duplicate results (unless I use a SELECT DISTINCT subquery which I'd
rather avoid).

Another point I've noticed is that in my other (much more complex)
queries, against a much larger dataset (~1.5 million triples), if I
put the pf:textMatch line anywhere but in the very beginning of the
query, the query takes a VERY long time to execute. If I put it as the
first line in the query, the query runs quickly. My guess for this is
that the query is executed in order, and it takes much more work for
the query executer to run the other parts of my query which contain
many results, and then have to go back and essentially filter out
those results where the literal doesn't match the pf:textMatch. I can
always place the pf:textMatch line first, but then I'm back to the
problem mentioned above where I get back too many duplicate results.

Thank you very much for your help!
-Elli






Re: Fueski with Larq - query anomaly

2012-11-16 Thread Paolo Castagna

Hi Osma, hi Elli

On 02/11/12 10:34, Osma Suominen wrote:

Hi Elli!

[apparently your reply didn't come through the mailing list, but this
one should]

31.10.2012 23:11, Elli Schwarz kirjoitti:

Thank you for the tip. Yes, if I generate the index using the
larqbuilder command, I don't get the duplicates in the query, regardless
of the placement of the pf:testMatch line. (As an aside, why does the
default behavior of creating the index allow duplicates, but the default
of the larqbuilder command does not?)


Good to hear that eliminating duplicates works for you. I have no idea
why the defaults are as they are.


LARQ index 'text' -- RDF nodes, see in IndexBuilderNode.java:

public void index(Node node, String indexStr)
{
try {
if ( avoidDuplicates() ) unindex(node, indexStr);
Document doc = new Document() ;
LARQ.store(doc, node) ;
LARQ.index(doc, node, indexStr) ;
getIndexWriter().addDocument(doc) ;
} catch (IOException ex)
{ throw new ARQLuceneException(index, ex) ; }
}

avoidDuplicates() by default returns 'true' and by default we want to 
avoid duplicates and make the Lucene index smaller.


if ( avoidDuplicates() ) unindex(node, indexStr); is 'ugly' and 
inefficient, but it is done to avoid having useless documents in the 
Lucene index, as you might have exactly the same RDF node/literal used 
in many triples.


I am open to better suggestions to make this better or faster.


However, switching the order of where I place the pf:textMatch line
(while it may slow down the query), should not produce different
results, even if there are duplicates in the index. This would appear to
be a bug in how Larq applies the results of the index lookup to the
query.


Elli, could you provide an example with some data and your query?


I'm not sure whether getting or not getting duplicates in specific
situations can be considered a bug. But yes, the implementation of LARQ
seems to be rather simplistic. It might help if the raw index results
were filtered to weed out duplicates before applying them to the query.


How could we do this?


Then the choice whether to try to avoid duplicates during indexing would
only be an optimization issue.

BTW I'm not (so far) a LARQ developer, just a fellow user..


But you could help out with LARQ (if you are using it!).
Patches are always welcome expecially from fellow users! ;-)

By the way, many thanks for the documentation on how to use LARQ with 
Fuseki. Very useful (and it will save me time... I can just point people 
to your page from now on).


Paolo



-Osma



Hi Elli!

It seems that at least part of your problem is having duplicates in the
LARQ index. Have you tried creating the Lucene index using the
larqbuilder command line tool, instead of removing the index and just
letting Fuseki rebuild it when it starts? See the end of my tutorial [1]
for a recipe.

As I understand it, unless you give larqbuilder the --allow-duplicates
option, it will try to avoid duplicates in the index. Though the index
building will take longer.

I've also noticed that it usually makes sense to place the pf:textMatch
pattern first in the query, otherwise it will be executed many times and
slow down the whole query, sometimes by a lot.

Hope this helps,
-Osma

[1] http://code.google.com/p/onki-light/wiki/InstallFusekiLARQ


On Tue, 23 Oct 2012, Elli Schwarz wrote:

  Hello,
 
 
  I am using Fuseki with Larq (thanks to Osma's recent instructions -
thanks Osma!)  where I recompiled Jena (after adding the Larq
dependency) to Jena revision 1399877 (this past Friday morning's version
of the trunk). I'm noticing the following anomaly when querying the data:
 
  First I insert the following triples:
  prefix xsd: http://www.w3.org/2001/XMLSchema#
  insert data {  graph urn:test:foo {
  urn:test:s1 urn:test:p1 foo^^xsd:string .
  urn:test:s1 urn:test:p2 foo^^xsd:string .
  urn:test:s2 urn:test:p3 foo^^xsd:string .
  } }
 
  Then I stop Fuseki, delete my index directory, and restart Fuseki.
(As an aside, I'd be very interested in a fix for this so I don't have
to restart Fuseki to rebuild the index - I'm watching JENA-164 and
hoping someone will be able to work on it soon!) Once Fuseki is back up,
I run the following query (I have default graph set as the union of
named graphs by default):
  PREFIX pf: http://jena.hpl.hp.com/ARQ/property#
  select * where {
  urn:test:s1 ?p ?lit .
  ?lit pf:textMatch foo . }
 
  and I get 2 results as I expect:
 
  
  | p| lit  |
  
  | urn:test:p1 | foo^^http://www.w3.org/2001/XMLSchema#string |
  | urn:test:p2 | foo^^http://www.w3.org/2001/XMLSchema#string |
  
  However, when I flip the order 

Re: Accents-insensitive search with LARQ

2012-11-16 Thread Paolo Castagna

On 27/10/12 00:31, Ondřej Hoferek wrote:

Hi all,

I would like to use the full text search with LARQ for accent-insensitive
matching. I.e. pattern {?literal pf:textMatch laska} should also return
literal láska žije.

I know that in Lucene, there is a class ISOLatin1AccentFilter which can be
used while building/querying the index. However, I don't know how to use it
from within LARQ.

Is there any way to achieve my goal?


Hi Ondrey,
look at the LARQ sources and in particular at how IndexLARQ is used.

That class has a couple of constructors which takes as a parameter a 
Lucene analyzer.


Please, try to see if those helps you.

Paolo



Best regards,
Ondrej





Re: LARQ index restrictions with Fuseki

2012-11-16 Thread Paolo Castagna

Hi Ondřej

On 26/10/12 21:50, Ondřej Hoferek wrote:

Hi all,

As far as I understood, LARQ index will be created for all the literals in
given dataset when used with Fuseki with configuration:

#dataset1 rdf:type tdb:DatasetTDB ;
tdb:location /tmp/tdb ;
ja:textIndex /tmp/lucene .

Is it possible to restrict the index built within Fuseki to certain named
graphs/properties?


No, it is not possible to restrict the index to certain named graphs.

However, there are constructors in IndexBuilderString, for example, 
which take as a parameter a Property to restrict statements which will 
be indexed:


  public IndexBuilderString(Property property, String fileDir)

This is not exposed via Assembler/configuration. If you are willing to 
learn more about the Jena Assembler configuration mechanism, I am happy 
to work with you and help you on the LARQ side of the job.


How would you like to specify the properties to index in your 
configuration file?


Paolo

 This might be handy if I would like to index only

relatively small subset of the all data. With the LARQ API it is possible
to restrict the index built to certain properties only. Alternatively, it
is possible to build the LARQ index separately for given dataset (TDB
dataset) using API (or any utility) with such restrictions and let fuseki
use it?

Best regards,
Ondrej





Re: Fueski with Larq - query anomaly

2012-11-16 Thread Paolo Castagna

On 16/11/12 22:20, Paolo Castagna wrote:

Elli, could you provide an example with some data and your query?


Apologies Elli, I now have found your example. ;-)

Paolo



Re: Fueski with Larq - query anomaly

2012-11-16 Thread Paolo Castagna

Hi Elli

On 23/10/12 16:47, Elli Schwarz wrote:



Hello,


I am using Fuseki with Larq (thanks to Osma's recent instructions - thanks 
Osma!)  where I recompiled Jena (after adding the Larq dependency) to Jena 
revision 1399877 (this past Friday morning's version of the trunk). I'm 
noticing the following anomaly when querying the data:

First I insert the following triples:
prefix xsd: http://www.w3.org/2001/XMLSchema#
insert data {  graph urn:test:foo {
  urn:test:s1 urn:test:p1 foo^^xsd:string .
  urn:test:s1 urn:test:p2 foo^^xsd:string .
  urn:test:s2 urn:test:p3 foo^^xsd:string .
} }

Then I stop Fuseki, delete my index directory, and restart Fuseki. (As an 
aside, I'd be very interested in a fix for this so I don't have to restart 
Fuseki to rebuild the index - I'm watching JENA-164 and hoping someone will be 
able to work on it soon!)


Re: JENA-164 ... yeah, I'd love to help you out, but it's a sort of 
architectural issue of Jena IMHO. It should be easier for developers to 
listen to events as triples are added/removed so that you can attach 
external indexes and keep them in sync.


There are multiple paths which you can use to change RDF data: APIs, 
SPARQL, etc. From a use point of view, you would like to keep your 
external index always in sync, no matter where the updates come from.


 Once Fuseki is back up, I run the following query (I have default 
graph set as the union of named graphs by default):

PREFIX pf: http://jena.hpl.hp.com/ARQ/property#
select * where {
  urn:test:s1 ?p ?lit .
  ?lit pf:textMatch foo .
}

and I get 2 results as I expect:


| p | lit  |

| urn:test:p1 | foo^^http://www.w3.org/2001/XMLSchema#string |
| urn:test:p2 | foo^^http://www.w3.org/2001/XMLSchema#string |

However, when I flip the order of my query like this:

PREFIX pf: http://jena.hpl.hp.com/ARQ/property#
select * where {
  ?lit pf:textMatch foo .
  urn:test:s1 ?p ?lit .

I get 6 results, instead of the two I expect:


| lit  | p |

| foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p1 |
| foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p2 |
| foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p1 |
| foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p2 |
| foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p1 |
| foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p2 |
My guess as to what happens is that in the 
second query, first the query executer executes the first line (the ?lit pf:textMatch foo) and this 
returns 3 results for foo, since there are 3 literals for foo. Then, the next line of the query has 
three bindings to ?lit, so it produces the 6 results above (2 for each foo literal since there are 2 
properties for urn:test:s1). I know that I can avoid this by using a SELECT DISTINCT, but I still think the 
query shouldn't produce different results based on switching the order. Additionally, if I put this in a CONSTRUCT 
query, I can't use DISTINCT to eliminate the duplicate results (unless I use a SELECT DISTINCT subquery which I'd 
rather avoid).


I am not sure, at the moment I have no clear idea on how this problem 
could be fixed.


Paolo



Another point I've noticed is that in my other (much more complex) queries, 
against a much larger dataset (~1.5 million triples), if I put the pf:textMatch 
line anywhere but in the very beginning of the query, the query takes a VERY 
long time to execute. If I put it as the first line in the query, the query 
runs quickly. My guess for this is that the query is executed in order, and it 
takes much more work for the query executer to run the other parts of my query 
which contain many results, and then have to go back and essentially filter out 
those results where the literal doesn't match the pf:textMatch. I can always 
place the pf:textMatch line first, but then I'm back to the problem mentioned 
above where I get back too many duplicate results.

Thank you very much for your help!
-Elli