A fault-tolerant and replicated data publishing solution (by Epimorphics)... and how to calculate the triples to add/remove?
Hi, I've just read this blog post from Andy: http://www.epimorphics.com/web/wiki/epimorphics-builds-data-publish-platform-environment-agency It describes a quite simple fault-tolerant and replicated data publishing solution using Apache Jena and Fuseki. Interesting. It's a master/slave architecture. The master (called by Andy in his post 'controller server') receives all updates and calculates the triples to be added, the triples to be removed so that changes are 'idempotent' (i.e. they can be reapplied multiple times (in the same order!) with the same effect). It would be interesting to know if the 'controller server' exposes a full SPARQL Update endpoint and/or the Graph Store HTTP Protocol and if that is the case how triples to be added/removed are calculated. (This is something I wanted to learn for a while, but I still did not find the time... a small example would be wonderful! ;-)). To conclude, I fully agree on the quite simple design and simple systems are easier to operate. The approach described can work well in a lot of scenarios where the rate of updates/writes isn't excessive and you have mostly reads (which I still believe to be the case most of the times when you have RDF data, since data is often human generated/curated data). My hope is to see something similar in the 'open' so that other Apache Jena and Fuseki users can benefit from an highly available and open source publishing solution for RDF data (and they can focus their energies/efforts elsewhere: on the quality of their data modeling, data, applications, user experience, etc.). Paolo PS: Disclaimer: I don't work for Epimorphics, those are just my personal opinions and, last but not least, I love simplicity.
Re: Ideas for an efficient TDB check?
Hi André, I know exactly how you feel and I had exactly the same need at times. How you know if your TDB indexes are all fine? Add the work 'production' to it and everything becomes more 'fun'. :-) Fortunately, we use replication and have the ability to replay updates going back as much as we want/need. This makes things more 'relaxing'. But, this is not the answer you are searching for right now. I do not have *the* answer for you nor a tool, but in the past I've done something similar to what you suggested, a sort of TDB index verifier/health checker. Here [1], it's just a quick and dirty solution (not scalable... it keeps stuff in memory, etc.). But, perhaps, it provides you with ideas. If a TDB health checking utility is useful and feasible, we should probably open a JIRA issue for it and gather ideas on how to best implement this. It should not be too much work. You are still using TDB 0.8.10, but on-disk format hasn't changed... so it's reasonable to expect such functionality would work with your indexes as well. My 2 cents, Paolo [1] https://github.com/castagna/tdbloader4/blob/f5363fa49d16a04a362898c1a5084ade620ee81b/src/test/java/dev/TDBVerifier.java Dr. André Lanka wrote: Hello Jena-Users, we are using Jena+TDB in production and are looking for an efficient method to check the validity of the TDB files on disk. Our situation is as follows. With Jena 2.6.4 and TDB 0.8.10 each of our servers stores triples in up to 4000 different TDB stores stored on its local hard drive. On average each store owns 1 million triples (with high variance). To get our system working fluently, we need massive parallel write access to the different stores, so one huge named graph is no alternative. Also we need to have all stores open and accessible. In order to get that large number of TDB stores opened in parallel, we customised the TDB code for our needs. For instance we introduced read caches shared between all stores (to avoid memory problems). Also we introduced basic capabilities to roll back transactions. (We took control over all data read from or written to ObjectFile and BlockMgr). So, in our situation we can't switch to the new TDB version over night. Now, the problem is that we had some disk issues a few days ago and want to check which stores have got broken (We know some of them are broken). Our initial idea is to iterate over all statements in the store and collect any S, P and O used in the store. Second step would be to check if any such URI is correctly mapped to an nodeID. And the other way round. Unfortunately we are not sure, if this will cover any possible file problem. Also, we think there could be a more efficient way to check the internal data structures. So, any idea (both high and low level) is highly appreciated. Thanks in advance André
Re: Import Messures
Stefan Scheffler wrote: Hey Paolo. Thanks for your reply. I used tdbloader2 with an own Tokenizer / Errorhandler (which just catches / skips errors and writes them into a file). the command was /.tdbloader2 --loc=store srcpath/* Is there a possibility to do incremental loads with the script files or do i have to write a own program? Hi Stefan, if you want to run an incremental load you should use tdbloader, not tdbloader2. tdbloader supports incremental loads, tdbloader2 not. If you are loading large datasets make sure you have enough RAM (you can load data on a machine with a lot of RAM and move indexes elsewhere). Paolo Regards, Stefan Am 24.06.2012 10:42, schrieb Paolo Castagna: Hi Stefan, as Rob said, loading data into an empty TDB store is a different from loading data into an existing TDB store. I assume that for your second data load you used tdbloader not tdbloader2. tdbloader2 does not even support incremental data loads (i.e. it will overwrite your existing data). I suspect this is what is going on. Can you share the exact commands you used as well as links to the RDF data? (this way others can replicate your experiments). Regards, Paolo Stefan Scheffler wrote: Hello, At the moment i am doing some performance checks on tdb. The first i checked was the import of the tdbloader2 and i got some weird results. Maybe someone can help me out. Here are my testbase and the results. The first test was to store 12 GB of triples into an empty store (i used the german dbpedia). Load time: 16 minutes average loading: ca 81.000 triple / second index time: 40 minutes store size: 9,3GB The second test was to store the same data into an allready filled store As i started the import i created a store with 348.398.593 Triples from DNB and HBZ (which are german libraries, store size: 33 GB). Then i started to load the german dbpedia in. Load time: 3 hours and 4 minutes average loading: ca 7200 / second index time: 38 minutes store size: 19 GB! Why does the loading time increases that immense? My expectation was, that the index time increases. But it does not. There where no other big proccesses running nearby. And why does the store size shrink to 19GB? I am totally confused about that point. With friendly regards Stefan
Re: just trying to read in an RDF file ...
Andy Seaborne wrote: On 24/06/12 09:23, Paolo Castagna wrote: Jena has moved to the Apache Software Foundation and it is not a TLPc (i.e. top level project). s/not/now/ Sorry. Yes, now! ;-) Paolo Jena is a TLP. Andy
Re: Want to run SPARQL Query with Hadoop Map Reduce Framework
Hi Mizanur, when you have big RDF datasets, it might make sense to use MapReduce (but only if you already have an Hadoop cluster at hand. Is this your case?). You say that your data is 'huge', just for the sake of curiosity... how many triples/quads is 'huge'? ;-) Most of the use cases I've seen related to statistics on RDF datasets were trivial MapReduce jobs. For a couple of examples on using MapReduce with RDF datasets have a look here: https://github.com/castagna/jena-grande https://github.com/castagna/tdbloader4 This, for example, is certainly not exactly what you need, but I am sure that with little changes you can get what you want: https://github.com/castagna/tdbloader4/blob/master/src/main/java/org/apache/jena/tdbloader4/StatsDriver.java Last but not least, you'll need to dump your RDF data out onto HDFS. I suggest you use N-Triples/N-Quads serialization formats. Running SPARQL queries on top of an Hadoop cluster is another (long and not easy) story. But, it might be possible to translate part of the SPARQL algebra into Pig Latin scripts and use Pig. In my opinion however, it makes more sense to use MapReduce to filter/slice massive datasets, load the result into a triple store and refine your data analysis using SPARQL there. My 2 cents, Paolo Md. Mizanur Rahoman wrote: Dear All, I want to collect some statistics over RDF data. My triple store is Virtuoso and I am using Jena for executing my query. I want to get some statistics like i) how many resources in my dataset ii) resources belong to in which position of dataset (i.e., sub/prd/obj) etc. As my data is huge, I want to use Hadoop Map Reduce in calculating such statistics. Can you please suggest.
Re: Want to run SPARQL Query with Hadoop Map Reduce Framework
Md. Mizanur Rahoman wrote: Hi Paolo, Thanks for your reply. Right now I am only using DBPedia, Geoname and NYTimes for LOD cloud. And later on I want to extend my dataset. Ok, so it's big, but not huge! ;-) If you have enough RAM you can do everything on a single machine. By the way, yes, I can use sparql directly to collect my required statistics but my assumption is using Hadoop could give me some boosting in collecting those stat. Well, it all depends if you already have an Hadoop cluster you can use. If not, a single machine with a lot of RAM might be easier/faster/better. I will knock you after going through your links. Sure, let me know how it goes. Paolo - Sincerely Md Mizanur On Tue, Jun 26, 2012 at 12:50 AM, Paolo Castagna castagna.li...@googlemail.com wrote: Hi Mizanur, when you have big RDF datasets, it might make sense to use MapReduce (but only if you already have an Hadoop cluster at hand. Is this your case?). You say that your data is 'huge', just for the sake of curiosity... how many triples/quads is 'huge'? ;-) Most of the use cases I've seen related to statistics on RDF datasets were trivial MapReduce jobs. For a couple of examples on using MapReduce with RDF datasets have a look here: https://github.com/castagna/jena-grande https://github.com/castagna/tdbloader4 This, for example, is certainly not exactly what you need, but I am sure that with little changes you can get what you want: https://github.com/castagna/tdbloader4/blob/master/src/main/java/org/apache/jena/tdbloader4/StatsDriver.java Last but not least, you'll need to dump your RDF data out onto HDFS. I suggest you use N-Triples/N-Quads serialization formats. Running SPARQL queries on top of an Hadoop cluster is another (long and not easy) story. But, it might be possible to translate part of the SPARQL algebra into Pig Latin scripts and use Pig. In my opinion however, it makes more sense to use MapReduce to filter/slice massive datasets, load the result into a triple store and refine your data analysis using SPARQL there. My 2 cents, Paolo Md. Mizanur Rahoman wrote: Dear All, I want to collect some statistics over RDF data. My triple store is Virtuoso and I am using Jena for executing my query. I want to get some statistics like i) how many resources in my dataset ii) resources belong to in which position of dataset (i.e., sub/prd/obj) etc. As my data is huge, I want to use Hadoop Map Reduce in calculating such statistics. Can you please suggest.
Re: LARQ prefix search results missing hits
Hi Osma, thanks for your help and feedback. Does your problem go away without changing the code and using: ?lit pf:textMatch ( 'a*' 10 ) It's not a problem adding a couple of '0'... However, I am thinking that this would just shift the problem, isn't it? Paolo On 15/08/12 10:31, Osma Suominen wrote: Hi Paolo! Thanks for your reply and sorry for the delay. I tested this again with today's svn snapshot and it's still a problem. However, after digging a bit further I found this in jena-larq/src/main/java/org/apache/jena/larq/LARQ.java: --clip-- // The number of results returned by default public static final int NUM_RESULTS = 1000 ; // should we increase this? -- PC --clip-- I changed NUM_RESULTS to 10 (added two zeros), built and installed my modified LARQ with mvn install (NB this required tweaking arq.ver and tdb.ver in jena-larq/pom.xml to match the current svn versions), rebuilt Fuseki and now the problem is gone! I would suggest that this constant be increased to something larger than 1000. Based on the code comment, you seem to have had similar thoughts sometime in the past :) Thanks, Osma 15.07.2012 11:21, Paolo Castagna kirjoitti: Hi Osma, first of all, thanks for sharing your experience and clearly describing your problem. Further comments inline. On 13/07/12 14:13, Osma Suominen wrote: Hello! I'm trying to use a Fuseki SPARQL endpoint together with LARQ to create a system for accessing SKOS thesauri. The user interface includes an autocompletion widget. The idea is to use the LARQ index to make fast prefix queries on the concept labels. However, I've noticed that in some situations I get less results from the index than what I'd expect. This seems to happen when the LARQ part of the query internally produces many hits, such as when doing a single character prefix query (e.g. ?lit pf:textMatch 'a*'). I'm using Fuseki 0.2.4-SNAPSHOT taken from SVN trunk on 2012-07-10 and LARQ 1.0.0-incubating. I compiled Fuseki with LARQ by adding the LARQ dependency to pom.xml and running mvn package. Other than this issue, Fuseki and LARQ queries seem to work fine. I'm using Ubuntu Linux 12.04 LTS amd64 with OpenJDK 1.6.0_24 installed from the standard Ubuntu packages. Steps to repeat: 1. package Fuseki with LARQ, as described above 2. start Fuseki with the attached configuration file, i.e. ./fuseki-server --config=larq-config.ttl 3. I'm using the STW thesaurus as an easily accessible example data set (though the problem was originally found with other data sets): - download http://zbw.eu/stw/versions/latest/download/stw.rdf.zip - unzip so you have stw.rdf 4. load the thesaurus file into the endpoint: ./s-put http://localhost:3030/ds/data default stw.rdf 6. build the LARQ index, e.g. this way: - kill Fuseki - rm -r /tmp/lucene - start Fuseki again, so the index will be built 7. Make SPARQL queries from the web interface at http://localhost:3030 First try this SPARQL query: PREFIX skos:http://www.w3.org/2004/02/skos/core# PREFIX pf:http://jena.hpl.hp.com/ARQ/property# SELECT DISTINCT * WHERE { ?lit pf:textMatch ar* . ?conc skos:prefLabel ?lit . FILTER(REGEX(?lit, '^ar.*', 'i')) } ORDER BY ?lit I get 120 hits, including Arab@en. Now try the same query, but change the pf:textMatch argument to a*. This way I get only 32 results, not including Arab@en, even though the shorter prefix query should match a superset of what was matched by the first query (the regex should still filter it down to the same result set). This issue is not just about single character prefix queries. With enough data sets loaded into the same index, this happens with longer prefix queries as well. I think that the problem might be related to Lucene's default limitation of a maximum of 1024 clauses in boolean queries (and thus prefix query matches), as described in the Lucene FAQ: http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_a_TooManyClauses_exception.3F Yes, I think your hypothesis might be correct (I've not verified it yet). In case this is the problem, is there any way to tell LARQ to use a higher BooleanQuery.setMaxClauseCount() value so that this limit is not triggered? I find it a bit disturbing that hits are silently being lost. I couldn't see any special output on the Fuseki log. Not sure about this. Paolo Am I doing something wrong? If this is a genuine problem in LARQ, I can of course make a bug report. Thanks and best regards, Osma Suominen
Re: LARQ prefix search results missing hits
Hi Osma On 20/08/12 11:10, Osma Suominen wrote: Hi Paolo! Thanks for your quick reply. 17.08.2012 20:16, Paolo Castagna wrote: Does your problem go away without changing the code and using: ?lit pf:textMatch ( 'a*' 10 ) I tested this but it didn't help. If I use a parameter less than 1000 then I get even fewer hits, but values above 1000 don't have any effect. Right. I think the problem is this line in IndexLARQ.java: TopDocs topDocs = searcher.search(query, (Filter)null, LARQ.NUM_RESULTS ) ; As you can see the parameter for maximum number of hits is taken directly from the NUM_RESULTS constant. The value specified in the query has no effect on this level. Correct. It's not a problem adding a couple of '0'... However, I am thinking that this would just shift the problem, isn't it? You're right, it would just shift the problem but a sufficiently large value could be used that never caused problems in practice. Maybe you could consider NUM_RESULTS = Integer.MAX_VALUE ? :) A lot of use cases about search are to used to drive a UI for people and often only the first few results are necessary. Try to continue hit 'next ' on Google, how many results can you get? ;-) Anyway, I increased the NUM_RESULT constant. Or maybe LARQ should use another variant of Lucene's IndexSearcher.search(), one which takes a Collector object instead of the integer n parameter. E.g. this: http://lucene.apache.org/core/old_versioned_docs/versions/3_1_0/api/core/org/apache/lucene/search/IndexSearcher.html#search%28org.apache.lucene.search.Query,%20org.apache.lucene.search.Filter,%20org.apache.lucene.search.Collector%29 Yes. That would be the thing to use if we want to retrieve all the results from Lucene. More thinking is necessary here... In the meantime, you can find a LARQ SNAPSHOT here: https://repository.apache.org/content/groups/snapshots/org/apache/jena/jena-larq/1.0.1-SNAPSHOT/ Paolo Thanks, Osma On 15/08/12 10:31, Osma Suominen wrote: Hi Paolo! Thanks for your reply and sorry for the delay. I tested this again with today's svn snapshot and it's still a problem. However, after digging a bit further I found this in jena-larq/src/main/java/org/apache/jena/larq/LARQ.java: --clip-- // The number of results returned by default public static final int NUM_RESULTS = 1000 ; // should we increase this? -- PC --clip-- I changed NUM_RESULTS to 10 (added two zeros), built and installed my modified LARQ with mvn install (NB this required tweaking arq.ver and tdb.ver in jena-larq/pom.xml to match the current svn versions), rebuilt Fuseki and now the problem is gone! I would suggest that this constant be increased to something larger than 1000. Based on the code comment, you seem to have had similar thoughts sometime in the past :) Thanks, Osma 15.07.2012 11:21, Paolo Castagna kirjoitti: Hi Osma, first of all, thanks for sharing your experience and clearly describing your problem. Further comments inline. On 13/07/12 14:13, Osma Suominen wrote: Hello! I'm trying to use a Fuseki SPARQL endpoint together with LARQ to create a system for accessing SKOS thesauri. The user interface includes an autocompletion widget. The idea is to use the LARQ index to make fast prefix queries on the concept labels. However, I've noticed that in some situations I get less results from the index than what I'd expect. This seems to happen when the LARQ part of the query internally produces many hits, such as when doing a single character prefix query (e.g. ?lit pf:textMatch 'a*'). I'm using Fuseki 0.2.4-SNAPSHOT taken from SVN trunk on 2012-07-10 and LARQ 1.0.0-incubating. I compiled Fuseki with LARQ by adding the LARQ dependency to pom.xml and running mvn package. Other than this issue, Fuseki and LARQ queries seem to work fine. I'm using Ubuntu Linux 12.04 LTS amd64 with OpenJDK 1.6.0_24 installed from the standard Ubuntu packages. Steps to repeat: 1. package Fuseki with LARQ, as described above 2. start Fuseki with the attached configuration file, i.e. ./fuseki-server --config=larq-config.ttl 3. I'm using the STW thesaurus as an easily accessible example data set (though the problem was originally found with other data sets): - download http://zbw.eu/stw/versions/latest/download/stw.rdf.zip - unzip so you have stw.rdf 4. load the thesaurus file into the endpoint: ./s-put http://localhost:3030/ds/data default stw.rdf 6. build the LARQ index, e.g. this way: - kill Fuseki - rm -r /tmp/lucene - start Fuseki again, so the index will be built 7. Make SPARQL queries from the web interface at http://localhost:3030 First try this SPARQL query: PREFIX skos:http://www.w3.org/2004/02/skos/core# PREFIX pf:http://jena.hpl.hp.com/ARQ/property# SELECT DISTINCT * WHERE { ?lit pf:textMatch ar* . ?conc skos:prefLabel ?lit . FILTER(REGEX(?lit, '^ar.*', 'i
Re: LARQ prefix search results missing hits
Hi Osma On 28/08/12 14:22, Osma Suominen wrote: Hi Paolo! Thanks a lot for the fix! I have tested the latest snapshot and it now works as expected. At least until I add lots of new data and hit the new limit :) You're of course right about the search use case. I think the problem here is that the LARQ index can be used for two very different use cases: A. Traditional IR, in which the user cares about only the first few results. Lucene is obviously very good at this, though full advantage (especially for non-English languages) of it can only be achieved by using specific Analyzer implementations, which appears not to be supported in LARQ, at least not without writing some Java code. B. Speeding up queries on literals for e.g. autocomplete search. While this can be done without a text index using FILTER(REGEX()), the queries tend to be quite slow, as the filter is applied only afterwards. In this case it is important that the text index returns all possible hits, not just the first ones. I have no idea which is the more important use case for LARQ, but I'm currently only interested in B because of the requirements of the application I'm building (ONKI Light, a SKOS vocabulary browser for SPARQL endpoints). Do you have any idea/proposal to make LARQ be good for both these use cases? Currently the benefits of LARQ (at least for the out-of-the-box configuration for Fuseki+LARQ) for both A and B are somewhat diminished by these limitations: 1. The index is global and contains data from all named graphs mixed up. This means that when you have many named graphs with different data (as I do), and try to query only one graph, the LARQ query part will still return hits from all the other graphs, slowing down later parts of the query. Yep. I though about this while ago, but I haven't actually tried to implement it. The changes to the index are trivial. The most difficult part perhaps is on the property function side, but maybe it's easy that as well. I think this could be a good contribution, if you need it. 2. Similarly, the index does not allow filtering by language on the query level. With multilingual data, you cannot make a query matching e.g. only English labels but will get hits from all the other languages as well. Yep. I have no proposal for this, but I understand the user need. 3. The default implementation also doesn't store much context for the literal, meaning that you cannot restrict the search only to e.g. skos:prefLabel literal values in skos:Concept type resources. This will again increase the number of hits returned by the index internally. I am not sure I follow this or I completely agree with you. What you say is true, but LARQ provides a property function and you can use it together with other triple patterns: { ?l pf:textMatch '...' . ?s skos:prefLabel ?l . ?s rdf:type skos:Concept . } Now, we can argue on what a clever optimizer should/could do, but from a point of view of the user, this is quite good and powerful and it gets you what you want. Isn't it? The syntax is very easy to remember and the property function very easy to use. The Lucene index can be kept quite simple and small. There may also be problems with prefix queries if you happen to hit the default BooleanQuery limit of 1024 clauses, but I haven't yet had this problem myself with LARQ. Another problem for use case B might be that the default Lucene StandardAnalyzer, which LARQ seems to use, filters common English stop words from the index and the query, which might interfer with the exact matching required for B. To be fair, other SPARQL text index implementations are not that good for prefix searches either. Virtuoso [1] requires at least 4 character prefixes to be specified (this can be changed by recompiling). AFAICT the 4store text index [2] doesn't support prefix queries at all, as the index structure requires whole words to be used (though possibly some creative use of subqueries and FILTER(REGEX()) could be used to still get some benefit of the index). Osma [1] http://docs.openlinksw.com/virtuoso/sparqlextensions.html#rdfsparqlrulefulltext [2] http://4store.org/trac/wiki/TextIndexing 26.08.2012 22:49, Paolo Castagna wrote: Hi Osma On 20/08/12 11:10, Osma Suominen wrote: Hi Paolo! Thanks for your quick reply. 17.08.2012 20:16, Paolo Castagna wrote: Does your problem go away without changing the code and using: ?lit pf:textMatch ( 'a*' 10 ) I tested this but it didn't help. If I use a parameter less than 1000 then I get even fewer hits, but values above 1000 don't have any effect. Right. I think the problem is this line in IndexLARQ.java: TopDocs topDocs = searcher.search(query, (Filter)null, LARQ.NUM_RESULTS ) ; As you can see the parameter for maximum number of hits is taken directly from the NUM_RESULTS constant. The value specified in the query has no effect on this level. Correct. It's
Re: LARQ prefix search results missing hits
Apologies, this was a mistake. Paolo On 10 September 2012 23:07, Paolo Castagna castagna.li...@gmail.com wrote: Hi Osma On 28/08/12 14:22, Osma Suominen wrote: Hi Paolo! Thanks a lot for the fix! I have tested the latest snapshot and it now works as expected. At least until I add lots of new data and hit the new limit :) You're of course right about the search use case. I think the problem here is that the LARQ index can be used for two very different use cases: A. Traditional IR, in which the user cares about only the first few results. Lucene is obviously very good at this, though full advantage (especially for non-English languages) of it can only be achieved by using specific Analyzer implementations, which appears not to be supported in LARQ, at least not without writing some Java code. B. Speeding up queries on literals for e.g. autocomplete search. While this can be done without a text index using FILTER(REGEX()), the queries tend to be quite slow, as the filter is applied only afterwards. In this case it is important that the text index returns all possible hits, not just the first ones. I have no idea which is the more important use case for LARQ, but I'm currently only interested in B because of the requirements of the application I'm building (ONKI Light, a SKOS vocabulary browser for SPARQL endpoints). Do you have any idea/proposal to make LARQ be good for both these use cases? Currently the benefits of LARQ (at least for the out-of-the-box configuration for Fuseki+LARQ) for both A and B are somewhat diminished by these limitations: 1. The index is global and contains data from all named graphs mixed up. This means that when you have many named graphs with different data (as I do), and try to query only one graph, the LARQ query part will still return hits from all the other graphs, slowing down later parts of the query. Yep. I though about this while ago, but I haven't actually tried to implement it. The changes to the index are trivial. The most difficult part perhaps is on the property function side, but maybe it's easy that as well. I think this could be a good contribution, if you need it. 2. Similarly, the index does not allow filtering by language on the query level. With multilingual data, you cannot make a query matching e.g. only English labels but will get hits from all the other languages as well. Yep. I have no proposal for this, but I understand the user need. 3. The default implementation also doesn't store much context for the literal, meaning that you cannot restrict the search only to e.g. skos:prefLabel literal values in skos:Concept type resources. This will again increase the number of hits returned by the index internally. I am not sure I follow this or I completely agree with you. What you say is true, but LARQ provides a property function and you can use it together with other triple patterns: { ?l pf:textMatch '...' . ?s skos:prefLabel ?l . ?s rdf:type skos:Concept . } Now, we can argue on what a clever optimizer should/could do, but from a point of view of the user, this is quite good and powerful and it gets you what you want. Isn't it? The syntax is very easy to remember and the property function very easy to use. The Lucene index can be kept quite simple and small. There may also be problems with prefix queries if you happen to hit the default BooleanQuery limit of 1024 clauses, but I haven't yet had this problem myself with LARQ. Another problem for use case B might be that the default Lucene StandardAnalyzer, which LARQ seems to use, filters common English stop words from the index and the query, which might interfer with the exact matching required for B. To be fair, other SPARQL text index implementations are not that good for prefix searches either. Virtuoso [1] requires at least 4 character prefixes to be specified (this can be changed by recompiling). AFAICT the 4store text index [2] doesn't support prefix queries at all, as the index structure requires whole words to be used (though possibly some creative use of subqueries and FILTER(REGEX()) could be used to still get some benefit of the index). Osma [1] http://docs.openlinksw.com/virtuoso/sparqlextensions.html#rdfsparqlrulefulltext [2] http://4store.org/trac/wiki/TextIndexing 26.08.2012 22:49, Paolo Castagna wrote: Hi Osma On 20/08/12 11:10, Osma Suominen wrote: Hi Paolo! Thanks for your quick reply. 17.08.2012 20:16, Paolo Castagna wrote: Does your problem go away without changing the code and using: ?lit pf:textMatch ( 'a*' 10 ) I tested this but it didn't help. If I use a parameter less than 1000 then I get even fewer hits, but values above 1000 don't have any effect. Right. I think the problem is this line in IndexLARQ.java: TopDocs topDocs = searcher.search(query, (Filter)null, LARQ.NUM_RESULTS ) ; As you can see the parameter for maximum number of hits
Re: SDB - community testing RC
Ciao Francesco, thanks for sharing. Just a couple of (late) comments. On 6 September 2012 13:21, Francesco Panico fpan...@imolinfo.it wrote: It's two year my society (GruppoImola) works with jena. Our customers are banks and insurances, so it's important to store triples in a relational DB instand of File System. If there will ever be a Powered By Apache Jena page somewhere on the web, you should consider be on that page. :-) One question, I imagine the culture of the customers you work with. However, what would you say are the main motivations for them to use RDBMS systems with Apache Jena? We focused on SDB. We have 5 customers with a semantic application in a production environment based on jena, sdb and semantic mediawiki. :-) Grazie mille for your feedback. Paolo
Re: Fueski with Larq - query anomaly
On 24/10/12 12:11, Osma Suominen wrote: Hi Elli! It seems that at least part of your problem is having duplicates in the LARQ index. Have you tried creating the Lucene index using the larqbuilder command line tool, instead of removing the index and just letting Fuseki rebuild it when it starts? See the end of my tutorial [1] for a recipe. As I understand it, unless you give larqbuilder the --allow-duplicates option, it will try to avoid duplicates in the index. Though the index building will take longer. Exactly. Duplicate removal slow down indexing. In you want to index a large dataset you want to disable it and go faster. Maybe that option should be renamed. Proposal? Paolo I've also noticed that it usually makes sense to place the pf:textMatch pattern first in the query, otherwise it will be executed many times and slow down the whole query, sometimes by a lot. Hope this helps, -Osma [1] http://code.google.com/p/onki-light/wiki/InstallFusekiLARQ On Tue, 23 Oct 2012, Elli Schwarz wrote: Hello, I am using Fuseki with Larq (thanks to Osma's recent instructions - thanks Osma!) where I recompiled Jena (after adding the Larq dependency) to Jena revision 1399877 (this past Friday morning's version of the trunk). I'm noticing the following anomaly when querying the data: First I insert the following triples: prefix xsd: http://www.w3.org/2001/XMLSchema# insert data { graph urn:test:foo { urn:test:s1 urn:test:p1 foo^^xsd:string . urn:test:s1 urn:test:p2 foo^^xsd:string . urn:test:s2 urn:test:p3 foo^^xsd:string . } } Then I stop Fuseki, delete my index directory, and restart Fuseki. (As an aside, I'd be very interested in a fix for this so I don't have to restart Fuseki to rebuild the index - I'm watching JENA-164 and hoping someone will be able to work on it soon!) Once Fuseki is back up, I run the following query (I have default graph set as the union of named graphs by default): PREFIX pf: http://jena.hpl.hp.com/ARQ/property# select * where { urn:test:s1 ?p ?lit . ?lit pf:textMatch foo . } and I get 2 results as I expect: | p | lit | | urn:test:p1 | foo^^http://www.w3.org/2001/XMLSchema#string | | urn:test:p2 | foo^^http://www.w3.org/2001/XMLSchema#string | However, when I flip the order of my query like this: PREFIX pf: http://jena.hpl.hp.com/ARQ/property# select * where { ?lit pf:textMatch foo . urn:test:s1 ?p ?lit . I get 6 results, instead of the two I expect: | lit | p | | foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p1 | | foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p2 | | foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p1 | | foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p2 | | foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p1 | | foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p2 | My guess as to what happens is that in the second query, first the query executer executes the first line (the ?lit pf:textMatch foo) and this returns 3 results for foo, since there are 3 literals for foo. Then, the next line of the query has three bindings to ?lit, so it produces the 6 results above (2 for each foo literal since there are 2 properties for urn:test:s1). I know that I can avoid this by using a SELECT DISTINCT, but I still think the query shouldn't produce different results based on switching the order. Additionally, if I put this in a CONSTRUCT query, I can't use DISTINCT to eliminate the duplicate results (unless I use a SELECT DISTINCT subquery which I'd rather avoid). Another point I've noticed is that in my other (much more complex) queries, against a much larger dataset (~1.5 million triples), if I put the pf:textMatch line anywhere but in the very beginning of the query, the query takes a VERY long time to execute. If I put it as the first line in the query, the query runs quickly. My guess for this is that the query is executed in order, and it takes much more work for the query executer to run the other parts of my query which contain many results, and then have to go back and essentially filter out those results where the literal doesn't match the pf:textMatch. I can always place the pf:textMatch line first, but then I'm back to the problem mentioned above where I get back too many duplicate results. Thank you very much for your help! -Elli
Re: Fueski with Larq - query anomaly
Hi Osma, hi Elli On 02/11/12 10:34, Osma Suominen wrote: Hi Elli! [apparently your reply didn't come through the mailing list, but this one should] 31.10.2012 23:11, Elli Schwarz kirjoitti: Thank you for the tip. Yes, if I generate the index using the larqbuilder command, I don't get the duplicates in the query, regardless of the placement of the pf:testMatch line. (As an aside, why does the default behavior of creating the index allow duplicates, but the default of the larqbuilder command does not?) Good to hear that eliminating duplicates works for you. I have no idea why the defaults are as they are. LARQ index 'text' -- RDF nodes, see in IndexBuilderNode.java: public void index(Node node, String indexStr) { try { if ( avoidDuplicates() ) unindex(node, indexStr); Document doc = new Document() ; LARQ.store(doc, node) ; LARQ.index(doc, node, indexStr) ; getIndexWriter().addDocument(doc) ; } catch (IOException ex) { throw new ARQLuceneException(index, ex) ; } } avoidDuplicates() by default returns 'true' and by default we want to avoid duplicates and make the Lucene index smaller. if ( avoidDuplicates() ) unindex(node, indexStr); is 'ugly' and inefficient, but it is done to avoid having useless documents in the Lucene index, as you might have exactly the same RDF node/literal used in many triples. I am open to better suggestions to make this better or faster. However, switching the order of where I place the pf:textMatch line (while it may slow down the query), should not produce different results, even if there are duplicates in the index. This would appear to be a bug in how Larq applies the results of the index lookup to the query. Elli, could you provide an example with some data and your query? I'm not sure whether getting or not getting duplicates in specific situations can be considered a bug. But yes, the implementation of LARQ seems to be rather simplistic. It might help if the raw index results were filtered to weed out duplicates before applying them to the query. How could we do this? Then the choice whether to try to avoid duplicates during indexing would only be an optimization issue. BTW I'm not (so far) a LARQ developer, just a fellow user.. But you could help out with LARQ (if you are using it!). Patches are always welcome expecially from fellow users! ;-) By the way, many thanks for the documentation on how to use LARQ with Fuseki. Very useful (and it will save me time... I can just point people to your page from now on). Paolo -Osma Hi Elli! It seems that at least part of your problem is having duplicates in the LARQ index. Have you tried creating the Lucene index using the larqbuilder command line tool, instead of removing the index and just letting Fuseki rebuild it when it starts? See the end of my tutorial [1] for a recipe. As I understand it, unless you give larqbuilder the --allow-duplicates option, it will try to avoid duplicates in the index. Though the index building will take longer. I've also noticed that it usually makes sense to place the pf:textMatch pattern first in the query, otherwise it will be executed many times and slow down the whole query, sometimes by a lot. Hope this helps, -Osma [1] http://code.google.com/p/onki-light/wiki/InstallFusekiLARQ On Tue, 23 Oct 2012, Elli Schwarz wrote: Hello, I am using Fuseki with Larq (thanks to Osma's recent instructions - thanks Osma!) where I recompiled Jena (after adding the Larq dependency) to Jena revision 1399877 (this past Friday morning's version of the trunk). I'm noticing the following anomaly when querying the data: First I insert the following triples: prefix xsd: http://www.w3.org/2001/XMLSchema# insert data { graph urn:test:foo { urn:test:s1 urn:test:p1 foo^^xsd:string . urn:test:s1 urn:test:p2 foo^^xsd:string . urn:test:s2 urn:test:p3 foo^^xsd:string . } } Then I stop Fuseki, delete my index directory, and restart Fuseki. (As an aside, I'd be very interested in a fix for this so I don't have to restart Fuseki to rebuild the index - I'm watching JENA-164 and hoping someone will be able to work on it soon!) Once Fuseki is back up, I run the following query (I have default graph set as the union of named graphs by default): PREFIX pf: http://jena.hpl.hp.com/ARQ/property# select * where { urn:test:s1 ?p ?lit . ?lit pf:textMatch foo . } and I get 2 results as I expect: | p| lit | | urn:test:p1 | foo^^http://www.w3.org/2001/XMLSchema#string | | urn:test:p2 | foo^^http://www.w3.org/2001/XMLSchema#string | However, when I flip the order
Re: Accents-insensitive search with LARQ
On 27/10/12 00:31, Ondřej Hoferek wrote: Hi all, I would like to use the full text search with LARQ for accent-insensitive matching. I.e. pattern {?literal pf:textMatch laska} should also return literal láska žije. I know that in Lucene, there is a class ISOLatin1AccentFilter which can be used while building/querying the index. However, I don't know how to use it from within LARQ. Is there any way to achieve my goal? Hi Ondrey, look at the LARQ sources and in particular at how IndexLARQ is used. That class has a couple of constructors which takes as a parameter a Lucene analyzer. Please, try to see if those helps you. Paolo Best regards, Ondrej
Re: LARQ index restrictions with Fuseki
Hi Ondřej On 26/10/12 21:50, Ondřej Hoferek wrote: Hi all, As far as I understood, LARQ index will be created for all the literals in given dataset when used with Fuseki with configuration: #dataset1 rdf:type tdb:DatasetTDB ; tdb:location /tmp/tdb ; ja:textIndex /tmp/lucene . Is it possible to restrict the index built within Fuseki to certain named graphs/properties? No, it is not possible to restrict the index to certain named graphs. However, there are constructors in IndexBuilderString, for example, which take as a parameter a Property to restrict statements which will be indexed: public IndexBuilderString(Property property, String fileDir) This is not exposed via Assembler/configuration. If you are willing to learn more about the Jena Assembler configuration mechanism, I am happy to work with you and help you on the LARQ side of the job. How would you like to specify the properties to index in your configuration file? Paolo This might be handy if I would like to index only relatively small subset of the all data. With the LARQ API it is possible to restrict the index built to certain properties only. Alternatively, it is possible to build the LARQ index separately for given dataset (TDB dataset) using API (or any utility) with such restrictions and let fuseki use it? Best regards, Ondrej
Re: Fueski with Larq - query anomaly
On 16/11/12 22:20, Paolo Castagna wrote: Elli, could you provide an example with some data and your query? Apologies Elli, I now have found your example. ;-) Paolo
Re: Fueski with Larq - query anomaly
Hi Elli On 23/10/12 16:47, Elli Schwarz wrote: Hello, I am using Fuseki with Larq (thanks to Osma's recent instructions - thanks Osma!) where I recompiled Jena (after adding the Larq dependency) to Jena revision 1399877 (this past Friday morning's version of the trunk). I'm noticing the following anomaly when querying the data: First I insert the following triples: prefix xsd: http://www.w3.org/2001/XMLSchema# insert data { graph urn:test:foo { urn:test:s1 urn:test:p1 foo^^xsd:string . urn:test:s1 urn:test:p2 foo^^xsd:string . urn:test:s2 urn:test:p3 foo^^xsd:string . } } Then I stop Fuseki, delete my index directory, and restart Fuseki. (As an aside, I'd be very interested in a fix for this so I don't have to restart Fuseki to rebuild the index - I'm watching JENA-164 and hoping someone will be able to work on it soon!) Re: JENA-164 ... yeah, I'd love to help you out, but it's a sort of architectural issue of Jena IMHO. It should be easier for developers to listen to events as triples are added/removed so that you can attach external indexes and keep them in sync. There are multiple paths which you can use to change RDF data: APIs, SPARQL, etc. From a use point of view, you would like to keep your external index always in sync, no matter where the updates come from. Once Fuseki is back up, I run the following query (I have default graph set as the union of named graphs by default): PREFIX pf: http://jena.hpl.hp.com/ARQ/property# select * where { urn:test:s1 ?p ?lit . ?lit pf:textMatch foo . } and I get 2 results as I expect: | p | lit | | urn:test:p1 | foo^^http://www.w3.org/2001/XMLSchema#string | | urn:test:p2 | foo^^http://www.w3.org/2001/XMLSchema#string | However, when I flip the order of my query like this: PREFIX pf: http://jena.hpl.hp.com/ARQ/property# select * where { ?lit pf:textMatch foo . urn:test:s1 ?p ?lit . I get 6 results, instead of the two I expect: | lit | p | | foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p1 | | foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p2 | | foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p1 | | foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p2 | | foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p1 | | foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p2 | My guess as to what happens is that in the second query, first the query executer executes the first line (the ?lit pf:textMatch foo) and this returns 3 results for foo, since there are 3 literals for foo. Then, the next line of the query has three bindings to ?lit, so it produces the 6 results above (2 for each foo literal since there are 2 properties for urn:test:s1). I know that I can avoid this by using a SELECT DISTINCT, but I still think the query shouldn't produce different results based on switching the order. Additionally, if I put this in a CONSTRUCT query, I can't use DISTINCT to eliminate the duplicate results (unless I use a SELECT DISTINCT subquery which I'd rather avoid). I am not sure, at the moment I have no clear idea on how this problem could be fixed. Paolo Another point I've noticed is that in my other (much more complex) queries, against a much larger dataset (~1.5 million triples), if I put the pf:textMatch line anywhere but in the very beginning of the query, the query takes a VERY long time to execute. If I put it as the first line in the query, the query runs quickly. My guess for this is that the query is executed in order, and it takes much more work for the query executer to run the other parts of my query which contain many results, and then have to go back and essentially filter out those results where the literal doesn't match the pf:textMatch. I can always place the pf:textMatch line first, but then I'm back to the problem mentioned above where I get back too many duplicate results. Thank you very much for your help! -Elli