Yes that is a good point about the newer string functions covering some typical cases of REGEX usage. I actually have a blog post I've written about this very issue that is currently sitting in a colleagues review queue before it hits my employers blog
Picking up on another topic that has come up in this thread you may want to take a look at the framework that we put out at Cray - http://sourceforge.net/projects/sparql-query-bm/ - as Andy pointed out the original talk on this topic is at http://www.slideshare.net/RobVesse/practical-sparql-benchmarking but the tool has evolved substantially since then. Take a look at the wiki - http://sourceforge.net/p/sparql-query-bm/wiki/Introduction/ - that describes the current state of the tool. Note that we haven't made an official 2.x release since it relies on various tweaks and fixes present in the current ARQ trunk (and SNAPSHOTs) around HTTP authentication but 2.x is essentially ready for regular use and certainly we're using the new features internally at Cray in anger. Versus the original 1.x version that I presented ~2 years ago now the 2.x version has substantial improvements: - Support for testing a wide range of operations including performing parameterised queries a la the BSBM harness - Extensible API to allow introducing completely custom operations to be tested (http://sparql-query-bm.sourceforge.net/javadoc/dev/core/net/sf/sparql/benc hmarking/operations/Operation.html) - Multiple test runners (http://sparql-query-bm.sourceforge.net/javadoc/dev/core/net/sf/sparql/benc hmarking/runners/Runner.html) - Support for in-memory testing (http://sourceforge.net/p/sparql-query-bm/wiki/In-Memory/) thus allowing you to cut out the network communication overhead of communicating with remote systems (or even local HTTP systems) Certainly this last one has been tested internally using TDB and seems to work nicely Rob On 21/04/2014 08:36, "Saud Aljaloud" <[email protected]> wrote: >On 21 Apr 2014, at 14:01, Andy Seaborne <[email protected]> wrote: >> I'd be interested in hearing in what ways the problem is different from >>SQL. >> >> Also - in SQL, there is LIKE. Would it be a good idea for SPARQL to >>have a separate "LIKE" > > >SPARQL 1.1 is good in addressing this. There are now three new functions: >STRSTARTS, STRENDS and CONTAINS. These are all special cases of LIKE. > > >> (=> can a system do a lot better with that than analysing a regex?). > >In theory, I think, Yes. Instead of compiling regex at all, you'd perform >simple string matching or even faster by building dedicated indices. > > >Many Thanks for the details, >Saud > >On 21 Apr 2014, at 14:01, Andy Seaborne <[email protected]> wrote: > >> On 21/04/14 12:23, Saud Al-Jaloud wrote: >>>> Just use current releases. >>> >>> We are using current releases, we are not looking for tuning systems >>> but rather the right configs as this is some sort of an extension. >>> Otherwise, some might argue that we were unfair/miss features for >>> some stores over others. For example, buffer size or the way of >>> building full text index etc. To some extend, we are trying to follow >>> the same rule as BSBM, see >>> >>>http://wifo5-03.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/s >>>pec/BenchmarkRules/ >>> >>> >>>> No relationship to http://www.ldbc.eu/? >>> Nope, this is part of my PhD, Which is mainly about optimising regex >>> within SPARQL, but I also look at fulltext search here. >> >> I'd be interested in hearing in what ways the problem is different from >>SQL. >> >> Also - in SQL, there is LIKE. Would it be a good idea for SPARQL to >>have a separate "LIKE" (=> can a system do a lot better with that than >>analysing a regex?). >> >>> >>>> As it's fulltext, you have to use the custom features of each >>>> system so comparing like-with-like is going to be hard. >>> Indeed, there are couple of challenges, that’s why I see it >>> interesting area. >>> >>>> ((That's really what stopped it getting standardized in SPARQL 1.1 >>>> - it's a large piece of work (see xpath-full-text) and so it was >>>> this or most of the other features.)) >>> I’ve also read your post, >>> >>>http://mail-archives.apache.org/mod_mbox/jena-users/201306.mbox/%3C51C57 >>>[email protected]%3E >>> >>> >> > Thanks for sharing such info. >>> But, don’t you think that even a common syntax can make a huge >>> difference, regardless of how stores internally implement it? >> >> It's an argument that came up when SPARQL 1.1 WG was deciding what to >>do. I happen to agree that common syntax would have been good but >>others felt if the text search language wasn't standardised as well, it >>was not a good use of the fixed amount of time we had. There is also an >>argument that what is really needed is a general extension mechanism >>(text, spatial, statistical analytics, ...) and again defining that is >>non-trivial. >> >> Standard involve compromises, as does working to a timescale with >>volunteers (not that we stuck to the timescale!). >> >>>> 1 million? Isn't that rather small? The whole thing will fit in >>>> memory. 1m BSBM fits completely in 1G RAM and TDB isn't very space >>>> efficient as it trades it for direct memory mapping. >>> That was just a test for the purpose of these emails to make sure we >>> are doing things right. the test will target 200M, maybe more. >>> >>>> 3G should be enough. 20G will slow it down (a small amount given >>>> your hardware) as much of TDB's memory usage is outside the Java >>>> heap. >>> I’ll take this for Jena. >>> >>> >>>> 20G will slow it down >>> Generally, I thought the max won’t affect the speed, as long as it >>> doesn’t reach the max. this will reduce GCs being performed, isn’t >>> it? >> >> Not in the case of TDB because the indexes are cached as memory mapped >>files, outside heap, so if you have a larger heap you have less index >>cache space. >> >> And the GC pauses get longer even if less frequent. A full GC happens >>sometime - see lots of big data blogs about the pain felt when the GC >>goes off into the weeds for seconds at a time. >> >>>> (why not using SSD's? They are common these days. Does wonders >>>> for loading speed!) >>> I’ve seen some stores recommending them. Unfortunately, I’ve got no >>> control over this for now. Just out of curiosity, within Jena, do you >>> think that the existing index structure i.e B+tree needs any changes >>> to get the best of SSDs? >> >> Not as far as I know. They work much better on SSDs already. They are >>large-ish block size (8K - the trees are 150 to 200 way B+Trees). The >>TDB B+Trees are quite specialised - they only work with fixed size keys >>and fixed size values making node search fast. >> >> In TDB, the places to look for optimizations are all the trends in >>modern DBs: e.g. design for in-memory use - the disk is just a backup >>and to move state across OS restart. Many uses fit in RAM, or a very >>high %-age of the hot data is RAM sized, on today's servers so designing >>for that would be good. >> >> Multi-core execution. Multi-machine execution (Project Lizard is going >>that way). >> >> A big change to make to the indexes use an MVCC design so that update >>in-place does not happen but transactions are single-write, not 2 writes >>(write to log, write to main DB sometime) as in CouchDB or, recently, >>Apache Mavibot. >> >> >> >> >> >> the NodeTable >> >> the NodeTable is the better place to look for optimizations. >> >> Andy >>> >>> >>> Cheers, Saud >>> >>> On 21 Apr 2014, at 10:38, Andy Seaborne <[email protected]> wrote: >>> >>>> On 20/04/14 17:13, Saud Aljaloud wrote: >>>>> Thanks Paul and Andy, >>>>> >>>>>> Why are you suggesting non-public? >>>>> The idea is that because we are benchmarking a number of triple >>>>> stores, and our choice is to ask each of them about the best >>>>> configurations privately for their own store, we want to reduce >>>>> the amount of core information of our work being publicly >>>>> available i.e: the queries or statistics about other stores, >>>>> until we publish them later at once. This being said, We can >>>>> discuss the general setup of Jena here. >>>> >>>> Unclear what anyone would do with such information ahead of >>>> publication unless it's to copy you and publish earlier. Just use >>>> current releases. >>>> >>>> No relationship to http://www.ldbc.eu/? >>>> >>>> As it's fulltext, you have to use the custom features of each >>>> system so comparing like-with-like is going to be hard. >>>> >>>> Each custom extension is going to have assumptions on usage - for >>>> jena, you can use Solr and have other applications going to the >>>> same index, it's not a Jena specific structure anymore (LARQ was). >>>> The text search languages have different capabilties. >>>> >>>> ((That's really what stopped it getting standardized in SPARQL 1.1 >>>> - it's a large piece of work (see xpath-full-text) and so it was >>>> this or most of the other features.)) >>>> >>>>>> The benchmark driver in their SVN repository is somewhat ahead >>>>>> of the last formal release >>>>> Thanks for pointing this out. >>>> >>>> I run a modified version that runs TDB locally to the benchmark >>>> driver to benchmark just TDB and not the protocol component. >>>> >>>>>> There isn't much on matching parts of strings in BSBM. >>>>> BSBM has got a number of literal predicates with a good/enough >>>>> length, see the predicates within the assembler below. or see, >>>>> >>>>>http://wifo5-03.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark >>>>>/spec/Dataset/index.html >>>>> >>>>> >>>>> >>>>> >>>>> >> Here are the steps we do to perform a test on a 1 million triples: >>>> >>>> 1 million? Isn't that rather small? The whole thing will fit in >>>> memory. 1m BSBM fits completely in 1G RAM and TDB isn't very space >>>> efficient as it trades it for direct memory mapping. >>>> >>>>> >>>>> ======================= Jena Configurations: 1- edit >>>>> fuseki-server: JVM_ARGS=${JVM_ARGS:--Xmx20G} >>>> >>>> 3G should be enough. 20G will slow it down (a small amount given >>>> your hardware) as much of TDB's memory usage is outside the Java >>>> heap. >>>> >>>> (why not using SSD's? They are common these days. Does wonders >>>> for loading speed!) >>>> >>>>> >>>>> ======================= Jena Configurations: 1- edit >>>>> fuseki-server: JVM_ARGS=${JVM_ARGS:--Xmx20G} >>>>> >>>>> >>>>> 2- create an Assembler for Jena Text with Lucene >>>>> "BSBM-fulltext-1.ttl" : >>>>> >>>>> >>>>> ## Example of a TDB dataset and text index published using >>>>> Fuseki @prefix : <http://localhost/jena_example/#> . >>>>> @prefix fuseki: <http://jena.apache.org/fuseki#> . @prefix rdf: >>>>> <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix rdfs: >>>>> <http://www.w3.org/2000/01/rdf-schema#> . @prefix tdb: >>>>> <http://jena.hpl.hp.com/2008/tdb#> . @prefix ja: >>>>> <http://jena.hpl.hp.com/2005/11/Assembler#> . @prefix text: >>>>> <http://jena.apache.org/text#> . @prefix rev: >>>>> <http://purl.org/stuff/rev#> . @prefix bsbm: >>>>> <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/> . >>>>> @prefix bsbm-inst: >>>>> <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/> . >>>>> @prefix dc: <http://purl.org/dc/elements/1.1/> . @prefix foaf: >>>>> <http://xmlns.com/foaf/0.1/> . >>>>> >>>>> [] rdf:type fuseki:Server ; # Timeout - server-wide default: >>>>> milliseconds. # Format 1: "1000" -- 1 second timeout # Format 2: >>>>> "10000,60000" -- 10s timeout to first result, then 60s timeout to >>>>> for rest of query. # See java doc for ARQ.queryTimeout # >>>>> ja:context [ ja:cxtName "arq:queryTimeout" ; ja:cxtValue "10000" >>>>> ] ; # ja:loadClass "your.code.Class" ; >>>>> >>>>> fuseki:services ( <#service_text_tdb> ) . >>>>> >>>>> # TDB [] ja:loadClass "com.hp.hpl.jena.tdb.TDB" . tdb:DatasetTDB >>>>> rdfs:subClassOf ja:RDFDataset . tdb:GraphTDB rdfs:subClassOf >>>>> ja:Model . >>>>> >>>>> # Text [] ja:loadClass "org.apache.jena.query.text.TextQuery" . >>>>> text:TextDataset rdfs:subClassOf ja:RDFDataset . >>>>> #text:TextIndexSolr rdfs:subClassOf text:TextIndex . >>>>> text:TextIndexLucene rdfs:subClassOf text:TextIndex . >>>>> >>>>> ## >>>>> --------------------------------------------------------------- >>>>> >>>>> <#service_text_tdb> rdf:type fuseki:Service ; rdfs:label >>>>> "TDB/text service" ; fuseki:name "BSBM1M" ; >>>>> fuseki:serviceQuery "query" ; fuseki:serviceQuery >>>>> "sparql" ; fuseki:serviceUpdate "update" ; >>>>> fuseki:serviceUpload "upload" ; >>>>> fuseki:serviceReadGraphStore "get" ; >>>>> fuseki:serviceReadWriteGraphStore "data" ; fuseki:dataset >>>>> :text_dataset ; . >>>>> >>>>> :text_dataset rdf:type text:TextDataset ; text:dataset >>>>> <#dataset> ; ##text:index <#indexSolr> ; text:index >>>>> <#indexLucene> ; . >>>>> >>>>> <#dataset> rdf:type tdb:DatasetTDB ; tdb:location >>>>> "/home/path/apache-jena-2.11.1/data" ; #tdb:unionDefaultGraph >>>>> true ; . >>>>> >>>>> <#indexSolr> a text:TextIndexSolr ; #text:server >>>>> <http://localhost:8983/solr/COLLECTION> ; text:server >>>>> <embedded:SolrARQ> ; text:entityMap <#entMap> ; . >>>>> >>>>> <#indexLucene> a text:TextIndexLucene ; text:directory >>>>> <file:/home/path/apache-jena-2.11.1/lucene> ; ##text:directory >>>>> "mem" ; text:entityMap <#entMap> ; . >>>>> >>>>> <#entMap> a text:EntityMap ; text:entityField "uri" ; >>>>> text:defaultField "text" ; ## Should be defined in the >>>>> text:map. text:map ( # rdfs:label [ text:field "text" ; >>>>> text:predicate rdfs:label ] [ text:field "text" ; text:predicate >>>>> rdfs:comment ] [ text:field "text" ; text:predicate foaf:name ] [ >>>>> text:field "text" ; text:predicate dc:title ] [ text:field >>>>> "text" ; text:predicate rev:text ] ) . >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> ======================= Jena Test procedure with statistics for >>>>> BSBM1M (one million triples): using a machine with specs [1,2] 1- >>>>> load data: ./tdbloader2 --loc ../data/ >>>>> ~/bsbmtools-0.2/dataset_1M.ttl 15:25:24 -- 35 seconds Size: 137M >>>>> . >>>>> >>>>> >>>>> >>>>> 2- build jena text index: java -cp fuseki-server.jar >>>>> jena.textindexer --desc=BSBM-fulltext-1.ttl INFO 31123 (3112 per >>>>> second)properties indexed (3112 per second overall) INFO 72657 >>>>> (5589 per second) properties indexed Size: 17M . >>>>> >>>>> 3- Flush OS memory and swap. >>>>> >>>>> 4- Run Server: ./fuseki-server --config=BSBM-fulltext-1.ttl >>>>> >>>>> >>>>> 5- Run test using BSBM driver: ./testdriver -ucf >>>>> usecases/literalSearch/fulltext/jena.txt -w 1000 -o >>>>> Jena_1Client_BSBM1M.xml http://localhost:3030/BSBM1M/sparql >>>>> >>>>> >>>>> ======================= >>>>> >>>>> >>>>> >>>>> Any comments would be appreciated. >>>>> >>>>> >>>>> Many thanks, Saud >>>>> >>>>> >>>> >>>>> [1] [2]2x AMD Opteron 4280 Processor (2.8GHz, 8C, 8M L2/8M L3 >>>>> Cache, 95W), DDR3-1600 MHz 128GB Memory for 2CPU (8x16GB Quad >>>>> Rank LV RDIMMs) 1066MHz 2x 300GB, SAS 6Gbps, 3.5-in, 15K RPM Hard >>>>> Drive (Hot-plug) SAS 6/iR Controller, For Hot Plug HDD Chassis No >>>>> Optical Drive Redundant Power Supply (2 PSU) 500W 2M Rack Power >>>>> Cord C13/C14 12A iDRAC6 Enterprise Sliding Ready Rack Rails C11 >>>>> Hot-Swap - R0 for SAS 6iR, Min. 2 Max. 4 SAS/SATA Hot Plug >>>>> Drives >>>> >>>>> [2] Red Hat Enterprise Linux Server release 6.4 (Santiago), java >>>>> version "1.7.0_51", >>>>> >>>>> >>>>> >>>>> >>> >>> >> >
