Re: [Wikidata] Wikidata query performance paper

2016-08-08 Thread Markus Kroetzsch

On 07.08.2016 22:58, Stas Malyshev wrote:

Hi!


the area for a long time). I guess the more difficult question then, is,
which RDF/SPARQL implementation to choose (since any such implementation
should cover as least points 1, 2 and 4 in a similar way), which in turn
reduces down to the distinguishing questions of performance, licensing,
distribution, maturity, tech support, development community, and
non-standard features (keyword search), etc.


We indeed had a giant spreadsheet in which a dozen of potential
solutions (some of them were eliminated very early, but some put up a
robust fight :) were evaluated on about 50 criteria. Of course, some of
them were hard to formalize, and some number were a bit arbitrary, but
that's what we did and Blazegraph came out with the best score.



If you want to go into Wikidata history, here is the "giant spreadsheet" 
Stas was referring to:


https://docs.google.com/spreadsheets/d/1MXikljoSUVP77w7JKf9EXN40OB-ZkMqT8Y5b2NYVKbU/edit?usp=sharing

Some criteria there are obviously rather vague and subjective, but even 
when disregarding the scoring, it shows which systems have been looked at.


Markus


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata query performance paper

2016-08-07 Thread Aidan Hogan

Hey Scott,

While I'm not sure I can help with the details of the specific example 
you are mentioning, but the general area you are in -- dealing with 
answering questions posed in natural language -- is called "Question 
Answering".


When dealing with data in an RDF format (as per Wikidata), there's quite 
a lot of research done in the context of "Question Answering over Linked 
Data" (QALD).


The methods are not 100% accurate, but given data in a structured format 
(like RDF), with good labels, and assuming relatively simple objective 
questions (like "what age is the current Italian president?") that can 
be answered over the data, I believe these techniques can get quite good 
results. One can check out the QALD evaluation series for more details 
on how good [1].


I'm not really in that area myself, but perhaps the keywords might be 
useful if you want to read more.


Probably this will not be so helpful though if your focus is on using 
Wikidata to answer one specific question. :)


Cheers,
Aidan

[1] http://qald.sebastianwalter.org/

On 07-08-2016 19:53, Scott MacLeod wrote:

Thanks, Aidan, Stas and Wikidatans,

Thanks for the feedback.

While I'm not yet a SQL/SPARQL programmer, I wonder if one could make
each word in the question concrete, a Qidentifier, and with rank-able
outcomes, create Wikidata Q-items/identifiers with attributes possibly
for each MIT OCW course in 7 languages courses and each Yale OYC
courses, as well as each WUaS subject page. It's the ranking of
responses that would lesson the significance of the question that is
inherent ill-definition/subjectivity, I think (?) - and there might be
other SQL/SPARQL related approaches to this problem too.

Then hypothetically one could compare, for example, the list of MIT OCW

Earth, Atmosphere and Planetary Science courses (e.g.
http://ocw.mit.edu/courses/earth-atmospheric-and-planetary-sciences/ and
in Spanish -
http://ocw.mit.edu/courses/translated-courses/spanish/#earth-atmospheric-and-planetary-sciences
and WUaS's Earth wiki subject -
http://worlduniversity.wikia.com/wiki/Earth,_Atmospheric,_and_Planetary_Sciences),

Statistics (e.g. http://ocw.mit.edu/courses/mathematics/ and in Spanish
http://ocw.mit.edu/courses/translated-courses/spanish/#mathematics and
WUaS's Statistics' wiki page
http://worlduniversity.wikia.com/wiki/Statistics),

Space/Astronautics courses
(http://ocw.mit.edu/courses/aeronautics-and-astronautics/ and WUaS's
Space wiki subject - http://worlduniversity.wikia.com/wiki/Space) with
perhaps wiki-added WUaS

Journalism wiki subject page (e.g.
http://ocw.mit.edu/courses/comparative-media-studies-writing/ and
Journalism http://worlduniversity.wikia.com/wiki/Journalism and various
forms of writing at WUaS http://worlduniversity.wikia.com/wiki/writing)

... with Q items, newspaper articles and ask a variety of related
questions of the results?

It would be some sort of correlation of the relative rankings of these
outputs in response to the queries - and which could yield results
paralleling somehow Google Search results, for example. (Possible
collaboration with Google Search even would increase eventually
collaboration in voice on Android smartphones, and in Google group video
Hangouts for ASL and other forms of sign language, for example).

I haven't been able to find any Mandarin Chinese MIT OCW Statistics,
Earth, Space, or Journalism courses -
http://ocw.mit.edu/courses/translated-courses/traditional-chinese
(accessible here http://ocw.mit.edu/courses/translated-courses/) yet, to
speak of, although these MIT OCW Writing courses in Mandarin Chinese -
http://ocw.mit.edu/courses/translated-courses/traditional-chinese/#comparative-media-studies-writing
- could work possibly for some of these hypothetical Wikidata query
performance questions I'm seeking to explore - in this "if one builds it
approach."

For example, and hypothetically, if there were 3 relatively recent and
new MIT OCW Earth courses, and 2 new MIT OCW Statistics courses, and 10
journalism articles from best newspapers and best academic journals in
English on Earth/Space
(http://ocw.mit.edu/courses/aeronautics-and-astronautics/), and 4 in
Chinese, and 5 in Spanish, for example, perhaps one could get helpful
and useful outputs (that could eventually be asked for in voice/natural
language processing), - by ranking relative importance partly according
to the newness of the course, and getting objective relative outcomes as
a group. The importance of a specific set of journals to a specific
discipline / subject could be another source of ranking of importance,
for example - to highlight the operative item in this question, and add
some further relative rankings as useful SQL coding possibilities.

Wikidata would generate or get a lot of valuable new fact-oriented and
knowledge-oriented Q items/identifiers/attributes (for CC MIT OCW's 2300
courses in English, and the other courses in 6 other languages, and CC
Yale OYC, as well as CC WUaS subjects, and with 

Re: [Wikidata] Wikidata query performance paper

2016-08-07 Thread Scott MacLeod
Thanks, Aidan, Stas and Wikidatans,

Thanks for the feedback.

While I'm not yet a SQL/SPARQL programmer, I wonder if one could make each
word in the question concrete, a Qidentifier, and with rank-able outcomes,
create Wikidata Q-items/identifiers with attributes possibly for each MIT
OCW course in 7 languages courses and each Yale OYC courses, as well as
each WUaS subject page. It's the ranking of responses that would lesson the
significance of the question that is inherent ill-definition/subjectivity,
I think (?) - and there might be other SQL/SPARQL related approaches to
this problem too.

Then hypothetically one could compare, for example, the list of MIT OCW

Earth, Atmosphere and Planetary Science courses (e.g.
http://ocw.mit.edu/courses/earth-atmospheric-and-planetary-sciences/ and in
Spanish -
http://ocw.mit.edu/courses/translated-courses/spanish/#earth-atmospheric-and-planetary-sciences
and WUaS's Earth wiki subject -
http://worlduniversity.wikia.com/wiki/Earth,_Atmospheric,_and_Planetary_Sciences
),

Statistics (e.g. http://ocw.mit.edu/courses/mathematics/ and in Spanish
http://ocw.mit.edu/courses/translated-courses/spanish/#mathematics and
WUaS's Statistics' wiki page
http://worlduniversity.wikia.com/wiki/Statistics),

Space/Astronautics courses (
http://ocw.mit.edu/courses/aeronautics-and-astronautics/ and WUaS's Space
wiki subject - http://worlduniversity.wikia.com/wiki/Space) with perhaps
wiki-added WUaS

Journalism wiki subject page (e.g.
http://ocw.mit.edu/courses/comparative-media-studies-writing/ and
Journalism http://worlduniversity.wikia.com/wiki/Journalism and various
forms of writing at WUaS http://worlduniversity.wikia.com/wiki/writing)

... with Q items, newspaper articles and ask a variety of related questions
of the results?

It would be some sort of correlation of the relative rankings of these
outputs in response to the queries - and which could yield results
paralleling somehow Google Search results, for example. (Possible
collaboration with Google Search even would increase eventually
collaboration in voice on Android smartphones, and in Google group video
Hangouts for ASL and other forms of sign language, for example).

I haven't been able to find any Mandarin Chinese MIT OCW Statistics, Earth,
Space, or Journalism courses -
http://ocw.mit.edu/courses/translated-courses/traditional-chinese
(accessible here http://ocw.mit.edu/courses/translated-courses/) yet, to
speak of, although these MIT OCW Writing courses in Mandarin Chinese -
http://ocw.mit.edu/courses/translated-courses/traditional-chinese/#comparative-media-studies-writing
- could work possibly for some of these hypothetical Wikidata query
performance questions I'm seeking to explore - in this "if one builds it
approach."

For example, and hypothetically, if there were 3 relatively recent and new
MIT OCW Earth courses, and 2 new MIT OCW Statistics courses, and 10
journalism articles from best newspapers and best academic journals in
English on Earth/Space (
http://ocw.mit.edu/courses/aeronautics-and-astronautics/), and 4 in
Chinese, and 5 in Spanish, for example, perhaps one could get helpful and
useful outputs (that could eventually be asked for in voice/natural
language processing), - by ranking relative importance partly according to
the newness of the course, and getting objective relative outcomes as a
group. The importance of a specific set of journals to a specific
discipline / subject could be another source of ranking of importance, for
example - to highlight the operative item in this question, and add some
further relative rankings as useful SQL coding possibilities.

Wikidata would generate or get a lot of valuable new fact-oriented and
knowledge-oriented Q items/identifiers/attributes (for CC MIT OCW's 2300
courses in English, and the other courses in 6 other languages, and CC Yale
OYC, as well as CC WUaS subjects, and with planning for major universities
with these and growing number of wiki subjects in all languages).

I have no idea yet how to write the SQL/SPARQL for this, but rankable Q*
identifiers, new Q* identifiers and Google would be places I'd begin if I
did. What do you think?

Cheers, Scott



On Sun, Aug 7, 2016 at 2:02 PM, Aidan Hogan  wrote:

> Hey Scott,
>
> On 07-08-2016 16:15, Info WorldUniversity wrote:
>
>> Hi Aidan, Markus, Daniel and Wikidatans,
>>
>> As an emergence out of this conversation on Wikidata query performance,
>> and re cc World University and School/Wikidata, as a theoretical
>> challenge, how would you suggest coding WUaS/Wikidata initially to be
>> able to answer this question - "What are most impt stats issues in
>> earth/space sci that journalists should understand?" -
>> https://twitter.com/ReginaNuzzo/status/761179359101259776 - in many
>> Wikipedia languages including however in American Sign Language (and
>> other sign languages), as well as eventually in voice. (Regina Nuzzo is
>> an associate Professor at Gallaudet University for the hearing

Re: [Wikidata] Wikidata query performance paper

2016-08-07 Thread Aidan Hogan

Hey Scott,

On 07-08-2016 16:15, Info WorldUniversity wrote:

Hi Aidan, Markus, Daniel and Wikidatans,

As an emergence out of this conversation on Wikidata query performance,
and re cc World University and School/Wikidata, as a theoretical
challenge, how would you suggest coding WUaS/Wikidata initially to be
able to answer this question - "What are most impt stats issues in
earth/space sci that journalists should understand?" -
https://twitter.com/ReginaNuzzo/status/761179359101259776 - in many
Wikipedia languages including however in American Sign Language (and
other sign languages), as well as eventually in voice. (Regina Nuzzo is
an associate Professor at Gallaudet University for the hearing
impaired/deafness, and has a Ph.D. in statistics from Stanford; Regina
was born with hearing loss herself).


I fear we are nowhere near answering these sorts of questions (by we, I 
mean the computer science community, not just Wikidata). The main 
problem is that the question is inherently ill-defined/subjective: there 
is no correct answer here.


We would need to think about refining the question to something that is 
well-defined/objective, which even as a human is difficult. Perhaps we 
could consider a question such as: "what statistical methods (from a 
fixed list) have been used in scientific papers referenced by news 
articles have been published in the past seven years by media companies 
that have their headquarters in the US?". Of course even then, there are 
still some minor subjective aspects, and Wikidata would not have 
coverage, to answer such a question.


The short answer is that machines are nowhere near answering these sorts 
of questions, no more than we are anywhere near taking a raw stream of 
binary data from an .mp4 video file and turning it into visual output. 
If we want to use machines to do useful things, we need to meet machines 
half-way. Part of that is formulating our questions in a way that 
machines can hope to process.



I'm excited for when we can ask WUaS (or Wikipedia) this question, (or
so many others) in voice combining, for example, CC WUaS Statistics,
Earth, Space & Journalism wiki subject pages (with all their CC MIT OCW
and Yale OYC) - http://worlduniversity.wikia.com/wiki/Subjects - in all
of Wikipedia's 358 languages, again eventually in voice and in ASL/other
sign languages
(https://twitter.com/WorldUnivAndSch/status/761593842202050560 - see,
too - https://www.wikidata.org/wiki/Wikidata:Project_chat#Schools).

Thanks for your paper, Aidan, as well. Would designing for deafness
inform how you would approach "Querying Wikidata: Comparing SPARQL,
Relational and Graph Databases" in any new ways?


In the context of Wikidata, the question of language is mostly a 
question of interface (which is itself non-trivial). But to answer the 
question in whatever language or mode, the question first has to be 
answered in some (machine-friendly) language. This is the direction in 
which Wikidata goes: answers are first Q* identifiers, for which labels 
in different languages can be generated and used to generate a mode.


Likewise our work is on the level of generating those Q* identifiers, 
which can be later turned into tables, maps, sentences, bubbles, etc. I 
think the interface question is an important one, but a different one to 
that which we tackle.


Cheers,
Aidan



On Sat, Aug 6, 2016 at 12:29 PM, Markus Kroetzsch
>
wrote:

Hi Aidan,

Thanks, very interesting, though I have not read the details yet.

I wonder if you have compared the actual query results you got from
the different stores. As far as I know, Neo4J actually uses a very
idiosyncratic query semantics that is neither compatible with SPARQL
(not even on the BGP level) nor with SQL (even for
SELECT-PROJECT-JOIN queries). So it is difficult to compare it to
engines that use SQL or SPARQL (or any other standard query
language, for that matter). In this sense, it may not be meaningful
to benchmark it against such systems.

Regarding Virtuoso, the reason for not picking it for Wikidata was
the lack of load-balancing support in the open source version, not
the performance of a single instance.

Best regards,

Markus



On 06.08.2016 18:19, Aidan Hogan wrote:

Hey all,

Recently we wrote a paper discussing the query performance for
Wikidata,
comparing different possible representations of the
knowledge-base in
Postgres (a relational database), Neo4J (a graph database),
Virtuoso (a
SPARQL database) and BlazeGraph (the SPARQL database currently
in use)
for a set of equivalent benchmark queries.

The paper was recently accepted for presentation at the
International
Semantic Web Conference (ISWC) 2016. A pre-print is available here:


Re: [Wikidata] Wikidata query performance paper

2016-08-07 Thread Stas Malyshev
Hi!

> the area for a long time). I guess the more difficult question then, is,
> which RDF/SPARQL implementation to choose (since any such implementation
> should cover as least points 1, 2 and 4 in a similar way), which in turn
> reduces down to the distinguishing questions of performance, licensing,
> distribution, maturity, tech support, development community, and
> non-standard features (keyword search), etc.

We indeed had a giant spreadsheet in which a dozen of potential
solutions (some of them were eliminated very early, but some put up a
robust fight :) were evaluated on about 50 criteria. Of course, some of
them were hard to formalize, and some number were a bit arbitrary, but
that's what we did and Blazegraph came out with the best score.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata query performance paper

2016-08-07 Thread Aidan Hogan

Hey Daniel,

On 07-08-2016 7:03, Daniel Kinzler wrote:

Hi Aidan!

Thank you for this very interesting research!

Query performance was of course on of the key factors for selecting the
technology to use for the query services. However, it was only one among several
more. The Wikidata use case is different from most common scenarios in some
ways, for instance:

* We cannot optimize for specific queries, since users are free to submit any
query they like.
* The data representation needs to be intuitive enough for (thenically inclined)
casual users to grasp and write queries.
* The data doesn't hold still, it needs to be updated continuously, mutliple
times per second.
* Our data types are more complex than usual - for instance, we suppor tmultiple
calendar models fro dates, and not only values but also different accuracies up
to billions of years; we use "quantities" with unit and uncertainty instead of
plain numbers, etc.

My point is that, if we had a static data set and a handful of known queries to
optimize for, we could have set up a relational or graph database that would be
far more performant than what we have now. The big advantage of Blazegraph is
its felxibility, not raw performance.


Understood. :) Taking everything into account as mentioned above, and 
based on our own experiences with various experiments in the context of 
Wikidata and other works, I think the choice to use RDF/SPARQL was the 
right one (though I would be biased on this issue since I've worked in 
the area for a long time). I guess the more difficult question then, is, 
which RDF/SPARQL implementation to choose (since any such implementation 
should cover as least points 1, 2 and 4 in a similar way), which in turn 
reduces down to the distinguishing questions of performance, licensing, 
distribution, maturity, tech support, development community, and 
non-standard features (keyword search), etc.


Based on raw query performance, based personally on what I have seen, I 
think Virtuoso probably has the lead at the moment in that it has 
consistently outperformed other SPARQL engines, not only in our Wikidata 
experiments, but in other benchmarks by other authors. However, taking 
all the other points into account, particularly in terms of licensing, 
Blazegraph does seem to have been a sound choice. And the current query 
service does seem to be a sound base to work forward from.



It might be interesting to you to know that we initially started to implement
the query service against a graph database, Titan - which was discontinued while
we were still getting up to speed. Luckily this happened early on, it would have
been quite painful to switch after we had gone live.


This is indeed good to know! (We considered other graph database 
engines, but we did not think Gremlin was a good fit with what Wikidata 
was trying to achieve in the sense of being too "imperative": though one 
can indeed do something like bgps with the language, it's not 
particularly easy, nor intuitive.)


Cheers,
Aidan



Am 06.08.2016 um 18:19 schrieb Aidan Hogan:

Hey all,

Recently we wrote a paper discussing the query performance for Wikidata,
comparing different possible representations of the knowledge-base in Postgres
(a relational database), Neo4J (a graph database), Virtuoso (a SPARQL database)
and BlazeGraph (the SPARQL database currently in use) for a set of equivalent
benchmark queries.

The paper was recently accepted for presentation at the International Semantic
Web Conference (ISWC) 2016. A pre-print is available here:

http://aidanhogan.com/docs/wikidata-sparql-relational-graph.pdf

Of course there are some caveats with these results in the sense that perhaps
other engines would perform better on different hardware, or different styles of
queries: for this reason we tried to use the most general types of queries
possible and tried to test different representations in different engines (we
did not vary the hardware). Also in the discussion of results, we tried to give
a more general explanation of the trends, highlighting some strengths/weaknesses
for each engine independently of the particular queries/data.

I think it's worth a glance for anyone who is interested in the
technology/techniques needed to query Wikidata.

Cheers,
Aidan


P.S., the paper above is a follow-up to a previous work with Markus Krötzsch
that focussed purely on RDF/SPARQL:

http://aidanhogan.com/docs/reification-wikidata-rdf-sparql.pdf

(I'm not sure if it was previously mentioned on the list.)

P.P.S., as someone who's somewhat of an outsider but who's been watching on for
a few years now, I'd like to congratulate the community for making Wikidata what
it is today. It's awesome work. Keep going. :)

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata





___
Wikidata mailing list
Wikidata@lists.wikimedia.org

Re: [Wikidata] Wikidata query performance paper

2016-08-06 Thread Aidan Hogan



On 06-08-2016 18:48, Stas Malyshev wrote:

Hi!


There's a brief summary in the paper of the models used. In terms of all
the "gory" details of how everything was generated, (hopefully) all of
the relevant details supporting the paper should be available here:

http://users.dcc.uchile.cl/~dhernand/wquery/


Yes, the gory part is what I'm after :) Thank you, I'll read through it
in the next couple of days and come back with any questions/comments I
might have.


Okay! :)


We just generalised sitelinks and references as a special type of
qualifier (actually I don't think the paper mentions sitelinks but we
mention this in the context of references).


Sitelinks can not be qualifiers, since they belong to the entity, not to
the statement. They can, I imagine, be considered a special case of
properties (we do not do it, but in theory it is not impossible to
represent them this way if one wanted to).


Ah yes, I think in that context our results should be considered as 
being issued over a "core" of Wikidata in the sense that we did not 
directly consider somevalue, novalue, ranks, etc. (I'm not certain in 
the case of sitelinks; I do not remember discussing those). This is 
indeed all doable in RDF without too much bother (I think) but would be 
much more involved for the relational database or for Neo4J.



I am not sure how exactly one would make references special case of
qualifier, as qualifier has one (maybe complex) value, while references
each can have multiple properties and values, but I'll read through the
details and the code before I talk more about it, it's possible that I
find my answers there.


I guess that depends on what you mean by "automatic" or "manual". :)

Automatic scripts were manually coded to convert from the JSON dump to
each representation. The code is linked above.


Here I meant queries, not data.


Ah, so the query generation process is also described in the 
documentation above. The core idea was to first create "subgraphs" of 
data with the patterns we wanted to generate queries for, and then using 
a certain random process, turn some constants into variables, and then 
select some variables to project. In summary, the queries were 
automatically generated from the data to ensure non-empty results.



I'm not sure I follow on this part, in particular on the part of
"semantic completeness" and why this is hard to achieve in the context
of relational databases. (I get the gist but don't understand enough to
respond directly ... but perhaps below I can answer indirectly?)


Check out https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format

This is the range of data we need to represent and allow people to
query. We found it hard to do this using relational model. It's probably
possible in theory, but producing efficient queries for it looked very
challenging, unless we were essentially to duplicate the effort that is
implemented in any graph database and only use the db itself for most
basic storage needs. That's pretty much what Titan + Cassandra combo
did, which we initially used until Titan's devs were acquired by
DataStax and resulting uncertainty prompted us to look into different
solutions. I imagine in theory it's also possible to create
Something+PostgreSQL combo doing the same, but PostgreSQL looks not enough.


Yes, this is something we did look into in some detail in the sense that 
we had a rather complex relational structure encoding all the features 
mentioned (storing everything from the JSON dumps, essentially), but the 
structure was so complex [1], we decided to simplify and consider the 
final models described in the paper ... especially given the prospect of 
trying to do something similar in Neo4J afterwards. :)



In any case, dealing with things like property paths seem to be rather
hard on SQL-based platform, and practically a must for Wikidata querying.


Yep, agreed.


* Encoding object values with different datatypes (booleans, dates,
etc.) was a pain. One option was to have separate tables/columns for
each datatype, which would complicate queries and also leave the
question of how to add calendars, precisions, etc. Another option was to
use JSON strings to encode the values (the version of Postgres we used
just considered these as strings, but I think the new version has some
JSONB(?) support that could help get around this).


That is also an issue. We have a number of specialty data types (e.g.
dates extending billion years into the future/past, coordinates
including different globes, etc.) which may present a challenge unless
the platform offers an easy way to encode custom types and deal with
them. RDF has rather flexible model (basically string + type IRI) here,
and Blazegraph too, not sure how accommodating the SQL databases would be.

Also, relational DBs mostly prefer very predictable data type model -
i.e. same column always contains the same type. This is obviously not
true for any generic representation, and may be not true 

Re: [Wikidata] Wikidata query performance paper

2016-08-06 Thread Aidan Hogan

On 06-08-2016 17:56, Stas Malyshev wrote:

Hi!


On a side note, the results we presented for BlazeGraph could improve
dramatically if one could isolate queries that timed out. Once one query
in a sequence timed-out (we used server-side timeouts), we observed that
a run of queries would then timeout, possibly a locking problem or


Could you please give a bit more details about this failure scenario? Is
is that several queries are run in parallel and one query, timing out,
hurts performance of others? Does it happen even after the long query
times out? Or it was a sequential run and after one query timed out, the
next query had worse performance than the same query when run not
preceded by the timing-out query, i.e. timeout query had persistent
effect beyond its initial run?


The latter was the case, yes. We ran the queries in a given batch 
sequentially (waiting for one to finish before the next was run) and 
when one timed out, the next would almost surely time-out and the engine 
would not recover.


We tried a few things on this, like waiting an extra 60 seconds before 
running the next query, and also changing memory configurations to avoid 
GC issues. I believe Daniel was also in contact with the devs. 
Ultimately we figured we probably couldn't resolve the issue without 
touching the source code, which would obviously not be fair.



BTW, what was the timeout setting in your experiments? I see in the
article that it says "timeouts are counted as 60 seconds" - does it mean
that Blazegraph had internal timeout setting set to 60 seconds, or that
the setting was different, but when processing results, the actual run
time was replaced by 60 seconds?


Yup, the settings are here:

http://users.dcc.uchile.cl/~dhernand/wquery/#configure-blazegraph

My understanding is that with those settings, we set an internal timeout 
on BlazeGraph of 60 seconds.



Also, did you use analytic mode for the queries?
https://wiki.blazegraph.com/wiki/index.php/QueryEvaluation#Analytic_Query_Evaluation
https://wiki.blazegraph.com/wiki/index.php/AnalyticQuery

This is the mode that is turned on automatically for the Wikidata Query
Service, and it uses AFAIK different memory management which may
influence how the cases you had problems with are handled.


This I am not aware of. I would have to ask Daniel to be sure (I know he 
spent quite a lot of time playing around with different settings in the 
case of BlazeGraph).



I would appreciate as much detail as you could give on this, as this may
also be useful on current query engine work. Also, if you're interested
in the work done on WDQS, our experiences and the reasons for certain
decisions and setups we did, I'd be glad to answer any questions.


I guess to start with you should have a look at the documentation here:

http://users.dcc.uchile.cl/~dhernand/wquery/

If there's some details missing from that, or if you have any further 
questions, I can put you in contact with Daniel who did all the scripts, 
ran the experiments, was in discussion with the devs, etc. in the 
context of BlazeGraph. (I don't think he's on this list.)


I could also ask him perhaps to try create a minimal-ish test-case that 
reproduces the problem.



resource leak. Also Daniel mentioned that from discussion with the devs,
they claim that the current implementation works best on SSD hard
drives; our experiments were on a standard SATA.


Yes, we run it on SSD, judging from our tests on test servers, running
on virtualized SATA machines, the difference is indeed dramatic (orders
of magnitude and more for some queries). Then again, this is highly
unscientific anecdotal evidence, we didn't make anything resembling
formal benchmarks since the test hardware is clearly inferior to the
production one and is meant to be so. But the point is that SSD is
likely a must for Blazegraph to work well on this data set. Might also
improve results for other engines, so not sure how it influences the
comparison between the engines.


Yes, I think this was the message we got from the mailing lists when we 
were trying to troubleshoot these issues: it would be better to use an 
SSD. But we did not have one, and of course we didn't want to tailor our 
hardware to suit one particular engine.


Unfortunately I think all such empirical experiments are in some sense 
anecdotal; even ours. We cannot deduce, for example, what would happen, 
relatively speaking, on a machine with an SSD, or more cores, or with 
multiple instances. But still, one can learn a lot from good anecdotes.


Cheers,
Aidan

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata query performance paper

2016-08-06 Thread Stas Malyshev
Hi!

> On a side note, the results we presented for BlazeGraph could improve
> dramatically if one could isolate queries that timed out. Once one query
> in a sequence timed-out (we used server-side timeouts), we observed that
> a run of queries would then timeout, possibly a locking problem or

Could you please give a bit more details about this failure scenario? Is
is that several queries are run in parallel and one query, timing out,
hurts performance of others? Does it happen even after the long query
times out? Or it was a sequential run and after one query timed out, the
next query had worse performance than the same query when run not
preceded by the timing-out query, i.e. timeout query had persistent
effect beyond its initial run?

BTW, what was the timeout setting in your experiments? I see in the
article that it says "timeouts are counted as 60 seconds" - does it mean
that Blazegraph had internal timeout setting set to 60 seconds, or that
the setting was different, but when processing results, the actual run
time was replaced by 60 seconds?

Also, did you use analytic mode for the queries?
https://wiki.blazegraph.com/wiki/index.php/QueryEvaluation#Analytic_Query_Evaluation
https://wiki.blazegraph.com/wiki/index.php/AnalyticQuery

This is the mode that is turned on automatically for the Wikidata Query
Service, and it uses AFAIK different memory management which may
influence how the cases you had problems with are handled.

I would appreciate as much detail as you could give on this, as this may
also be useful on current query engine work. Also, if you're interested
in the work done on WDQS, our experiences and the reasons for certain
decisions and setups we did, I'd be glad to answer any questions.

> resource leak. Also Daniel mentioned that from discussion with the devs,
> they claim that the current implementation works best on SSD hard
> drives; our experiments were on a standard SATA.

Yes, we run it on SSD, judging from our tests on test servers, running
on virtualized SATA machines, the difference is indeed dramatic (orders
of magnitude and more for some queries). Then again, this is highly
unscientific anecdotal evidence, we didn't make anything resembling
formal benchmarks since the test hardware is clearly inferior to the
production one and is meant to be so. But the point is that SSD is
likely a must for Blazegraph to work well on this data set. Might also
improve results for other engines, so not sure how it influences the
comparison between the engines.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata query performance paper

2016-08-06 Thread Stas Malyshev
Hi!

> The paper was recently accepted for presentation at the International
> Semantic Web Conference (ISWC) 2016. A pre-print is available here:
> 
> http://aidanhogan.com/docs/wikidata-sparql-relational-graph.pdf

Thank you for the link!
It would be interesting to see actual data representations used for RDF
(e.g. examples of the data or more detailed description). I notice that
they differ substantially from what we use in the Wikidata Query service
implementation, used with Blazegraph, and also some of the performance
features we have implemented are probably not part of your
implementation. In any case, it would be interesting to know the details
of which RDF representations were used.

I also note that only statements and qualifiers are mentioned in most of
the text, but very little mention of sitelinks and references. Were they
part of the model too?

Due to the different RDF semantics, it would be also interesting to get
more details about how the example queries were translated to the RDF
representation(s) used in the article. Was it an automatic process or
they were translated manually? Is it possible to see them?

When working on Query Service implementation, we considered a number of
possible representations, which regard to both performance and semantic
completeness. One of the conclusions was that achieving adequate
semantic completeness and performance on relational database, while
allowing people to (relatively) easy write complex queries is not
possible, due to relational engines not being a good match for
hierachical graph-like structures in Wikidata.

It would be interesting to look at the Postgres implementation of the
data model and queries to see whether your conclusions were different in
this case.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata query performance paper

2016-08-06 Thread Markus Kroetzsch

Hi Aidan,

Thanks, very interesting, though I have not read the details yet.

I wonder if you have compared the actual query results you got from the 
different stores. As far as I know, Neo4J actually uses a very 
idiosyncratic query semantics that is neither compatible with SPARQL 
(not even on the BGP level) nor with SQL (even for SELECT-PROJECT-JOIN 
queries). So it is difficult to compare it to engines that use SQL or 
SPARQL (or any other standard query language, for that matter). In this 
sense, it may not be meaningful to benchmark it against such systems.


Regarding Virtuoso, the reason for not picking it for Wikidata was the 
lack of load-balancing support in the open source version, not the 
performance of a single instance.


Best regards,

Markus


On 06.08.2016 18:19, Aidan Hogan wrote:

Hey all,

Recently we wrote a paper discussing the query performance for Wikidata,
comparing different possible representations of the knowledge-base in
Postgres (a relational database), Neo4J (a graph database), Virtuoso (a
SPARQL database) and BlazeGraph (the SPARQL database currently in use)
for a set of equivalent benchmark queries.

The paper was recently accepted for presentation at the International
Semantic Web Conference (ISWC) 2016. A pre-print is available here:

http://aidanhogan.com/docs/wikidata-sparql-relational-graph.pdf

Of course there are some caveats with these results in the sense that
perhaps other engines would perform better on different hardware, or
different styles of queries: for this reason we tried to use the most
general types of queries possible and tried to test different
representations in different engines (we did not vary the hardware).
Also in the discussion of results, we tried to give a more general
explanation of the trends, highlighting some strengths/weaknesses for
each engine independently of the particular queries/data.

I think it's worth a glance for anyone who is interested in the
technology/techniques needed to query Wikidata.

Cheers,
Aidan


P.S., the paper above is a follow-up to a previous work with Markus
Krötzsch that focussed purely on RDF/SPARQL:

http://aidanhogan.com/docs/reification-wikidata-rdf-sparql.pdf

(I'm not sure if it was previously mentioned on the list.)

P.P.S., as someone who's somewhat of an outsider but who's been watching
on for a few years now, I'd like to congratulate the community for
making Wikidata what it is today. It's awesome work. Keep going. :)

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata