Re: [Wikidata] Wikidata query performance paper

Aidan Hogan Sun, 07 Aug 2016 14:03:07 -0700

Hey Scott,

On 07-08-2016 16:15, Info WorldUniversity wrote:

Hi Aidan, Markus, Daniel and Wikidatans,


As an emergence out of this conversation on Wikidata query performance,
and re cc World University and School/Wikidata, as a theoretical
challenge, how would you suggest coding WUaS/Wikidata initially to be
able to answer this question - "What are most impt stats issues in
earth/space sci that journalists should understand?" -
https://twitter.com/ReginaNuzzo/status/761179359101259776 - in many
Wikipedia languages including however in American Sign Language (and
other sign languages), as well as eventually in voice. (Regina Nuzzo is
an associate Professor at Gallaudet University for the hearing
impaired/deafness, and has a Ph.D. in statistics from Stanford; Regina
was born with hearing loss herself).

I fear we are nowhere near answering these sorts of questions (by we, Imean the computer science community, not just Wikidata). The mainproblem is that the question is inherently ill-defined/subjective: thereis no correct answer here.

We would need to think about refining the question to something that iswell-defined/objective, which even as a human is difficult. Perhaps wecould consider a question such as: "what statistical methods (from afixed list) have been used in scientific papers referenced by newsarticles have been published in the past seven years by media companiesthat have their headquarters in the US?". Of course even then, there arestill some minor subjective aspects, and Wikidata would not havecoverage, to answer such a question.

The short answer is that machines are nowhere near answering these sortsof questions, no more than we are anywhere near taking a raw stream ofbinary data from an .mp4 video file and turning it into visual output.If we want to use machines to do useful things, we need to meet machineshalf-way. Part of that is formulating our questions in a way thatmachines can hope to process.

I'm excited for when we can ask WUaS (or Wikipedia) this question, (or
so many others) in voice combining, for example, CC WUaS Statistics,
Earth, Space & Journalism wiki subject pages (with all their CC MIT OCW
and Yale OYC) - http://worlduniversity.wikia.com/wiki/Subjects - in all
of Wikipedia's 358 languages, again eventually in voice and in ASL/other
sign languages
(https://twitter.com/WorldUnivAndSch/status/761593842202050560 - see,
too - https://www.wikidata.org/wiki/Wikidata:Project_chat#Schools).

Thanks for your paper, Aidan, as well. Would designing for deafness
inform how you would approach "Querying Wikidata: Comparing SPARQL,
Relational and Graph Databases" in any new ways?

In the context of Wikidata, the question of language is mostly aquestion of interface (which is itself non-trivial). But to answer thequestion in whatever language or mode, the question first has to beanswered in some (machine-friendly) language. This is the direction inwhich Wikidata goes: answers are first Q* identifiers, for which labelsin different languages can be generated and used to generate a mode.

Likewise our work is on the level of generating those Q* identifiers,which can be later turned into tables, maps, sentences, bubbles, etc. Ithink the interface question is an important one, but a different one tothat which we tackle.


Cheers,
Aidan

On Sat, Aug 6, 2016 at 12:29 PM, Markus Kroetzsch
<markus.kroetz...@tu-dresden.de <mailto:markus.kroetz...@tu-dresden.de>>
wrote:

    Hi Aidan,

    Thanks, very interesting, though I have not read the details yet.

    I wonder if you have compared the actual query results you got from
    the different stores. As far as I know, Neo4J actually uses a very
    idiosyncratic query semantics that is neither compatible with SPARQL
    (not even on the BGP level) nor with SQL (even for
    SELECT-PROJECT-JOIN queries). So it is difficult to compare it to
    engines that use SQL or SPARQL (or any other standard query
    language, for that matter). In this sense, it may not be meaningful
    to benchmark it against such systems.

    Regarding Virtuoso, the reason for not picking it for Wikidata was
    the lack of load-balancing support in the open source version, not
    the performance of a single instance.

    Best regards,

    Markus



    On 06.08.2016 18:19, Aidan Hogan wrote:

        Hey all,

        Recently we wrote a paper discussing the query performance for
        Wikidata,
        comparing different possible representations of the
        knowledge-base in
        Postgres (a relational database), Neo4J (a graph database),
        Virtuoso (a
        SPARQL database) and BlazeGraph (the SPARQL database currently
        in use)
        for a set of equivalent benchmark queries.

        The paper was recently accepted for presentation at the
        International
        Semantic Web Conference (ISWC) 2016. A pre-print is available here:

        http://aidanhogan.com/docs/wikidata-sparql-relational-graph.pdf
        <http://aidanhogan.com/docs/wikidata-sparql-relational-graph.pdf>

        Of course there are some caveats with these results in the sense
        that
        perhaps other engines would perform better on different hardware, or
        different styles of queries: for this reason we tried to use the
        most
        general types of queries possible and tried to test different
        representations in different engines (we did not vary the hardware).
        Also in the discussion of results, we tried to give a more general
        explanation of the trends, highlighting some
        strengths/weaknesses for
        each engine independently of the particular queries/data.

        I think it's worth a glance for anyone who is interested in the
        technology/techniques needed to query Wikidata.

        Cheers,
        Aidan


        P.S., the paper above is a follow-up to a previous work with Markus
        Krötzsch that focussed purely on RDF/SPARQL:

        http://aidanhogan.com/docs/reification-wikidata-rdf-sparql.pdf
        <http://aidanhogan.com/docs/reification-wikidata-rdf-sparql.pdf>

        (I'm not sure if it was previously mentioned on the list.)

        P.P.S., as someone who's somewhat of an outsider but who's been
        watching
        on for a few years now, I'd like to congratulate the community for
        making Wikidata what it is today. It's awesome work. Keep going. :)

        _______________________________________________
        Wikidata mailing list
        Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
        https://lists.wikimedia.org/mailman/listinfo/wikidata
        <https://lists.wikimedia.org/mailman/listinfo/wikidata>



    _______________________________________________
    Wikidata mailing list
    Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
    https://lists.wikimedia.org/mailman/listinfo/wikidata
    <https://lists.wikimedia.org/mailman/listinfo/wikidata>




--

- Scott MacLeod - Founder & President

- http://worlduniversityandschool.org
<http://worlduniversityandschool.org/>

- 415 480 4577

- PO Box 442, (86 Ridgecrest Road), Canyon, CA 94516

- World University and School - like Wikipedia with best STEM-centric
OpenCourseWare - incorporated as a nonprofit university and school in
California, and is a U.S. 501 (c) (3) tax-exempt educational organization.


World University and School is sending you this because of your interest
in free, online, higher education. If you don't want to receive these,
please reply with 'unsubscribe' in the body of the email, leaving the
subject line intact. Thank you.



_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata query performance paper

Reply via email to