Hi Micru,

On 07/08/18 17:26, David Cuenca Tudela wrote:
Hi Markus,

Thanks for making the logs available. Personally I would be interested in knowing how often a certain item pops up in queries. That way it would make easier to know the popularity of certain items.

Do you think it's something that could be accomplished?

This would be quite easy to do: since each query is one line in the files, and since we have expanded all URLs (meaning they close with ">", which is URL-encoded as "%3E"), you can simply do a zgrep -c over the files to count the queries that mention the item (and make sure to use the closing "%3E" to avoid Q1234 matching a search for Q123). One such grep over any of the three larger files takes less than a minute.

If you want a sorted list of "most popular" items, this is a bit more work and would require at least some Python script, or some less obvious combination of sed (extracting all URLs of entities), and sort.

Best,

Markus


Regards,
Micru



On Tue, 7 Aug 2018, 17:01 Markus Kroetzsch, <markus.kroetz...@tu-dresden.de <mailto:markus.kroetz...@tu-dresden.de>> wrote:

    Dear all,

    I am happy to announce that as part of an ongoing research
    collaboration
    between TU Dresden researchers and Wikimedia [1], we could now release
    pre-processed logs from the Wikidata SPARQL Query Service [2]. You can
    find details and download links on the following page:

    https://iccl.inf.tu-dresden.de/web/Wikidata_SPARQL_Logs/en

    The data so far comprises over 200 million queries answered in
    June-August 2017. There is also an accompanying publication that
    describes the workings of and practical experiences with the SPARQL
    query service [3].

    The logs have been pre-processed to remove information that could
    potentially be used for identifying individual users (e.g., comments
    were removed, geo-coordinates coarsened, and query strings reformatted
    completely -- see above page for details). Nevertheless, one can still
    learn many interesting things from the logs, e.g., which properties and
    entities are used in queries, which SPARQL features are most prominent,
    or which languages are requested.

    We also have preserved some amount of user agent information, but
    without overly detailed software versions and only in cases where the
    agents occurred many times across several weeks. This can at least be
    used to recognise the (significant amount) of queries generated, e.g.,
    by Magnus' tools, or to do a rough analysis of which software platforms
    are mostly used to send queries from. We used #TOOL comments found in
    queries to refine user agent information in some cases.

    We also made an effort to identify those queries that come from browser
    agents *and* also behave like one would expect from a browser (not all
    "browsers" did). We called such queries "organic" and provide this
    classification with the logs (there is also a filtered dump of only
    organic queries, which is much smaller and therefore nicer to process,
    also for testing). See the paper for details on our methodology.

    Finally, the data contains the time of each request, so one can
    reconstruct query loads over time.

    Feedback is very welcome, both in terms of comments on the data (is it
    useful to you? would you like to see more? do you have concerns?)
    and in
    terms of insights that you can get from it (we did some analyses but
    one
    can surely do more).

    Cheers,

    Markus

    [1]
    https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries
    [2] https://query.wikidata.org/ (or rather the web service that powers
    this UI and many other applications).
    [3] Stanislav Malyshev, Markus Krötzsch, Larry González, Julius
    Gonsior,
    Adrian Bielefeldt: Getting the Most out of Wikidata: Semantic
    Technology
    Usage in Wikipedia’s Knowledge Graph. In Proceedings of the 17th
    International Semantic Web Conference (ISWC-18), Springer 2018.
    https://iccl.inf.tu-dresden.de/web/Wikidata_SPARQL_Logs/en

-- Prof. Dr. Markus Kroetzsch
    Knowledge-Based Systems Group
    Center for Advancing Electronics Dresden (cfaed)
    Faculty of Computer Science
    TU Dresden
    +49 351 463 38486
    https://kbs.inf.tu-dresden.de/

    _______________________________________________
    Wikidata mailing list
    Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
    https://lists.wikimedia.org/mailman/listinfo/wikidata



_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Reply via email to