Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

Sebastian Hellmann Mon, 16 Oct 2017 16:38:19 -0700

Ok, I put some effort intohttps://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot/Handelsregisterto move the discussion there.


All the best,


Sebastian


On 16.10.2017 18:06, Yaroslav Blanter wrote:

Dear All,

it is great that we are having this discussion, but may I pleasesuggest to have it on the RfP page on Wikidata? People already askedsimilar questions there, and, in my experience, on-wiki discussionwill likely lead to refined request which will accomodate all suggestions.


Cheers
Yaroslav

On Mon, Oct 16, 2017 at 5:53 PM, Sebastian Hellmann<hellm...@informatik.uni-leipzig.de<mailto:hellm...@informatik.uni-leipzig.de>> wrote:


    ah, ok, sorry, I was assuming that Blazegraph would transitively
    resolve this automatically.

    Ok, so let's divide the problem:

    # Task 1:

    Connect all existing organisations with the data from the
    handelsregister. (No new identifiers added, we can start right now)

    Add a constraint that all German organisations should be connected
    to a court, i.e. the registering organisation as well as the id
    assigned by the court.

    @all: any properties I can reuse for this?

    I will focus on this as it seems quite easy. We can first filter
    orgs by other criteria, i.e. country as a blocking key and then
    string match the rest.

    # Task 2:

    Add all missing identifiers for the remaining orgs in
    Handelsregister. Whereas 2 can be rediscussed and decided, if 1 is
    finished sufficiently.


    # regarding maintenance:
    I find Wikidata as such very hard to maintain as all data is
    copied from somewhere else eventually, but Wikipedia has the same
    problem. In the case of the German Business register, maintenance
    is especially easy as the orgs are stable and uniquely
    identifiable. Even the fact that a company gets shut down should
    still be in Wikidata, so you have historical information. I mean,
    you also keep the Roman Empire, the Hanse and even finished
    projects in Wikidata. So even if an org ceases to exist, the entry
    in Wikidata should stay.

    # regarding Opencorporates
    I have a critical opinion with Opencorporates. It appears to be
    open, but you actually can not get the data. If somebody has a
    data dump, please forward to me. Thanks.
    More on top, I consider Opencorporates a danger to open data. It
    appears to push open availability of data, but then it is limited
    to open licenses. Usefulness is limited as there are no free dumps
    and no possibility to duplicate it effectlively. Wikipedia and
    Wikidata provide dumps and an API for exactly this reason.
    Everytime somebody wants to create an open organisation dataset
    with no barriers, the existence of Opencorporates is blocking this.

    Cheers,
    Sebastian


    On 16.10.2017 15:34, Antonin Delpeuch (lists) wrote:

    And… my own count was wrong too, because I forgot to add DISTINCT in my
    query (if there are multiple paths from the class to "organization
    (Q43229)", items will appear multiple times).

    So, I get 1 168 084 now.
    http://tinyurl.com/yaeqlsnl

    It's easy to get these things wrong!

    Antonin

    On 16/10/2017 14:16, Antonin Delpeuch (lists) wrote:

    Thanks Ettore for spotting that!

    Wikidata types (P31) only make sense when you consider the "subclass of"
    (P279) property that we use to build the ontology (except in a few cases
    where the community has decided not to use any subclass for a particular
    type).

    So, to retrieve all items of a certain type in SPARQL, you need to use
    something like this:

    ?item wdt:P31/wdt:P279* ?type

    You can also have other variants to accept non-truthy statements.

    Just with this truthy version, I currently get 1 208 227 items. But note
    that there are still a lot of items where P31 is not provided, or
    subclasses which have not been connected to "organization (Q43229)"…

    So in general, it's very hard to have any "guarantees that there are no
    duplicates", just because you don't have any guarantees that the
    information currently in Wikidata is complete or correct.

    I would recommend trying to import something a bit smaller to get
    acquainted with how Wikidata works and what the matching process looks
    like in practice. And beyond a one-off import, as Ettore said it is
    important to think how the data will be maintained in the future…

    Antonin

    On 16/10/2017 13:46, Ettore RIZZA wrote:

         - Wikidata has 40k organisations:

         https://query.wikidata.org/#SELECT
    <https://query.wikidata.org/#SELECT>
         <https://query.wikidata.org/#SELECT>
    <https://query.wikidata.org/#SELECT>  %3Fitem %3FitemLabel %0AWHERE
         %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
         bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}


    Hi,

    I think Wikidata contains many more organizations than that. If we
    choose the "instance of Business enterprise", we get 135570 results. And
    I imagine there are many other categories that bring together commercial
    companies.


    
https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ4830453.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%7D
    
<https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ4830453.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%7D>

    On the substance, the project to add all companies of a country would
    make Wikidata a kind of totally free clone of Open Corporates
    <https://opencorporates.com/> <https://opencorporates.com/>. I would of 
course be delighted to see
    that, but is it not a challenge to maintain such a database? Companies
    are like humans, it appears and disappears every day.

    2017-10-16 13:41 GMT+02:00 Sebastian Hellmann
    <hellm...@informatik.uni-leipzig.de
    <mailto:hellm...@informatik.uni-leipzig.de>
    <mailto:hellm...@informatik.uni-leipzig.de>
    <mailto:hellm...@informatik.uni-leipzig.de>>:

         Hi all,

         the technical challenges are not so difficult.

         - 2.2 million are the exact number of German organisations, i.e.
         associations and companies. They are also unique.

         - Wikidata has 40k organisations:

         https://query.wikidata.org/#SELECT
    <https://query.wikidata.org/#SELECT>
         <https://query.wikidata.org/#SELECT>
    <https://query.wikidata.org/#SELECT>  %3Fitem %3FitemLabel %0AWHERE
         %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
         bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}

         so there would be a maximum of 40k duplicates These are easy to find
         and deduplicate

         - The crawl can be done easily, a colleague has done so before.


         The issues here are:

         - Do you want to upload the data in Wikidata? It would be a real big
         extension. Can I go ahead

         - If the data were available externally as structured data under
         open license, I would probably not suggest loading it into wikidata,
         as the data can be retrieved from the official source directly,
         however, here this data will not be published in a decent format.

         I thought that the way data is copied from coyrighted sources, i.e.
         only facts is ok for wikidata. This done in a lot of places, I
         guess. Same for Wikipedia, i.e. News articles and copyrighted books
         are referenced. So Wikimedia or the Wikimedia community are experts
         on this.

         All the best,

         Sebastian


         On 16.10.2017 10:18, Neubert, Joachim wrote:

         Hi Sebastian,____

         __ __

         This is huge! It will cover almost all currently existing German
         companies. Many of these will have similar names, so preparing for
         disambiguation is a concern.____

         __ __

         A good way for such an approach would be proposing a property for
         an external identifier, loading the data into Mix-n-match,
         creating links for companies already in Wikidata, and adding the
         rest (or perhaps only parts of them - I’m not sure if having all
         of them in Wikidata makes sense, but that’s another discussion),
         preferably with location and/or sector of trade in the description
         field.____

         __ __

         I’ve tried to figure out what could be used as key for a external
         identifier property. However, it looks like the registry does not
         offer any (persistent) URL to its entries. So for looking up a
         company, apparently there are two options:____

         __ __

         -          conducting an extended search for the exact string “A&A
         Dienstleistungsgesellschaft mbH“____

         -          copying the register number “32853” plus selecting the
         court (Leipzig) from the according dropdown list and search that____

         __ __

         Both ways are not very intuitive, even if we can provide a link to
         the search form. This would make a weak connection to the source
         of information. Much more important, it makes disambiguation in
         Mix-n-match difficult. This applies for the preparation of your
         initial load (you would not want to create duplicates). But much
         more so for everybody else who wants to match his or her data
         later on. Being forced to search for entries manually in a
         cumbersome way for disambiguation of a new, possibly large and
         rich dataset is, in my eyes, not something we want to impose on
         future contributors. And often, the free information they find in
         the registry (formal name, register number, legal form, address)
         will not easily match with the information they have (common name,
         location, perhaps founding date, and most important sector of
         trade), so disambiguation may still be difficult.____

         __ __

         Have you checked which parts of the accessible information as
         below can be crawled and added legally to external databases such
         as Wikidata?____

         __ __

         Cheers, Joachim____

         __ __

         --____

         Joachim Neubert____

         __ __

         ZBW – German National Library of Economics____

         Leibniz Information Centre for Economics____

         Neuer Jungfernstieg 21
         20354 Hamburg____

         Phone +49-42834-462____

         __ __

         __ __

         __ __

         *Von:*Wikidata [mailto:wikidata-boun...@lists.wikimedia.org
    <mailto:wikidata-boun...@lists.wikimedia.org>
         <mailto:wikidata-boun...@lists.wikimedia.org>
    <mailto:wikidata-boun...@lists.wikimedia.org>] *Im Auftrag von
         *Sebastian Hellmann
         *Gesendet:* Sonntag, 15. Oktober 2017 09:45
         *An:*wikidata@lists.wikimedia.org <mailto:wikidata@lists.wikimedia.org>
         <mailto:wikidata@lists.wikimedia.org>
    <mailto:wikidata@lists.wikimedia.org>
         *Betreff:* [Wikidata] Kickstartet: Adding 2.2 million German
         organisations to Wikidata____

         __ __

         Hi all,____

         the German business registry contains roughly 2.2 million
         organisations. Some information is paid, but other is public, i.e.
         the info you are searching for at and clicking on UT (see example
         below):____

         https://www.handelsregister.de/rp_web/mask.do?Typ=e
    <https://www.handelsregister.de/rp_web/mask.do?Typ=e>
         <https://www.handelsregister.de/rp_web/mask.do?Typ=e>
    <https://www.handelsregister.de/rp_web/mask.do?Typ=e>____

         __ __

         I would like to add this to Wikidata, either by crawling or by
         raising money to use crowdsourcing concepts like crowdflour or
         amazon turk. ____

         __ __

         It should meet notability criteria 2:
         https://www.wikidata.org/wiki/Wikidata:Notability
    <https://www.wikidata.org/wiki/Wikidata:Notability>
         <https://www.wikidata.org/wiki/Wikidata:Notability>
    <https://www.wikidata.org/wiki/Wikidata:Notability>____

             2. It refers to an instance of a *clearly identifiable
             conceptual or material entity*. The entity must be notable, in
             the sense that it *can be described using serious and publicly
             available references*. If there is no item about you yet, you
             are probably not notable.____


         The reference is the official German business registry, which is
         serious and public. Orgs are also per definition clearly
         identifiable legal entities.

         How can I get clearance to proceed on this?

         All the best,
         Sebastian____

         __ __

         __ __


               Entity data____

         __ __

         Saxony District court *Leipzig HRB 32853 * – A&A
         Dienstleistungsgesellschaft mbH ____

         Legal status:____

                

         Gesellschaft mit beschränkter Haftung  ____

                
                

         Capital:____

                

         25.000,00 EUR ____

                
                

         Date of entry:____

                

         29/08/2016
         (When entering date of entry, wrong data input can occur due to
         system failures!) ____

                
                

         Date of removal:____

                

         - ____

                
                

         Balance sheet available: ____

                

         - ____

                
                

         Address (subject to correction):____

                

         A&A Dienstleistungsgesellschaft mbH
         Prager Straße 38-40____

         04317 Leipzig ____

                
                

         __ __

         --
         All the best,
         Sebastian Hellmann

         Director of Knowledge Integration and Linked Data Technologies
         (KILT) Competence Center
         at the Institute for Applied Informatics (InfAI) at Leipzig University
         Executive Director of the DBpedia Association
         Projects:http://dbpedia.org,http://nlp2rdf.org,
         http://linguistics.okfn.org,https://www.w3.org/community/ld4lt
    <https://www.w3.org/community/ld4lt>
         <http://www.w3.org/community/ld4lt>
    <http://www.w3.org/community/ld4lt>
         Homepage:http://aksw.org/SebastianHellmann
    <http://aksw.org/SebastianHellmann>
         <http://aksw.org/SebastianHellmann>
    <http://aksw.org/SebastianHellmann>
         Research Group:http://aksw.org____



         _______________________________________________
         Wikidata mailing list
         Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>  
<mailto:Wikidata@lists.wikimedia.org>
    <mailto:Wikidata@lists.wikimedia.org>
         https://lists.wikimedia.org/mailman/listinfo/wikidata
    <https://lists.wikimedia.org/mailman/listinfo/wikidata>
         <https://lists.wikimedia.org/mailman/listinfo/wikidata>
    <https://lists.wikimedia.org/mailman/listinfo/wikidata>

         --
         All the best,
         Sebastian Hellmann

         Director of Knowledge Integration and Linked Data Technologies
         (KILT) Competence Center
         at the Institute for Applied Informatics (InfAI) at Leipzig University
         Executive Director of the DBpedia Association
         Projects:http://dbpedia.org,http://nlp2rdf.org,
         http://linguistics.okfn.org,https://www.w3.org/community/ld4lt
    <https://www.w3.org/community/ld4lt>
         <http://www.w3.org/community/ld4lt>
    <http://www.w3.org/community/ld4lt>
         Homepage:http://aksw.org/SebastianHellmann
    <http://aksw.org/SebastianHellmann>
         <http://aksw.org/SebastianHellmann>
    <http://aksw.org/SebastianHellmann>
         Research Group:http://aksw.org

         _______________________________________________
         Wikidata mailing list
         Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>  
<mailto:Wikidata@lists.wikimedia.org>
    <mailto:Wikidata@lists.wikimedia.org>
         https://lists.wikimedia.org/mailman/listinfo/wikidata
    <https://lists.wikimedia.org/mailman/listinfo/wikidata>
         <https://lists.wikimedia.org/mailman/listinfo/wikidata>
    <https://lists.wikimedia.org/mailman/listinfo/wikidata>




    _______________________________________________
    Wikidata mailing list
    Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
    https://lists.wikimedia.org/mailman/listinfo/wikidata
    <https://lists.wikimedia.org/mailman/listinfo/wikidata>

    _______________________________________________
    Wikidata mailing list
    Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
    https://lists.wikimedia.org/mailman/listinfo/wikidata
    <https://lists.wikimedia.org/mailman/listinfo/wikidata>

    _______________________________________________
    Wikidata mailing list
    Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
    https://lists.wikimedia.org/mailman/listinfo/wikidata
    <https://lists.wikimedia.org/mailman/listinfo/wikidata>

--All the best,

    Sebastian Hellmann

    Director of Knowledge Integration and Linked Data Technologies
    (KILT) Competence Center
    at the Institute for Applied Informatics (InfAI) at Leipzig University
    Executive Director of the DBpedia Association
    Projects: http://dbpedia.org, http://nlp2rdf.org,
    http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
    <http://www.w3.org/community/ld4lt>
    Homepage: http://aksw.org/SebastianHellmann
    <http://aksw.org/SebastianHellmann>
    Research Group: http://aksw.org

    _______________________________________________
    Wikidata mailing list
    Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
    https://lists.wikimedia.org/mailman/listinfo/wikidata
    <https://lists.wikimedia.org/mailman/listinfo/wikidata>




_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


--
All the best,
Sebastian Hellmann

Director of Knowledge Integration and Linked Data Technologies (KILT)Competence Center

at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association

Projects: http://dbpedia.org, http://nlp2rdf.org,http://linguistics.okfn.org, https://www.w3.org/community/ld4lt<http://www.w3.org/community/ld4lt>

Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

Reply via email to