Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

hellmann Mon, 16 Oct 2017 05:40:53 -0700

The best way then to not create duplicates is to look at all existing 
organizations in Wikidata and add the court and court number manually, if they 
are German and then exclude these from the import.


Guarantees that there will be no duplicates.

So the technical side is feasible.
Barriers are political and legal.

Sebastian 

Am 16. Oktober 2017 14:24:51 MESZ schrieb Sebastian Hellmann 
<hellm...@informatik.uni-leipzig.de>:
>Ah yes, forgot to mention:
>
>there is no URI or unique identifier given by the Handelsregister 
>system. However, the courts take care that the registrations are
>unique, 
>so it is implicit. Handelsregister could easily create stable URIs out 
>of the court+type+number like /Leipzig_HRB_32853
>
>For Wikidata this is not a problem to handle. So no technical issues 
>from this side either.
>
>All the best,
>
>Sebastian
>
>
>On 16.10.2017 13:41, Sebastian Hellmann wrote:
>>
>> Hi all,
>>
>> the technical challenges are not so difficult.
>>
>> - 2.2 million are the exact number of German organisations, i.e. 
>> associations and companies. They are also unique.
>>
>> - Wikidata has 40k organisations:
>>
>> https://query.wikidata.org/#SELECT %3Fitem %3FitemLabel %0AWHERE 
>> %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel { 
>> bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
>>
>> so there would be a maximum of 40k duplicates These are easy to find 
>> and deduplicate
>>
>> - The crawl can be done easily, a colleague has done so before.
>>
>>
>> The issues here are:
>>
>> - Do you want to upload the data in Wikidata? It would be a real big 
>> extension. Can I go ahead
>>
>> - If the data were available externally as structured data under open
>
>> license, I would probably not suggest loading it into wikidata, as
>the 
>> data can be retrieved from the official source directly, however,
>here 
>> this data will not be published in a decent format.
>>
>> I thought that the way data is copied from coyrighted sources, i.e. 
>> only facts is ok for wikidata. This done in a lot of places, I guess.
>
>> Same for Wikipedia, i.e. News articles and copyrighted books are 
>> referenced. So Wikimedia or the Wikimedia community are experts on
>this.
>>
>> All the best,
>>
>> Sebastian
>>
>>
>> On 16.10.2017 10:18, Neubert, Joachim wrote:
>>>
>>> Hi Sebastian,
>>>
>>> This is huge! It will cover almost all currently existing German 
>>> companies. Many of these will have similar names, so preparing for 
>>> disambiguation is a concern.
>>>
>>> A good way for such an approach would be proposing a property for an
>
>>> external identifier, loading the data into Mix-n-match, creating 
>>> links for companies already in Wikidata, and adding the rest (or 
>>> perhaps only parts of them - I’m not sure if having all of them in 
>>> Wikidata makes sense, but that’s another discussion), preferably
>with 
>>> location and/or sector of trade in the description field.
>>>
>>> I’ve tried to figure out what could be used as key for a external 
>>> identifier property. However, it looks like the registry does not 
>>> offer any (persistent) URL to its entries. So for looking up a 
>>> company, apparently there are two options:
>>>
>>> -conducting an extended search for the exact string “A&A 
>>> Dienstleistungsgesellschaft mbH“
>>>
>>> -copying the register number “32853” plus selecting the court 
>>> (Leipzig) from the according dropdown list and search that
>>>
>>> Both ways are not very intuitive, even if we can provide a link to 
>>> the search form. This would make a weak connection to the source of 
>>> information. Much more important, it makes disambiguation in 
>>> Mix-n-match difficult. This applies for the preparation of your 
>>> initial load (you would not want to create duplicates). But much
>more 
>>> so for everybody else who wants to match his or her data later on. 
>>> Being forced to search for entries manually in a cumbersome way for 
>>> disambiguation of a new, possibly large and rich dataset is, in my 
>>> eyes, not something we want to impose on future contributors. And 
>>> often, the free information they find in the registry (formal name, 
>>> register number, legal form, address) will not easily match with the
>
>>> information they have (common name, location, perhaps founding date,
>
>>> and most important sector of trade), so disambiguation may still be 
>>> difficult.
>>>
>>> Have you checked which parts of the accessible information as below 
>>> can be crawled and added legally to external databases such as
>Wikidata?
>>>
>>> Cheers, Joachim
>>>
>>> --
>>>
>>> Joachim Neubert
>>>
>>> ZBW – German National Library of Economics
>>>
>>> Leibniz Information Centre for Economics
>>>
>>> Neuer Jungfernstieg 21
>>> 20354 Hamburg
>>>
>>> Phone +49-42834-462
>>>
>>> *Von:*Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] *Im 
>>> Auftrag von *Sebastian Hellmann
>>> *Gesendet:* Sonntag, 15. Oktober 2017 09:45
>>> *An:* wikidata@lists.wikimedia.org
><mailto:wikidata@lists.wikimedia.org>
>>> *Betreff:* [Wikidata] Kickstartet: Adding 2.2 million German 
>>> organisations to Wikidata
>>>
>>> Hi all,
>>>
>>> the German business registry contains roughly 2.2 million 
>>> organisations. Some information is paid, but other is public, i.e. 
>>> the info you are searching for at and clicking on UT (see example
>below):
>>>
>>> https://www.handelsregister.de/rp_web/mask.do?Typ=e
>>>
>>> I would like to add this to Wikidata, either by crawling or by 
>>> raising money to use crowdsourcing concepts like crowdflour or
>amazon 
>>> turk.
>>>
>>> It should meet notability criteria 2: 
>>> https://www.wikidata.org/wiki/Wikidata:Notability
>>>
>>>     2. It refers to an instance of a *clearly identifiable
>conceptual
>>>     or material entity*. The entity must be notable, in the sense
>>>     that it *can be described using serious and publicly available
>>>     references*. If there is no item about you yet, you are probably
>>>     not notable.
>>>
>>>
>>> The reference is the official German business registry, which is 
>>> serious and public. Orgs are also per definition clearly
>identifiable 
>>> legal entities.
>>>
>>> How can I get clearance to proceed on this?
>>>
>>> All the best,
>>> Sebastian
>>>
>>>
>>>       Entity data
>>>
>>> Saxony District court *Leipzig HRB 32853 * – A&A 
>>> Dienstleistungsgesellschaft mbH
>>>
>>> Legal status:
>>>
>>>     
>>>
>>> Gesellschaft mit beschränkter Haftung
>>>
>>>     
>>>     
>>>
>>> Capital:
>>>
>>>     
>>>
>>> 25.000,00 EUR
>>>
>>>     
>>>     
>>>
>>> Date of entry:
>>>
>>>     
>>>
>>> 29/08/2016
>>> (When entering date of entry, wrong data input can occur due to 
>>> system failures!)
>>>
>>>     
>>>     
>>>
>>> Date of removal:
>>>
>>>     
>>>
>>> -
>>>
>>>     
>>>     
>>>
>>> Balance sheet available:
>>>
>>>     
>>>
>>> -
>>>
>>>     
>>>     
>>>
>>> Address (subject to correction):
>>>
>>>     
>>>
>>> A&A Dienstleistungsgesellschaft mbH
>>> Prager Straße 38-40
>>>
>>> 04317 Leipzig
>>>
>>>     
>>>     
>>>
>>> -- 
>>> All the best,
>>> Sebastian Hellmann
>>>
>>> Director of Knowledge Integration and Linked Data Technologies
>(KILT) 
>>> Competence Center
>>> at the Institute for Applied Informatics (InfAI) at Leipzig
>University
>>> Executive Director of the DBpedia Association
>>> Projects: http://dbpedia.org, http://nlp2rdf.org, 
>>> http://linguistics.okfn.org, https://www.w3.org/community/ld4lt 
>>> <http://www.w3.org/community/ld4lt>
>>> Homepage: http://aksw.org/SebastianHellmann
>>> Research Group: http://aksw.org
>>>
>>>
>>>
>>> _______________________________________________
>>> Wikidata mailing list
>>> Wikidata@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>> -- 
>> All the best,
>> Sebastian Hellmann
>>
>> Director of Knowledge Integration and Linked Data Technologies (KILT)
>
>> Competence Center
>> at the Institute for Applied Informatics (InfAI) at Leipzig
>University
>> Executive Director of the DBpedia Association
>> Projects: http://dbpedia.org, http://nlp2rdf.org, 
>> http://linguistics.okfn.org, https://www.w3.org/community/ld4lt 
>> <http://www.w3.org/community/ld4lt>
>> Homepage: http://aksw.org/SebastianHellmann
>> Research Group: http://aksw.org
>>
>>
>> _______________________________________________
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>-- 
>All the best,
>Sebastian Hellmann
>
>Director of Knowledge Integration and Linked Data Technologies (KILT) 
>Competence Center
>at the Institute for Applied Informatics (InfAI) at Leipzig University
>Executive Director of the DBpedia Association
>Projects: http://dbpedia.org, http://nlp2rdf.org, 
>http://linguistics.okfn.org, https://www.w3.org/community/ld4lt 
><http://www.w3.org/community/ld4lt>
>Homepage: http://aksw.org/SebastianHellmann
>Research Group: http://aksw.org

-- 
Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

Reply via email to