Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

Ettore RIZZA Mon, 16 Oct 2017 07:09:23 -0700

@Antonin : Thanks for this counting method, it seems very effective (I
already knew that there were 3.6 M of humans (Q5) in Wikidata).


https://query.wikidata.org/#%23compter%20le%20nombre%20d%27%C3%A9l%C3%A9ments%20appartenant%20%C3%A0%20la%20cat%C3%A9gorie%0A%23organisation%20ou%20%C3%A0%20ses%20enfants%0ASELECT%20DISTINCT%20%28COUNT%28DISTINCT%20%3Fitem%29%20AS%20%3Fcount%29%20WHERE%20%7B%20%3Fitem%20%28wdt%3AP31%2Fwdt%3AP279%2a%29%20wd%3AQ5.%20%7D

2017-10-16 15:34 GMT+02:00 Antonin Delpeuch (lists) <
li...@antonin.delpeuch.eu>:

> And… my own count was wrong too, because I forgot to add DISTINCT in my
> query (if there are multiple paths from the class to "organization
> (Q43229)", items will appear multiple times).
>
> So, I get 1 168 084 now.
> http://tinyurl.com/yaeqlsnl
>
> It's easy to get these things wrong!
>
> Antonin
>
> On 16/10/2017 14:16, Antonin Delpeuch (lists) wrote:
> > Thanks Ettore for spotting that!
> >
> > Wikidata types (P31) only make sense when you consider the "subclass of"
> > (P279) property that we use to build the ontology (except in a few cases
> > where the community has decided not to use any subclass for a particular
> > type).
> >
> > So, to retrieve all items of a certain type in SPARQL, you need to use
> > something like this:
> >
> > ?item wdt:P31/wdt:P279* ?type
> >
> > You can also have other variants to accept non-truthy statements.
> >
> > Just with this truthy version, I currently get 1 208 227 items. But note
> > that there are still a lot of items where P31 is not provided, or
> > subclasses which have not been connected to "organization (Q43229)"…
> >
> > So in general, it's very hard to have any "guarantees that there are no
> > duplicates", just because you don't have any guarantees that the
> > information currently in Wikidata is complete or correct.
> >
> > I would recommend trying to import something a bit smaller to get
> > acquainted with how Wikidata works and what the matching process looks
> > like in practice. And beyond a one-off import, as Ettore said it is
> > important to think how the data will be maintained in the future…
> >
> > Antonin
> >
> > On 16/10/2017 13:46, Ettore RIZZA wrote:
> >>     - Wikidata has 40k organisations:
> >>
> >>     https://query.wikidata.org/#SELECT
> >>     <https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE
> >>     %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
> >>     bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
> >>
> >>
> >> Hi,
> >>
> >> I think Wikidata contains many more organizations than that. If we
> >> choose the "instance of Business enterprise", we get 135570 results. And
> >> I imagine there are many other categories that bring together commercial
> >> companies.
> >>
> >>
> >> https://query.wikidata.org/#SELECT%20%3Fitem%20%
> 3FitemLabel%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%
> 3AQ4830453.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%
> 3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_
> LANGUAGE%5D%2Cen%22.%20%7D%0A%7D
> >>
> >> On the substance, the project to add all companies of a country would
> >> make Wikidata a kind of totally free clone of Open Corporates
> >> <https://opencorporates.com/>. I would of course be delighted to see
> >> that, but is it not a challenge to maintain such a database? Companies
> >> are like humans, it appears and disappears every day.
> >>
> >>
> >>
> >> 2017-10-16 13:41 GMT+02:00 Sebastian Hellmann
> >> <hellm...@informatik.uni-leipzig.de
> >> <mailto:hellm...@informatik.uni-leipzig.de>>:
> >>
> >>     Hi all,
> >>
> >>     the technical challenges are not so difficult.
> >>
> >>     - 2.2 million are the exact number of German organisations, i.e.
> >>     associations and companies. They are also unique.
> >>
> >>     - Wikidata has 40k organisations:
> >>
> >>     https://query.wikidata.org/#SELECT
> >>     <https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE
> >>     %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
> >>     bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
> >>
> >>     so there would be a maximum of 40k duplicates These are easy to find
> >>     and deduplicate
> >>
> >>     - The crawl can be done easily, a colleague has done so before.
> >>
> >>
> >>     The issues here are:
> >>
> >>     - Do you want to upload the data in Wikidata? It would be a real big
> >>     extension. Can I go ahead
> >>
> >>     - If the data were available externally as structured data under
> >>     open license, I would probably not suggest loading it into wikidata,
> >>     as the data can be retrieved from the official source directly,
> >>     however, here this data will not be published in a decent format.
> >>
> >>     I thought that the way data is copied from coyrighted sources, i.e.
> >>     only facts is ok for wikidata. This done in a lot of places, I
> >>     guess. Same for Wikipedia, i.e. News articles and copyrighted books
> >>     are referenced. So Wikimedia or the Wikimedia community are experts
> >>     on this.
> >>
> >>     All the best,
> >>
> >>     Sebastian
> >>
> >>
> >>     On 16.10.2017 10:18, Neubert, Joachim wrote:
> >>>
> >>>     Hi Sebastian,____
> >>>
> >>>     __ __
> >>>
> >>>     This is huge! It will cover almost all currently existing German
> >>>     companies. Many of these will have similar names, so preparing for
> >>>     disambiguation is a concern.____
> >>>
> >>>     __ __
> >>>
> >>>     A good way for such an approach would be proposing a property for
> >>>     an external identifier, loading the data into Mix-n-match,
> >>>     creating links for companies already in Wikidata, and adding the
> >>>     rest (or perhaps only parts of them - I’m not sure if having all
> >>>     of them in Wikidata makes sense, but that’s another discussion),
> >>>     preferably with location and/or sector of trade in the description
> >>>     field.____
> >>>
> >>>     __ __
> >>>
> >>>     I’ve tried to figure out what could be used as key for a external
> >>>     identifier property. However, it looks like the registry does not
> >>>     offer any (persistent) URL to its entries. So for looking up a
> >>>     company, apparently there are two options:____
> >>>
> >>>     __ __
> >>>
> >>>     -          conducting an extended search for the exact string “A&A
> >>>     Dienstleistungsgesellschaft mbH“____
> >>>
> >>>     -          copying the register number “32853” plus selecting the
> >>>     court (Leipzig) from the according dropdown list and search
> that____
> >>>
> >>>     __ __
> >>>
> >>>     Both ways are not very intuitive, even if we can provide a link to
> >>>     the search form. This would make a weak connection to the source
> >>>     of information. Much more important, it makes disambiguation in
> >>>     Mix-n-match difficult. This applies for the preparation of your
> >>>     initial load (you would not want to create duplicates). But much
> >>>     more so for everybody else who wants to match his or her data
> >>>     later on. Being forced to search for entries manually in a
> >>>     cumbersome way for disambiguation of a new, possibly large and
> >>>     rich dataset is, in my eyes, not something we want to impose on
> >>>     future contributors. And often, the free information they find in
> >>>     the registry (formal name, register number, legal form, address)
> >>>     will not easily match with the information they have (common name,
> >>>     location, perhaps founding date, and most important sector of
> >>>     trade), so disambiguation may still be difficult.____
> >>>
> >>>     __ __
> >>>
> >>>     Have you checked which parts of the accessible information as
> >>>     below can be crawled and added legally to external databases such
> >>>     as Wikidata?____
> >>>
> >>>     __ __
> >>>
> >>>     Cheers, Joachim____
> >>>
> >>>     __ __
> >>>
> >>>     --____
> >>>
> >>>     Joachim Neubert____
> >>>
> >>>     __ __
> >>>
> >>>     ZBW – German National Library of Economics____
> >>>
> >>>     Leibniz Information Centre for Economics____
> >>>
> >>>     Neuer Jungfernstieg 21
> >>>     20354 Hamburg____
> >>>
> >>>     Phone +49-42834-462____
> >>>
> >>>     __ __
> >>>
> >>>     __ __
> >>>
> >>>     __ __
> >>>
> >>>     *Von:*Wikidata [mailto:wikidata-boun...@lists.wikimedia.org
> >>>     <mailto:wikidata-boun...@lists.wikimedia.org>] *Im Auftrag von
> >>>     *Sebastian Hellmann
> >>>     *Gesendet:* Sonntag, 15. Oktober 2017 09:45
> >>>     *An:* wikidata@lists.wikimedia.org
> >>>     <mailto:wikidata@lists.wikimedia.org>
> >>>     *Betreff:* [Wikidata] Kickstartet: Adding 2.2 million German
> >>>     organisations to Wikidata____
> >>>
> >>>     __ __
> >>>
> >>>     Hi all,____
> >>>
> >>>     the German business registry contains roughly 2.2 million
> >>>     organisations. Some information is paid, but other is public, i.e.
> >>>     the info you are searching for at and clicking on UT (see example
> >>>     below):____
> >>>
> >>>     https://www.handelsregister.de/rp_web/mask.do?Typ=e
> >>>     <https://www.handelsregister.de/rp_web/mask.do?Typ=e>____
> >>>
> >>>     __ __
> >>>
> >>>     I would like to add this to Wikidata, either by crawling or by
> >>>     raising money to use crowdsourcing concepts like crowdflour or
> >>>     amazon turk. ____
> >>>
> >>>     __ __
> >>>
> >>>     It should meet notability criteria 2:
> >>>     https://www.wikidata.org/wiki/Wikidata:Notability
> >>>     <https://www.wikidata.org/wiki/Wikidata:Notability>____
> >>>
> >>>         2. It refers to an instance of a *clearly identifiable
> >>>         conceptual or material entity*. The entity must be notable, in
> >>>         the sense that it *can be described using serious and publicly
> >>>         available references*. If there is no item about you yet, you
> >>>         are probably not notable.____
> >>>
> >>>
> >>>     The reference is the official German business registry, which is
> >>>     serious and public. Orgs are also per definition clearly
> >>>     identifiable legal entities.
> >>>
> >>>     How can I get clearance to proceed on this?
> >>>
> >>>     All the best,
> >>>     Sebastian____
> >>>
> >>>     __ __
> >>>
> >>>     __ __
> >>>
> >>>
> >>>           Entity data____
> >>>
> >>>     __ __
> >>>
> >>>     Saxony District court *Leipzig HRB 32853 * – A&A
> >>>     Dienstleistungsgesellschaft mbH ____
> >>>
> >>>     Legal status:____
> >>>
> >>>
> >>>
> >>>     Gesellschaft mit beschränkter Haftung  ____
> >>>
> >>>
> >>>
> >>>
> >>>     Capital:____
> >>>
> >>>
> >>>
> >>>     25.000,00 EUR ____
> >>>
> >>>
> >>>
> >>>
> >>>     Date of entry:____
> >>>
> >>>
> >>>
> >>>     29/08/2016
> >>>     (When entering date of entry, wrong data input can occur due to
> >>>     system failures!) ____
> >>>
> >>>
> >>>
> >>>
> >>>     Date of removal:____
> >>>
> >>>
> >>>
> >>>     - ____
> >>>
> >>>
> >>>
> >>>
> >>>     Balance sheet available: ____
> >>>
> >>>
> >>>
> >>>     - ____
> >>>
> >>>
> >>>
> >>>
> >>>     Address (subject to correction):____
> >>>
> >>>
> >>>
> >>>     A&A Dienstleistungsgesellschaft mbH
> >>>     Prager Straße 38-40____
> >>>
> >>>     04317 Leipzig ____
> >>>
> >>>
> >>>
> >>>
> >>>     __ __
> >>>
> >>>     --
> >>>     All the best,
> >>>     Sebastian Hellmann
> >>>
> >>>     Director of Knowledge Integration and Linked Data Technologies
> >>>     (KILT) Competence Center
> >>>     at the Institute for Applied Informatics (InfAI) at Leipzig
> University
> >>>     Executive Director of the DBpedia Association
> >>>     Projects: http://dbpedia.org, http://nlp2rdf.org,
> >>>     http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
> >>>     <http://www.w3.org/community/ld4lt>
> >>>     Homepage: http://aksw.org/SebastianHellmann
> >>>     <http://aksw.org/SebastianHellmann>
> >>>     Research Group: http://aksw.org____
> >>>
> >>>
> >>>
> >>>     _______________________________________________
> >>>     Wikidata mailing list
> >>>     Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
> >>>     https://lists.wikimedia.org/mailman/listinfo/wikidata
> >>>     <https://lists.wikimedia.org/mailman/listinfo/wikidata>
> >>
> >>     --
> >>     All the best,
> >>     Sebastian Hellmann
> >>
> >>     Director of Knowledge Integration and Linked Data Technologies
> >>     (KILT) Competence Center
> >>     at the Institute for Applied Informatics (InfAI) at Leipzig
> University
> >>     Executive Director of the DBpedia Association
> >>     Projects: http://dbpedia.org, http://nlp2rdf.org,
> >>     http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
> >>     <http://www.w3.org/community/ld4lt>
> >>     Homepage: http://aksw.org/SebastianHellmann
> >>     <http://aksw.org/SebastianHellmann>
> >>     Research Group: http://aksw.org
> >>
> >>     _______________________________________________
> >>     Wikidata mailing list
> >>     Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
> >>     https://lists.wikimedia.org/mailman/listinfo/wikidata
> >>     <https://lists.wikimedia.org/mailman/listinfo/wikidata>
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> Wikidata mailing list
> >> Wikidata@lists.wikimedia.org
> >> https://lists.wikimedia.org/mailman/listinfo/wikidata
> >>
> >
> >
> > _______________________________________________
> > Wikidata mailing list
> > Wikidata@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikidata
> >
>
>
> _______________________________________________
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

Reply via email to