@Antonin : Thanks for this counting method, it seems very effective (I already knew that there were 3.6 M of humans (Q5) in Wikidata).
https://query.wikidata.org/#%23compter%20le%20nombre%20d%27%C3%A9l%C3%A9ments%20appartenant%20%C3%A0%20la%20cat%C3%A9gorie%0A%23organisation%20ou%20%C3%A0%20ses%20enfants%0ASELECT%20DISTINCT%20%28COUNT%28DISTINCT%20%3Fitem%29%20AS%20%3Fcount%29%20WHERE%20%7B%20%3Fitem%20%28wdt%3AP31%2Fwdt%3AP279%2a%29%20wd%3AQ5.%20%7D 2017-10-16 15:34 GMT+02:00 Antonin Delpeuch (lists) < li...@antonin.delpeuch.eu>: > And… my own count was wrong too, because I forgot to add DISTINCT in my > query (if there are multiple paths from the class to "organization > (Q43229)", items will appear multiple times). > > So, I get 1 168 084 now. > http://tinyurl.com/yaeqlsnl > > It's easy to get these things wrong! > > Antonin > > On 16/10/2017 14:16, Antonin Delpeuch (lists) wrote: > > Thanks Ettore for spotting that! > > > > Wikidata types (P31) only make sense when you consider the "subclass of" > > (P279) property that we use to build the ontology (except in a few cases > > where the community has decided not to use any subclass for a particular > > type). > > > > So, to retrieve all items of a certain type in SPARQL, you need to use > > something like this: > > > > ?item wdt:P31/wdt:P279* ?type > > > > You can also have other variants to accept non-truthy statements. > > > > Just with this truthy version, I currently get 1 208 227 items. But note > > that there are still a lot of items where P31 is not provided, or > > subclasses which have not been connected to "organization (Q43229)"… > > > > So in general, it's very hard to have any "guarantees that there are no > > duplicates", just because you don't have any guarantees that the > > information currently in Wikidata is complete or correct. > > > > I would recommend trying to import something a bit smaller to get > > acquainted with how Wikidata works and what the matching process looks > > like in practice. And beyond a one-off import, as Ettore said it is > > important to think how the data will be maintained in the future… > > > > Antonin > > > > On 16/10/2017 13:46, Ettore RIZZA wrote: > >> - Wikidata has 40k organisations: > >> > >> https://query.wikidata.org/#SELECT > >> <https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE > >> %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel { > >> bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A} > >> > >> > >> Hi, > >> > >> I think Wikidata contains many more organizations than that. If we > >> choose the "instance of Business enterprise", we get 135570 results. And > >> I imagine there are many other categories that bring together commercial > >> companies. > >> > >> > >> https://query.wikidata.org/#SELECT%20%3Fitem%20% > 3FitemLabel%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd% > 3AQ4830453.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd% > 3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_ > LANGUAGE%5D%2Cen%22.%20%7D%0A%7D > >> > >> On the substance, the project to add all companies of a country would > >> make Wikidata a kind of totally free clone of Open Corporates > >> <https://opencorporates.com/>. I would of course be delighted to see > >> that, but is it not a challenge to maintain such a database? Companies > >> are like humans, it appears and disappears every day. > >> > >> > >> > >> 2017-10-16 13:41 GMT+02:00 Sebastian Hellmann > >> <hellm...@informatik.uni-leipzig.de > >> <mailto:hellm...@informatik.uni-leipzig.de>>: > >> > >> Hi all, > >> > >> the technical challenges are not so difficult. > >> > >> - 2.2 million are the exact number of German organisations, i.e. > >> associations and companies. They are also unique. > >> > >> - Wikidata has 40k organisations: > >> > >> https://query.wikidata.org/#SELECT > >> <https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE > >> %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel { > >> bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A} > >> > >> so there would be a maximum of 40k duplicates These are easy to find > >> and deduplicate > >> > >> - The crawl can be done easily, a colleague has done so before. > >> > >> > >> The issues here are: > >> > >> - Do you want to upload the data in Wikidata? It would be a real big > >> extension. Can I go ahead > >> > >> - If the data were available externally as structured data under > >> open license, I would probably not suggest loading it into wikidata, > >> as the data can be retrieved from the official source directly, > >> however, here this data will not be published in a decent format. > >> > >> I thought that the way data is copied from coyrighted sources, i.e. > >> only facts is ok for wikidata. This done in a lot of places, I > >> guess. Same for Wikipedia, i.e. News articles and copyrighted books > >> are referenced. So Wikimedia or the Wikimedia community are experts > >> on this. > >> > >> All the best, > >> > >> Sebastian > >> > >> > >> On 16.10.2017 10:18, Neubert, Joachim wrote: > >>> > >>> Hi Sebastian,____ > >>> > >>> __ __ > >>> > >>> This is huge! It will cover almost all currently existing German > >>> companies. Many of these will have similar names, so preparing for > >>> disambiguation is a concern.____ > >>> > >>> __ __ > >>> > >>> A good way for such an approach would be proposing a property for > >>> an external identifier, loading the data into Mix-n-match, > >>> creating links for companies already in Wikidata, and adding the > >>> rest (or perhaps only parts of them - I’m not sure if having all > >>> of them in Wikidata makes sense, but that’s another discussion), > >>> preferably with location and/or sector of trade in the description > >>> field.____ > >>> > >>> __ __ > >>> > >>> I’ve tried to figure out what could be used as key for a external > >>> identifier property. However, it looks like the registry does not > >>> offer any (persistent) URL to its entries. So for looking up a > >>> company, apparently there are two options:____ > >>> > >>> __ __ > >>> > >>> - conducting an extended search for the exact string “A&A > >>> Dienstleistungsgesellschaft mbH“____ > >>> > >>> - copying the register number “32853” plus selecting the > >>> court (Leipzig) from the according dropdown list and search > that____ > >>> > >>> __ __ > >>> > >>> Both ways are not very intuitive, even if we can provide a link to > >>> the search form. This would make a weak connection to the source > >>> of information. Much more important, it makes disambiguation in > >>> Mix-n-match difficult. This applies for the preparation of your > >>> initial load (you would not want to create duplicates). But much > >>> more so for everybody else who wants to match his or her data > >>> later on. Being forced to search for entries manually in a > >>> cumbersome way for disambiguation of a new, possibly large and > >>> rich dataset is, in my eyes, not something we want to impose on > >>> future contributors. And often, the free information they find in > >>> the registry (formal name, register number, legal form, address) > >>> will not easily match with the information they have (common name, > >>> location, perhaps founding date, and most important sector of > >>> trade), so disambiguation may still be difficult.____ > >>> > >>> __ __ > >>> > >>> Have you checked which parts of the accessible information as > >>> below can be crawled and added legally to external databases such > >>> as Wikidata?____ > >>> > >>> __ __ > >>> > >>> Cheers, Joachim____ > >>> > >>> __ __ > >>> > >>> --____ > >>> > >>> Joachim Neubert____ > >>> > >>> __ __ > >>> > >>> ZBW – German National Library of Economics____ > >>> > >>> Leibniz Information Centre for Economics____ > >>> > >>> Neuer Jungfernstieg 21 > >>> 20354 Hamburg____ > >>> > >>> Phone +49-42834-462____ > >>> > >>> __ __ > >>> > >>> __ __ > >>> > >>> __ __ > >>> > >>> *Von:*Wikidata [mailto:wikidata-boun...@lists.wikimedia.org > >>> <mailto:wikidata-boun...@lists.wikimedia.org>] *Im Auftrag von > >>> *Sebastian Hellmann > >>> *Gesendet:* Sonntag, 15. Oktober 2017 09:45 > >>> *An:* wikidata@lists.wikimedia.org > >>> <mailto:wikidata@lists.wikimedia.org> > >>> *Betreff:* [Wikidata] Kickstartet: Adding 2.2 million German > >>> organisations to Wikidata____ > >>> > >>> __ __ > >>> > >>> Hi all,____ > >>> > >>> the German business registry contains roughly 2.2 million > >>> organisations. Some information is paid, but other is public, i.e. > >>> the info you are searching for at and clicking on UT (see example > >>> below):____ > >>> > >>> https://www.handelsregister.de/rp_web/mask.do?Typ=e > >>> <https://www.handelsregister.de/rp_web/mask.do?Typ=e>____ > >>> > >>> __ __ > >>> > >>> I would like to add this to Wikidata, either by crawling or by > >>> raising money to use crowdsourcing concepts like crowdflour or > >>> amazon turk. ____ > >>> > >>> __ __ > >>> > >>> It should meet notability criteria 2: > >>> https://www.wikidata.org/wiki/Wikidata:Notability > >>> <https://www.wikidata.org/wiki/Wikidata:Notability>____ > >>> > >>> 2. It refers to an instance of a *clearly identifiable > >>> conceptual or material entity*. The entity must be notable, in > >>> the sense that it *can be described using serious and publicly > >>> available references*. If there is no item about you yet, you > >>> are probably not notable.____ > >>> > >>> > >>> The reference is the official German business registry, which is > >>> serious and public. Orgs are also per definition clearly > >>> identifiable legal entities. > >>> > >>> How can I get clearance to proceed on this? > >>> > >>> All the best, > >>> Sebastian____ > >>> > >>> __ __ > >>> > >>> __ __ > >>> > >>> > >>> Entity data____ > >>> > >>> __ __ > >>> > >>> Saxony District court *Leipzig HRB 32853 * – A&A > >>> Dienstleistungsgesellschaft mbH ____ > >>> > >>> Legal status:____ > >>> > >>> > >>> > >>> Gesellschaft mit beschränkter Haftung ____ > >>> > >>> > >>> > >>> > >>> Capital:____ > >>> > >>> > >>> > >>> 25.000,00 EUR ____ > >>> > >>> > >>> > >>> > >>> Date of entry:____ > >>> > >>> > >>> > >>> 29/08/2016 > >>> (When entering date of entry, wrong data input can occur due to > >>> system failures!) ____ > >>> > >>> > >>> > >>> > >>> Date of removal:____ > >>> > >>> > >>> > >>> - ____ > >>> > >>> > >>> > >>> > >>> Balance sheet available: ____ > >>> > >>> > >>> > >>> - ____ > >>> > >>> > >>> > >>> > >>> Address (subject to correction):____ > >>> > >>> > >>> > >>> A&A Dienstleistungsgesellschaft mbH > >>> Prager Straße 38-40____ > >>> > >>> 04317 Leipzig ____ > >>> > >>> > >>> > >>> > >>> __ __ > >>> > >>> -- > >>> All the best, > >>> Sebastian Hellmann > >>> > >>> Director of Knowledge Integration and Linked Data Technologies > >>> (KILT) Competence Center > >>> at the Institute for Applied Informatics (InfAI) at Leipzig > University > >>> Executive Director of the DBpedia Association > >>> Projects: http://dbpedia.org, http://nlp2rdf.org, > >>> http://linguistics.okfn.org, https://www.w3.org/community/ld4lt > >>> <http://www.w3.org/community/ld4lt> > >>> Homepage: http://aksw.org/SebastianHellmann > >>> <http://aksw.org/SebastianHellmann> > >>> Research Group: http://aksw.org____ > >>> > >>> > >>> > >>> _______________________________________________ > >>> Wikidata mailing list > >>> Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> > >>> https://lists.wikimedia.org/mailman/listinfo/wikidata > >>> <https://lists.wikimedia.org/mailman/listinfo/wikidata> > >> > >> -- > >> All the best, > >> Sebastian Hellmann > >> > >> Director of Knowledge Integration and Linked Data Technologies > >> (KILT) Competence Center > >> at the Institute for Applied Informatics (InfAI) at Leipzig > University > >> Executive Director of the DBpedia Association > >> Projects: http://dbpedia.org, http://nlp2rdf.org, > >> http://linguistics.okfn.org, https://www.w3.org/community/ld4lt > >> <http://www.w3.org/community/ld4lt> > >> Homepage: http://aksw.org/SebastianHellmann > >> <http://aksw.org/SebastianHellmann> > >> Research Group: http://aksw.org > >> > >> _______________________________________________ > >> Wikidata mailing list > >> Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> > >> https://lists.wikimedia.org/mailman/listinfo/wikidata > >> <https://lists.wikimedia.org/mailman/listinfo/wikidata> > >> > >> > >> > >> > >> _______________________________________________ > >> Wikidata mailing list > >> Wikidata@lists.wikimedia.org > >> https://lists.wikimedia.org/mailman/listinfo/wikidata > >> > > > > > > _______________________________________________ > > Wikidata mailing list > > Wikidata@lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/wikidata > > > > > _______________________________________________ > Wikidata mailing list > Wikidata@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikidata >
_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata