Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-27 Thread Luigi Assom
As a general question, is *it discouraged or encouraged to mirror corporate
data on Wikidata as public repository?*

*Could you provide a bullet list of why discouraged?*

*How does the decision process work?*

Referring to:
> https://www.wikidata.org/wiki/Wikidata:Introduction
> Data is entered and maintained* by Wikidata editors*, who decide on the
> rules of content creation and management.
> Referring to:
> *A secondary database.* Wikidata records not just statements, but also
> their sources, and connections to other databases.




*According to wikidata editors, is it possible to index web sources and
collate their data on WD?*How do they deal with bulks or pieces of data
that may be provided by users ?

Indexing the web does not require agreements, since any web crowler of
search engines works indeed like that.

Here, a crowd of people can coordinate themselve to create a *consistent*
database.

I believe consistency is a key to serve *Anyone in the world.*

> Anyone can use Wikidata for any number of different ways by using its
> application programming interface.


I think applications that have "value" in the sense of corporate datasets
can be built over data including business profiles and ownership towards
other participated  /subsidiaries companies and stakeholders who
participate in the business.

Imagine a minimised version of Bloomberg of Bureau Van Dijk, free to serve
* a**nyone in the world.*

*

I think I could contribute in three ways:

   - collecting data
   - designing test-application to facilitate crowd-sourced addition of
   data
   - providing a simplified guide to treat Wikidata properties on a
   specific case (a kind of info-graphic, but need very clear guidance in the
   entities and properties for corporates).




On Fri, Oct 27, 2017 at 9:14 AM, Jakob Voß  wrote:

> Laura Morales wrote:
>
> OK, just asked. Their reply was that they "reserves the right under
>> paragraph 3.3 of ODbL to release the database under different terms",
>> which is to say their data is NOT free because they want to control
>> how and where the data is used. Are we starting to see "free vs open"
>> all over again, this time with data instead of software?
>>
>
> This means we could re-publish the data openly once we actually get it but
> they make it hard to get their data :-(
>
> I'd still try to be open about OpenCorporates and keep on asking them. If
> they don't switch to more open data sharing, they will likely be replaced,
> that's for sure. So work independently from OpenCorporates but keep
> compatible unless they actively reject to work with Wikidata in any way.
>
> Cheers,
> Jakob
>
> --
> Jakob Voß 
> Verbundzentrale des GBV (VZG) / Common Library Network
> Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
> +49 (0)551 39-10242, http://www.gbv.de/
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-27 Thread Jakob Voß

Laura Morales wrote:


OK, just asked. Their reply was that they "reserves the right under
paragraph 3.3 of ODbL to release the database under different terms",
which is to say their data is NOT free because they want to control
how and where the data is used. Are we starting to see "free vs open"
all over again, this time with data instead of software?


This means we could re-publish the data openly once we actually get it 
but they make it hard to get their data :-(


I'd still try to be open about OpenCorporates and keep on asking them. 
If they don't switch to more open data sharing, they will likely be 
replaced, that's for sure. So work independently from OpenCorporates but 
keep compatible unless they actively reject to work with Wikidata in any 
way.


Cheers,
Jakob

--
Jakob Voß 
Verbundzentrale des GBV (VZG) / Common Library Network
Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
+49 (0)551 39-10242, http://www.gbv.de/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-26 Thread Luigi Assom
I think Laura raised a very good point here.
One question, broader :

is Wikidata team thinking about moving their dataset over *block-chain*?
That, I do believe, would incentivise people to participate, maintain, and
even craft useful thing with clear licenses (eventually profiting based on
utility of a thing ) .


Again, the possibility to compute / process things that are of public
utility / governance depends on accessibliity to data.
Or, very large computation / legal / negitation power of a few stakeholders.

Making data accessisble from multiple stakehodlers from public audiences,
or at least mirroring the meta-data, would allow a sane "competition" /
collaboration in governance - with also concrete applications saving $$$ ,
like preventing companies collusions.

I also tried to connect with OpenCorporates - research and CEO.
No answer so far.





On Thu, Oct 26, 2017 at 2:40 PM, Laura Morales  wrote:

> OK, just asked. Their reply was that they "reserves the right under
> paragraph 3.3 of ODbL to release the database under different terms", which
> is to say their data is NOT free because they want to control how and where
> the data is used.
> Are we starting to see "free vs open" all over again, this time with data
> instead of software?
>
>
>
> Sent: Wednesday, October 25, 2017 at 5:06 PM
> From: "Thad Guidry" 
> To: "Discussion list for the Wikidata project." <
> wikidata@lists.wikimedia.org>
> Subject: Re: [Wikidata] Kickstartet: Adding 2.2 million German
> organisations to Wikidata
>
> Laura,
>
> Talk to OpenCorporates and ask those questions yourself.
> Get involved ! :)
>
>
> -Thad
> +ThadGuidry[https://plus.google.com/+ThadGuidry]
>
>
> On Wed, Oct 25, 2017 at 3:22 AM Laura Morales  laure...@mail.com]> wrote:Is there any RDF dump available of
> OpenCorporates data? Or even any dump at all? Their licensing terms are
> ambiguous... They say it's released under ODbL, but if I want to use the
> data I have to ask permission and they will decide if I can use it for free
> or if I have to pay a fee :/
>  ___ Wikidata mailing list
> Wikidata@lists.wikimedia.org https://lists.wikimedia.org/
> mailman/listinfo/wikidata[https://lists.wikimedia.org/
> mailman/listinfo/wikidata]
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-26 Thread Laura Morales
OK, just asked. Their reply was that they "reserves the right under paragraph 
3.3 of ODbL to release the database under different terms", which is to say 
their data is NOT free because they want to control how and where the data is 
used.
Are we starting to see "free vs open" all over again, this time with data 
instead of software?
 
 

Sent: Wednesday, October 25, 2017 at 5:06 PM
From: "Thad Guidry" 
To: "Discussion list for the Wikidata project." 
Subject: Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to 
Wikidata

Laura,
 
Talk to OpenCorporates and ask those questions yourself.
Get involved ! :)
 

-Thad
+ThadGuidry[https://plus.google.com/+ThadGuidry]
  

On Wed, Oct 25, 2017 at 3:22 AM Laura Morales 
mailto:laure...@mail.com]> wrote:Is there any RDF dump 
available of OpenCorporates data? Or even any dump at all? Their licensing 
terms are ambiguous... They say it's released under ODbL, but if I want to use 
the data I have to ask permission and they will decide if I can use it for free 
or if I have to pay a fee :/
 ___ Wikidata mailing list 
Wikidata@lists.wikimedia.org 
https://lists.wikimedia.org/mailman/listinfo/wikidata[https://lists.wikimedia.org/mailman/listinfo/wikidata]

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-25 Thread Thad Guidry
Laura,

Talk to OpenCorporates and ask those questions yourself.
Get involved ! :)

-Thad
+ThadGuidry 


On Wed, Oct 25, 2017 at 3:22 AM Laura Morales  wrote:

> Is there any RDF dump available of OpenCorporates data? Or even any dump
> at all? Their licensing terms are ambiguous... They say it's released under
> ODbL, but if I want to use the data I have to ask permission and they will
> decide if I can use it for free or if I have to pay a fee :/
>
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-25 Thread Laura Morales
Is there any RDF dump available of OpenCorporates data? Or even any dump at 
all? Their licensing terms are ambiguous... They say it's released under ODbL, 
but if I want to use the data I have to ask permission and they will decide if 
I can use it for free or if I have to pay a fee :/
 
 

Sent: Wednesday, October 25, 2017 at 9:44 AM
From: "Jakob Voß" 
To: wikidata@lists.wikimedia.org
Subject: Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to 
Wikidata
Hi Luigi,

I favour cooperation with OpenCorporates instead of independently adding
lots of company record to Wikidata. Sure there are parallel strategies
but any effort should also include OpenCorporates to some degree.

OpenCorporates is licensed under ODbL (just added this referenced
statement to Q7095760) and we have property P1320 to link Wikidata and
OpenCorporates. A first step would be to align

https://opencorporates.com/registers
https://en.wikipedia.org/wiki/List_of_company_registers

Right now we have 18 instances of company register (Q1394657) and its
subclasses explicitly classified as such in Wikidata.

These items should be linked with the registers listed at
OpenCorporates, e.g.

UK Companies House (Q257303)
= 
https://opencorporates.com/registers/270[https://opencorporates.com/registers/270]

I've also noticed that OpenCorporates has a field for "Identifiers"
where Wikidata QIDs may be included to have two-way-links between the
two datasets.

Anyway, better contact 
https://opencorporates.com/info/contributing[https://opencorporates.com/info/contributing]
 at
least to let them know about your plans.

Cheers,
Jakob

--
Jakob Voß 
Verbundzentrale des GBV (VZG) / Common Library Network
Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
+49 (0)551 39-10242, http://www.gbv.de/[http://www.gbv.de/]

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata[https://lists.wikimedia.org/mailman/listinfo/wikidata]

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-25 Thread Jakob Voß

Hi Luigi,

I favour cooperation with OpenCorporates instead of independently adding 
lots of company record to Wikidata. Sure there are parallel strategies 
but any effort should also include OpenCorporates to some degree.


OpenCorporates is licensed under ODbL (just added this referenced 
statement to Q7095760) and we have property P1320 to link Wikidata and 
OpenCorporates. A first step would be to align


https://opencorporates.com/registers
https://en.wikipedia.org/wiki/List_of_company_registers

Right now we have 18 instances of company register (Q1394657) and its 
subclasses explicitly classified as such in Wikidata.


These items should be linked with the registers listed at 
OpenCorporates, e.g.


UK Companies House (Q257303)
= https://opencorporates.com/registers/270

I've also noticed that OpenCorporates has a field for "Identifiers" 
where Wikidata QIDs may be included to have two-way-links between the 
two datasets.


Anyway, better contact https://opencorporates.com/info/contributing at 
least to let them know about your plans.


Cheers,
Jakob

--
Jakob Voß 
Verbundzentrale des GBV (VZG) / Common Library Network
Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
+49 (0)551 39-10242, http://www.gbv.de/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-19 Thread Thad Guidry
No connections to Opencorporates, sorry.

The good news is that the data sources in Opencorporates (the Registers)
are accessible to you...sometimes in dump format.

https://opencorporates.com/registers

Hope that helps you further in your research and needs.  I am not saying
its easy :)

-Thad
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-19 Thread Luigi Assom
Hi Thad,

It is a really great project, I quote some of the points of Sebastian:

>* # regarding Opencorporates *>* I have a critical opinion with
> Opencorporates. It appears to be *>* open, but you actually can not get
> the data. If somebody has a *>* data dump, please forward to me. Thanks. *
> >* More on top, I consider Opencorporates a danger to open data. It *>*
> appears to push open availability of data, but then it is limited *>* to
> open licenses. Usefulness is limited as there are no free dumps *>* and
> no possibility to duplicate it effectlively. Wikipedia and *>* Wikidata
> provide dumps and an API for exactly this reason. *>* Everytime somebody
> wants to create an open organisation dataset *>* with no barriers, the
> existence of Opencorporates is blocking this.*


I think that having the possibility to make an analysis on bulk is
important.

Some data in opencorporates are incomplete - like founders, capital raised,
investors, despite some info is fed from users.
Currently most data is about US and NZ, Id like t see EU more represented.

I would like to have possibility to visualise a network of companies and
their participations.
And build bypartite graphs between personas and companies.
I will try to reach them, about cooperation for such a project.

Do you have connections with them?




On Thu, Oct 19, 2017 at 2:17 PM, Thad Guidry  wrote:

> Hi Luigi,
>
> Have you looked at https://opencorporates.com  ?
>
> Thad
> +ThadGuidry 
>
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-19 Thread Thad Guidry
Hi Luigi,

Have you looked at https://opencorporates.com  ?

Thad
+ThadGuidry 
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-19 Thread Luigi Assom
Hi,

I would like to join thread I found in the archive:
https://lists.wikimedia.org/pipermail/wikidata//2017-October/011259.html

I worked in contextual research to facilitate knowledge transfer.

One of the domain I would like to treat is visualisation of economics
networks.

I seek for an impact over governance of innovation and transparency over
economics network control, and allow also SMEs companies or private
citizens to build their analytics and prevent cases of collusions.

Information about business profiles is currently a premium service provided
by private specialised corporations, although much of the information about
companies is public, but there is lack of open data policy.

I would like to fill the gap and contribute to feed Wikidata as repository,
either in bulk either as a collective action - as a design thinker I could
contribute to design processes to fill in data, like applications that
facilitate the process.

*Is there any guidance or clearance about this initiatives?*

I am happy to read similar interest from Germany, Belgium and Italy, I
would like to connect.

I read that feeding wikidata with corporate information would significantly
increase the size - though, I think that the benefit to allow to inquire
for public governance would allow to distribute governance of economics
data.

Aside of public services like:
https://www.gov.uk/government/organisations/companies-house

I would like to allow data-visualisation researchers (as myself) to uncover
for the public results like:
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0025995

that relies on private parternships to access corporate databases, and so
findings cannot be quieried by the public.

*Is there a specific Wikidata policy to comply with to feed data from
scrapers of websites?*

As a starter, the URI of sites with good reputation could act as an
*identifier*.
I believe that scraping would be legit for information about property
"facts" (below) are public, and organisations that collated data provides
services (as professional communities or services augmented with private
data) that would be not in competition with building a repository.

In a way, I see wikidata as possibility to indexing data that can be
functional to search engines and discovery engines, and indexing data is an
activity that is daily run by such services. I believe that enabling public
transparency would enhance open-data services.


Below some properties of interest.




Properties I would be interested in are:
- TEAM (founders)
- DESCRIPTION (corporate description over products and services)
- INVESTORS (corp. and private equity)
- EMPLOYEES / INCUBATORS / ADVISORS (personal information available as
public information over the web)
- PARTICIPATED COMPANIES
- DATE of acquisition  or participation to companies
- CAPITAL (if available, or in ranges)
- VAT NUMBER (or registry number)
- ADDRESS

Other ideas to fetch the business profile of companies?
It should be, somehow, publicly available, for each corporate report to the
organisation registry and there are already private companies offering
analytics over the business profiles.



Luigi
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-16 Thread Sebastian Hellmann
Ok, I put some effort into 
https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot/Handelsregister 
to move the discussion there.


All the best,

Sebastian


On 16.10.2017 18:06, Yaroslav Blanter wrote:

Dear All,

it is great that we are having this discussion, but may I please 
suggest to have it on the RfP page on Wikidata? People already asked 
similar questions there, and, in my experience, on-wiki discussion 
will likely lead to refined request which will accomodate all suggestions.


Cheers
Yaroslav

On Mon, Oct 16, 2017 at 5:53 PM, Sebastian Hellmann 
> wrote:


ah, ok, sorry, I was assuming that Blazegraph would transitively
resolve this automatically.

Ok, so let's divide the problem:

# Task 1:

Connect all existing organisations with the data from the
handelsregister. (No new identifiers added, we can start right now)

Add a constraint that all German organisations should be connected
to a court, i.e. the registering organisation as well as the id
assigned by the court.

@all: any properties I can reuse for this?

I will focus on this as it seems quite easy. We can first filter
orgs by other criteria, i.e. country as a blocking key and then
string match the rest.

# Task 2:

Add all missing identifiers for the remaining orgs in
Handelsregister. Whereas 2 can be rediscussed and decided, if 1 is
finished sufficiently.


# regarding maintenance:
I find Wikidata as such very hard to maintain as all data is
copied from somewhere else eventually, but Wikipedia has the same
problem. In the case of the German Business register, maintenance
is especially easy as the orgs are stable and uniquely
identifiable. Even the fact that a company gets shut down should
still be in Wikidata, so you have historical information. I mean,
you also keep the Roman Empire, the Hanse and even finished
projects in Wikidata. So even if an org ceases to exist, the entry
in Wikidata should stay.

# regarding Opencorporates
I have a critical opinion with Opencorporates. It appears to be
open, but you actually can not get the data. If somebody has a
data dump, please forward to me. Thanks.
More on top, I consider Opencorporates a danger to open data. It
appears to push open availability of data, but then it is limited
to open licenses. Usefulness is limited as there are no free dumps
and no possibility to duplicate it effectlively. Wikipedia and
Wikidata provide dumps and an API for exactly this reason.
Everytime somebody wants to create an open organisation dataset
with no barriers, the existence of Opencorporates is blocking this.

Cheers,
Sebastian


On 16.10.2017 15:34, Antonin Delpeuch (lists) wrote:

And… my own count was wrong too, because I forgot to add DISTINCT in my
query (if there are multiple paths from the class to "organization
(Q43229)", items will appear multiple times).

So, I get 1 168 084 now.
http://tinyurl.com/yaeqlsnl

It's easy to get these things wrong!

Antonin

On 16/10/2017 14:16, Antonin Delpeuch (lists) wrote:

Thanks Ettore for spotting that!

Wikidata types (P31) only make sense when you consider the "subclass of"
(P279) property that we use to build the ontology (except in a few cases
where the community has decided not to use any subclass for a particular
type).

So, to retrieve all items of a certain type in SPARQL, you need to use
something like this:

?item wdt:P31/wdt:P279* ?type

You can also have other variants to accept non-truthy statements.

Just with this truthy version, I currently get 1 208 227 items. But note
that there are still a lot of items where P31 is not provided, or
subclasses which have not been connected to "organization (Q43229)"…

So in general, it's very hard to have any "guarantees that there are no
duplicates", just because you don't have any guarantees that the
information currently in Wikidata is complete or correct.

I would recommend trying to import something a bit smaller to get
acquainted with how Wikidata works and what the matching process looks
like in practice. And beyond a one-off import, as Ettore said it is
important to think how the data will be maintained in the future…

Antonin

On 16/10/2017 13:46, Ettore RIZZA wrote:

 - Wikidata has 40k organisations:

 https://query.wikidata.org/#SELECT

 
  %3Fitem %3FitemLabel %0AWHERE
 %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
 bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}


Hi,

I think Wikidata contains many more organizations than that. If we
choose the "instance o

Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-16 Thread Sebastian Hellmann

Hi Yaroslav,

in addition to this list, I added it here:

https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot/Handelsregister

and here:

https://www.wikidata.org/wiki/Wikidata:Project_chat#Handelsregister

but I received more and longer answers on this list.

All the best,

Sebastian


On 16.10.2017 18:06, Yaroslav Blanter wrote:

Dear All,

it is great that we are having this discussion, but may I please 
suggest to have it on the RfP page on Wikidata? People already asked 
similar questions there, and, in my experience, on-wiki discussion 
will likely lead to refined request which will accomodate all suggestions.


Cheers
Yaroslav

On Mon, Oct 16, 2017 at 5:53 PM, Sebastian Hellmann 
> wrote:


ah, ok, sorry, I was assuming that Blazegraph would transitively
resolve this automatically.

Ok, so let's divide the problem:

# Task 1:

Connect all existing organisations with the data from the
handelsregister. (No new identifiers added, we can start right now)

Add a constraint that all German organisations should be connected
to a court, i.e. the registering organisation as well as the id
assigned by the court.

@all: any properties I can reuse for this?

I will focus on this as it seems quite easy. We can first filter
orgs by other criteria, i.e. country as a blocking key and then
string match the rest.

# Task 2:

Add all missing identifiers for the remaining orgs in
Handelsregister. Whereas 2 can be rediscussed and decided, if 1 is
finished sufficiently.


# regarding maintenance:
I find Wikidata as such very hard to maintain as all data is
copied from somewhere else eventually, but Wikipedia has the same
problem. In the case of the German Business register, maintenance
is especially easy as the orgs are stable and uniquely
identifiable. Even the fact that a company gets shut down should
still be in Wikidata, so you have historical information. I mean,
you also keep the Roman Empire, the Hanse and even finished
projects in Wikidata. So even if an org ceases to exist, the entry
in Wikidata should stay.

# regarding Opencorporates
I have a critical opinion with Opencorporates. It appears to be
open, but you actually can not get the data. If somebody has a
data dump, please forward to me. Thanks.
More on top, I consider Opencorporates a danger to open data. It
appears to push open availability of data, but then it is limited
to open licenses. Usefulness is limited as there are no free dumps
and no possibility to duplicate it effectlively. Wikipedia and
Wikidata provide dumps and an API for exactly this reason.
Everytime somebody wants to create an open organisation dataset
with no barriers, the existence of Opencorporates is blocking this.

Cheers,
Sebastian


On 16.10.2017 15:34, Antonin Delpeuch (lists) wrote:

And… my own count was wrong too, because I forgot to add DISTINCT in my
query (if there are multiple paths from the class to "organization
(Q43229)", items will appear multiple times).

So, I get 1 168 084 now.
http://tinyurl.com/yaeqlsnl

It's easy to get these things wrong!

Antonin

On 16/10/2017 14:16, Antonin Delpeuch (lists) wrote:

Thanks Ettore for spotting that!

Wikidata types (P31) only make sense when you consider the "subclass of"
(P279) property that we use to build the ontology (except in a few cases
where the community has decided not to use any subclass for a particular
type).

So, to retrieve all items of a certain type in SPARQL, you need to use
something like this:

?item wdt:P31/wdt:P279* ?type

You can also have other variants to accept non-truthy statements.

Just with this truthy version, I currently get 1 208 227 items. But note
that there are still a lot of items where P31 is not provided, or
subclasses which have not been connected to "organization (Q43229)"…

So in general, it's very hard to have any "guarantees that there are no
duplicates", just because you don't have any guarantees that the
information currently in Wikidata is complete or correct.

I would recommend trying to import something a bit smaller to get
acquainted with how Wikidata works and what the matching process looks
like in practice. And beyond a one-off import, as Ettore said it is
important to think how the data will be maintained in the future…

Antonin

On 16/10/2017 13:46, Ettore RIZZA wrote:

 - Wikidata has 40k organisations:

 https://query.wikidata.org/#SELECT

 
  %3Fitem %3FitemLabel %0AWHERE
 %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
 bd%3AserviceParam wikibase%3Alanguage "[AUT

Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-16 Thread Yaroslav Blanter
Dear All,

it is great that we are having this discussion, but may I please suggest to
have it on the RfP page on Wikidata? People already asked similar questions
there, and, in my experience, on-wiki discussion will likely lead to
refined request which will accomodate all suggestions.

Cheers
Yaroslav

On Mon, Oct 16, 2017 at 5:53 PM, Sebastian Hellmann <
hellm...@informatik.uni-leipzig.de> wrote:

> ah, ok, sorry, I was assuming that Blazegraph would transitively resolve
> this automatically.
>
> Ok, so let's divide the problem:
>
> # Task 1:
>
> Connect all existing organisations with the data from the handelsregister.
> (No new identifiers added, we can start right now)
>
> Add a constraint that all German organisations should be connected to a
> court, i.e. the registering organisation as well as the id assigned by the
> court.
>
> @all: any properties I can reuse for this?
>
> I will focus on this as it seems quite easy. We can first filter orgs by
> other criteria, i.e. country as a blocking key and then string match the
> rest.
>
> # Task 2:
>
> Add all missing identifiers for the remaining orgs in Handelsregister.
> Whereas 2 can be rediscussed and decided, if 1 is finished sufficiently.
>
> # regarding maintenance:
> I find Wikidata as such very hard to maintain as all data is copied from
> somewhere else eventually, but Wikipedia has the same problem. In the case
> of the German Business register, maintenance is especially easy as the orgs
> are stable and uniquely identifiable. Even the fact that a company gets
> shut down should still be in Wikidata, so you have historical information.
> I mean, you also keep the Roman Empire, the Hanse and even finished
> projects in Wikidata. So even if an org ceases to exist, the entry in
> Wikidata should stay.
>
> # regarding Opencorporates
> I have a critical opinion with Opencorporates. It appears to be open, but
> you actually can not get the data. If somebody has a data dump, please
> forward to me. Thanks.
> More on top, I consider Opencorporates a danger to open data. It appears
> to push open availability of data, but then it is limited to open licenses.
> Usefulness is limited as there are no free dumps and no possibility to
> duplicate it effectlively. Wikipedia and Wikidata provide dumps and an API
> for exactly this reason. Everytime somebody wants to create an open
> organisation dataset with no barriers, the existence of Opencorporates is
> blocking this.
>
> Cheers,
> Sebastian
>
>
> On 16.10.2017 15:34, Antonin Delpeuch (lists) wrote:
>
> And… my own count was wrong too, because I forgot to add DISTINCT in my
> query (if there are multiple paths from the class to "organization
> (Q43229)", items will appear multiple times).
>
> So, I get 1 168 084 now.http://tinyurl.com/yaeqlsnl
>
> It's easy to get these things wrong!
>
> Antonin
>
> On 16/10/2017 14:16, Antonin Delpeuch (lists) wrote:
>
> Thanks Ettore for spotting that!
>
> Wikidata types (P31) only make sense when you consider the "subclass of"
> (P279) property that we use to build the ontology (except in a few cases
> where the community has decided not to use any subclass for a particular
> type).
>
> So, to retrieve all items of a certain type in SPARQL, you need to use
> something like this:
>
> ?item wdt:P31/wdt:P279* ?type
>
> You can also have other variants to accept non-truthy statements.
>
> Just with this truthy version, I currently get 1 208 227 items. But note
> that there are still a lot of items where P31 is not provided, or
> subclasses which have not been connected to "organization (Q43229)"…
>
> So in general, it's very hard to have any "guarantees that there are no
> duplicates", just because you don't have any guarantees that the
> information currently in Wikidata is complete or correct.
>
> I would recommend trying to import something a bit smaller to get
> acquainted with how Wikidata works and what the matching process looks
> like in practice. And beyond a one-off import, as Ettore said it is
> important to think how the data will be maintained in the future…
>
> Antonin
>
> On 16/10/2017 13:46, Ettore RIZZA wrote:
>
> - Wikidata has 40k organisations:
>
> https://query.wikidata.org/#SELECT
>   
> %3Fitem %3FitemLabel %0AWHERE
> %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
> bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
>
>
> Hi,
>
> I think Wikidata contains many more organizations than that. If we
> choose the "instance of Business enterprise", we get 135570 results. And
> I imagine there are many other categories that bring together commercial
> companies.
>
> https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ4830453.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%7D
>
> On the substanc

Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-16 Thread Sebastian Hellmann
ah, ok, sorry, I was assuming that Blazegraph would transitively resolve 
this automatically.


Ok, so let's divide the problem:

# Task 1:

Connect all existing organisations with the data from the 
handelsregister. (No new identifiers added, we can start right now)


Add a constraint that all German organisations should be connected to a 
court, i.e. the registering organisation as well as the id assigned by 
the court.


@all: any properties I can reuse for this?

I will focus on this as it seems quite easy. We can first filter orgs by 
other criteria, i.e. country as a blocking key and then string match the 
rest.


# Task 2:

Add all missing identifiers for the remaining orgs in Handelsregister. 
Whereas 2 can be rediscussed and decided, if 1 is finished sufficiently.



# regarding maintenance:
I find Wikidata as such very hard to maintain as all data is copied from 
somewhere else eventually, but Wikipedia has the same problem. In the 
case of the German Business register, maintenance is especially easy as 
the orgs are stable and uniquely identifiable. Even the fact that a 
company gets shut down should still be in Wikidata, so you have 
historical information. I mean, you also keep the Roman Empire, the 
Hanse and even finished projects in Wikidata. So even if an org ceases 
to exist, the entry in Wikidata should stay.


# regarding Opencorporates
I have a critical opinion with Opencorporates. It appears to be open, 
but you actually can not get the data. If somebody has a data dump, 
please forward to me. Thanks.
More on top, I consider Opencorporates a danger to open data. It appears 
to push open availability of data, but then it is limited to open 
licenses. Usefulness is limited as there are no free dumps and no 
possibility to duplicate it effectlively. Wikipedia and Wikidata provide 
dumps and an API for exactly this reason. Everytime somebody wants to 
create an open organisation dataset with no barriers, the existence of 
Opencorporates is blocking this.


Cheers,
Sebastian


On 16.10.2017 15:34, Antonin Delpeuch (lists) wrote:

And… my own count was wrong too, because I forgot to add DISTINCT in my
query (if there are multiple paths from the class to "organization
(Q43229)", items will appear multiple times).

So, I get 1 168 084 now.
http://tinyurl.com/yaeqlsnl

It's easy to get these things wrong!

Antonin

On 16/10/2017 14:16, Antonin Delpeuch (lists) wrote:

Thanks Ettore for spotting that!

Wikidata types (P31) only make sense when you consider the "subclass of"
(P279) property that we use to build the ontology (except in a few cases
where the community has decided not to use any subclass for a particular
type).

So, to retrieve all items of a certain type in SPARQL, you need to use
something like this:

?item wdt:P31/wdt:P279* ?type

You can also have other variants to accept non-truthy statements.

Just with this truthy version, I currently get 1 208 227 items. But note
that there are still a lot of items where P31 is not provided, or
subclasses which have not been connected to "organization (Q43229)"…

So in general, it's very hard to have any "guarantees that there are no
duplicates", just because you don't have any guarantees that the
information currently in Wikidata is complete or correct.

I would recommend trying to import something a bit smaller to get
acquainted with how Wikidata works and what the matching process looks
like in practice. And beyond a one-off import, as Ettore said it is
important to think how the data will be maintained in the future…

Antonin

On 16/10/2017 13:46, Ettore RIZZA wrote:

 - Wikidata has 40k organisations:

 https://query.wikidata.org/#SELECT
  %3Fitem %3FitemLabel %0AWHERE
 %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
 bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}


Hi,

I think Wikidata contains many more organizations than that. If we
choose the "instance of Business enterprise", we get 135570 results. And
I imagine there are many other categories that bring together commercial
companies.


https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ4830453.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%7D

On the substance, the project to add all companies of a country would
make Wikidata a kind of totally free clone of Open Corporates
. I would of course be delighted to see
that, but is it not a challenge to maintain such a database? Companies
are like humans, it appears and disappears every day.

  


2017-10-16 13:41 GMT+02:00 Sebastian Hellmann
mailto:hellm...@informatik.uni-leipzig.de>>:

 Hi all,

 the technical challenges are not so difficult.

 - 2.2 million are the exact number of German organisations, i.e.
 associations and companies. They are also unique

Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-16 Thread Ettore RIZZA
While I'm on the subject, I would like to draw attention to the Neckar
project , which aims
precisely to classify Wikidata entities in people, places and
organizations. Frequently updated Json dumps are available.

2017-10-16 16:08 GMT+02:00 Ettore RIZZA :

> @Antonin : Thanks for this counting method, it seems very effective (I
> already knew that there were 3.6 M of humans (Q5) in Wikidata).
>
> https://query.wikidata.org/#%23compter%20le%20nombre%20d%
> 27%C3%A9l%C3%A9ments%20appartenant%20%C3%A0%20la%20cat%C3%A9gorie%0A%
> 23organisation%20ou%20%C3%A0%20ses%20enfants%0ASELECT%
> 20DISTINCT%20%28COUNT%28DISTINCT%20%3Fitem%29%20AS%
> 20%3Fcount%29%20WHERE%20%7B%20%3Fitem%20%28wdt%3AP31%
> 2Fwdt%3AP279%2a%29%20wd%3AQ5.%20%7D
>
> 2017-10-16 15:34 GMT+02:00 Antonin Delpeuch (lists) <
> li...@antonin.delpeuch.eu>:
>
>> And… my own count was wrong too, because I forgot to add DISTINCT in my
>> query (if there are multiple paths from the class to "organization
>> (Q43229)", items will appear multiple times).
>>
>> So, I get 1 168 084 now.
>> http://tinyurl.com/yaeqlsnl
>>
>> It's easy to get these things wrong!
>>
>> Antonin
>>
>> On 16/10/2017 14:16, Antonin Delpeuch (lists) wrote:
>> > Thanks Ettore for spotting that!
>> >
>> > Wikidata types (P31) only make sense when you consider the "subclass of"
>> > (P279) property that we use to build the ontology (except in a few cases
>> > where the community has decided not to use any subclass for a particular
>> > type).
>> >
>> > So, to retrieve all items of a certain type in SPARQL, you need to use
>> > something like this:
>> >
>> > ?item wdt:P31/wdt:P279* ?type
>> >
>> > You can also have other variants to accept non-truthy statements.
>> >
>> > Just with this truthy version, I currently get 1 208 227 items. But note
>> > that there are still a lot of items where P31 is not provided, or
>> > subclasses which have not been connected to "organization (Q43229)"…
>> >
>> > So in general, it's very hard to have any "guarantees that there are no
>> > duplicates", just because you don't have any guarantees that the
>> > information currently in Wikidata is complete or correct.
>> >
>> > I would recommend trying to import something a bit smaller to get
>> > acquainted with how Wikidata works and what the matching process looks
>> > like in practice. And beyond a one-off import, as Ettore said it is
>> > important to think how the data will be maintained in the future…
>> >
>> > Antonin
>> >
>> > On 16/10/2017 13:46, Ettore RIZZA wrote:
>> >> - Wikidata has 40k organisations:
>> >>
>> >> https://query.wikidata.org/#SELECT
>> >>  %3Fitem %3FitemLabel %0AWHERE
>> >> %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel
>> {
>> >> bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
>> >>
>> >>
>> >> Hi,
>> >>
>> >> I think Wikidata contains many more organizations than that. If we
>> >> choose the "instance of Business enterprise", we get 135570 results.
>> And
>> >> I imagine there are many other categories that bring together
>> commercial
>> >> companies.
>> >>
>> >>
>> >> https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%
>> 20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ4830453.%
>> 0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AservicePa
>> ram%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%7D
>> >>
>> >> On the substance, the project to add all companies of a country would
>> >> make Wikidata a kind of totally free clone of Open Corporates
>> >> . I would of course be delighted to see
>> >> that, but is it not a challenge to maintain such a database? Companies
>> >> are like humans, it appears and disappears every day.
>> >>
>> >>
>> >>
>> >> 2017-10-16 13:41 GMT+02:00 Sebastian Hellmann
>> >> > >> >:
>> >>
>> >> Hi all,
>> >>
>> >> the technical challenges are not so difficult.
>> >>
>> >> - 2.2 million are the exact number of German organisations, i.e.
>> >> associations and companies. They are also unique.
>> >>
>> >> - Wikidata has 40k organisations:
>> >>
>> >> https://query.wikidata.org/#SELECT
>> >>  %3Fitem %3FitemLabel %0AWHERE
>> >> %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel
>> {
>> >> bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
>> >>
>> >> so there would be a maximum of 40k duplicates These are easy to
>> find
>> >> and deduplicate
>> >>
>> >> - The crawl can be done easily, a colleague has done so before.
>> >>
>> >>
>> >> The issues here are:
>> >>
>> >> - Do you want to upload the data in Wikidata? It would be a real
>> big
>> >> extension. Can I go ahead
>> >>
>> >> - If the data were available externally as structured data under
>> >> open license, I would probably no

Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-16 Thread Ettore RIZZA
@Antonin : Thanks for this counting method, it seems very effective (I
already knew that there were 3.6 M of humans (Q5) in Wikidata).

https://query.wikidata.org/#%23compter%20le%20nombre%20d%27%C3%A9l%C3%A9ments%20appartenant%20%C3%A0%20la%20cat%C3%A9gorie%0A%23organisation%20ou%20%C3%A0%20ses%20enfants%0ASELECT%20DISTINCT%20%28COUNT%28DISTINCT%20%3Fitem%29%20AS%20%3Fcount%29%20WHERE%20%7B%20%3Fitem%20%28wdt%3AP31%2Fwdt%3AP279%2a%29%20wd%3AQ5.%20%7D

2017-10-16 15:34 GMT+02:00 Antonin Delpeuch (lists) <
li...@antonin.delpeuch.eu>:

> And… my own count was wrong too, because I forgot to add DISTINCT in my
> query (if there are multiple paths from the class to "organization
> (Q43229)", items will appear multiple times).
>
> So, I get 1 168 084 now.
> http://tinyurl.com/yaeqlsnl
>
> It's easy to get these things wrong!
>
> Antonin
>
> On 16/10/2017 14:16, Antonin Delpeuch (lists) wrote:
> > Thanks Ettore for spotting that!
> >
> > Wikidata types (P31) only make sense when you consider the "subclass of"
> > (P279) property that we use to build the ontology (except in a few cases
> > where the community has decided not to use any subclass for a particular
> > type).
> >
> > So, to retrieve all items of a certain type in SPARQL, you need to use
> > something like this:
> >
> > ?item wdt:P31/wdt:P279* ?type
> >
> > You can also have other variants to accept non-truthy statements.
> >
> > Just with this truthy version, I currently get 1 208 227 items. But note
> > that there are still a lot of items where P31 is not provided, or
> > subclasses which have not been connected to "organization (Q43229)"…
> >
> > So in general, it's very hard to have any "guarantees that there are no
> > duplicates", just because you don't have any guarantees that the
> > information currently in Wikidata is complete or correct.
> >
> > I would recommend trying to import something a bit smaller to get
> > acquainted with how Wikidata works and what the matching process looks
> > like in practice. And beyond a one-off import, as Ettore said it is
> > important to think how the data will be maintained in the future…
> >
> > Antonin
> >
> > On 16/10/2017 13:46, Ettore RIZZA wrote:
> >> - Wikidata has 40k organisations:
> >>
> >> https://query.wikidata.org/#SELECT
> >>  %3Fitem %3FitemLabel %0AWHERE
> >> %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
> >> bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
> >>
> >>
> >> Hi,
> >>
> >> I think Wikidata contains many more organizations than that. If we
> >> choose the "instance of Business enterprise", we get 135570 results. And
> >> I imagine there are many other categories that bring together commercial
> >> companies.
> >>
> >>
> >> https://query.wikidata.org/#SELECT%20%3Fitem%20%
> 3FitemLabel%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%
> 3AQ4830453.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%
> 3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_
> LANGUAGE%5D%2Cen%22.%20%7D%0A%7D
> >>
> >> On the substance, the project to add all companies of a country would
> >> make Wikidata a kind of totally free clone of Open Corporates
> >> . I would of course be delighted to see
> >> that, but is it not a challenge to maintain such a database? Companies
> >> are like humans, it appears and disappears every day.
> >>
> >>
> >>
> >> 2017-10-16 13:41 GMT+02:00 Sebastian Hellmann
> >>  >> >:
> >>
> >> Hi all,
> >>
> >> the technical challenges are not so difficult.
> >>
> >> - 2.2 million are the exact number of German organisations, i.e.
> >> associations and companies. They are also unique.
> >>
> >> - Wikidata has 40k organisations:
> >>
> >> https://query.wikidata.org/#SELECT
> >>  %3Fitem %3FitemLabel %0AWHERE
> >> %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
> >> bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
> >>
> >> so there would be a maximum of 40k duplicates These are easy to find
> >> and deduplicate
> >>
> >> - The crawl can be done easily, a colleague has done so before.
> >>
> >>
> >> The issues here are:
> >>
> >> - Do you want to upload the data in Wikidata? It would be a real big
> >> extension. Can I go ahead
> >>
> >> - If the data were available externally as structured data under
> >> open license, I would probably not suggest loading it into wikidata,
> >> as the data can be retrieved from the official source directly,
> >> however, here this data will not be published in a decent format.
> >>
> >> I thought that the way data is copied from coyrighted sources, i.e.
> >> only facts is ok for wikidata. This done in a lot of places, I
> >> guess. Same for Wikipedia, i.e. News articles and copyrighted books
> >> are referenced. So Wikimedia

Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-16 Thread Antonin Delpeuch (lists)
And… my own count was wrong too, because I forgot to add DISTINCT in my
query (if there are multiple paths from the class to "organization
(Q43229)", items will appear multiple times).

So, I get 1 168 084 now.
http://tinyurl.com/yaeqlsnl

It's easy to get these things wrong!

Antonin

On 16/10/2017 14:16, Antonin Delpeuch (lists) wrote:
> Thanks Ettore for spotting that!
> 
> Wikidata types (P31) only make sense when you consider the "subclass of"
> (P279) property that we use to build the ontology (except in a few cases
> where the community has decided not to use any subclass for a particular
> type).
> 
> So, to retrieve all items of a certain type in SPARQL, you need to use
> something like this:
> 
> ?item wdt:P31/wdt:P279* ?type
> 
> You can also have other variants to accept non-truthy statements.
> 
> Just with this truthy version, I currently get 1 208 227 items. But note
> that there are still a lot of items where P31 is not provided, or
> subclasses which have not been connected to "organization (Q43229)"…
> 
> So in general, it's very hard to have any "guarantees that there are no
> duplicates", just because you don't have any guarantees that the
> information currently in Wikidata is complete or correct.
> 
> I would recommend trying to import something a bit smaller to get
> acquainted with how Wikidata works and what the matching process looks
> like in practice. And beyond a one-off import, as Ettore said it is
> important to think how the data will be maintained in the future…
> 
> Antonin
> 
> On 16/10/2017 13:46, Ettore RIZZA wrote:
>> - Wikidata has 40k organisations: 
>>
>> https://query.wikidata.org/#SELECT
>>  %3Fitem %3FitemLabel %0AWHERE
>> %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
>> bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
>>
>>
>> Hi, 
>>
>> I think Wikidata contains many more organizations than that. If we
>> choose the "instance of Business enterprise", we get 135570 results. And
>> I imagine there are many other categories that bring together commercial
>> companies.
>>
>>
>> https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ4830453.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%7D
>>
>> On the substance, the project to add all companies of a country would
>> make Wikidata a kind of totally free clone of Open Corporates
>> . I would of course be delighted to see
>> that, but is it not a challenge to maintain such a database? Companies
>> are like humans, it appears and disappears every day.
>>
>>  
>>
>> 2017-10-16 13:41 GMT+02:00 Sebastian Hellmann
>> > >:
>>
>> Hi all,
>>
>> the technical challenges are not so difficult.
>>
>> - 2.2 million are the exact number of German organisations, i.e.
>> associations and companies. They are also unique.
>>
>> - Wikidata has 40k organisations:
>>
>> https://query.wikidata.org/#SELECT
>>  %3Fitem %3FitemLabel %0AWHERE
>> %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
>> bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
>>
>> so there would be a maximum of 40k duplicates These are easy to find
>> and deduplicate
>>
>> - The crawl can be done easily, a colleague has done so before.  
>>
>>
>> The issues here are:
>>
>> - Do you want to upload the data in Wikidata? It would be a real big
>> extension. Can I go ahead
>>
>> - If the data were available externally as structured data under
>> open license, I would probably not suggest loading it into wikidata,
>> as the data can be retrieved from the official source directly,
>> however, here this data will not be published in a decent format.
>>
>> I thought that the way data is copied from coyrighted sources, i.e.
>> only facts is ok for wikidata. This done in a lot of places, I
>> guess. Same for Wikipedia, i.e. News articles and copyrighted books
>> are referenced. So Wikimedia or the Wikimedia community are experts
>> on this.
>>
>> All the best,
>>
>> Sebastian
>>
>>
>> On 16.10.2017 10:18, Neubert, Joachim wrote:
>>>
>>> Hi Sebastian,
>>>
>>> __ __
>>>
>>> This is huge! It will cover almost all currently existing German
>>> companies. Many of these will have similar names, so preparing for
>>> disambiguation is a concern.
>>>
>>> __ __
>>>
>>> A good way for such an approach would be proposing a property for
>>> an external identifier, loading the data into Mix-n-match,
>>> creating links for companies already in Wikidata, and adding the
>>> rest (or perhaps only parts of them - I’m not sure if having all
>>> of them in Wikidata makes sen

Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-16 Thread Antonin Delpeuch (lists)
Thanks Ettore for spotting that!

Wikidata types (P31) only make sense when you consider the "subclass of"
(P279) property that we use to build the ontology (except in a few cases
where the community has decided not to use any subclass for a particular
type).

So, to retrieve all items of a certain type in SPARQL, you need to use
something like this:

?item wdt:P31/wdt:P279* ?type

You can also have other variants to accept non-truthy statements.

Just with this truthy version, I currently get 1 208 227 items. But note
that there are still a lot of items where P31 is not provided, or
subclasses which have not been connected to "organization (Q43229)"…

So in general, it's very hard to have any "guarantees that there are no
duplicates", just because you don't have any guarantees that the
information currently in Wikidata is complete or correct.

I would recommend trying to import something a bit smaller to get
acquainted with how Wikidata works and what the matching process looks
like in practice. And beyond a one-off import, as Ettore said it is
important to think how the data will be maintained in the future…

Antonin

On 16/10/2017 13:46, Ettore RIZZA wrote:
> - Wikidata has 40k organisations: 
> 
> https://query.wikidata.org/#SELECT
>  %3Fitem %3FitemLabel %0AWHERE
> %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
> bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
> 
> 
> Hi, 
> 
> I think Wikidata contains many more organizations than that. If we
> choose the "instance of Business enterprise", we get 135570 results. And
> I imagine there are many other categories that bring together commercial
> companies.
> 
> 
> https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ4830453.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%7D
> 
> On the substance, the project to add all companies of a country would
> make Wikidata a kind of totally free clone of Open Corporates
> . I would of course be delighted to see
> that, but is it not a challenge to maintain such a database? Companies
> are like humans, it appears and disappears every day.
> 
>  
> 
> 2017-10-16 13:41 GMT+02:00 Sebastian Hellmann
>  >:
> 
> Hi all,
> 
> the technical challenges are not so difficult.
> 
> - 2.2 million are the exact number of German organisations, i.e.
> associations and companies. They are also unique.
> 
> - Wikidata has 40k organisations:
> 
> https://query.wikidata.org/#SELECT
>  %3Fitem %3FitemLabel %0AWHERE
> %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
> bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
> 
> so there would be a maximum of 40k duplicates These are easy to find
> and deduplicate
> 
> - The crawl can be done easily, a colleague has done so before.  
> 
> 
> The issues here are:
> 
> - Do you want to upload the data in Wikidata? It would be a real big
> extension. Can I go ahead
> 
> - If the data were available externally as structured data under
> open license, I would probably not suggest loading it into wikidata,
> as the data can be retrieved from the official source directly,
> however, here this data will not be published in a decent format.
> 
> I thought that the way data is copied from coyrighted sources, i.e.
> only facts is ok for wikidata. This done in a lot of places, I
> guess. Same for Wikipedia, i.e. News articles and copyrighted books
> are referenced. So Wikimedia or the Wikimedia community are experts
> on this.
> 
> All the best,
> 
> Sebastian
> 
> 
> On 16.10.2017 10:18, Neubert, Joachim wrote:
>>
>> Hi Sebastian,
>>
>> __ __
>>
>> This is huge! It will cover almost all currently existing German
>> companies. Many of these will have similar names, so preparing for
>> disambiguation is a concern.
>>
>> __ __
>>
>> A good way for such an approach would be proposing a property for
>> an external identifier, loading the data into Mix-n-match,
>> creating links for companies already in Wikidata, and adding the
>> rest (or perhaps only parts of them - I’m not sure if having all
>> of them in Wikidata makes sense, but that’s another discussion),
>> preferably with location and/or sector of trade in the description
>> field.
>>
>> __ __
>>
>> I’ve tried to figure out what could be used as key for a external
>> identifier property. However, it looks like the registry does not
>> offer any (persistent) URL to its entries. So for looking up a
>> company, apparently there are two options:
>>
>> __ __
>>
>> -  conductin

Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-16 Thread Ettore RIZZA
>
> - Wikidata has 40k organisations:

https://query.wikidata.org/#SELECT %3Fitem %3FitemLabel %0AWHERE %0A{%0A
> %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
> bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}


Hi,

I think Wikidata contains many more organizations than that. If we choose
the "instance of Business enterprise", we get 135570 results. And I imagine
there are many other categories that bring together commercial companies.


https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ4830453.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%7D

On the substance, the project to add all companies of a country would make
Wikidata a kind of totally free clone of Open Corporates
. I would of course be delighted to see that,
but is it not a challenge to maintain such a database? Companies are like
humans, it appears and disappears every day.



2017-10-16 13:41 GMT+02:00 Sebastian Hellmann <
hellm...@informatik.uni-leipzig.de>:

> Hi all,
>
> the technical challenges are not so difficult.
>
> - 2.2 million are the exact number of German organisations, i.e.
> associations and companies. They are also unique.
>
> - Wikidata has 40k organisations:
>
> https://query.wikidata.org/#SELECT %3Fitem %3FitemLabel %0AWHERE %0A{%0A
> %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
> bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
>
> so there would be a maximum of 40k duplicates These are easy to find and
> deduplicate
>
> - The crawl can be done easily, a colleague has done so before.
>
>
> The issues here are:
>
> - Do you want to upload the data in Wikidata? It would be a real big
> extension. Can I go ahead
>
> - If the data were available externally as structured data under open
> license, I would probably not suggest loading it into wikidata, as the data
> can be retrieved from the official source directly, however, here this data
> will not be published in a decent format.
>
> I thought that the way data is copied from coyrighted sources, i.e. only
> facts is ok for wikidata. This done in a lot of places, I guess. Same for
> Wikipedia, i.e. News articles and copyrighted books are referenced. So
> Wikimedia or the Wikimedia community are experts on this.
>
> All the best,
>
> Sebastian
>
> On 16.10.2017 10:18, Neubert, Joachim wrote:
>
> Hi Sebastian,
>
>
>
> This is huge! It will cover almost all currently existing German
> companies. Many of these will have similar names, so preparing for
> disambiguation is a concern.
>
>
>
> A good way for such an approach would be proposing a property for an
> external identifier, loading the data into Mix-n-match, creating links for
> companies already in Wikidata, and adding the rest (or perhaps only parts
> of them - I’m not sure if having all of them in Wikidata makes sense, but
> that’s another discussion), preferably with location and/or sector of trade
> in the description field.
>
>
>
> I’ve tried to figure out what could be used as key for a external
> identifier property. However, it looks like the registry does not offer any
> (persistent) URL to its entries. So for looking up a company, apparently
> there are two options:
>
>
>
> -  conducting an extended search for the exact string “A&A
> Dienstleistungsgesellschaft mbH“
>
> -  copying the register number “32853” plus selecting the court
> (Leipzig) from the according dropdown list and search that
>
>
>
> Both ways are not very intuitive, even if we can provide a link to the
> search form. This would make a weak connection to the source of
> information. Much more important, it makes disambiguation in Mix-n-match
> difficult. This applies for the preparation of your initial load (you would
> not want to create duplicates). But much more so for everybody else who
> wants to match his or her data later on. Being forced to search for entries
> manually in a cumbersome way for disambiguation of a new, possibly large
> and rich dataset is, in my eyes, not something we want to impose on future
> contributors. And often, the free information they find in the registry
> (formal name, register number, legal form, address) will not easily match
> with the information they have (common name, location, perhaps founding
> date, and most important sector of trade), so disambiguation may still be
> difficult.
>
>
>
> Have you checked which parts of the accessible information as below can be
> crawled and added legally to external databases such as Wikidata?
>
>
>
> Cheers, Joachim
>
>
>
> --
>
> Joachim Neubert
>
>
>
> ZBW – German National Library of Economics
>
> Leibniz Information Centre for Economics
>
> Neuer Jungfernstieg 21
> 20354 Hamburg
>
> Phone +49-42834-462
>
>
>
>
>
>
>
> *Von:* Wikidata [mailto:wikidata-boun...@lists.wikimedia.org
> ] *Im Auftrag von *Sebastian
> Hel

Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-16 Thread hellmann
The best way then to not create duplicates is to look at all existing 
organizations in Wikidata and add the court and court number manually, if they 
are German and then exclude these from the import.

Guarantees that there will be no duplicates.

So the technical side is feasible.
Barriers are political and legal.

Sebastian 

Am 16. Oktober 2017 14:24:51 MESZ schrieb Sebastian Hellmann 
:
>Ah yes, forgot to mention:
>
>there is no URI or unique identifier given by the Handelsregister 
>system. However, the courts take care that the registrations are
>unique, 
>so it is implicit. Handelsregister could easily create stable URIs out 
>of the court+type+number like /Leipzig_HRB_32853
>
>For Wikidata this is not a problem to handle. So no technical issues 
>from this side either.
>
>All the best,
>
>Sebastian
>
>
>On 16.10.2017 13:41, Sebastian Hellmann wrote:
>>
>> Hi all,
>>
>> the technical challenges are not so difficult.
>>
>> - 2.2 million are the exact number of German organisations, i.e. 
>> associations and companies. They are also unique.
>>
>> - Wikidata has 40k organisations:
>>
>> https://query.wikidata.org/#SELECT %3Fitem %3FitemLabel %0AWHERE 
>> %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel { 
>> bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
>>
>> so there would be a maximum of 40k duplicates These are easy to find 
>> and deduplicate
>>
>> - The crawl can be done easily, a colleague has done so before.
>>
>>
>> The issues here are:
>>
>> - Do you want to upload the data in Wikidata? It would be a real big 
>> extension. Can I go ahead
>>
>> - If the data were available externally as structured data under open
>
>> license, I would probably not suggest loading it into wikidata, as
>the 
>> data can be retrieved from the official source directly, however,
>here 
>> this data will not be published in a decent format.
>>
>> I thought that the way data is copied from coyrighted sources, i.e. 
>> only facts is ok for wikidata. This done in a lot of places, I guess.
>
>> Same for Wikipedia, i.e. News articles and copyrighted books are 
>> referenced. So Wikimedia or the Wikimedia community are experts on
>this.
>>
>> All the best,
>>
>> Sebastian
>>
>>
>> On 16.10.2017 10:18, Neubert, Joachim wrote:
>>>
>>> Hi Sebastian,
>>>
>>> This is huge! It will cover almost all currently existing German 
>>> companies. Many of these will have similar names, so preparing for 
>>> disambiguation is a concern.
>>>
>>> A good way for such an approach would be proposing a property for an
>
>>> external identifier, loading the data into Mix-n-match, creating 
>>> links for companies already in Wikidata, and adding the rest (or 
>>> perhaps only parts of them - I’m not sure if having all of them in 
>>> Wikidata makes sense, but that’s another discussion), preferably
>with 
>>> location and/or sector of trade in the description field.
>>>
>>> I’ve tried to figure out what could be used as key for a external 
>>> identifier property. However, it looks like the registry does not 
>>> offer any (persistent) URL to its entries. So for looking up a 
>>> company, apparently there are two options:
>>>
>>> -conducting an extended search for the exact string “A&A 
>>> Dienstleistungsgesellschaft mbH“
>>>
>>> -copying the register number “32853” plus selecting the court 
>>> (Leipzig) from the according dropdown list and search that
>>>
>>> Both ways are not very intuitive, even if we can provide a link to 
>>> the search form. This would make a weak connection to the source of 
>>> information. Much more important, it makes disambiguation in 
>>> Mix-n-match difficult. This applies for the preparation of your 
>>> initial load (you would not want to create duplicates). But much
>more 
>>> so for everybody else who wants to match his or her data later on. 
>>> Being forced to search for entries manually in a cumbersome way for 
>>> disambiguation of a new, possibly large and rich dataset is, in my 
>>> eyes, not something we want to impose on future contributors. And 
>>> often, the free information they find in the registry (formal name, 
>>> register number, legal form, address) will not easily match with the
>
>>> information they have (common name, location, perhaps founding date,
>
>>> and most important sector of trade), so disambiguation may still be 
>>> difficult.
>>>
>>> Have you checked which parts of the accessible information as below 
>>> can be crawled and added legally to external databases such as
>Wikidata?
>>>
>>> Cheers, Joachim
>>>
>>> --
>>>
>>> Joachim Neubert
>>>
>>> ZBW – German National Library of Economics
>>>
>>> Leibniz Information Centre for Economics
>>>
>>> Neuer Jungfernstieg 21
>>> 20354 Hamburg
>>>
>>> Phone +49-42834-462
>>>
>>> *Von:*Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] *Im 
>>> Auftrag von *Sebastian Hellmann
>>> *Gesendet:* Sonntag, 15. Oktober 2017 09:45
>>> *An:* wikidata@lists.wikimedia.org
>

Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-16 Thread Sebastian Hellmann

Ah yes, forgot to mention:

there is no URI or unique identifier given by the Handelsregister 
system. However, the courts take care that the registrations are unique, 
so it is implicit. Handelsregister could easily create stable URIs out 
of the court+type+number like /Leipzig_HRB_32853


For Wikidata this is not a problem to handle. So no technical issues 
from this side either.


All the best,

Sebastian


On 16.10.2017 13:41, Sebastian Hellmann wrote:


Hi all,

the technical challenges are not so difficult.

- 2.2 million are the exact number of German organisations, i.e. 
associations and companies. They are also unique.


- Wikidata has 40k organisations:

https://query.wikidata.org/#SELECT %3Fitem %3FitemLabel %0AWHERE 
%0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel { 
bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}


so there would be a maximum of 40k duplicates These are easy to find 
and deduplicate


- The crawl can be done easily, a colleague has done so before.


The issues here are:

- Do you want to upload the data in Wikidata? It would be a real big 
extension. Can I go ahead


- If the data were available externally as structured data under open 
license, I would probably not suggest loading it into wikidata, as the 
data can be retrieved from the official source directly, however, here 
this data will not be published in a decent format.


I thought that the way data is copied from coyrighted sources, i.e. 
only facts is ok for wikidata. This done in a lot of places, I guess. 
Same for Wikipedia, i.e. News articles and copyrighted books are 
referenced. So Wikimedia or the Wikimedia community are experts on this.


All the best,

Sebastian


On 16.10.2017 10:18, Neubert, Joachim wrote:


Hi Sebastian,

This is huge! It will cover almost all currently existing German 
companies. Many of these will have similar names, so preparing for 
disambiguation is a concern.


A good way for such an approach would be proposing a property for an 
external identifier, loading the data into Mix-n-match, creating 
links for companies already in Wikidata, and adding the rest (or 
perhaps only parts of them - I’m not sure if having all of them in 
Wikidata makes sense, but that’s another discussion), preferably with 
location and/or sector of trade in the description field.


I’ve tried to figure out what could be used as key for a external 
identifier property. However, it looks like the registry does not 
offer any (persistent) URL to its entries. So for looking up a 
company, apparently there are two options:


-conducting an extended search for the exact string “A&A 
Dienstleistungsgesellschaft mbH“


-copying the register number “32853” plus selecting the court 
(Leipzig) from the according dropdown list and search that


Both ways are not very intuitive, even if we can provide a link to 
the search form. This would make a weak connection to the source of 
information. Much more important, it makes disambiguation in 
Mix-n-match difficult. This applies for the preparation of your 
initial load (you would not want to create duplicates). But much more 
so for everybody else who wants to match his or her data later on. 
Being forced to search for entries manually in a cumbersome way for 
disambiguation of a new, possibly large and rich dataset is, in my 
eyes, not something we want to impose on future contributors. And 
often, the free information they find in the registry (formal name, 
register number, legal form, address) will not easily match with the 
information they have (common name, location, perhaps founding date, 
and most important sector of trade), so disambiguation may still be 
difficult.


Have you checked which parts of the accessible information as below 
can be crawled and added legally to external databases such as Wikidata?


Cheers, Joachim

--

Joachim Neubert

ZBW – German National Library of Economics

Leibniz Information Centre for Economics

Neuer Jungfernstieg 21
20354 Hamburg

Phone +49-42834-462

*Von:*Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] *Im 
Auftrag von *Sebastian Hellmann

*Gesendet:* Sonntag, 15. Oktober 2017 09:45
*An:* wikidata@lists.wikimedia.org 
*Betreff:* [Wikidata] Kickstartet: Adding 2.2 million German 
organisations to Wikidata


Hi all,

the German business registry contains roughly 2.2 million 
organisations. Some information is paid, but other is public, i.e. 
the info you are searching for at and clicking on UT (see example below):


https://www.handelsregister.de/rp_web/mask.do?Typ=e

I would like to add this to Wikidata, either by crawling or by 
raising money to use crowdsourcing concepts like crowdflour or amazon 
turk.


It should meet notability criteria 2: 
https://www.wikidata.org/wiki/Wikidata:Notability


2. It refers to an instance of a *clearly identifiable conceptual
or material entity*. The entity must be notable, in the sense
that it *c

Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-16 Thread Sebastian Hellmann

Hi all,

the technical challenges are not so difficult.

- 2.2 million are the exact number of German organisations, i.e. 
associations and companies. They are also unique.


- Wikidata has 40k organisations:

https://query.wikidata.org/#SELECT %3Fitem %3FitemLabel %0AWHERE %0A{%0A 
%3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel { 
bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}


so there would be a maximum of 40k duplicates These are easy to find and 
deduplicate


- The crawl can be done easily, a colleague has done so before.


The issues here are:

- Do you want to upload the data in Wikidata? It would be a real big 
extension. Can I go ahead


- If the data were available externally as structured data under open 
license, I would probably not suggest loading it into wikidata, as the 
data can be retrieved from the official source directly, however, here 
this data will not be published in a decent format.


I thought that the way data is copied from coyrighted sources, i.e. only 
facts is ok for wikidata. This done in a lot of places, I guess. Same 
for Wikipedia, i.e. News articles and copyrighted books are referenced. 
So Wikimedia or the Wikimedia community are experts on this.


All the best,

Sebastian


On 16.10.2017 10:18, Neubert, Joachim wrote:


Hi Sebastian,

This is huge! It will cover almost all currently existing German 
companies. Many of these will have similar names, so preparing for 
disambiguation is a concern.


A good way for such an approach would be proposing a property for an 
external identifier, loading the data into Mix-n-match, creating links 
for companies already in Wikidata, and adding the rest (or perhaps 
only parts of them - I’m not sure if having all of them in Wikidata 
makes sense, but that’s another discussion), preferably with location 
and/or sector of trade in the description field.


I’ve tried to figure out what could be used as key for a external 
identifier property. However, it looks like the registry does not 
offer any (persistent) URL to its entries. So for looking up a 
company, apparently there are two options:


-conducting an extended search for the exact string “A&A 
Dienstleistungsgesellschaft mbH“


-copying the register number “32853” plus selecting the court 
(Leipzig) from the according dropdown list and search that


Both ways are not very intuitive, even if we can provide a link to the 
search form. This would make a weak connection to the source of 
information. Much more important, it makes disambiguation in 
Mix-n-match difficult. This applies for the preparation of your 
initial load (you would not want to create duplicates). But much more 
so for everybody else who wants to match his or her data later on. 
Being forced to search for entries manually in a cumbersome way for 
disambiguation of a new, possibly large and rich dataset is, in my 
eyes, not something we want to impose on future contributors. And 
often, the free information they find in the registry (formal name, 
register number, legal form, address) will not easily match with the 
information they have (common name, location, perhaps founding date, 
and most important sector of trade), so disambiguation may still be 
difficult.


Have you checked which parts of the accessible information as below 
can be crawled and added legally to external databases such as Wikidata?


Cheers, Joachim

--

Joachim Neubert

ZBW – German National Library of Economics

Leibniz Information Centre for Economics

Neuer Jungfernstieg 21
20354 Hamburg

Phone +49-42834-462

*Von:*Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] *Im 
Auftrag von *Sebastian Hellmann

*Gesendet:* Sonntag, 15. Oktober 2017 09:45
*An:* wikidata@lists.wikimedia.org 
*Betreff:* [Wikidata] Kickstartet: Adding 2.2 million German 
organisations to Wikidata


Hi all,

the German business registry contains roughly 2.2 million 
organisations. Some information is paid, but other is public, i.e. the 
info you are searching for at and clicking on UT (see example below):


https://www.handelsregister.de/rp_web/mask.do?Typ=e

I would like to add this to Wikidata, either by crawling or by raising 
money to use crowdsourcing concepts like crowdflour or amazon turk.


It should meet notability criteria 2: 
https://www.wikidata.org/wiki/Wikidata:Notability


2. It refers to an instance of a *clearly identifiable conceptual
or material entity*. The entity must be notable, in the sense that
it *can be described using serious and publicly available
references*. If there is no item about you yet, you are probably
not notable.


The reference is the official German business registry, which is 
serious and public. Orgs are also per definition clearly identifiable 
legal entities.


How can I get clearance to proceed on this?

All the best,
Sebastian


  Entity data

Saxony District court *Leipzig HRB 32853 * – A&A 
Dienstleistungsgesellschaft 

Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-16 Thread Federico Morando
Dear All,

although in Italy these data are normally not available (not even the basic
data) from the chambers of commerce, there are some open data from which we
could extract several identifiers - of course these are biased toward the
suppliers of Public Administrations, because contracting with PA is the
trigger for being listed in these Open Data.

In the context of a broader effort to upload this kind of data in Wikidata,
as the one which seems to emerge from this thread, the firm which I manage
may be willing to contribute about half a million couples of labels and VAT
IDs... it's a relatively thin dataset - in the sense that you just have the
name of the firm and the VAT ID, and possibly a link to a portal we're
building in which you may gather additional information about the activity
of this firm with the Italian public administration - but, as I was
mentioning, Italian firm data are quite rare (they are not even available
on OpenCorporates.com).

By the way, https://www.wikidata.org/wiki/Property:P3608 (EU VAT number)
already exists and may provide a sufficient identifier in most cases, since
in most cases the country ISO code (e.g. IT for Italy) + the national VAT
ID does generated the EU VAT number (the actual algorithm may be a bit more
complex, but it's documented). (That said, there are also national
identifiers which may be worth creating, such as the number of registration
at national chambers of commerce, etc.)

About the value of these data on Wikidata, starting from our use case, I
think that having permanent URIs for all firms on Wikidata would provide,
for instance, great value for several anti-corruption projects around the
world. (This could also provide a place to trace some international links
among companies, which are not always readily available today.) That said,
I perfectly understand the concerns of Andra in terms of scalability and
maintenance, and this is one of the reasons I did not think of donating
these data to Wikidata so far.

I'll try to follow these discussions, but please - Sebastian or others -
feel free to ping me if the project goes on and you want to include these
Italian data.

Best,

Federico



On Mon, Oct 16, 2017 at 10:25 AM, Andra Waagmeester 
wrote:

> There is an equal size of data on Belgian enterprises available. with the
> same objective to enrich wikidata with enterprise data I recently proposed
> the following property: https://www.wikidata.org/wiki/Wikidata:
> Property_proposal/NACE_code
>
> However, after some talks with others in the Wikidata community, I
> recently have some second thoughts on whether or not a full dump of these
> type of datasets are valuable enrichments of Wikidata. Adding 2 million
> items with additional statement per item would be quite an enlargement of
> Wikidata. If we would bot add all business of both Belgium and Germany, we
> would have 4 million of new items, which currently would count for 10% of
> all of Wikidata. I am not sure what this would mean in term scalability and
> if it would cause any scalability issues.
>
> Maybe a use-case driven approach here would be more appropriate. We could
> think of a bot that would source both the trade registers of the different
> countries when a specific use case would vouch for the inclusion of trade
> data.
>
> Just my 2cts
>
> Cheers,
>
> Andra
>
> On Mon, Oct 16, 2017 at 9:48 AM, Sebastian Hellmann <
> hellm...@informatik.uni-leipzig.de> wrote:
>
>> Thanks, done.
>>
>> https://www.wikidata.org/wiki/Wikidata:Project_chat#Handelsregister
>>
>> On 15.10.2017 22:10, Yaroslav Blanter wrote:
>>
>> Hi Sebastian,
>>
>> I would say the best way is to file a request for the permissions for the
>> bot
>>
>> https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot
>>
>> and possibly leave a message on the Project Chat
>>
>> https://www.wikidata.org/wiki/Wikidata:Project_chat
>>
>> Cheers
>> Yaroslav
>>
>> On Sun, Oct 15, 2017 at 9:44 AM, Sebastian Hellmann <
>> hellm...@informatik.uni-leipzig.de> wrote:
>>
>>> Hi all,
>>>
>>> the German business registry contains roughly 2.2 million organisations.
>>> Some information is paid, but other is public, i.e. the info you are
>>> searching for at and clicking on UT (see example below):
>>>
>>> https://www.handelsregister.de/rp_web/mask.do?Typ=e
>>>
>>>
>>> I would like to add this to Wikidata, either by crawling or by raising
>>> money to use crowdsourcing concepts like crowdflour or amazon turk.
>>>
>>>
>>> It should meet notability criteria 2: https://www.wikidata.org/wiki/
>>> Wikidata:Notability
>>>
>>> 2. It refers to an instance of a *clearly identifiable conceptual or
>>> material entity*. The entity must be notable, in the sense that it *can
>>> be described using serious and publicly available references*. If there
>>> is no item about you yet, you are probably not notable.
>>>
>>>
>>> The reference is the official German business registry, which is serious
>>> and public. Orgs are also per definition clearly identifi

Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-16 Thread Andra Waagmeester
There is an equal size of data on Belgian enterprises available. with the
same objective to enrich wikidata with enterprise data I recently proposed
the following property:
https://www.wikidata.org/wiki/Wikidata:Property_proposal/NACE_code

However, after some talks with others in the Wikidata community, I recently
have some second thoughts on whether or not a full dump of these type of
datasets are valuable enrichments of Wikidata. Adding 2 million items with
additional statement per item would be quite an enlargement of Wikidata. If
we would bot add all business of both Belgium and Germany, we would have 4
million of new items, which currently would count for 10% of all of
Wikidata. I am not sure what this would mean in term scalability and if it
would cause any scalability issues.

Maybe a use-case driven approach here would be more appropriate. We could
think of a bot that would source both the trade registers of the different
countries when a specific use case would vouch for the inclusion of trade
data.

Just my 2cts

Cheers,

Andra

On Mon, Oct 16, 2017 at 9:48 AM, Sebastian Hellmann <
hellm...@informatik.uni-leipzig.de> wrote:

> Thanks, done.
>
> https://www.wikidata.org/wiki/Wikidata:Project_chat#Handelsregister
>
> On 15.10.2017 22:10, Yaroslav Blanter wrote:
>
> Hi Sebastian,
>
> I would say the best way is to file a request for the permissions for the
> bot
>
> https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot
>
> and possibly leave a message on the Project Chat
>
> https://www.wikidata.org/wiki/Wikidata:Project_chat
>
> Cheers
> Yaroslav
>
> On Sun, Oct 15, 2017 at 9:44 AM, Sebastian Hellmann <
> hellm...@informatik.uni-leipzig.de> wrote:
>
>> Hi all,
>>
>> the German business registry contains roughly 2.2 million organisations.
>> Some information is paid, but other is public, i.e. the info you are
>> searching for at and clicking on UT (see example below):
>>
>> https://www.handelsregister.de/rp_web/mask.do?Typ=e
>>
>>
>> I would like to add this to Wikidata, either by crawling or by raising
>> money to use crowdsourcing concepts like crowdflour or amazon turk.
>>
>>
>> It should meet notability criteria 2: https://www.wikidata.org/wiki/
>> Wikidata:Notability
>>
>> 2. It refers to an instance of a *clearly identifiable conceptual or
>> material entity*. The entity must be notable, in the sense that it *can
>> be described using serious and publicly available references*. If there
>> is no item about you yet, you are probably not notable.
>>
>>
>> The reference is the official German business registry, which is serious
>> and public. Orgs are also per definition clearly identifiable legal
>> entities.
>>
>> How can I get clearance to proceed on this?
>>
>> All the best,
>> Sebastian
>>
>>
>>
>> Entity data
>> Saxony District court *Leipzig HRB 32853 *– A&A
>> Dienstleistungsgesellschaft mbH
>> Legal status: Gesellschaft mit beschränkter Haftung
>> Capital: 25.000,00 EUR
>> Date of entry: 29/08/2016
>> (When entering date of entry, wrong data input can occur due to system
>> failures!)
>> Date of removal: -
>> Balance sheet available: -
>> Address (subject to correction): A&A Dienstleistungsgesellschaft mbH
>> Prager Straße 38-40
>> 04317 Leipzig
>>
>>
>> --
>> All the best,
>> Sebastian Hellmann
>>
>> Director of Knowledge Integration and Linked Data Technologies (KILT)
>> Competence Center
>> at the Institute for Applied Informatics (InfAI) at Leipzig University
>> Executive Director of the DBpedia Association
>> Projects: http://dbpedia.org, http://nlp2rdf.org,
>> http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
>> 
>> Homepage: http://aksw.org/SebastianHellmann
>> Research Group: http://aksw.org
>>
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>>
>
>
> ___
> Wikidata mailing 
> listWikidata@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
> --
> All the best,
> Sebastian Hellmann
>
> Director of Knowledge Integration and Linked Data Technologies (KILT)
> Competence Center
> at the Institute for Applied Informatics (InfAI) at Leipzig University
> Executive Director of the DBpedia Association
> Projects: http://dbpedia.org, http://nlp2rdf.org,
> http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
> 
> Homepage: http://aksw.org/SebastianHellmann
> Research Group: http://aksw.org
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-16 Thread Neubert, Joachim
Hi Sebastian,

This is huge! It will cover almost all currently existing German companies. 
Many of these will have similar names, so preparing for disambiguation is a 
concern.

A good way for such an approach would be proposing a property for an external 
identifier, loading the data into Mix-n-match, creating links for companies 
already in Wikidata, and adding the rest (or perhaps only parts of them - I’m 
not sure if having all of them in Wikidata makes sense, but that’s another 
discussion), preferably with location and/or sector of trade in the description 
field.

I’ve tried to figure out what could be used as key for a external identifier 
property. However, it looks like the registry does not offer any (persistent) 
URL to its entries. So for looking up a company, apparently there are two 
options:


-  conducting an extended search for the exact string “A&A 
Dienstleistungsgesellschaft mbH“

-  copying the register number “32853” plus selecting the court 
(Leipzig) from the according dropdown list and search that

Both ways are not very intuitive, even if we can provide a link to the search 
form. This would make a weak connection to the source of information. Much more 
important, it makes disambiguation in Mix-n-match difficult. This applies for 
the preparation of your initial load (you would not want to create duplicates). 
But much more so for everybody else who wants to match his or her data later 
on. Being forced to search for entries manually in a cumbersome way for 
disambiguation of a new, possibly large and rich dataset is, in my eyes, not 
something we want to impose on future contributors. And often, the free 
information they find in the registry (formal name, register number, legal 
form, address) will not easily match with the information they have (common 
name, location, perhaps founding date, and most important sector of trade), so 
disambiguation may still be difficult.

Have you checked which parts of the accessible information as below can be 
crawled and added legally to external databases such as Wikidata?

Cheers, Joachim

--
Joachim Neubert

ZBW – German National Library of Economics
Leibniz Information Centre for Economics
Neuer Jungfernstieg 21
20354 Hamburg
Phone +49-42834-462



Von: Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] Im Auftrag von 
Sebastian Hellmann
Gesendet: Sonntag, 15. Oktober 2017 09:45
An: wikidata@lists.wikimedia.org
Betreff: [Wikidata] Kickstartet: Adding 2.2 million German organisations to 
Wikidata


Hi all,

the German business registry contains roughly 2.2 million organisations. Some 
information is paid, but other is public, i.e. the info you are searching for 
at and clicking on UT (see example below):

https://www.handelsregister.de/rp_web/mask.do?Typ=e



I would like to add this to Wikidata, either by crawling or by raising money to 
use crowdsourcing concepts like crowdflour or amazon turk.



It should meet notability criteria 2: 
https://www.wikidata.org/wiki/Wikidata:Notability

2. It refers to an instance of a clearly identifiable conceptual or material 
entity. The entity must be notable, in the sense that it can be described using 
serious and publicly available references. If there is no item about you yet, 
you are probably not notable.

The reference is the official German business registry, which is serious and 
public. Orgs are also per definition clearly identifiable legal entities.

How can I get clearance to proceed on this?

All the best,
Sebastian





Entity data

Saxony District court Leipzig HRB 32853 – A&A Dienstleistungsgesellschaft mbH

Legal status:

Gesellschaft mit beschränkter Haftung

Capital:

25.000,00 EUR

Date of entry:

29/08/2016
(When entering date of entry, wrong data input can occur due to system 
failures!)

Date of removal:

-

Balance sheet available:

-

Address (subject to correction):

A&A Dienstleistungsgesellschaft mbH
Prager Straße 38-40
04317 Leipzig



--
All the best,
Sebastian Hellmann

Director of Knowledge Integration and Linked Data Technologies (KILT) 
Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org, http://linguistics.okfn.org, 
https://www.w3.org/community/ld4lt
Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-16 Thread Sebastian Hellmann

Thanks, done.

https://www.wikidata.org/wiki/Wikidata:Project_chat#Handelsregister


On 15.10.2017 22:10, Yaroslav Blanter wrote:

Hi Sebastian,

I would say the best way is to file a request for the permissions for 
the bot


https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot

and possibly leave a message on the Project Chat

https://www.wikidata.org/wiki/Wikidata:Project_chat

Cheers
Yaroslav

On Sun, Oct 15, 2017 at 9:44 AM, Sebastian Hellmann 
> wrote:


Hi all,

the German business registry contains roughly 2.2 million
organisations. Some information is paid, but other is public, i.e.
the info you are searching for at and clicking on UT (see example
below):

https://www.handelsregister.de/rp_web/mask.do?Typ=e



I would like to add this to Wikidata, either by crawling or by
raising money to use crowdsourcing concepts like crowdflour or
amazon turk.


It should meet notability criteria 2:
https://www.wikidata.org/wiki/Wikidata:Notability



2. It refers to an instance of a *clearly identifiable conceptual
or material entity*. The entity must be notable, in the sense
that it *can be described using serious and publicly available
references*. If there is no item about you yet, you are probably
not notable.



The reference is the official German business registry, which is
serious and public. Orgs are also per definition clearly
identifiable legal entities.

How can I get clearance to proceed on this?

All the best,
Sebastian



  Entity data


Saxony District court *Leipzig HRB 32853 *– A&A
Dienstleistungsgesellschaft mbH
Legal status:   Gesellschaft mit beschränkter Haftung
Capital:25.000,00 EUR
Date of entry:  29/08/2016
(When entering date of entry, wrong data input can occur due to
system failures!)
Date of removal:-
Balance sheet available:-
Address (subject to correction):A&A Dienstleistungsgesellschaft mbH
Prager Straße 38-40
04317 Leipzig


-- 
All the best,

Sebastian Hellmann

Director of Knowledge Integration and Linked Data Technologies
(KILT) Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org,
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt

Homepage: http://aksw.org/SebastianHellmann

Research Group: http://aksw.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org 
https://lists.wikimedia.org/mailman/listinfo/wikidata





___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


--
All the best,
Sebastian Hellmann

Director of Knowledge Integration and Linked Data Technologies (KILT) 
Competence Center

at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org, 
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt 


Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-15 Thread Federico Leva (Nemo)
This is an area where I would very much like to see some important 
properties created and populated, to the benefit e.g. of various 
infoboxes on Wikipedias which contain data in need of frequent updates 
(especially income, revenue, market capitalization, number of employees, 
links to most recent financial statements and other corporate information).

https://www.wikidata.org/w/index.php?title=Wikidata:Property_proposal/Organization&oldid=307430401
https://www.wikidata.org/wiki/Wikidata:List_of_properties/Organization

Even data for companies listed in stock exchanges is terribly outdated 
most of the times.


Nemo

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-15 Thread Yaroslav Blanter
Hi Sebastian,

I would say the best way is to file a request for the permissions for the
bot

https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot

and possibly leave a message on the Project Chat

https://www.wikidata.org/wiki/Wikidata:Project_chat

Cheers
Yaroslav

On Sun, Oct 15, 2017 at 9:44 AM, Sebastian Hellmann <
hellm...@informatik.uni-leipzig.de> wrote:

> Hi all,
>
> the German business registry contains roughly 2.2 million organisations.
> Some information is paid, but other is public, i.e. the info you are
> searching for at and clicking on UT (see example below):
>
> https://www.handelsregister.de/rp_web/mask.do?Typ=e
>
>
> I would like to add this to Wikidata, either by crawling or by raising
> money to use crowdsourcing concepts like crowdflour or amazon turk.
>
>
> It should meet notability criteria 2: https://www.wikidata.org/wiki/
> Wikidata:Notability
>
> 2. It refers to an instance of a *clearly identifiable conceptual or
> material entity*. The entity must be notable, in the sense that it *can
> be described using serious and publicly available references*. If there
> is no item about you yet, you are probably not notable.
>
>
> The reference is the official German business registry, which is serious
> and public. Orgs are also per definition clearly identifiable legal
> entities.
>
> How can I get clearance to proceed on this?
>
> All the best,
> Sebastian
>
>
>
> Entity data
> Saxony District court *Leipzig HRB 32853 *– A&A
> Dienstleistungsgesellschaft mbH
> Legal status: Gesellschaft mit beschränkter Haftung
> Capital: 25.000,00 EUR
> Date of entry: 29/08/2016
> (When entering date of entry, wrong data input can occur due to system
> failures!)
> Date of removal: -
> Balance sheet available: -
> Address (subject to correction): A&A Dienstleistungsgesellschaft mbH
> Prager Straße 38-40
> 04317 Leipzig
>
>
> --
> All the best,
> Sebastian Hellmann
>
> Director of Knowledge Integration and Linked Data Technologies (KILT)
> Competence Center
> at the Institute for Applied Informatics (InfAI) at Leipzig University
> Executive Director of the DBpedia Association
> Projects: http://dbpedia.org, http://nlp2rdf.org,
> http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
> 
> Homepage: http://aksw.org/SebastianHellmann
> Research Group: http://aksw.org
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata