Re: [Wikidata] weekly summary #282

2017-10-16 Thread Thad Guidry
Lydia,

Could I ask a itty bitty favor ?
For the Newest Properties listing... In the future, can you sort that into
2 listings ?  IDs and Other ?

​That will make it much easier for folks (like me) that have to maintain
mappings​ against other schema and vocabularies and ontologies.

Thad
+ThadGuidry 
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Mix n Match question

2017-10-16 Thread David Lowe
It uploaded properly, but I've got a couple f questions:
When I confirm a match, I get an "Invalid snak property" error.
Also, what is to happen with these matches? I have a property for my PIC
IDs (P2750), but there seems to be no way to update the wikidata entry with
my IDs. I fear I've done something wrong!


*David Lowe | The New York Public Library**Specialist II, Photography
Collection*

*Photographers' Identities Catalog *

On Mon, Oct 16, 2017 at 6:26 PM, David Lowe  wrote:

> Many thanks, Magnus! I look forward to working/playing with this.
>
> Best,
> David
>
>
> *David Lowe | The New York Public Library**Specialist II, Photography
> Collection*
>
> *Photographers' Identities Catalog *
>
> On Mon, Oct 16, 2017 at 5:26 PM, Magnus Manske <
> magnusman...@googlemail.com> wrote:
>
>> Hi David,
>>
>> the upload page at
>> https://tools.wmflabs.org/mix-n-match/import.php
>> won't take your matches, but they can be imported from Wikidata with a
>> click.
>>
>> If the upload is too big for the page, mail me the file and I'll do it
>> the old-fashioned way ;-)
>>
>> Cheers,
>> Magnus
>>
>> On Mon, Oct 16, 2017 at 10:19 PM David Lowe  wrote:
>>
>>> Magnus, or anyone else who may be able to advise:
>>>
>>> I'd like to add the Photographers' Identities Catalog (PIC) entries to
>>> Mix n Match. I have about 128,000 entries for photographers in PIC, of
>>> which I already have matched ~14,000 to Wikidata entries. My PIC IDs are
>>> already in the corresponding Wikidata entries. I assume I should remove
>>> these from the file before I upload it to Mix n Match, but wanted to check
>>> first.
>>>
>>> Thanks in advance,
>>> David
>>>
>>>
>>> *David Lowe | The New York Public Library**Specialist II, Photography
>>> Collection*
>>>
>>> *Photographers' Identities Catalog *
>>> ___
>>> Wikidata mailing list
>>> Wikidata@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>>
>>
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>>
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-16 Thread Sebastian Hellmann
Ok, I put some effort into 
https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot/Handelsregister 
to move the discussion there.


All the best,

Sebastian


On 16.10.2017 18:06, Yaroslav Blanter wrote:

Dear All,

it is great that we are having this discussion, but may I please 
suggest to have it on the RfP page on Wikidata? People already asked 
similar questions there, and, in my experience, on-wiki discussion 
will likely lead to refined request which will accomodate all suggestions.


Cheers
Yaroslav

On Mon, Oct 16, 2017 at 5:53 PM, Sebastian Hellmann 
> wrote:


ah, ok, sorry, I was assuming that Blazegraph would transitively
resolve this automatically.

Ok, so let's divide the problem:

# Task 1:

Connect all existing organisations with the data from the
handelsregister. (No new identifiers added, we can start right now)

Add a constraint that all German organisations should be connected
to a court, i.e. the registering organisation as well as the id
assigned by the court.

@all: any properties I can reuse for this?

I will focus on this as it seems quite easy. We can first filter
orgs by other criteria, i.e. country as a blocking key and then
string match the rest.

# Task 2:

Add all missing identifiers for the remaining orgs in
Handelsregister. Whereas 2 can be rediscussed and decided, if 1 is
finished sufficiently.


# regarding maintenance:
I find Wikidata as such very hard to maintain as all data is
copied from somewhere else eventually, but Wikipedia has the same
problem. In the case of the German Business register, maintenance
is especially easy as the orgs are stable and uniquely
identifiable. Even the fact that a company gets shut down should
still be in Wikidata, so you have historical information. I mean,
you also keep the Roman Empire, the Hanse and even finished
projects in Wikidata. So even if an org ceases to exist, the entry
in Wikidata should stay.

# regarding Opencorporates
I have a critical opinion with Opencorporates. It appears to be
open, but you actually can not get the data. If somebody has a
data dump, please forward to me. Thanks.
More on top, I consider Opencorporates a danger to open data. It
appears to push open availability of data, but then it is limited
to open licenses. Usefulness is limited as there are no free dumps
and no possibility to duplicate it effectlively. Wikipedia and
Wikidata provide dumps and an API for exactly this reason.
Everytime somebody wants to create an open organisation dataset
with no barriers, the existence of Opencorporates is blocking this.

Cheers,
Sebastian


On 16.10.2017 15:34, Antonin Delpeuch (lists) wrote:

And… my own count was wrong too, because I forgot to add DISTINCT in my
query (if there are multiple paths from the class to "organization
(Q43229)", items will appear multiple times).

So, I get 1 168 084 now.
http://tinyurl.com/yaeqlsnl

It's easy to get these things wrong!

Antonin

On 16/10/2017 14:16, Antonin Delpeuch (lists) wrote:

Thanks Ettore for spotting that!

Wikidata types (P31) only make sense when you consider the "subclass of"
(P279) property that we use to build the ontology (except in a few cases
where the community has decided not to use any subclass for a particular
type).

So, to retrieve all items of a certain type in SPARQL, you need to use
something like this:

?item wdt:P31/wdt:P279* ?type

You can also have other variants to accept non-truthy statements.

Just with this truthy version, I currently get 1 208 227 items. But note
that there are still a lot of items where P31 is not provided, or
subclasses which have not been connected to "organization (Q43229)"…

So in general, it's very hard to have any "guarantees that there are no
duplicates", just because you don't have any guarantees that the
information currently in Wikidata is complete or correct.

I would recommend trying to import something a bit smaller to get
acquainted with how Wikidata works and what the matching process looks
like in practice. And beyond a one-off import, as Ettore said it is
important to think how the data will be maintained in the future…

Antonin

On 16/10/2017 13:46, Ettore RIZZA wrote:

 - Wikidata has 40k organisations:

 https://query.wikidata.org/#SELECT

 
  %3Fitem %3FitemLabel %0AWHERE
 %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
 bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}


Hi,

I think Wikidata contains many more organizations than that. If we
choose the "instance o

Re: [Wikidata] Mix n Match question

2017-10-16 Thread David Lowe
Many thanks, Magnus! I look forward to working/playing with this.

Best,
David


*David Lowe | The New York Public Library**Specialist II, Photography
Collection*

*Photographers' Identities Catalog *

On Mon, Oct 16, 2017 at 5:26 PM, Magnus Manske 
wrote:

> Hi David,
>
> the upload page at
> https://tools.wmflabs.org/mix-n-match/import.php
> won't take your matches, but they can be imported from Wikidata with a
> click.
>
> If the upload is too big for the page, mail me the file and I'll do it the
> old-fashioned way ;-)
>
> Cheers,
> Magnus
>
> On Mon, Oct 16, 2017 at 10:19 PM David Lowe  wrote:
>
>> Magnus, or anyone else who may be able to advise:
>>
>> I'd like to add the Photographers' Identities Catalog (PIC) entries to
>> Mix n Match. I have about 128,000 entries for photographers in PIC, of
>> which I already have matched ~14,000 to Wikidata entries. My PIC IDs are
>> already in the corresponding Wikidata entries. I assume I should remove
>> these from the file before I upload it to Mix n Match, but wanted to check
>> first.
>>
>> Thanks in advance,
>> David
>>
>>
>> *David Lowe | The New York Public Library**Specialist II, Photography
>> Collection*
>>
>> *Photographers' Identities Catalog *
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] weekly summary #282

2017-10-16 Thread Lydia Pintscher
Hey everyone :)

Here is your summary of what happened around Wikidata over the past week.

Discussions

   - New request for comments: Association Football Matches
   


Events /
Press/Blogs


   - Upcoming: *WikidataCon
   *, Berlin,
   28-29 October (sold out)
   - If you organize a local event for the Wikidata birthday, feel free to
   add it on the events page 
   and the birthday page
   
   - A blog post
   
   describing Textes d'affiches
   , a tool using Wikidata
   to show movies and the works they are adapted from, developed during a
   hackathon at the national French library (by Shonagon, in French)
   - Slides of the presentation "Wikidata Query Service: The State of the
   Engine"
   

   by Stas Malyshev
   - Slides of the presentation "Wikidata and structured data initiatives
   at Wikimedia. Current trends and priorities"
   

   by Dario Taborelli
   - Wikidata as a linking hub for knowledge organization systems?
   Integrating an authority mapping into Wikidata and learning lessons for KOS
   mappings by Joachim Neubert 

Other Noteworthy Stuff

   - If you are interested in Structured Data on Commons
   , and
   helping out as a Wikidata contributor, please consider joining the
new Structured
   Commons community focus group
   

   !
   - Participate to the Global Legislative Openness Week
   

   with MySociety and organize a Wikidata workshop in your country, from 20th
   to 30th November 2017
   - Inventaire now has a News section
    where you can follow ongoing
   discussions and developments of the project
   - wikidata-cli  added bot
   support
   
   and new commands (wd search
   
,
   wd aliases
   
,
   wd add-alias
   
,
   wd remove-alias
   
,
   wd set-alias
   

   )
   - wikidata-edit  added bot
   edits support
   

   and new functions (alias.add
   
,
   alias.remove
   
,
   alias.set
   

   )
   - HarvestTemplates 
   has a new collaboration platform
   
   - You can now use reCh  to
   patrol items from your PagePile

Did you know?

   - Newest properties
   : Women's Sports
   Foundation ID , Oklahoma
   Sports Hall of Fame ID , North
   Carolina Sports Hall of Fame ID
   , New Mexico Sports Hall
   of Fame ID , National
   Trust Collections ID
, Infopatrimônio
   ID , KBO pitcher ID
   , KBO hitter ID
   ,

Re: [Wikidata] Mix n Match question

2017-10-16 Thread Magnus Manske
Hi David,

the upload page at
https://tools.wmflabs.org/mix-n-match/import.php
won't take your matches, but they can be imported from Wikidata with a
click.

If the upload is too big for the page, mail me the file and I'll do it the
old-fashioned way ;-)

Cheers,
Magnus

On Mon, Oct 16, 2017 at 10:19 PM David Lowe  wrote:

> Magnus, or anyone else who may be able to advise:
>
> I'd like to add the Photographers' Identities Catalog (PIC) entries to Mix
> n Match. I have about 128,000 entries for photographers in PIC, of which I
> already have matched ~14,000 to Wikidata entries. My PIC IDs are already in
> the corresponding Wikidata entries. I assume I should remove these from the
> file before I upload it to Mix n Match, but wanted to check first.
>
> Thanks in advance,
> David
>
>
> *David Lowe | The New York Public Library**Specialist II, Photography
> Collection*
>
> *Photographers' Identities Catalog *
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Mix n Match question

2017-10-16 Thread David Lowe
Magnus, or anyone else who may be able to advise:

I'd like to add the Photographers' Identities Catalog (PIC) entries to Mix
n Match. I have about 128,000 entries for photographers in PIC, of which I
already have matched ~14,000 to Wikidata entries. My PIC IDs are already in
the corresponding Wikidata entries. I assume I should remove these from the
file before I upload it to Mix n Match, but wanted to check first.

Thanks in advance,
David


*David Lowe | The New York Public Library**Specialist II, Photography
Collection*

*Photographers' Identities Catalog *
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] JADE needs your feedback

2017-10-16 Thread Keegan Peterzell
​Hi all,

Feedback is needed for specifics:

* ​Judgments, Endorsements, and Preferences -
https://www.mediawiki.org/wiki/Topic:Tzw0uv2bucrdprm4
​* Thematic and quant analysis of judgements -
https://www.mediawiki.org/wiki/Topic:Tzw5fix7hbs4ui8j​
​*  Free text comments and suppression -
https://www.mediawiki.org/wiki/Topic:Tzw4ebq17wbdog74​
* Should we integrate JADE with Flow? -
https://www.mediawiki.org/wiki/Topic:Tzw3qg8qiqow10d8

If you think you might have something to say, please take a look and
contribute.

I think this should be the last general email with Wikidata-l cc'd. Thanks!

-- 
Keegan Peterzell
Technical Collaboration Specialist
Wikimedia Foundation
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-16 Thread Sebastian Hellmann

Hi Yaroslav,

in addition to this list, I added it here:

https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot/Handelsregister

and here:

https://www.wikidata.org/wiki/Wikidata:Project_chat#Handelsregister

but I received more and longer answers on this list.

All the best,

Sebastian


On 16.10.2017 18:06, Yaroslav Blanter wrote:

Dear All,

it is great that we are having this discussion, but may I please 
suggest to have it on the RfP page on Wikidata? People already asked 
similar questions there, and, in my experience, on-wiki discussion 
will likely lead to refined request which will accomodate all suggestions.


Cheers
Yaroslav

On Mon, Oct 16, 2017 at 5:53 PM, Sebastian Hellmann 
> wrote:


ah, ok, sorry, I was assuming that Blazegraph would transitively
resolve this automatically.

Ok, so let's divide the problem:

# Task 1:

Connect all existing organisations with the data from the
handelsregister. (No new identifiers added, we can start right now)

Add a constraint that all German organisations should be connected
to a court, i.e. the registering organisation as well as the id
assigned by the court.

@all: any properties I can reuse for this?

I will focus on this as it seems quite easy. We can first filter
orgs by other criteria, i.e. country as a blocking key and then
string match the rest.

# Task 2:

Add all missing identifiers for the remaining orgs in
Handelsregister. Whereas 2 can be rediscussed and decided, if 1 is
finished sufficiently.


# regarding maintenance:
I find Wikidata as such very hard to maintain as all data is
copied from somewhere else eventually, but Wikipedia has the same
problem. In the case of the German Business register, maintenance
is especially easy as the orgs are stable and uniquely
identifiable. Even the fact that a company gets shut down should
still be in Wikidata, so you have historical information. I mean,
you also keep the Roman Empire, the Hanse and even finished
projects in Wikidata. So even if an org ceases to exist, the entry
in Wikidata should stay.

# regarding Opencorporates
I have a critical opinion with Opencorporates. It appears to be
open, but you actually can not get the data. If somebody has a
data dump, please forward to me. Thanks.
More on top, I consider Opencorporates a danger to open data. It
appears to push open availability of data, but then it is limited
to open licenses. Usefulness is limited as there are no free dumps
and no possibility to duplicate it effectlively. Wikipedia and
Wikidata provide dumps and an API for exactly this reason.
Everytime somebody wants to create an open organisation dataset
with no barriers, the existence of Opencorporates is blocking this.

Cheers,
Sebastian


On 16.10.2017 15:34, Antonin Delpeuch (lists) wrote:

And… my own count was wrong too, because I forgot to add DISTINCT in my
query (if there are multiple paths from the class to "organization
(Q43229)", items will appear multiple times).

So, I get 1 168 084 now.
http://tinyurl.com/yaeqlsnl

It's easy to get these things wrong!

Antonin

On 16/10/2017 14:16, Antonin Delpeuch (lists) wrote:

Thanks Ettore for spotting that!

Wikidata types (P31) only make sense when you consider the "subclass of"
(P279) property that we use to build the ontology (except in a few cases
where the community has decided not to use any subclass for a particular
type).

So, to retrieve all items of a certain type in SPARQL, you need to use
something like this:

?item wdt:P31/wdt:P279* ?type

You can also have other variants to accept non-truthy statements.

Just with this truthy version, I currently get 1 208 227 items. But note
that there are still a lot of items where P31 is not provided, or
subclasses which have not been connected to "organization (Q43229)"…

So in general, it's very hard to have any "guarantees that there are no
duplicates", just because you don't have any guarantees that the
information currently in Wikidata is complete or correct.

I would recommend trying to import something a bit smaller to get
acquainted with how Wikidata works and what the matching process looks
like in practice. And beyond a one-off import, as Ettore said it is
important to think how the data will be maintained in the future…

Antonin

On 16/10/2017 13:46, Ettore RIZZA wrote:

 - Wikidata has 40k organisations:

 https://query.wikidata.org/#SELECT

 
  %3Fitem %3FitemLabel %0AWHERE
 %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
 bd%3AserviceParam wikibase%3Alanguage "[AUT

Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-16 Thread Yaroslav Blanter
Dear All,

it is great that we are having this discussion, but may I please suggest to
have it on the RfP page on Wikidata? People already asked similar questions
there, and, in my experience, on-wiki discussion will likely lead to
refined request which will accomodate all suggestions.

Cheers
Yaroslav

On Mon, Oct 16, 2017 at 5:53 PM, Sebastian Hellmann <
hellm...@informatik.uni-leipzig.de> wrote:

> ah, ok, sorry, I was assuming that Blazegraph would transitively resolve
> this automatically.
>
> Ok, so let's divide the problem:
>
> # Task 1:
>
> Connect all existing organisations with the data from the handelsregister.
> (No new identifiers added, we can start right now)
>
> Add a constraint that all German organisations should be connected to a
> court, i.e. the registering organisation as well as the id assigned by the
> court.
>
> @all: any properties I can reuse for this?
>
> I will focus on this as it seems quite easy. We can first filter orgs by
> other criteria, i.e. country as a blocking key and then string match the
> rest.
>
> # Task 2:
>
> Add all missing identifiers for the remaining orgs in Handelsregister.
> Whereas 2 can be rediscussed and decided, if 1 is finished sufficiently.
>
> # regarding maintenance:
> I find Wikidata as such very hard to maintain as all data is copied from
> somewhere else eventually, but Wikipedia has the same problem. In the case
> of the German Business register, maintenance is especially easy as the orgs
> are stable and uniquely identifiable. Even the fact that a company gets
> shut down should still be in Wikidata, so you have historical information.
> I mean, you also keep the Roman Empire, the Hanse and even finished
> projects in Wikidata. So even if an org ceases to exist, the entry in
> Wikidata should stay.
>
> # regarding Opencorporates
> I have a critical opinion with Opencorporates. It appears to be open, but
> you actually can not get the data. If somebody has a data dump, please
> forward to me. Thanks.
> More on top, I consider Opencorporates a danger to open data. It appears
> to push open availability of data, but then it is limited to open licenses.
> Usefulness is limited as there are no free dumps and no possibility to
> duplicate it effectlively. Wikipedia and Wikidata provide dumps and an API
> for exactly this reason. Everytime somebody wants to create an open
> organisation dataset with no barriers, the existence of Opencorporates is
> blocking this.
>
> Cheers,
> Sebastian
>
>
> On 16.10.2017 15:34, Antonin Delpeuch (lists) wrote:
>
> And… my own count was wrong too, because I forgot to add DISTINCT in my
> query (if there are multiple paths from the class to "organization
> (Q43229)", items will appear multiple times).
>
> So, I get 1 168 084 now.http://tinyurl.com/yaeqlsnl
>
> It's easy to get these things wrong!
>
> Antonin
>
> On 16/10/2017 14:16, Antonin Delpeuch (lists) wrote:
>
> Thanks Ettore for spotting that!
>
> Wikidata types (P31) only make sense when you consider the "subclass of"
> (P279) property that we use to build the ontology (except in a few cases
> where the community has decided not to use any subclass for a particular
> type).
>
> So, to retrieve all items of a certain type in SPARQL, you need to use
> something like this:
>
> ?item wdt:P31/wdt:P279* ?type
>
> You can also have other variants to accept non-truthy statements.
>
> Just with this truthy version, I currently get 1 208 227 items. But note
> that there are still a lot of items where P31 is not provided, or
> subclasses which have not been connected to "organization (Q43229)"…
>
> So in general, it's very hard to have any "guarantees that there are no
> duplicates", just because you don't have any guarantees that the
> information currently in Wikidata is complete or correct.
>
> I would recommend trying to import something a bit smaller to get
> acquainted with how Wikidata works and what the matching process looks
> like in practice. And beyond a one-off import, as Ettore said it is
> important to think how the data will be maintained in the future…
>
> Antonin
>
> On 16/10/2017 13:46, Ettore RIZZA wrote:
>
> - Wikidata has 40k organisations:
>
> https://query.wikidata.org/#SELECT
>   
> %3Fitem %3FitemLabel %0AWHERE
> %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
> bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
>
>
> Hi,
>
> I think Wikidata contains many more organizations than that. If we
> choose the "instance of Business enterprise", we get 135570 results. And
> I imagine there are many other categories that bring together commercial
> companies.
>
> https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ4830453.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%7D
>
> On the substanc

Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-16 Thread Sebastian Hellmann
ah, ok, sorry, I was assuming that Blazegraph would transitively resolve 
this automatically.


Ok, so let's divide the problem:

# Task 1:

Connect all existing organisations with the data from the 
handelsregister. (No new identifiers added, we can start right now)


Add a constraint that all German organisations should be connected to a 
court, i.e. the registering organisation as well as the id assigned by 
the court.


@all: any properties I can reuse for this?

I will focus on this as it seems quite easy. We can first filter orgs by 
other criteria, i.e. country as a blocking key and then string match the 
rest.


# Task 2:

Add all missing identifiers for the remaining orgs in Handelsregister. 
Whereas 2 can be rediscussed and decided, if 1 is finished sufficiently.



# regarding maintenance:
I find Wikidata as such very hard to maintain as all data is copied from 
somewhere else eventually, but Wikipedia has the same problem. In the 
case of the German Business register, maintenance is especially easy as 
the orgs are stable and uniquely identifiable. Even the fact that a 
company gets shut down should still be in Wikidata, so you have 
historical information. I mean, you also keep the Roman Empire, the 
Hanse and even finished projects in Wikidata. So even if an org ceases 
to exist, the entry in Wikidata should stay.


# regarding Opencorporates
I have a critical opinion with Opencorporates. It appears to be open, 
but you actually can not get the data. If somebody has a data dump, 
please forward to me. Thanks.
More on top, I consider Opencorporates a danger to open data. It appears 
to push open availability of data, but then it is limited to open 
licenses. Usefulness is limited as there are no free dumps and no 
possibility to duplicate it effectlively. Wikipedia and Wikidata provide 
dumps and an API for exactly this reason. Everytime somebody wants to 
create an open organisation dataset with no barriers, the existence of 
Opencorporates is blocking this.


Cheers,
Sebastian


On 16.10.2017 15:34, Antonin Delpeuch (lists) wrote:

And… my own count was wrong too, because I forgot to add DISTINCT in my
query (if there are multiple paths from the class to "organization
(Q43229)", items will appear multiple times).

So, I get 1 168 084 now.
http://tinyurl.com/yaeqlsnl

It's easy to get these things wrong!

Antonin

On 16/10/2017 14:16, Antonin Delpeuch (lists) wrote:

Thanks Ettore for spotting that!

Wikidata types (P31) only make sense when you consider the "subclass of"
(P279) property that we use to build the ontology (except in a few cases
where the community has decided not to use any subclass for a particular
type).

So, to retrieve all items of a certain type in SPARQL, you need to use
something like this:

?item wdt:P31/wdt:P279* ?type

You can also have other variants to accept non-truthy statements.

Just with this truthy version, I currently get 1 208 227 items. But note
that there are still a lot of items where P31 is not provided, or
subclasses which have not been connected to "organization (Q43229)"…

So in general, it's very hard to have any "guarantees that there are no
duplicates", just because you don't have any guarantees that the
information currently in Wikidata is complete or correct.

I would recommend trying to import something a bit smaller to get
acquainted with how Wikidata works and what the matching process looks
like in practice. And beyond a one-off import, as Ettore said it is
important to think how the data will be maintained in the future…

Antonin

On 16/10/2017 13:46, Ettore RIZZA wrote:

 - Wikidata has 40k organisations:

 https://query.wikidata.org/#SELECT
  %3Fitem %3FitemLabel %0AWHERE
 %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
 bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}


Hi,

I think Wikidata contains many more organizations than that. If we
choose the "instance of Business enterprise", we get 135570 results. And
I imagine there are many other categories that bring together commercial
companies.


https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ4830453.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%7D

On the substance, the project to add all companies of a country would
make Wikidata a kind of totally free clone of Open Corporates
. I would of course be delighted to see
that, but is it not a challenge to maintain such a database? Companies
are like humans, it appears and disappears every day.

  


2017-10-16 13:41 GMT+02:00 Sebastian Hellmann
mailto:hellm...@informatik.uni-leipzig.de>>:

 Hi all,

 the technical challenges are not so difficult.

 - 2.2 million are the exact number of German organisations, i.e.
 associations and companies. They are also unique

Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-16 Thread Ettore RIZZA
While I'm on the subject, I would like to draw attention to the Neckar
project , which aims
precisely to classify Wikidata entities in people, places and
organizations. Frequently updated Json dumps are available.

2017-10-16 16:08 GMT+02:00 Ettore RIZZA :

> @Antonin : Thanks for this counting method, it seems very effective (I
> already knew that there were 3.6 M of humans (Q5) in Wikidata).
>
> https://query.wikidata.org/#%23compter%20le%20nombre%20d%
> 27%C3%A9l%C3%A9ments%20appartenant%20%C3%A0%20la%20cat%C3%A9gorie%0A%
> 23organisation%20ou%20%C3%A0%20ses%20enfants%0ASELECT%
> 20DISTINCT%20%28COUNT%28DISTINCT%20%3Fitem%29%20AS%
> 20%3Fcount%29%20WHERE%20%7B%20%3Fitem%20%28wdt%3AP31%
> 2Fwdt%3AP279%2a%29%20wd%3AQ5.%20%7D
>
> 2017-10-16 15:34 GMT+02:00 Antonin Delpeuch (lists) <
> li...@antonin.delpeuch.eu>:
>
>> And… my own count was wrong too, because I forgot to add DISTINCT in my
>> query (if there are multiple paths from the class to "organization
>> (Q43229)", items will appear multiple times).
>>
>> So, I get 1 168 084 now.
>> http://tinyurl.com/yaeqlsnl
>>
>> It's easy to get these things wrong!
>>
>> Antonin
>>
>> On 16/10/2017 14:16, Antonin Delpeuch (lists) wrote:
>> > Thanks Ettore for spotting that!
>> >
>> > Wikidata types (P31) only make sense when you consider the "subclass of"
>> > (P279) property that we use to build the ontology (except in a few cases
>> > where the community has decided not to use any subclass for a particular
>> > type).
>> >
>> > So, to retrieve all items of a certain type in SPARQL, you need to use
>> > something like this:
>> >
>> > ?item wdt:P31/wdt:P279* ?type
>> >
>> > You can also have other variants to accept non-truthy statements.
>> >
>> > Just with this truthy version, I currently get 1 208 227 items. But note
>> > that there are still a lot of items where P31 is not provided, or
>> > subclasses which have not been connected to "organization (Q43229)"…
>> >
>> > So in general, it's very hard to have any "guarantees that there are no
>> > duplicates", just because you don't have any guarantees that the
>> > information currently in Wikidata is complete or correct.
>> >
>> > I would recommend trying to import something a bit smaller to get
>> > acquainted with how Wikidata works and what the matching process looks
>> > like in practice. And beyond a one-off import, as Ettore said it is
>> > important to think how the data will be maintained in the future…
>> >
>> > Antonin
>> >
>> > On 16/10/2017 13:46, Ettore RIZZA wrote:
>> >> - Wikidata has 40k organisations:
>> >>
>> >> https://query.wikidata.org/#SELECT
>> >>  %3Fitem %3FitemLabel %0AWHERE
>> >> %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel
>> {
>> >> bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
>> >>
>> >>
>> >> Hi,
>> >>
>> >> I think Wikidata contains many more organizations than that. If we
>> >> choose the "instance of Business enterprise", we get 135570 results.
>> And
>> >> I imagine there are many other categories that bring together
>> commercial
>> >> companies.
>> >>
>> >>
>> >> https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%
>> 20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ4830453.%
>> 0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AservicePa
>> ram%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%7D
>> >>
>> >> On the substance, the project to add all companies of a country would
>> >> make Wikidata a kind of totally free clone of Open Corporates
>> >> . I would of course be delighted to see
>> >> that, but is it not a challenge to maintain such a database? Companies
>> >> are like humans, it appears and disappears every day.
>> >>
>> >>
>> >>
>> >> 2017-10-16 13:41 GMT+02:00 Sebastian Hellmann
>> >> > >> >:
>> >>
>> >> Hi all,
>> >>
>> >> the technical challenges are not so difficult.
>> >>
>> >> - 2.2 million are the exact number of German organisations, i.e.
>> >> associations and companies. They are also unique.
>> >>
>> >> - Wikidata has 40k organisations:
>> >>
>> >> https://query.wikidata.org/#SELECT
>> >>  %3Fitem %3FitemLabel %0AWHERE
>> >> %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel
>> {
>> >> bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
>> >>
>> >> so there would be a maximum of 40k duplicates These are easy to
>> find
>> >> and deduplicate
>> >>
>> >> - The crawl can be done easily, a colleague has done so before.
>> >>
>> >>
>> >> The issues here are:
>> >>
>> >> - Do you want to upload the data in Wikidata? It would be a real
>> big
>> >> extension. Can I go ahead
>> >>
>> >> - If the data were available externally as structured data under
>> >> open license, I would probably no

Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-16 Thread Ettore RIZZA
@Antonin : Thanks for this counting method, it seems very effective (I
already knew that there were 3.6 M of humans (Q5) in Wikidata).

https://query.wikidata.org/#%23compter%20le%20nombre%20d%27%C3%A9l%C3%A9ments%20appartenant%20%C3%A0%20la%20cat%C3%A9gorie%0A%23organisation%20ou%20%C3%A0%20ses%20enfants%0ASELECT%20DISTINCT%20%28COUNT%28DISTINCT%20%3Fitem%29%20AS%20%3Fcount%29%20WHERE%20%7B%20%3Fitem%20%28wdt%3AP31%2Fwdt%3AP279%2a%29%20wd%3AQ5.%20%7D

2017-10-16 15:34 GMT+02:00 Antonin Delpeuch (lists) <
li...@antonin.delpeuch.eu>:

> And… my own count was wrong too, because I forgot to add DISTINCT in my
> query (if there are multiple paths from the class to "organization
> (Q43229)", items will appear multiple times).
>
> So, I get 1 168 084 now.
> http://tinyurl.com/yaeqlsnl
>
> It's easy to get these things wrong!
>
> Antonin
>
> On 16/10/2017 14:16, Antonin Delpeuch (lists) wrote:
> > Thanks Ettore for spotting that!
> >
> > Wikidata types (P31) only make sense when you consider the "subclass of"
> > (P279) property that we use to build the ontology (except in a few cases
> > where the community has decided not to use any subclass for a particular
> > type).
> >
> > So, to retrieve all items of a certain type in SPARQL, you need to use
> > something like this:
> >
> > ?item wdt:P31/wdt:P279* ?type
> >
> > You can also have other variants to accept non-truthy statements.
> >
> > Just with this truthy version, I currently get 1 208 227 items. But note
> > that there are still a lot of items where P31 is not provided, or
> > subclasses which have not been connected to "organization (Q43229)"…
> >
> > So in general, it's very hard to have any "guarantees that there are no
> > duplicates", just because you don't have any guarantees that the
> > information currently in Wikidata is complete or correct.
> >
> > I would recommend trying to import something a bit smaller to get
> > acquainted with how Wikidata works and what the matching process looks
> > like in practice. And beyond a one-off import, as Ettore said it is
> > important to think how the data will be maintained in the future…
> >
> > Antonin
> >
> > On 16/10/2017 13:46, Ettore RIZZA wrote:
> >> - Wikidata has 40k organisations:
> >>
> >> https://query.wikidata.org/#SELECT
> >>  %3Fitem %3FitemLabel %0AWHERE
> >> %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
> >> bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
> >>
> >>
> >> Hi,
> >>
> >> I think Wikidata contains many more organizations than that. If we
> >> choose the "instance of Business enterprise", we get 135570 results. And
> >> I imagine there are many other categories that bring together commercial
> >> companies.
> >>
> >>
> >> https://query.wikidata.org/#SELECT%20%3Fitem%20%
> 3FitemLabel%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%
> 3AQ4830453.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%
> 3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_
> LANGUAGE%5D%2Cen%22.%20%7D%0A%7D
> >>
> >> On the substance, the project to add all companies of a country would
> >> make Wikidata a kind of totally free clone of Open Corporates
> >> . I would of course be delighted to see
> >> that, but is it not a challenge to maintain such a database? Companies
> >> are like humans, it appears and disappears every day.
> >>
> >>
> >>
> >> 2017-10-16 13:41 GMT+02:00 Sebastian Hellmann
> >>  >> >:
> >>
> >> Hi all,
> >>
> >> the technical challenges are not so difficult.
> >>
> >> - 2.2 million are the exact number of German organisations, i.e.
> >> associations and companies. They are also unique.
> >>
> >> - Wikidata has 40k organisations:
> >>
> >> https://query.wikidata.org/#SELECT
> >>  %3Fitem %3FitemLabel %0AWHERE
> >> %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
> >> bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
> >>
> >> so there would be a maximum of 40k duplicates These are easy to find
> >> and deduplicate
> >>
> >> - The crawl can be done easily, a colleague has done so before.
> >>
> >>
> >> The issues here are:
> >>
> >> - Do you want to upload the data in Wikidata? It would be a real big
> >> extension. Can I go ahead
> >>
> >> - If the data were available externally as structured data under
> >> open license, I would probably not suggest loading it into wikidata,
> >> as the data can be retrieved from the official source directly,
> >> however, here this data will not be published in a decent format.
> >>
> >> I thought that the way data is copied from coyrighted sources, i.e.
> >> only facts is ok for wikidata. This done in a lot of places, I
> >> guess. Same for Wikipedia, i.e. News articles and copyrighted books
> >> are referenced. So Wikimedia

Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-16 Thread Antonin Delpeuch (lists)
And… my own count was wrong too, because I forgot to add DISTINCT in my
query (if there are multiple paths from the class to "organization
(Q43229)", items will appear multiple times).

So, I get 1 168 084 now.
http://tinyurl.com/yaeqlsnl

It's easy to get these things wrong!

Antonin

On 16/10/2017 14:16, Antonin Delpeuch (lists) wrote:
> Thanks Ettore for spotting that!
> 
> Wikidata types (P31) only make sense when you consider the "subclass of"
> (P279) property that we use to build the ontology (except in a few cases
> where the community has decided not to use any subclass for a particular
> type).
> 
> So, to retrieve all items of a certain type in SPARQL, you need to use
> something like this:
> 
> ?item wdt:P31/wdt:P279* ?type
> 
> You can also have other variants to accept non-truthy statements.
> 
> Just with this truthy version, I currently get 1 208 227 items. But note
> that there are still a lot of items where P31 is not provided, or
> subclasses which have not been connected to "organization (Q43229)"…
> 
> So in general, it's very hard to have any "guarantees that there are no
> duplicates", just because you don't have any guarantees that the
> information currently in Wikidata is complete or correct.
> 
> I would recommend trying to import something a bit smaller to get
> acquainted with how Wikidata works and what the matching process looks
> like in practice. And beyond a one-off import, as Ettore said it is
> important to think how the data will be maintained in the future…
> 
> Antonin
> 
> On 16/10/2017 13:46, Ettore RIZZA wrote:
>> - Wikidata has 40k organisations: 
>>
>> https://query.wikidata.org/#SELECT
>>  %3Fitem %3FitemLabel %0AWHERE
>> %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
>> bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
>>
>>
>> Hi, 
>>
>> I think Wikidata contains many more organizations than that. If we
>> choose the "instance of Business enterprise", we get 135570 results. And
>> I imagine there are many other categories that bring together commercial
>> companies.
>>
>>
>> https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ4830453.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%7D
>>
>> On the substance, the project to add all companies of a country would
>> make Wikidata a kind of totally free clone of Open Corporates
>> . I would of course be delighted to see
>> that, but is it not a challenge to maintain such a database? Companies
>> are like humans, it appears and disappears every day.
>>
>>  
>>
>> 2017-10-16 13:41 GMT+02:00 Sebastian Hellmann
>> > >:
>>
>> Hi all,
>>
>> the technical challenges are not so difficult.
>>
>> - 2.2 million are the exact number of German organisations, i.e.
>> associations and companies. They are also unique.
>>
>> - Wikidata has 40k organisations:
>>
>> https://query.wikidata.org/#SELECT
>>  %3Fitem %3FitemLabel %0AWHERE
>> %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
>> bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
>>
>> so there would be a maximum of 40k duplicates These are easy to find
>> and deduplicate
>>
>> - The crawl can be done easily, a colleague has done so before.  
>>
>>
>> The issues here are:
>>
>> - Do you want to upload the data in Wikidata? It would be a real big
>> extension. Can I go ahead
>>
>> - If the data were available externally as structured data under
>> open license, I would probably not suggest loading it into wikidata,
>> as the data can be retrieved from the official source directly,
>> however, here this data will not be published in a decent format.
>>
>> I thought that the way data is copied from coyrighted sources, i.e.
>> only facts is ok for wikidata. This done in a lot of places, I
>> guess. Same for Wikipedia, i.e. News articles and copyrighted books
>> are referenced. So Wikimedia or the Wikimedia community are experts
>> on this.
>>
>> All the best,
>>
>> Sebastian
>>
>>
>> On 16.10.2017 10:18, Neubert, Joachim wrote:
>>>
>>> Hi Sebastian,
>>>
>>> __ __
>>>
>>> This is huge! It will cover almost all currently existing German
>>> companies. Many of these will have similar names, so preparing for
>>> disambiguation is a concern.
>>>
>>> __ __
>>>
>>> A good way for such an approach would be proposing a property for
>>> an external identifier, loading the data into Mix-n-match,
>>> creating links for companies already in Wikidata, and adding the
>>> rest (or perhaps only parts of them - I’m not sure if having all
>>> of them in Wikidata makes sen

Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-16 Thread Antonin Delpeuch (lists)
Thanks Ettore for spotting that!

Wikidata types (P31) only make sense when you consider the "subclass of"
(P279) property that we use to build the ontology (except in a few cases
where the community has decided not to use any subclass for a particular
type).

So, to retrieve all items of a certain type in SPARQL, you need to use
something like this:

?item wdt:P31/wdt:P279* ?type

You can also have other variants to accept non-truthy statements.

Just with this truthy version, I currently get 1 208 227 items. But note
that there are still a lot of items where P31 is not provided, or
subclasses which have not been connected to "organization (Q43229)"…

So in general, it's very hard to have any "guarantees that there are no
duplicates", just because you don't have any guarantees that the
information currently in Wikidata is complete or correct.

I would recommend trying to import something a bit smaller to get
acquainted with how Wikidata works and what the matching process looks
like in practice. And beyond a one-off import, as Ettore said it is
important to think how the data will be maintained in the future…

Antonin

On 16/10/2017 13:46, Ettore RIZZA wrote:
> - Wikidata has 40k organisations: 
> 
> https://query.wikidata.org/#SELECT
>  %3Fitem %3FitemLabel %0AWHERE
> %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
> bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
> 
> 
> Hi, 
> 
> I think Wikidata contains many more organizations than that. If we
> choose the "instance of Business enterprise", we get 135570 results. And
> I imagine there are many other categories that bring together commercial
> companies.
> 
> 
> https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ4830453.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%7D
> 
> On the substance, the project to add all companies of a country would
> make Wikidata a kind of totally free clone of Open Corporates
> . I would of course be delighted to see
> that, but is it not a challenge to maintain such a database? Companies
> are like humans, it appears and disappears every day.
> 
>  
> 
> 2017-10-16 13:41 GMT+02:00 Sebastian Hellmann
>  >:
> 
> Hi all,
> 
> the technical challenges are not so difficult.
> 
> - 2.2 million are the exact number of German organisations, i.e.
> associations and companies. They are also unique.
> 
> - Wikidata has 40k organisations:
> 
> https://query.wikidata.org/#SELECT
>  %3Fitem %3FitemLabel %0AWHERE
> %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
> bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
> 
> so there would be a maximum of 40k duplicates These are easy to find
> and deduplicate
> 
> - The crawl can be done easily, a colleague has done so before.  
> 
> 
> The issues here are:
> 
> - Do you want to upload the data in Wikidata? It would be a real big
> extension. Can I go ahead
> 
> - If the data were available externally as structured data under
> open license, I would probably not suggest loading it into wikidata,
> as the data can be retrieved from the official source directly,
> however, here this data will not be published in a decent format.
> 
> I thought that the way data is copied from coyrighted sources, i.e.
> only facts is ok for wikidata. This done in a lot of places, I
> guess. Same for Wikipedia, i.e. News articles and copyrighted books
> are referenced. So Wikimedia or the Wikimedia community are experts
> on this.
> 
> All the best,
> 
> Sebastian
> 
> 
> On 16.10.2017 10:18, Neubert, Joachim wrote:
>>
>> Hi Sebastian,
>>
>> __ __
>>
>> This is huge! It will cover almost all currently existing German
>> companies. Many of these will have similar names, so preparing for
>> disambiguation is a concern.
>>
>> __ __
>>
>> A good way for such an approach would be proposing a property for
>> an external identifier, loading the data into Mix-n-match,
>> creating links for companies already in Wikidata, and adding the
>> rest (or perhaps only parts of them - I’m not sure if having all
>> of them in Wikidata makes sense, but that’s another discussion),
>> preferably with location and/or sector of trade in the description
>> field.
>>
>> __ __
>>
>> I’ve tried to figure out what could be used as key for a external
>> identifier property. However, it looks like the registry does not
>> offer any (persistent) URL to its entries. So for looking up a
>> company, apparently there are two options:
>>
>> __ __
>>
>> -  conductin

Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-16 Thread Ettore RIZZA
>
> - Wikidata has 40k organisations:

https://query.wikidata.org/#SELECT %3Fitem %3FitemLabel %0AWHERE %0A{%0A
> %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
> bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}


Hi,

I think Wikidata contains many more organizations than that. If we choose
the "instance of Business enterprise", we get 135570 results. And I imagine
there are many other categories that bring together commercial companies.


https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ4830453.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%7D

On the substance, the project to add all companies of a country would make
Wikidata a kind of totally free clone of Open Corporates
. I would of course be delighted to see that,
but is it not a challenge to maintain such a database? Companies are like
humans, it appears and disappears every day.



2017-10-16 13:41 GMT+02:00 Sebastian Hellmann <
hellm...@informatik.uni-leipzig.de>:

> Hi all,
>
> the technical challenges are not so difficult.
>
> - 2.2 million are the exact number of German organisations, i.e.
> associations and companies. They are also unique.
>
> - Wikidata has 40k organisations:
>
> https://query.wikidata.org/#SELECT %3Fitem %3FitemLabel %0AWHERE %0A{%0A
> %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
> bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
>
> so there would be a maximum of 40k duplicates These are easy to find and
> deduplicate
>
> - The crawl can be done easily, a colleague has done so before.
>
>
> The issues here are:
>
> - Do you want to upload the data in Wikidata? It would be a real big
> extension. Can I go ahead
>
> - If the data were available externally as structured data under open
> license, I would probably not suggest loading it into wikidata, as the data
> can be retrieved from the official source directly, however, here this data
> will not be published in a decent format.
>
> I thought that the way data is copied from coyrighted sources, i.e. only
> facts is ok for wikidata. This done in a lot of places, I guess. Same for
> Wikipedia, i.e. News articles and copyrighted books are referenced. So
> Wikimedia or the Wikimedia community are experts on this.
>
> All the best,
>
> Sebastian
>
> On 16.10.2017 10:18, Neubert, Joachim wrote:
>
> Hi Sebastian,
>
>
>
> This is huge! It will cover almost all currently existing German
> companies. Many of these will have similar names, so preparing for
> disambiguation is a concern.
>
>
>
> A good way for such an approach would be proposing a property for an
> external identifier, loading the data into Mix-n-match, creating links for
> companies already in Wikidata, and adding the rest (or perhaps only parts
> of them - I’m not sure if having all of them in Wikidata makes sense, but
> that’s another discussion), preferably with location and/or sector of trade
> in the description field.
>
>
>
> I’ve tried to figure out what could be used as key for a external
> identifier property. However, it looks like the registry does not offer any
> (persistent) URL to its entries. So for looking up a company, apparently
> there are two options:
>
>
>
> -  conducting an extended search for the exact string “A&A
> Dienstleistungsgesellschaft mbH“
>
> -  copying the register number “32853” plus selecting the court
> (Leipzig) from the according dropdown list and search that
>
>
>
> Both ways are not very intuitive, even if we can provide a link to the
> search form. This would make a weak connection to the source of
> information. Much more important, it makes disambiguation in Mix-n-match
> difficult. This applies for the preparation of your initial load (you would
> not want to create duplicates). But much more so for everybody else who
> wants to match his or her data later on. Being forced to search for entries
> manually in a cumbersome way for disambiguation of a new, possibly large
> and rich dataset is, in my eyes, not something we want to impose on future
> contributors. And often, the free information they find in the registry
> (formal name, register number, legal form, address) will not easily match
> with the information they have (common name, location, perhaps founding
> date, and most important sector of trade), so disambiguation may still be
> difficult.
>
>
>
> Have you checked which parts of the accessible information as below can be
> crawled and added legally to external databases such as Wikidata?
>
>
>
> Cheers, Joachim
>
>
>
> --
>
> Joachim Neubert
>
>
>
> ZBW – German National Library of Economics
>
> Leibniz Information Centre for Economics
>
> Neuer Jungfernstieg 21
> 20354 Hamburg
>
> Phone +49-42834-462
>
>
>
>
>
>
>
> *Von:* Wikidata [mailto:wikidata-boun...@lists.wikimedia.org
> ] *Im Auftrag von *Sebastian
> Hel

Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-16 Thread hellmann
The best way then to not create duplicates is to look at all existing 
organizations in Wikidata and add the court and court number manually, if they 
are German and then exclude these from the import.

Guarantees that there will be no duplicates.

So the technical side is feasible.
Barriers are political and legal.

Sebastian 

Am 16. Oktober 2017 14:24:51 MESZ schrieb Sebastian Hellmann 
:
>Ah yes, forgot to mention:
>
>there is no URI or unique identifier given by the Handelsregister 
>system. However, the courts take care that the registrations are
>unique, 
>so it is implicit. Handelsregister could easily create stable URIs out 
>of the court+type+number like /Leipzig_HRB_32853
>
>For Wikidata this is not a problem to handle. So no technical issues 
>from this side either.
>
>All the best,
>
>Sebastian
>
>
>On 16.10.2017 13:41, Sebastian Hellmann wrote:
>>
>> Hi all,
>>
>> the technical challenges are not so difficult.
>>
>> - 2.2 million are the exact number of German organisations, i.e. 
>> associations and companies. They are also unique.
>>
>> - Wikidata has 40k organisations:
>>
>> https://query.wikidata.org/#SELECT %3Fitem %3FitemLabel %0AWHERE 
>> %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel { 
>> bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
>>
>> so there would be a maximum of 40k duplicates These are easy to find 
>> and deduplicate
>>
>> - The crawl can be done easily, a colleague has done so before.
>>
>>
>> The issues here are:
>>
>> - Do you want to upload the data in Wikidata? It would be a real big 
>> extension. Can I go ahead
>>
>> - If the data were available externally as structured data under open
>
>> license, I would probably not suggest loading it into wikidata, as
>the 
>> data can be retrieved from the official source directly, however,
>here 
>> this data will not be published in a decent format.
>>
>> I thought that the way data is copied from coyrighted sources, i.e. 
>> only facts is ok for wikidata. This done in a lot of places, I guess.
>
>> Same for Wikipedia, i.e. News articles and copyrighted books are 
>> referenced. So Wikimedia or the Wikimedia community are experts on
>this.
>>
>> All the best,
>>
>> Sebastian
>>
>>
>> On 16.10.2017 10:18, Neubert, Joachim wrote:
>>>
>>> Hi Sebastian,
>>>
>>> This is huge! It will cover almost all currently existing German 
>>> companies. Many of these will have similar names, so preparing for 
>>> disambiguation is a concern.
>>>
>>> A good way for such an approach would be proposing a property for an
>
>>> external identifier, loading the data into Mix-n-match, creating 
>>> links for companies already in Wikidata, and adding the rest (or 
>>> perhaps only parts of them - I’m not sure if having all of them in 
>>> Wikidata makes sense, but that’s another discussion), preferably
>with 
>>> location and/or sector of trade in the description field.
>>>
>>> I’ve tried to figure out what could be used as key for a external 
>>> identifier property. However, it looks like the registry does not 
>>> offer any (persistent) URL to its entries. So for looking up a 
>>> company, apparently there are two options:
>>>
>>> -conducting an extended search for the exact string “A&A 
>>> Dienstleistungsgesellschaft mbH“
>>>
>>> -copying the register number “32853” plus selecting the court 
>>> (Leipzig) from the according dropdown list and search that
>>>
>>> Both ways are not very intuitive, even if we can provide a link to 
>>> the search form. This would make a weak connection to the source of 
>>> information. Much more important, it makes disambiguation in 
>>> Mix-n-match difficult. This applies for the preparation of your 
>>> initial load (you would not want to create duplicates). But much
>more 
>>> so for everybody else who wants to match his or her data later on. 
>>> Being forced to search for entries manually in a cumbersome way for 
>>> disambiguation of a new, possibly large and rich dataset is, in my 
>>> eyes, not something we want to impose on future contributors. And 
>>> often, the free information they find in the registry (formal name, 
>>> register number, legal form, address) will not easily match with the
>
>>> information they have (common name, location, perhaps founding date,
>
>>> and most important sector of trade), so disambiguation may still be 
>>> difficult.
>>>
>>> Have you checked which parts of the accessible information as below 
>>> can be crawled and added legally to external databases such as
>Wikidata?
>>>
>>> Cheers, Joachim
>>>
>>> --
>>>
>>> Joachim Neubert
>>>
>>> ZBW – German National Library of Economics
>>>
>>> Leibniz Information Centre for Economics
>>>
>>> Neuer Jungfernstieg 21
>>> 20354 Hamburg
>>>
>>> Phone +49-42834-462
>>>
>>> *Von:*Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] *Im 
>>> Auftrag von *Sebastian Hellmann
>>> *Gesendet:* Sonntag, 15. Oktober 2017 09:45
>>> *An:* wikidata@lists.wikimedia.org
>

[Wikidata] Google Code-in: Get your tasks for young contributors prepared!

2017-10-16 Thread Andre Klapper
Google Code-in is an annual contest for 13-17 year old students. It
will take place from Nov28 to Jan17 and is not only about coding tasks.

While we wait whether Wikimedia will get accepted:
* You have small, self-contained bugs you'd like to see fixed?
* Your documentation needs specific improvements?
* Your user interface has small design issues?
* Your Outreachy/Summer of Code project welcomes small tweaks?
* You'd enjoy helping someone port your template to Lua?
* Your gadget code uses some deprecated API calls?
* You have tasks in mind that welcome some research?

Also note that "Beginner tasks" (e.g. "Set up Vagrant" etc) and
"generic" tasks are very welcome (e.g. "Choose & fix 2 PHP7 issues
from the list in https://phabricator.wikimedia.org/T120336 ").
Because we will need hundreds of tasks. :)

And we also have more than 400 unassigned open 'easy' tasks listed:
https://phabricator.wikimedia.org/maniphest/query/HCyOonSbFn.z/#R
Would you be willing to mentor some of those in your area?

Please take a moment to find / update [Phabricator etc.] tasks in your
project(s) which would take an experienced contributor 2-3 hours. Check

   https://www.mediawiki.org/wiki/Google_Code-in/Mentors

and please ask if you have any questions!

For some achievements from last round, see
https://blog.wikimedia.org/2017/02/03/google-code-in/

Thanks!,
andre
-- 
Andre Klapper | Wikimedia Bugwrangler
http://blogs.gnome.org/aklapper/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-16 Thread Sebastian Hellmann

Ah yes, forgot to mention:

there is no URI or unique identifier given by the Handelsregister 
system. However, the courts take care that the registrations are unique, 
so it is implicit. Handelsregister could easily create stable URIs out 
of the court+type+number like /Leipzig_HRB_32853


For Wikidata this is not a problem to handle. So no technical issues 
from this side either.


All the best,

Sebastian


On 16.10.2017 13:41, Sebastian Hellmann wrote:


Hi all,

the technical challenges are not so difficult.

- 2.2 million are the exact number of German organisations, i.e. 
associations and companies. They are also unique.


- Wikidata has 40k organisations:

https://query.wikidata.org/#SELECT %3Fitem %3FitemLabel %0AWHERE 
%0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel { 
bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}


so there would be a maximum of 40k duplicates These are easy to find 
and deduplicate


- The crawl can be done easily, a colleague has done so before.


The issues here are:

- Do you want to upload the data in Wikidata? It would be a real big 
extension. Can I go ahead


- If the data were available externally as structured data under open 
license, I would probably not suggest loading it into wikidata, as the 
data can be retrieved from the official source directly, however, here 
this data will not be published in a decent format.


I thought that the way data is copied from coyrighted sources, i.e. 
only facts is ok for wikidata. This done in a lot of places, I guess. 
Same for Wikipedia, i.e. News articles and copyrighted books are 
referenced. So Wikimedia or the Wikimedia community are experts on this.


All the best,

Sebastian


On 16.10.2017 10:18, Neubert, Joachim wrote:


Hi Sebastian,

This is huge! It will cover almost all currently existing German 
companies. Many of these will have similar names, so preparing for 
disambiguation is a concern.


A good way for such an approach would be proposing a property for an 
external identifier, loading the data into Mix-n-match, creating 
links for companies already in Wikidata, and adding the rest (or 
perhaps only parts of them - I’m not sure if having all of them in 
Wikidata makes sense, but that’s another discussion), preferably with 
location and/or sector of trade in the description field.


I’ve tried to figure out what could be used as key for a external 
identifier property. However, it looks like the registry does not 
offer any (persistent) URL to its entries. So for looking up a 
company, apparently there are two options:


-conducting an extended search for the exact string “A&A 
Dienstleistungsgesellschaft mbH“


-copying the register number “32853” plus selecting the court 
(Leipzig) from the according dropdown list and search that


Both ways are not very intuitive, even if we can provide a link to 
the search form. This would make a weak connection to the source of 
information. Much more important, it makes disambiguation in 
Mix-n-match difficult. This applies for the preparation of your 
initial load (you would not want to create duplicates). But much more 
so for everybody else who wants to match his or her data later on. 
Being forced to search for entries manually in a cumbersome way for 
disambiguation of a new, possibly large and rich dataset is, in my 
eyes, not something we want to impose on future contributors. And 
often, the free information they find in the registry (formal name, 
register number, legal form, address) will not easily match with the 
information they have (common name, location, perhaps founding date, 
and most important sector of trade), so disambiguation may still be 
difficult.


Have you checked which parts of the accessible information as below 
can be crawled and added legally to external databases such as Wikidata?


Cheers, Joachim

--

Joachim Neubert

ZBW – German National Library of Economics

Leibniz Information Centre for Economics

Neuer Jungfernstieg 21
20354 Hamburg

Phone +49-42834-462

*Von:*Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] *Im 
Auftrag von *Sebastian Hellmann

*Gesendet:* Sonntag, 15. Oktober 2017 09:45
*An:* wikidata@lists.wikimedia.org 
*Betreff:* [Wikidata] Kickstartet: Adding 2.2 million German 
organisations to Wikidata


Hi all,

the German business registry contains roughly 2.2 million 
organisations. Some information is paid, but other is public, i.e. 
the info you are searching for at and clicking on UT (see example below):


https://www.handelsregister.de/rp_web/mask.do?Typ=e

I would like to add this to Wikidata, either by crawling or by 
raising money to use crowdsourcing concepts like crowdflour or amazon 
turk.


It should meet notability criteria 2: 
https://www.wikidata.org/wiki/Wikidata:Notability


2. It refers to an instance of a *clearly identifiable conceptual
or material entity*. The entity must be notable, in the sense
that it *c

Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-16 Thread Sebastian Hellmann

Hi all,

the technical challenges are not so difficult.

- 2.2 million are the exact number of German organisations, i.e. 
associations and companies. They are also unique.


- Wikidata has 40k organisations:

https://query.wikidata.org/#SELECT %3Fitem %3FitemLabel %0AWHERE %0A{%0A 
%3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel { 
bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}


so there would be a maximum of 40k duplicates These are easy to find and 
deduplicate


- The crawl can be done easily, a colleague has done so before.


The issues here are:

- Do you want to upload the data in Wikidata? It would be a real big 
extension. Can I go ahead


- If the data were available externally as structured data under open 
license, I would probably not suggest loading it into wikidata, as the 
data can be retrieved from the official source directly, however, here 
this data will not be published in a decent format.


I thought that the way data is copied from coyrighted sources, i.e. only 
facts is ok for wikidata. This done in a lot of places, I guess. Same 
for Wikipedia, i.e. News articles and copyrighted books are referenced. 
So Wikimedia or the Wikimedia community are experts on this.


All the best,

Sebastian


On 16.10.2017 10:18, Neubert, Joachim wrote:


Hi Sebastian,

This is huge! It will cover almost all currently existing German 
companies. Many of these will have similar names, so preparing for 
disambiguation is a concern.


A good way for such an approach would be proposing a property for an 
external identifier, loading the data into Mix-n-match, creating links 
for companies already in Wikidata, and adding the rest (or perhaps 
only parts of them - I’m not sure if having all of them in Wikidata 
makes sense, but that’s another discussion), preferably with location 
and/or sector of trade in the description field.


I’ve tried to figure out what could be used as key for a external 
identifier property. However, it looks like the registry does not 
offer any (persistent) URL to its entries. So for looking up a 
company, apparently there are two options:


-conducting an extended search for the exact string “A&A 
Dienstleistungsgesellschaft mbH“


-copying the register number “32853” plus selecting the court 
(Leipzig) from the according dropdown list and search that


Both ways are not very intuitive, even if we can provide a link to the 
search form. This would make a weak connection to the source of 
information. Much more important, it makes disambiguation in 
Mix-n-match difficult. This applies for the preparation of your 
initial load (you would not want to create duplicates). But much more 
so for everybody else who wants to match his or her data later on. 
Being forced to search for entries manually in a cumbersome way for 
disambiguation of a new, possibly large and rich dataset is, in my 
eyes, not something we want to impose on future contributors. And 
often, the free information they find in the registry (formal name, 
register number, legal form, address) will not easily match with the 
information they have (common name, location, perhaps founding date, 
and most important sector of trade), so disambiguation may still be 
difficult.


Have you checked which parts of the accessible information as below 
can be crawled and added legally to external databases such as Wikidata?


Cheers, Joachim

--

Joachim Neubert

ZBW – German National Library of Economics

Leibniz Information Centre for Economics

Neuer Jungfernstieg 21
20354 Hamburg

Phone +49-42834-462

*Von:*Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] *Im 
Auftrag von *Sebastian Hellmann

*Gesendet:* Sonntag, 15. Oktober 2017 09:45
*An:* wikidata@lists.wikimedia.org 
*Betreff:* [Wikidata] Kickstartet: Adding 2.2 million German 
organisations to Wikidata


Hi all,

the German business registry contains roughly 2.2 million 
organisations. Some information is paid, but other is public, i.e. the 
info you are searching for at and clicking on UT (see example below):


https://www.handelsregister.de/rp_web/mask.do?Typ=e

I would like to add this to Wikidata, either by crawling or by raising 
money to use crowdsourcing concepts like crowdflour or amazon turk.


It should meet notability criteria 2: 
https://www.wikidata.org/wiki/Wikidata:Notability


2. It refers to an instance of a *clearly identifiable conceptual
or material entity*. The entity must be notable, in the sense that
it *can be described using serious and publicly available
references*. If there is no item about you yet, you are probably
not notable.


The reference is the official German business registry, which is 
serious and public. Orgs are also per definition clearly identifiable 
legal entities.


How can I get clearance to proceed on this?

All the best,
Sebastian


  Entity data

Saxony District court *Leipzig HRB 32853 * – A&A 
Dienstleistungsgesellschaft 

Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-16 Thread Federico Morando
Dear All,

although in Italy these data are normally not available (not even the basic
data) from the chambers of commerce, there are some open data from which we
could extract several identifiers - of course these are biased toward the
suppliers of Public Administrations, because contracting with PA is the
trigger for being listed in these Open Data.

In the context of a broader effort to upload this kind of data in Wikidata,
as the one which seems to emerge from this thread, the firm which I manage
may be willing to contribute about half a million couples of labels and VAT
IDs... it's a relatively thin dataset - in the sense that you just have the
name of the firm and the VAT ID, and possibly a link to a portal we're
building in which you may gather additional information about the activity
of this firm with the Italian public administration - but, as I was
mentioning, Italian firm data are quite rare (they are not even available
on OpenCorporates.com).

By the way, https://www.wikidata.org/wiki/Property:P3608 (EU VAT number)
already exists and may provide a sufficient identifier in most cases, since
in most cases the country ISO code (e.g. IT for Italy) + the national VAT
ID does generated the EU VAT number (the actual algorithm may be a bit more
complex, but it's documented). (That said, there are also national
identifiers which may be worth creating, such as the number of registration
at national chambers of commerce, etc.)

About the value of these data on Wikidata, starting from our use case, I
think that having permanent URIs for all firms on Wikidata would provide,
for instance, great value for several anti-corruption projects around the
world. (This could also provide a place to trace some international links
among companies, which are not always readily available today.) That said,
I perfectly understand the concerns of Andra in terms of scalability and
maintenance, and this is one of the reasons I did not think of donating
these data to Wikidata so far.

I'll try to follow these discussions, but please - Sebastian or others -
feel free to ping me if the project goes on and you want to include these
Italian data.

Best,

Federico



On Mon, Oct 16, 2017 at 10:25 AM, Andra Waagmeester 
wrote:

> There is an equal size of data on Belgian enterprises available. with the
> same objective to enrich wikidata with enterprise data I recently proposed
> the following property: https://www.wikidata.org/wiki/Wikidata:
> Property_proposal/NACE_code
>
> However, after some talks with others in the Wikidata community, I
> recently have some second thoughts on whether or not a full dump of these
> type of datasets are valuable enrichments of Wikidata. Adding 2 million
> items with additional statement per item would be quite an enlargement of
> Wikidata. If we would bot add all business of both Belgium and Germany, we
> would have 4 million of new items, which currently would count for 10% of
> all of Wikidata. I am not sure what this would mean in term scalability and
> if it would cause any scalability issues.
>
> Maybe a use-case driven approach here would be more appropriate. We could
> think of a bot that would source both the trade registers of the different
> countries when a specific use case would vouch for the inclusion of trade
> data.
>
> Just my 2cts
>
> Cheers,
>
> Andra
>
> On Mon, Oct 16, 2017 at 9:48 AM, Sebastian Hellmann <
> hellm...@informatik.uni-leipzig.de> wrote:
>
>> Thanks, done.
>>
>> https://www.wikidata.org/wiki/Wikidata:Project_chat#Handelsregister
>>
>> On 15.10.2017 22:10, Yaroslav Blanter wrote:
>>
>> Hi Sebastian,
>>
>> I would say the best way is to file a request for the permissions for the
>> bot
>>
>> https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot
>>
>> and possibly leave a message on the Project Chat
>>
>> https://www.wikidata.org/wiki/Wikidata:Project_chat
>>
>> Cheers
>> Yaroslav
>>
>> On Sun, Oct 15, 2017 at 9:44 AM, Sebastian Hellmann <
>> hellm...@informatik.uni-leipzig.de> wrote:
>>
>>> Hi all,
>>>
>>> the German business registry contains roughly 2.2 million organisations.
>>> Some information is paid, but other is public, i.e. the info you are
>>> searching for at and clicking on UT (see example below):
>>>
>>> https://www.handelsregister.de/rp_web/mask.do?Typ=e
>>>
>>>
>>> I would like to add this to Wikidata, either by crawling or by raising
>>> money to use crowdsourcing concepts like crowdflour or amazon turk.
>>>
>>>
>>> It should meet notability criteria 2: https://www.wikidata.org/wiki/
>>> Wikidata:Notability
>>>
>>> 2. It refers to an instance of a *clearly identifiable conceptual or
>>> material entity*. The entity must be notable, in the sense that it *can
>>> be described using serious and publicly available references*. If there
>>> is no item about you yet, you are probably not notable.
>>>
>>>
>>> The reference is the official German business registry, which is serious
>>> and public. Orgs are also per definition clearly identifi

Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-16 Thread Andra Waagmeester
There is an equal size of data on Belgian enterprises available. with the
same objective to enrich wikidata with enterprise data I recently proposed
the following property:
https://www.wikidata.org/wiki/Wikidata:Property_proposal/NACE_code

However, after some talks with others in the Wikidata community, I recently
have some second thoughts on whether or not a full dump of these type of
datasets are valuable enrichments of Wikidata. Adding 2 million items with
additional statement per item would be quite an enlargement of Wikidata. If
we would bot add all business of both Belgium and Germany, we would have 4
million of new items, which currently would count for 10% of all of
Wikidata. I am not sure what this would mean in term scalability and if it
would cause any scalability issues.

Maybe a use-case driven approach here would be more appropriate. We could
think of a bot that would source both the trade registers of the different
countries when a specific use case would vouch for the inclusion of trade
data.

Just my 2cts

Cheers,

Andra

On Mon, Oct 16, 2017 at 9:48 AM, Sebastian Hellmann <
hellm...@informatik.uni-leipzig.de> wrote:

> Thanks, done.
>
> https://www.wikidata.org/wiki/Wikidata:Project_chat#Handelsregister
>
> On 15.10.2017 22:10, Yaroslav Blanter wrote:
>
> Hi Sebastian,
>
> I would say the best way is to file a request for the permissions for the
> bot
>
> https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot
>
> and possibly leave a message on the Project Chat
>
> https://www.wikidata.org/wiki/Wikidata:Project_chat
>
> Cheers
> Yaroslav
>
> On Sun, Oct 15, 2017 at 9:44 AM, Sebastian Hellmann <
> hellm...@informatik.uni-leipzig.de> wrote:
>
>> Hi all,
>>
>> the German business registry contains roughly 2.2 million organisations.
>> Some information is paid, but other is public, i.e. the info you are
>> searching for at and clicking on UT (see example below):
>>
>> https://www.handelsregister.de/rp_web/mask.do?Typ=e
>>
>>
>> I would like to add this to Wikidata, either by crawling or by raising
>> money to use crowdsourcing concepts like crowdflour or amazon turk.
>>
>>
>> It should meet notability criteria 2: https://www.wikidata.org/wiki/
>> Wikidata:Notability
>>
>> 2. It refers to an instance of a *clearly identifiable conceptual or
>> material entity*. The entity must be notable, in the sense that it *can
>> be described using serious and publicly available references*. If there
>> is no item about you yet, you are probably not notable.
>>
>>
>> The reference is the official German business registry, which is serious
>> and public. Orgs are also per definition clearly identifiable legal
>> entities.
>>
>> How can I get clearance to proceed on this?
>>
>> All the best,
>> Sebastian
>>
>>
>>
>> Entity data
>> Saxony District court *Leipzig HRB 32853 *– A&A
>> Dienstleistungsgesellschaft mbH
>> Legal status: Gesellschaft mit beschränkter Haftung
>> Capital: 25.000,00 EUR
>> Date of entry: 29/08/2016
>> (When entering date of entry, wrong data input can occur due to system
>> failures!)
>> Date of removal: -
>> Balance sheet available: -
>> Address (subject to correction): A&A Dienstleistungsgesellschaft mbH
>> Prager Straße 38-40
>> 04317 Leipzig
>>
>>
>> --
>> All the best,
>> Sebastian Hellmann
>>
>> Director of Knowledge Integration and Linked Data Technologies (KILT)
>> Competence Center
>> at the Institute for Applied Informatics (InfAI) at Leipzig University
>> Executive Director of the DBpedia Association
>> Projects: http://dbpedia.org, http://nlp2rdf.org,
>> http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
>> 
>> Homepage: http://aksw.org/SebastianHellmann
>> Research Group: http://aksw.org
>>
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>>
>
>
> ___
> Wikidata mailing 
> listWikidata@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
> --
> All the best,
> Sebastian Hellmann
>
> Director of Knowledge Integration and Linked Data Technologies (KILT)
> Competence Center
> at the Institute for Applied Informatics (InfAI) at Leipzig University
> Executive Director of the DBpedia Association
> Projects: http://dbpedia.org, http://nlp2rdf.org,
> http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
> 
> Homepage: http://aksw.org/SebastianHellmann
> Research Group: http://aksw.org
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-16 Thread Neubert, Joachim
Hi Sebastian,

This is huge! It will cover almost all currently existing German companies. 
Many of these will have similar names, so preparing for disambiguation is a 
concern.

A good way for such an approach would be proposing a property for an external 
identifier, loading the data into Mix-n-match, creating links for companies 
already in Wikidata, and adding the rest (or perhaps only parts of them - I’m 
not sure if having all of them in Wikidata makes sense, but that’s another 
discussion), preferably with location and/or sector of trade in the description 
field.

I’ve tried to figure out what could be used as key for a external identifier 
property. However, it looks like the registry does not offer any (persistent) 
URL to its entries. So for looking up a company, apparently there are two 
options:


-  conducting an extended search for the exact string “A&A 
Dienstleistungsgesellschaft mbH“

-  copying the register number “32853” plus selecting the court 
(Leipzig) from the according dropdown list and search that

Both ways are not very intuitive, even if we can provide a link to the search 
form. This would make a weak connection to the source of information. Much more 
important, it makes disambiguation in Mix-n-match difficult. This applies for 
the preparation of your initial load (you would not want to create duplicates). 
But much more so for everybody else who wants to match his or her data later 
on. Being forced to search for entries manually in a cumbersome way for 
disambiguation of a new, possibly large and rich dataset is, in my eyes, not 
something we want to impose on future contributors. And often, the free 
information they find in the registry (formal name, register number, legal 
form, address) will not easily match with the information they have (common 
name, location, perhaps founding date, and most important sector of trade), so 
disambiguation may still be difficult.

Have you checked which parts of the accessible information as below can be 
crawled and added legally to external databases such as Wikidata?

Cheers, Joachim

--
Joachim Neubert

ZBW – German National Library of Economics
Leibniz Information Centre for Economics
Neuer Jungfernstieg 21
20354 Hamburg
Phone +49-42834-462



Von: Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] Im Auftrag von 
Sebastian Hellmann
Gesendet: Sonntag, 15. Oktober 2017 09:45
An: wikidata@lists.wikimedia.org
Betreff: [Wikidata] Kickstartet: Adding 2.2 million German organisations to 
Wikidata


Hi all,

the German business registry contains roughly 2.2 million organisations. Some 
information is paid, but other is public, i.e. the info you are searching for 
at and clicking on UT (see example below):

https://www.handelsregister.de/rp_web/mask.do?Typ=e



I would like to add this to Wikidata, either by crawling or by raising money to 
use crowdsourcing concepts like crowdflour or amazon turk.



It should meet notability criteria 2: 
https://www.wikidata.org/wiki/Wikidata:Notability

2. It refers to an instance of a clearly identifiable conceptual or material 
entity. The entity must be notable, in the sense that it can be described using 
serious and publicly available references. If there is no item about you yet, 
you are probably not notable.

The reference is the official German business registry, which is serious and 
public. Orgs are also per definition clearly identifiable legal entities.

How can I get clearance to proceed on this?

All the best,
Sebastian





Entity data

Saxony District court Leipzig HRB 32853 – A&A Dienstleistungsgesellschaft mbH

Legal status:

Gesellschaft mit beschränkter Haftung

Capital:

25.000,00 EUR

Date of entry:

29/08/2016
(When entering date of entry, wrong data input can occur due to system 
failures!)

Date of removal:

-

Balance sheet available:

-

Address (subject to correction):

A&A Dienstleistungsgesellschaft mbH
Prager Straße 38-40
04317 Leipzig



--
All the best,
Sebastian Hellmann

Director of Knowledge Integration and Linked Data Technologies (KILT) 
Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org, http://linguistics.okfn.org, 
https://www.w3.org/community/ld4lt
Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

2017-10-16 Thread Sebastian Hellmann

Thanks, done.

https://www.wikidata.org/wiki/Wikidata:Project_chat#Handelsregister


On 15.10.2017 22:10, Yaroslav Blanter wrote:

Hi Sebastian,

I would say the best way is to file a request for the permissions for 
the bot


https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot

and possibly leave a message on the Project Chat

https://www.wikidata.org/wiki/Wikidata:Project_chat

Cheers
Yaroslav

On Sun, Oct 15, 2017 at 9:44 AM, Sebastian Hellmann 
> wrote:


Hi all,

the German business registry contains roughly 2.2 million
organisations. Some information is paid, but other is public, i.e.
the info you are searching for at and clicking on UT (see example
below):

https://www.handelsregister.de/rp_web/mask.do?Typ=e



I would like to add this to Wikidata, either by crawling or by
raising money to use crowdsourcing concepts like crowdflour or
amazon turk.


It should meet notability criteria 2:
https://www.wikidata.org/wiki/Wikidata:Notability



2. It refers to an instance of a *clearly identifiable conceptual
or material entity*. The entity must be notable, in the sense
that it *can be described using serious and publicly available
references*. If there is no item about you yet, you are probably
not notable.



The reference is the official German business registry, which is
serious and public. Orgs are also per definition clearly
identifiable legal entities.

How can I get clearance to proceed on this?

All the best,
Sebastian



  Entity data


Saxony District court *Leipzig HRB 32853 *– A&A
Dienstleistungsgesellschaft mbH
Legal status:   Gesellschaft mit beschränkter Haftung
Capital:25.000,00 EUR
Date of entry:  29/08/2016
(When entering date of entry, wrong data input can occur due to
system failures!)
Date of removal:-
Balance sheet available:-
Address (subject to correction):A&A Dienstleistungsgesellschaft mbH
Prager Straße 38-40
04317 Leipzig


-- 
All the best,

Sebastian Hellmann

Director of Knowledge Integration and Linked Data Technologies
(KILT) Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org,
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt

Homepage: http://aksw.org/SebastianHellmann

Research Group: http://aksw.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org 
https://lists.wikimedia.org/mailman/listinfo/wikidata





___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


--
All the best,
Sebastian Hellmann

Director of Knowledge Integration and Linked Data Technologies (KILT) 
Competence Center

at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org, 
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt 


Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata