Re: [Wikidata] Tool for consuming left-over data from import

2017-08-08 Thread Thad Guidry
Gerard,

Sure working with linked data is great.  But sometimes data is not linked
at all and has no identifiers...

That's where the work Antonin is doing with OpenRefine helps with
reconciling when even there are no identifiers other than a name.  Many
datasets only have Strings as Things.  In fact, I'd say its quite useful to
not only *add additional statements about existing Things* we already have,
but *also adding more Things* in the world that have yet to be included in
a database like Wikidata where no identifiers have been created yet for
that Thing.

And I'm just as upset as you are about the goldmine of data still locked up
in Freebase.  But don't worry, baby steps, and eventually that data will
make its way into Wikidata.  Getting the Primary Sources tool up to par is
a big step towards that, but certainly not the end of the line.

-Thad
+ThadGuidry 

On Tue, Aug 8, 2017 at 6:57 AM Gerard Meijssen 
wrote:

> Hoi,
> Given that Wikidata has identifiers to many external sources the challenge
> of reconciliation is often less of a challenge for crowds and less of a
> challenge than it needs to be. A few examples; the OCLC maintains two
> distinct identifiers; VIAF and ISNI.  They are both actively maintained.
> When we include VIAF numbers in Wikidata, there will be instances where the
> identifiers become redirects. The same is true for ISNI. When we have the
> latest VIAF numbers, the ISNI numbers are highly likely to be correct.
> (better than 95% - the minimum requirements for imports at ISNI)..
>
> When we share our identifiers regularly, we will learn about redirects and
> gain the direct links. We shared our identiers and VIAF identifiers with
> the Open Library. They now include them and in return we received a file
> that helped us depuplicate our Open Library identifiers and replace the
> redirects. What is infuriating is that there are Open Library identifiers
> hidden in the Freebase data. They cannot be exported, we can not send them
> to OL for processing and import them in Wikidata. We do a subpar job as a
> consequence.
>
> Another project where we will  gain information from multiple sources is
> the Biodiversity Heritage Library. We may gain links through their
> collaboration with the Internet Archive and the OCLC. This will reduce the
> chances for the introduction of duplicates at our end because of shared
> identifiers. I will also reduce the amount or people we have to process
> before they are included in Wikidata. It will allow for both OCLC, BHL and
> IA to learn of identifiers as we have them allowing for subsequent
> improvement is quality in the future for all of us.
>
> So in my opinion we should agressively share identifiers, collaborate and
> seek the redirects and replace them and become more and more a focal point
> for links between resources.
> Thanks,
>  GerardM
>
> On 8 August 2017 at 11:13, Marco Fossati  wrote:
>
>> Hi Antonin,
>>
>> On 8/7/17 20:36, Antonin Delpeuch (lists) wrote:
>>
>>> Does anybody know an alternative to CrowdFlower that can be used for
>>> free with volunteer workers?
>>>
>> There you go: https://crowdcrafting.org/
>> Hope this helps you keep up with your great work on openrefine.
>>
>> I believe entity reconciliation is one of the most challenging tasks that
>> keep third-party data providers away from imports to Wikidata.
>> Cheers,
>>
>> Marco
>>
>>
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Tool for consuming left-over data from import

2017-08-08 Thread Gerard Meijssen
Hoi,
Given that Wikidata has identifiers to many external sources the challenge
of reconciliation is often less of a challenge for crowds and less of a
challenge than it needs to be. A few examples; the OCLC maintains two
distinct identifiers; VIAF and ISNI.  They are both actively maintained.
When we include VIAF numbers in Wikidata, there will be instances where the
identifiers become redirects. The same is true for ISNI. When we have the
latest VIAF numbers, the ISNI numbers are highly likely to be correct.
(better than 95% - the minimum requirements for imports at ISNI)..

When we share our identifiers regularly, we will learn about redirects and
gain the direct links. We shared our identiers and VIAF identifiers with
the Open Library. They now include them and in return we received a file
that helped us depuplicate our Open Library identifiers and replace the
redirects. What is infuriating is that there are Open Library identifiers
hidden in the Freebase data. They cannot be exported, we can not send them
to OL for processing and import them in Wikidata. We do a subpar job as a
consequence.

Another project where we will  gain information from multiple sources is
the Biodiversity Heritage Library. We may gain links through their
collaboration with the Internet Archive and the OCLC. This will reduce the
chances for the introduction of duplicates at our end because of shared
identifiers. I will also reduce the amount or people we have to process
before they are included in Wikidata. It will allow for both OCLC, BHL and
IA to learn of identifiers as we have them allowing for subsequent
improvement is quality in the future for all of us.

So in my opinion we should agressively share identifiers, collaborate and
seek the redirects and replace them and become more and more a focal point
for links between resources.
Thanks,
 GerardM

On 8 August 2017 at 11:13, Marco Fossati  wrote:

> Hi Antonin,
>
> On 8/7/17 20:36, Antonin Delpeuch (lists) wrote:
>
>> Does anybody know an alternative to CrowdFlower that can be used for
>> free with volunteer workers?
>>
> There you go: https://crowdcrafting.org/
> Hope this helps you keep up with your great work on openrefine.
>
> I believe entity reconciliation is one of the most challenging tasks that
> keep third-party data providers away from imports to Wikidata.
> Cheers,
>
> Marco
>
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Tool for consuming left-over data from import

2017-08-08 Thread Antonin Delpeuch (lists)
On 08/08/2017 10:13, Marco Fossati wrote:
> On 8/7/17 20:36, Antonin Delpeuch (lists) wrote:
>> Does anybody know an alternative to CrowdFlower that can be used for
>> free with volunteer workers?
> There you go: https://crowdcrafting.org/
> Hope this helps you keep up with your great work on openrefine.
> 
> I believe entity reconciliation is one of the most challenging tasks
> that keep third-party data providers away from imports to Wikidata.

Thanks a lot! This looks great indeed - and the backend (pybossa) is
open source and very generic, that is awesome! That means we could quite
easily run that on WMF Cloud.

I've created an issue here:
https://github.com/sparkica/LODRefine/issues/25

Cheers,
Antonin

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Tool for consuming left-over data from import

2017-08-08 Thread Marco Fossati

Hi Antonin,

On 8/7/17 20:36, Antonin Delpeuch (lists) wrote:

Does anybody know an alternative to CrowdFlower that can be used for
free with volunteer workers?

There you go: https://crowdcrafting.org/
Hope this helps you keep up with your great work on openrefine.

I believe entity reconciliation is one of the most challenging tasks 
that keep third-party data providers away from imports to Wikidata.

Cheers,

Marco

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Tool for consuming left-over data from import

2017-08-07 Thread Antonin Delpeuch (lists)
Hi!

That reminds me of the crowdsourcing extension that LODrefine has - it
lets you crowdsource the manual part of the reconciliation process. But
it uses CrowdFlower for that (which is quite pricy). It'd be great if
Wikidata Game could evolve into a decent Wikimedia-focused alternative
to this sort of service - but that would be a lot of work.

Does anybody know an alternative to CrowdFlower that can be used for
free with volunteer workers?

Antonin

On 04/08/2017 15:57, André Costa wrote:
> Hi all!
> 
> As part of the Connected Open Heritage project Wikimedia Sverige have
> been migrating Wiki Loves Monuments datasets from Wikipedias to Wikidata.
> 
> In the course of doing this we keep a note of the data which we fail to
> migrate. For each of these left-over bits we know which item and which
> property it belongs to as well as the source field and language from the
> Wikipedia list.  An example would e.g. be a "type of building" field
> where we could not match the text to an item on Wikidata but know that
> the target property is P31.
> 
> We have created dumps of these (such as
> https://tools.wmflabs.org/coh/_total_se-ship_new.json, don't worry this
> one is tiny) but are now looking for an easy way for users to consume them.
> 
> Does anyone know of a tool which could do this today? The Wikidata game
> only allows (AFAIK) for yes/no/skip whereas you would here want
> something like /invalid/skip. And if not are there any
> tools which with a bit of forking could be made to do it?
> 
> We have only published a few dumps but there are more to come. I would
> also imagine that this, or a similar, format could be useful for other
> imports/template harvests where some fields are more easily handled by
> humans.
> 
> Any thoughts and suggestions are welcome.
> Cheers,
> André
> André Costa |Senior Developer, Wikimedia Sverige
> | andre.co...@wikimedia.se 
> | +46 (0)733-964574
> 
> Stöd fri kunskap, bli medlem i Wikimedia Sverige.
> Läs mer på blimedlem.wikimedia.se 
> 
> 
> 
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
> 


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Tool for consuming left-over data from import

2017-08-07 Thread Nick Wilson (Quiddity)
On Fri, Aug 4, 2017 at 10:57 AM, André Costa  wrote:
> Hi all!
>
> As part of the Connected Open Heritage project Wikimedia Sverige have been
> migrating Wiki Loves Monuments datasets from Wikipedias to Wikidata.
>
> In the course of doing this we keep a note of the data which we fail to
> migrate. For each of these left-over bits we know which item and which
> property it belongs to as well as the source field and language from the
> Wikipedia list.  An example would e.g. be a "type of building" field where
> we could not match the text to an item on Wikidata but know that the target
> property is P31.
>
> We have created dumps of these (such as
> https://tools.wmflabs.org/coh/_total_se-ship_new.json, don't worry this one
> is tiny) but are now looking for an easy way for users to consume them.
>
> Does anyone know of a tool which could do this today? The Wikidata game only
> allows (AFAIK) for yes/no/skip whereas you would here want something like
> /invalid/skip. And if not are there any tools which with a bit
> of forking could be made to do it?
>


(IANADeveloper, but) I believe Wikidata Game might handle this?
E.g. The "Date" game has fields for dates
http://storage8.static.itmages.com/i/17/0807/h_1502126752_6195952_63d5e0e3da.png
http://storage5.static.itmages.com/i/17/0807/h_1502126720_7252323_c4174b3da6.png
https://tools.wmflabs.org/wikidata-game/#mode=no_date
I forget if any of the Distributed Games have similar functionality
(and no time to check now).
Hope that helps!


> We have only published a few dumps but there are more to come. I would also
> imagine that this, or a similar, format could be useful for other
> imports/template harvests where some fields are more easily handled by
> humans.
>
> Any thoughts and suggestions are welcome.
> Cheers,
> André
> André Costa | Senior Developer, Wikimedia Sverige | andre.co...@wikimedia.se
> | +46 (0)733-964574
>
> Stöd fri kunskap, bli medlem i Wikimedia Sverige.
> Läs mer på blimedlem.wikimedia.se
>
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>



-- 
Nick Wilson (Quiddity)
Community Liaison, Wikimedia Foundation

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Tool for consuming left-over data from import

2017-08-04 Thread fn

Dear André,


Great work you have done.

I am wondering whether you are aware of the issues around the Danish 
dataset and the clean up apparently required.


As far as I can determine the German Wikipedia has had a number of 
articles on Danish dolmens and they are also available on Wikidata. As 
far as I can see these items have not been linked with the new Swedish 
additions.


For instance, "Dolmen von Tornby" https://www.wikidata.org/wiki/Q1269335 
has no Danish ID but is probably one of these items: 
https://www.wikidata.org/wiki/Q30240926 and 
https://www.wikidata.org/wiki/Q30240928 or 
https://www.wikidata.org/wiki/Q30114892 or 
https://www.wikidata.org/wiki/Q30114893 which the Alicia bot has added.


There are quite a lot of Danish dolmens on the German Wikipedia 
https://de.wikipedia.org/wiki/Kategorie:Gro%C3%9Fsteingrab_in_D%C3%A4nemark


I am sorry to present you with yet another problem. Perhaps the items 
can be matched by the geo-coordinate.



best regards
Finn




On 08/04/2017 04:57 PM, André Costa wrote:

Hi all!

As part of the Connected Open Heritage project Wikimedia Sverige have 
been migrating Wiki Loves Monuments datasets from Wikipedias to Wikidata.


In the course of doing this we keep a note of the data which we fail to 
migrate. For each of these left-over bits we know which item and which 
property it belongs to as well as the source field and language from the 
Wikipedia list.  An example would e.g. be a "type of building" field 
where we could not match the text to an item on Wikidata but know that 
the target property is P31.


We have created dumps of these (such as 
https://tools.wmflabs.org/coh/_total_se-ship_new.json, don't worry this 
one is tiny) but are now looking for an easy way for users to consume them.


Does anyone know of a tool which could do this today? The Wikidata game 
only allows (AFAIK) for yes/no/skip whereas you would here want 
something like /invalid/skip. And if not are there any 
tools which with a bit of forking could be made to do it?


We have only published a few dumps but there are more to come. I would 
also imagine that this, or a similar, format could be useful for other 
imports/template harvests where some fields are more easily handled by 
humans.


Any thoughts and suggestions are welcome.
Cheers,
André
André Costa |Senior Developer, Wikimedia Sverige 
|andre.co...@wikimedia.se  
|+46 (0)733-964574


Stöd fri kunskap, bli medlem i Wikimedia Sverige.
Läs mer på blimedlem.wikimedia.se 



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Tool for consuming left-over data from import

2017-08-04 Thread André Costa
Hi all!

As part of the Connected Open Heritage project Wikimedia Sverige have been
migrating Wiki Loves Monuments datasets from Wikipedias to Wikidata.

In the course of doing this we keep a note of the data which we fail to
migrate. For each of these left-over bits we know which item and which
property it belongs to as well as the source field and language from the
Wikipedia list.  An example would e.g. be a "type of building" field where
we could not match the text to an item on Wikidata but know that the target
property is P31.

We have created dumps of these (such as
https://tools.wmflabs.org/coh/_total_se-ship_new.json, don't worry this one
is tiny) but are now looking for an easy way for users to consume them.

Does anyone know of a tool which could do this today? The Wikidata game
only allows (AFAIK) for yes/no/skip whereas you would here want something
like /invalid/skip. And if not are there any tools which with
a bit of forking could be made to do it?

We have only published a few dumps but there are more to come. I would also
imagine that this, or a similar, format could be useful for other
imports/template harvests where some fields are more easily handled by
humans.

Any thoughts and suggestions are welcome.
Cheers,
André
André Costa | Senior Developer, Wikimedia Sverige | andre.co...@wikimedia.se
| +46 (0)733-964574

Stöd fri kunskap, bli medlem i Wikimedia Sverige.
Läs mer på blimedlem.wikimedia.se
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata