Re: [Wikidata] Duplicates in Wikidata

2015-12-27 Thread Gerard Meijssen
Hoi,
The realisation that a lot of errors exist is easy. The reason why this
makes a difference is because many of the problematic items are presented
to us on a platter.. It makes sense to THEM to do this and it helps US
improve our quality.

By improving this list we become more valuable to the OCLC because we show
we may behave responsibly about these kinds of notifications.
THAT is why I am happy..
Thanks,
  GerardM

On 28 December 2015 at 00:19, John Erling Blad  wrote:

> There are also a lot of errors/duplicates in WorldCat.
>
> On Sun, Dec 27, 2015 at 12:43 PM, Gerard Meijssen <
> gerard.meijs...@gmail.com> wrote:
>
>> Hoi,
>> Probably :)
>> Thanks,
>>  Gerard
>>
>> On 27 December 2015 at 12:31, Federico Leva (Nemo) 
>> wrote:
>>
>>> Is this something for a Wikidata game? :)
>>>
>>> Nemo
>>>
>>>
>>> ___
>>> Wikidata mailing list
>>> Wikidata@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>>
>>
>>
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>>
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Duplicates in Wikidata

2015-12-27 Thread John Erling Blad
There are also a lot of errors/duplicates in WorldCat.

On Sun, Dec 27, 2015 at 12:43 PM, Gerard Meijssen  wrote:

> Hoi,
> Probably :)
> Thanks,
>  Gerard
>
> On 27 December 2015 at 12:31, Federico Leva (Nemo) 
> wrote:
>
>> Is this something for a Wikidata game? :)
>>
>> Nemo
>>
>>
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] [ANNOUNCEMENT] StrepHit IEG project kick-off seminar

2015-12-27 Thread Marco Fossati
Hi Dario,

Date: Wed, 23 Dec 2015 08:04:33 -0800
> From: Dario Taraborelli 
> To: "Discussion list for the Wikidata project."
> 
> Subject: Re: [Wikidata] [ANNOUNCEMENT] StrepHit IEG project kick-off
> seminar
> Message-ID:
>  ugvjnbb-mjx...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hi Marco,
>
> will the seminar be streamed or recorded?
>
I have to check with FBK's staff, it should be straightforward.
I will take care of sharing the link with everyone.

Cheers,

Marco

>
> Dario
>
> On Wed, Dec 23, 2015 at 8:03 AM, Marco Fossati 
> wrote:
>
> > [Begging pardon if you read this multiple times]
> >
> > Hi everyone,
> >
> > I would like to announce with great pleasure the StrepHit IEG project
> > kick-off seminar.
> > Of course, you are all invited to attend.
> >
> > The event will be held in a special day: Wikipedia's birthday!
> >
> > Below you can find the details.
> >
> > Schedule: 15 January 2016, 11:00 am, Luigi Stringa Conference Room
> > Location: Fondazione Bruno Kessler, Via Sommarive 18, Povo, Trento, Italy
> > - http://www.openstreetmap.org/way/28933739
> >
> > Abstract: We kick-off StrepHit, a project funded by the Wikimedia
> > Foundation through the Individual Engagement Grants program.
> > StrepHit is a Natural Language Processing pipeline that understands human
> > language, extracts facts from text and produces Wikidata statements with
> > reference URLs.
> > It will enhance the data quality of Wikidata by suggesting references to
> > validate statements, and will help Wikidata become the gold-standard hub
> of
> > the Open Data landscape.
> >
> > Link:
> >
> https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References
> >
> > Speaker's bio: Marco Fossati is a researcher with a double background in
> > Natural Languages and Information Technologies. He works at the Data and
> > Knowledge Management (DKM) research unit at Fondazione Bruno Kessler,
> > Trento, Italy. He is member of the DBpedia Association board of trustees,
> > founder and representative of its Italian chapter. He has
> interdisciplinary
> > skills both in linguistics and in programming. His research focuses on
> > bridging the gap between Natural Language Processing techniques and Large
> > Scale Structured Knowledge Bases in order to drive the Web of Data
> towards
> > its full potential.
> >
> > See you in Trento and long live Wikipedia!
> > Cheers,
> >
> > Marco
> >
> > ___
> > Wikidata mailing list
> > Wikidata@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikidata
> >
> >
>
>
> --
>
>
> *Dario Taraborelli  *Head of Research, Wikimedia Foundation
> wikimediafoundation.org • nitens.org • @readermeter
> 
> -- next part --
> An HTML attachment was scrubbed...
> URL: <
> https://lists.wikimedia.org/pipermail/wikidata/attachments/20151223/9f7376ef/attachment-0001.html
> >
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Duplicates in Wikidata

2015-12-27 Thread Gerard Meijssen
Hoi,
Probably :)
Thanks,
 Gerard

On 27 December 2015 at 12:31, Federico Leva (Nemo) 
wrote:

> Is this something for a Wikidata game? :)
>
> Nemo
>
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Duplicates in Wikidata

2015-12-27 Thread Federico Leva (Nemo)

Is this something for a Wikidata game? :)

Nemo

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Duplicates in Wikidata

2015-12-27 Thread Gerard Meijssen
Hoi,
Lists like these are gold. Yes we are interested in such lists and yes,
they indicate issues that we can solve.

I did the first one.. Moritz Büsgen is now only one person. It is suggested
that bots might merge these. Possibly, it is one of those issues where the
community may have an opinion.

It would be cool to have a workflow where items that have been resolved
disappear. By fixing the first one, other people will not know that it was
fixed..

Thanks,
  GerardM

On 23 December 2015 at 23:05, Proffitt,Merrilee  wrote:

> Hello colleagues,
>
>
>
> During the most recent VIAF harvest we encountered a number of duplicate
> records in Wikidata. Forwarding on in case this is of interest (there is an
> attached file – not sure if that will go through on this list or not).
>
>
>
> Some discussion from OCLC colleagues is included below.
>
>
>
> Merrilee Proffitt, Senior Program Officer
> OCLC Research
>
>
>
> *From:* Toves,Jenny
> *Sent:* Tuesday, December 22, 2015 6:02 AM
> *To:* Proffitt,Merrilee
> *Subject:* FW: 201551 vs 201552
>
>
>
> Good morning Merrilee,
>
>
>
> You probably know that we harvest wikidata monthly for ingest into VIAF.
> This month we found 315 pairs of records that appear to be duplicates. That
> was a jump from previous months. I am not sure who would be interested in
> this but Thom & I thought you might be. The attached report has 630 lines
> showing what viaf saw as duplicates. So this pair of lines:
>
>
>
> WKP|Q21518392   =998$aCharles du Bois
> Larbalestier$2WKP|Q21341290$3duplicate
>
> WKP|Q21341290   =998$aCharles du Bois
> Larbalestier$2WKP|Q21518392$3duplicate
>
>
>
> Shows that those two wikidata numbers are linked to one another by viaf.
>
>
>
> I don’t think we expect you to do anything with this unless you find it
> interesting. I suspect there are bots to clean this stuff up but maybe not.
>
>
>
> --Jenny.
>
>
>
> *From:* Hickey,Thom
> *Sent:* Monday, December 21, 2015 9:47 PM
> *To:* Toves,Jenny
> *Subject:* RE: 201551 vs 201552
>
>
>
> She probably would be interested.
>
>
>
> --Th
>
>
>
>
> *From: *Toves,Jenny 
> *Sent: *Monday, December 21, 2015 9:35 PM
> *To: *Hickey,Thom 
> *Subject: *RE: 201551 vs 201552
>
>
>
> Exact same name + dates. Do you a list of them? Do you think Merrilee or
> anyone would be interested?
>
>
>
> *From:* Hickey,Thom
> *Sent:* Monday, December 21, 2015 8:04 PM
> *To:* Toves,Jenny
> *Subject:* FW: 201551 vs 201552
>
>
>
> Noticed WKP duplicates went way up
>
> --Th
>
>
>
>
> *From: *Jenny Toves 
> *Sent: *Monday, December 21, 2015 5:12 PM
> *To: *Hickey,Thom ; Toves,Jenny 
> *Subject: *201551 vs 201552
>
>
>
>
>
> REPORT for records
>
> Changed 13.51%: geographic 3369217.0 -> 3824513.0
>
> Change in % of 8:NLR at_least_one_match 16% -> 24%
>
> Changed 19.83%:NLR all_matches 181437.0 -> 217423.0
>
> Change in % of 88:  NLR with_bibs 0% -> 88%
>
> Changed 17.99%:WKP geographic 2529990.0 -> 2985194.0
>
> Changed -19.95%:  WKP corporate 369224.0 -> 295579.0
>
>
>
> REPORT for matches
>
> Changed 12.70%:  exact corporate name 1021239.0 -> 1150899.0
>
> Changed 14.29%:XR  viafid 7.0 -> 8.0
>
> Changed -10.42%:  XR  expression title to sibling 48.0 -> 43.0
>
> Changed -16.16%:  PTBNP  forced 229.0 -> 192.0
>
> Changed -37.50%:  NSZL  forced 8.0 -> 5.0
>
> Changed 38.46%:NLP  suggested 13.0 -> 18.0
>
> No longer zero: NLR  standard number 0.0 -> 21479.0
>
> No longer zero: NLR  exact title 0.0 -> 5166.0
>
> No longer zero: NLR  partial date and partial title 0.0 -> 618.0
>
> No longer zero: NLR  name as subject 0.0 -> 62.0
>
> No longer zero: NLR  partial title and publisher 0.0 -> 88.0
>
> No longer zero: NLR  title 0.0 -> 5093.0
>
> Changed -47.66%:  NLR  forced single date 37125.0 -> 19430.0
>
> Changed 14.29%:NLR  viafid 14.0 -> 16.0
>
> No longer zero: NLR  partial date and publisher 0.0 -> 15894.0
>
> No longer zero: NLR  joint author 0.0 -> 5228.0
>
> Changed -14.49%:  LC  suggested 7594.0 -> 6494.0
>
> Changed 33.33%:CYT  viafid 12.0 -> 16.0
>
> Changed -21.08%:  NLA  forced 223.0 -> 176.0
>
> Changed 233.33%: LNL  forced 3.0 -> 10.0
>
> Changed 12.50%:NLB  viafid 8.0 -> 9.0
>
> Changed 16.67%:NLB  ngram corporate name 6.0 -> 7.0
>
> Changed 25.71%:VLACC  forced 35.0 -> 44.0
>
> Changed 19.13%:DNB  exact corporate name 315872.0 -> 376304.0
>
> Changed 14.29%:DNB  expression title to sibling 7.0 -> 8.0
>
> Changed 16.67%:BNF  expression title to sibling 6.0 -> 7.0
>
> Changed 15.91%:ICCU  forced 44.0 -> 51.0
>
> Changed 25.54%:NTA  forced 9699.0 -> 12176.0
>
> Changed 28.62%:WKP  exact corporate name 224787.0 -> 289112.0
>
> Changed 23.73%:WKP  longer corporate name 76057.0 -> 94106.0
>
> Changed 584.78%: WKP  duplicate reco

[Wikidata] Duplicates in Wikidata

2015-12-27 Thread Proffitt,Merrilee
Hello colleagues,

During the most recent VIAF harvest we encountered a number of duplicate 
records in Wikidata. Forwarding on in case this is of interest (there is an 
attached file – not sure if that will go through on this list or not).

Some discussion from OCLC colleagues is included below.

Merrilee Proffitt, Senior Program Officer
OCLC Research

From: Toves,Jenny
Sent: Tuesday, December 22, 2015 6:02 AM
To: Proffitt,Merrilee
Subject: FW: 201551 vs 201552

Good morning Merrilee,

You probably know that we harvest wikidata monthly for ingest into VIAF. This 
month we found 315 pairs of records that appear to be duplicates. That was a 
jump from previous months. I am not sure who would be interested in this but 
Thom & I thought you might be. The attached report has 630 lines showing what 
viaf saw as duplicates. So this pair of lines:

WKP|Q21518392   =998$aCharles du Bois Larbalestier$2WKP|Q21341290$3duplicate
WKP|Q21341290   =998$aCharles du Bois Larbalestier$2WKP|Q21518392$3duplicate

Shows that those two wikidata numbers are linked to one another by viaf.

I don’t think we expect you to do anything with this unless you find it 
interesting. I suspect there are bots to clean this stuff up but maybe not.

--Jenny.

From: Hickey,Thom
Sent: Monday, December 21, 2015 9:47 PM
To: Toves,Jenny
Subject: RE: 201551 vs 201552

She probably would be interested.

--Th


From: Toves,Jenny
Sent: Monday, December 21, 2015 9:35 PM
To: Hickey,Thom
Subject: RE: 201551 vs 201552

Exact same name + dates. Do you a list of them? Do you think Merrilee or anyone 
would be interested?

From: Hickey,Thom
Sent: Monday, December 21, 2015 8:04 PM
To: Toves,Jenny
Subject: FW: 201551 vs 201552

Noticed WKP duplicates went way up
--Th


From: Jenny Toves
Sent: Monday, December 21, 2015 5:12 PM
To: Hickey,Thom; Toves,Jenny
Subject: 201551 vs 201552


REPORT for records
Changed 13.51%: geographic 3369217.0 -> 3824513.0
Change in % of 8:NLR at_least_one_match 16% -> 24%
Changed 19.83%:NLR all_matches 181437.0 -> 217423.0
Change in % of 88:  NLR with_bibs 0% -> 88%
Changed 17.99%:WKP geographic 2529990.0 -> 2985194.0
Changed -19.95%:  WKP corporate 369224.0 -> 295579.0

REPORT for matches
Changed 12.70%:  exact corporate name 1021239.0 -> 1150899.0
Changed 14.29%:XR  viafid 7.0 -> 8.0
Changed -10.42%:  XR  expression title to sibling 48.0 -> 43.0
Changed -16.16%:  PTBNP  forced 229.0 -> 192.0
Changed -37.50%:  NSZL  forced 8.0 -> 5.0
Changed 38.46%:NLP  suggested 13.0 -> 18.0
No longer zero: NLR  standard number 0.0 -> 21479.0
No longer zero: NLR  exact title 0.0 -> 5166.0
No longer zero: NLR  partial date and partial title 0.0 -> 618.0
No longer zero: NLR  name as subject 0.0 -> 62.0
No longer zero: NLR  partial title and publisher 0.0 -> 88.0
No longer zero: NLR  title 0.0 -> 5093.0
Changed -47.66%:  NLR  forced single date 37125.0 -> 19430.0
Changed 14.29%:NLR  viafid 14.0 -> 16.0
No longer zero: NLR  partial date and publisher 0.0 -> 15894.0
No longer zero: NLR  joint author 0.0 -> 5228.0
Changed -14.49%:  LC  suggested 7594.0 -> 6494.0
Changed 33.33%:CYT  viafid 12.0 -> 16.0
Changed -21.08%:  NLA  forced 223.0 -> 176.0
Changed 233.33%: LNL  forced 3.0 -> 10.0
Changed 12.50%:NLB  viafid 8.0 -> 9.0
Changed 16.67%:NLB  ngram corporate name 6.0 -> 7.0
Changed 25.71%:VLACC  forced 35.0 -> 44.0
Changed 19.13%:DNB  exact corporate name 315872.0 -> 376304.0
Changed 14.29%:DNB  expression title to sibling 7.0 -> 8.0
Changed 16.67%:BNF  expression title to sibling 6.0 -> 7.0
Changed 15.91%:ICCU  forced 44.0 -> 51.0
Changed 25.54%:NTA  forced 9699.0 -> 12176.0
Changed 28.62%:WKP  exact corporate name 224787.0 -> 289112.0
Changed 23.73%:WKP  longer corporate name 76057.0 -> 94106.0
Changed 584.78%: WKP  duplicate record 92.0 -> 630.0
Changed -18.92%:  EGAXA  forced 37.0 -> 30.0

REPORT for tags
Changed 11.56%:NSZL  work links (993) 225.0 -> 251.0
No longer zero: NLR  wrote about (955) 0.0 -> 106.0
No longer zero: NLR  bibs (999) 0.0 -> 108202.0
No longer zero: NLR  was a subject (960) 0.0 -> 16423.0
No longer zero: NLR  relator code (941) 0.0 -> 103950.0
No longer zero: NLR  language of work (940) 0.0 -> 108193.0
No longer zero: NLR  issn (902) 0.0 -> 34.0
No longer zero: NLR  bib title (910) 0.0 -> 107895.0
No longer zero: NLR  joint corporate author (951) 0.0 -> 24235.0
Changed 146.67%: NLR  compared (996) 27448.0 -> 67705.0
No longer zero: NLR  rectype + biblvl (944) 0.0 -> 108194.0
No longer zero: NLR  country of publication (922) 0.0 -> 108169.0
No longer zero: N