[Wikidata] Re: Wikidata Query Service scaling update Aug 2021

2021-08-19 Thread Marco Fossati
Dropping my two cents here: I'm wondering about the Wikidata Linked Data 
Fragments (LDF) service [1] usage.


LDF [2] is nice because it shifts the computation burden to the client, 
at the cost of less expressive SPARQL queries, IIRC.
I think it would be a good idea to forward simple queries to that 
service, instead of WDQS.


Cheers,

Marco

[1] https://query.wikidata.org/bigdata/ldf
[2] https://linkeddatafragments.org/

On 8/19/21 12:48 AM, Imre Samu wrote:
 > (i) identify and delete lower priority data (e.g. labels, 
descriptions, aliases, non-normalized values, etc);


Ouch.
For me
- as a native Hungarian: the labels, descriptions, aliases - is 
extremely important
- as a data user: I am using "labels","aliases" in my concordances tools 
( mapping wikidata-ids with external ids )


So  Please clarify the practical meaning of the *"delete"*
*
*Thanks in advance,
   Imre



Mike Pham mailto:mp...@wikimedia.org>> ezt írta 
(időpont: 2021. aug. 18., Sze, 23:08):


Wikidata community members,


Thank you for all of your work helping Wikidata grow and improve
over the years. In the spirit of better communication, we would like
to take this opportunity to share some of the current challenges
Wikidata Query Service (WDQS) is facing, and some strategies we have
for dealing with them.


WDQS currently risks failing to provide acceptable service quality
due to the following reasons:

 1.

Blazegraph scaling

 1.

Graph size. WDQS uses Blazegraph as our graph backend. While
Blazegraph can theoretically support 50 billion edges
, in reality Wikidata is the
largest graph we know of running on Blazegraph (~13 billion
triples

),
and there is a risk that we will reach a size

limit
of what it can realistically support
. Once Blazegraph
is maxed out, WDQS can no longer be updated. This will also
break Wikidata tools that rely on WDQS.

 2.

Software support. Blazegraph is end of life software, which
is no longer actively maintained, making it an unsustainable
backend to continue moving forward with long term.


Blazegraph maxing out in size poses the greatest risk for
catastrophic failure, as it would effectively prevent WDQS from
being updated further, and inevitably fall out of date. Our long
term strategy to address this is to move to a new graph backend that
best meets our WDQS needs and is actively maintained, and begin the
migration off of Blazegraph as soon as a viable alternative is
identified .


In the interim period, we are exploring disaster mitigation options
for reducing Wikidata’s graph size in the case that we hit this
upper graph size limit: (i) identify and delete lower priority data
(e.g. labels, descriptions, aliases, non-normalized values, etc);
(ii) separate out certain subgraphs (such as Lexemes and/or
scholarly articles). This would be a last resort scenario to keep
Wikidata and WDQS running with reduced functionality while we are
able to deploy a more long-term solution.



 2.

Update and access scaling

 1.

Throughput. WDQS is currently trying to provide fast
updates, and fast unlimited queries for all users. As the
number of SPARQL queries grows over time

alongside
graph updates, WDQS is struggling to sufficiently keep up

in
each dimension of service quality without compromising
anywhere.  For users, this often leads to timed out queries.

 2.

Equitable service. We are currently unable to adjust system
behavior per user/agent. As such, it is not possible to
provide equitable service to users: for example, a heavy
user could swamp WDQS enough to hinder usability by
community users.


In addition to being a querying service for Wikidata, WDQS is also
part of the edit pipeline of Wikidata (every edit on Wikidata is
pushed to WDQS to update the data there). While deploying the new
Flink-based Streaming Updater
will help with increasing
throughput of Wikidata updates, there is a substantial risk that
WDQS will be unable to keep up with the combination of increased
querying and updating, re

[Wikidata] Re: URLs statistics for Discogs and MusicBrainz

2021-08-02 Thread Marco Fossati

Dear Seth,

The short answer is yes.
For more details, you can have a look at the discussion in the Wikidata 
chat: 
https://www.wikidata.org/wiki/Wikidata:Project_chat#URLs_statistics_for_Discogs_(Q504063)_and_MusicBrainz_(Q14005)


Cheers,

Marco

On 8/1/21 3:24 AM, Seth Deegan wrote:
We already have properties for most of these links? I'm not sure what 
you're asking as I have little knowledge of the context of the situation...


lectrician1,

On Thu, Jul 29, 2021 at 8:56 AM Marco Fossati <mailto:foss...@spaziodati.eu>> wrote:


[Cross-posting from the Wikidata chat]

Hi everyone,

Following some feedback by Azertus (thanks!), I collected statistics on
the most frequent Web domains that occur in Discogs [1] and MusicBrainz
[2]. It looks like some of them may be candidates for identifier
property creation, while others stem from a failed match against known
properties, mainly due to inconsistencies in URL match pattern (P8966),
format as a regular expression (P1793), and formatter URL (P1630)
values.

You can have a look at them here [3].

It would be great to gather thoughts on the next steps.
Two main questions:
1. should we go for a property proposal for each of the candidates?
2. what's the best way to fix URL match pattern (P8966), format as a
regular expression (P1793), and formatter URL (P1630) values, so that
the next time we can convert URLs to proper identifiers?

Cheers,

Marco

[1] https://www.discogs.com/ <https://www.discogs.com/>
[2] https://musicbrainz.org/ <https://musicbrainz.org/>
[3]

https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego_2/Timeline#July_2021

<https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego_2/Timeline#July_2021>
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
<mailto:wikidata@lists.wikimedia.org>
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org
<mailto:wikidata-le...@lists.wikimedia.org>


___
Wikidata mailing list -- wikidata@lists.wikimedia.org
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


___
Wikidata mailing list -- wikidata@lists.wikimedia.org
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] URLs statistics for Discogs and MusicBrainz

2021-07-29 Thread Marco Fossati

[Cross-posting from the Wikidata chat]

Hi everyone,

Following some feedback by Azertus (thanks!), I collected statistics on 
the most frequent Web domains that occur in Discogs [1] and MusicBrainz 
[2]. It looks like some of them may be candidates for identifier 
property creation, while others stem from a failed match against known 
properties, mainly due to inconsistencies in URL match pattern (P8966), 
format as a regular expression (P1793), and formatter URL (P1630) values.


You can have a look at them here [3].

It would be great to gather thoughts on the next steps.
Two main questions:
1. should we go for a property proposal for each of the candidates?
2. what's the best way to fix URL match pattern (P8966), format as a 
regular expression (P1793), and formatter URL (P1630) values, so that 
the next time we can convert URLs to proper identifiers?


Cheers,

Marco

[1] https://www.discogs.com/
[2] https://musicbrainz.org/
[3] 
https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego_2/Timeline#July_2021

___
Wikidata mailing list -- wikidata@lists.wikimedia.org
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Item validation criteria

2021-07-06 Thread Marco Fossati

[Please pardon me if you have already read this on the Wikidata chat]

Hello folks,


TL;DR: what do you think of the 3 validation criteria below?


I'm excited to let you know that the soweego 2 project has just started [1]!

To cut a long story short, soweego links Wikidata to large third-party 
catalogs.


The next step will be all about synchronization of Wikidata to a given 
target catalog through a set of validation criteria. Let me paste below 
some key parts of the project proposal.


1) existence: whether a target identifier found in a given Wikidata item 
is still available in the target catalog;
2) links: to what extent all URLs available in a Wikidata item overlap 
with those in the corresponding target catalog entry;
3) metadata: to what extent relevant statements available in a Wikidata 
item overlap with those in the corresponding target catalog entry.


These criteria would respectively trigger a set of actions. As a toy 
example:


1) Elvis Presley (Q303) has a MusicBrainz identifier 01809552, which 
does not exist in MusicBrainz anymore.

   Action = mark the identifier statement with a deprecated rank;
2) Elvis Presley (Q303) has 7 URLs, MusicBrainz 01809552 has 8 URLs, and 
3 overlap.
   Action = add 5 URLs from MusicBrainz to Elvis Presley (Q303) and 
submit 4 URLs from Wikidata to the MusicBrainz community;
3) Wikidata states that Elvis Presley (Q303) was born on January 8, 1935 
in Tupelo, while MusicBrainz states that 01809552 was born in 1934 in 
Memphis.
   Action = add 2 referenced statements with MusicBrainz values to 
Elvis Presley (Q303) and notify 2 Wikidata values to the MusicBrainz 
community.


In case of either full or no overlap in criteria 2 and 3, the Wikidata 
identifier statement should be marked with a preferred or a deprecated 
rank respectively.


Please note that the soweego bot already has an approved task for 
criterion 2 [2], together with a set of test edits [3]. In addition, we 
performed (then reverted) a set of test edits for criterion 1 [4].


I'm glad to hear any thoughts about the validation criteria, keeping in 
mind that the more generic the better.


Stay tuned for more rock'n'roll!
With love,

Marco

[1] https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego_2
[2] 
https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot/Soweego_bot_2
[3] 
https://www.wikidata.org/w/index.php?title=Special:Contributions&target=Soweego_bot&contribs=user&start=2018-11-05&end=2018-11-05&limit=250
[4] 
https://www.wikidata.org/w/index.php?title=Special:Contributions&target=Soweego_bot&contribs=user&start=2018-11-07&end=2018-11-13&limit=100

___
Wikidata mailing list -- wikidata@lists.wikimedia.org
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] soweego 2 proposal

2020-02-19 Thread Marco Fossati

Hi everyone,

---
TL;DR: soweego 2 is on its way.
   Here's the Project Grant proposal:

https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego_2
---

Does the name *soweego* ring you a bell?
It's an artificial intelligence that links Wikidata to large catalogs [1].
It's a close friend of Mix'n'match [2], which mainly caters for small 
catalogs.


The next big step is to check Wikidata content against third-party 
trusted sources.
In a nutshell, we want to enable feedback loops between Wikidatans and 
catalog maintainers.
The ultimate goal is to foster mutual benefits in the open knowledge 
landscape.


I'd be really grateful if you could have a look at the proposal page [3].

Can't wait for your feedback.
Best,

Marco

[1] https://soweego.readthedocs.io/
[2] https://tools.wmflabs.org/mix-n-match/
[3] https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego_2

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Concise/Notable Wikidata Dump

2019-12-18 Thread Marco Fossati

Hi everyone,

Benno (in CC) has recently announced this tool:
https://tools.wmflabs.org/wdumps/

I haven't checked it out yet, but it sounds related to Aidan's inquiry.
Hope this helps.

Cheers,

Marco

On 12/18/19 8:01 AM, Edgard Marx wrote:

+1

On Tue, Dec 17, 2019, 19:14 Aidan Hogan > wrote:


Hey all,

As someone who likes to use Wikidata in their research, and likes to
give students projects relating to Wikidata, I am finding it more and
more difficult to (recommend to) work with recent versions of Wikidata
due to the increasing dump sizes, where even the truthy version now
costs considerable time and machine resources to process and handle. In
some cases we just grin and bear the costs, while in other cases we
apply an ad hoc sampling to be able to play around with the data and
try
things quickly.

More generally, I think the growing data volumes might inadvertently
scare people off taking the dumps and using them in their research.

One idea we had recently to reduce the data size for a student project
while keeping the most notable parts of Wikidata was to only keep
claims
that involve an item linked to Wikipedia; in other words, if the
statement involves a Q item (in the "subject" or "object") not
linked to
Wikipedia, the statement is removed.

I wonder would it be possible for Wikidata to provide such a dump to
download (e.g., in RDF) for people who prefer to work with a more
concise sub-graph that still maintains the most "notable" parts? While
of course one could compute this from the full-dump locally, making
such
a version available as a dump directly would save clients some
resources, potentially encourage more research using/on Wikidata, and
having such a version "rubber-stamped" by Wikidata would also help to
justify the use of such a dataset for research purposes.

... just an idea I thought I would float out there. Perhaps there is
another (better) way to define a concise dump.

Best,
Aidan

___
Wikidata mailing list
Wikidata@lists.wikimedia.org 
https://lists.wikimedia.org/mailman/listinfo/wikidata


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Comparison of Wikidata, DBpedia, and Freebase (draft and invitation)

2019-10-01 Thread Marco Fossati

Hi Denny,

Thanks for publishing your Colab notebook!
I went through it and would like to share my first thoughts here. We can 
then move further discussion somewhere else.


1. in general, how can we compare datasets with totally different time 
stamps? Wikidata is alive, Freebase is dead, and the latest DBpedia dump 
is old;
2. given that all datasets contain Wikipedia links, perhaps we could use 
them as a bridge for the comparison, instead of Wikidata mappings. I'm 
assuming that Freebase and DBpedia entities with Wikidata mappings are 
subsets of the whole datasets (but this should be verified);
3. we could use record linkage techniques to connect Wikidata entities 
with Freebase and DBpedia ones, then assess the agreement in terms of 
statements per entity. There has been some experimental work (different 
use case and goal) in the soweego project:

https://soweego.readthedocs.io/en/latest/validator.html


On 10/1/19 1:13 AM, Denny Vrandečić wrote:
Marco, I totally agree with what you said - the project has stalled, and 
there is plenty of opportunity to harvest more data from Freebase and 
bring it to Wikidata, and this should be reignited.

Yeah, that would be great.
There is known work to do, but it's hard to sustain such a big project 
without allocated resources:

https://phabricator.wikimedia.org/maniphest/query/CPiqkafGs5G./#R

BTW, there is also version 2 of the Wikidata primary sources tool that 
needs love, although I'm now skeptical that it will be an effective way 
to achieve the Freebase harvesting.
We should probably rethink the whole thing, and restart small with very 
simple use cases, pretty much like the Harvest templates tool you mentioned:

https://tools.wmflabs.org/pltools/harvesttemplates/

Cheers,

Marco

P.S.: I *might* have found the freshest relevant DBpedia datasets:
https://databus.dbpedia.org/dbpedia/mappings/mappingbased-objects
I said *might* because it was really painful to find a download button 
and to guess among multiple versions of the same dataset:

https://downloads.dbpedia.org/repo/lts/mappings/mappingbased-objects/2019.09.01/mappingbased-objects_lang=en.ttl.bz2
@Sebastian may know if it's the good one :-)

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Google's stake in Wikidata and Wikipedia

2019-09-27 Thread Marco Fossati

Hey Sebastian,

On 9/20/19 10:22 AM, Sebastian Hellmann wrote:

Not much of Freebase did end up in Wikidata.


Dropping here some pointers to shed light on the migration of Freebase 
to Wikidata, since I was partially involved in the process:

1. WikiProject [1];
2. the paper behind [2];
3. datasets to be migrated [3].

I can confirm that the migration has stalled: as of today, *528 
thousands* Freebase statements were curated by the community, out of *10 
million* ones. By 'curated', I mean approved or rejected.
These numbers come from two queries against the primary sources tool 
database.


The stall is due to several causes: in my opinion, the most important 
one was the bad quality of sources [4,5] coming from the Knowledge Vault 
project [6].


Cheers,

Marco

[1] https://www.wikidata.org/wiki/Wikidata:WikiProject_Freebase
[2] 
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/44818.pdf
[3] 
https://www.wikidata.org/wiki/Wikidata:Primary_sources_tool/Version_1#Data
[4] 
https://www.wikidata.org/wiki/Wikidata_talk:Primary_sources_tool/Archive/2017#Quality_of_sources
[5] 
https://www.wikidata.org/wiki/Wikidata:Requests_for_comment/Semi-automatic_Addition_of_References_to_Wikidata_Statements#A_whitelist_for_sources

[6] https://www.cs.ubc.ca/~murphyk/Papers/kv-kdd14.pdf

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Personal news: a new role

2019-09-23 Thread Marco Fossati

That's just awesome, Denny.
Unity is strength.
Wishing you all the best.

Marco


On 19/09/19 18:56, Denny Vrandečić wrote:

Hello all,

Over the last few years, more and more research teams all around the 
world have started to use Wikidata. Wikidata is becoming a fundamental 
resource [1]. That is also true for research at Google. One advantage of 
using Wikidata as a research resource is that it is available to 
everyone. Results can be reproduced and validated externally. Yay!


I had used my 20% time to support such teams. The requests became more 
frequent, and now I am moving to a new role in Google Research, akin to 
a Wikimedian in Residence [2]: my role is to promote understanding of 
the Wikimedia projects within Google, work with Googlers to share more 
resources with the Wikimedia communities, and to facilitate the 
improvement of Wikimedia content by the Wikimedia communities, all with 
a strong focus on Wikidata.


One deeply satisfying thing for me is that the goals of my new role and 
the goals of the communities are so well aligned: it is really about 
improving the coverage and quality of the content, and about pushing the 
projects closer towards letting everyone share in the sum of all knowledge.


Expect to see more from me again - there are already a number of fun 
ideas in the pipeline, and I am looking forward to see them get out of 
the gates! I am looking forward to hearing your ideas and suggestions, 
and to continue contributing to the Wikimedia goals.


Cheers,
Denny

P.S.: Which also means, incidentally, that my 20% time is opening for 
new shenanigans [3].


[1] https://www.semanticscholar.org/search?q=wikidata&sort=relevance
[2] https://meta.wikimedia.org/wiki/Wikimedian_in_residence
[3] https://wikipedia20.pubpub.org/pub/vyf7ksah


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] soweego 1.1 is under way

2019-08-28 Thread Marco Fossati

Thanks for the feedback, Thad!
Would you mind opening a ticket on GitHub with your suggestions?
https://github.com/Wikidata/soweego/issues

In the meanwhile, the original soweego proposal gives more insight about 
the motivation and the outcomes, starting from the summary:

https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego

Cheers,

Marco

On 28/08/19 18:39, Thad Guidry wrote:

Hmm

I'd love to see you record a 3 min video added as a link on your 
README.md might help to quickly understand the scope/applicability/benefits.


Your WHAT statement is not enough for fully understanding the WHY and 
BENEFIT provided.
I'd suggest to add continued sentence, "...large-scale third-party 
catalogs, so that you can ..."
"/soweego/ is a pipeline that connects Wikidata 
<https://wikidata.org/> to large-scale third-party catalogs."


Thad
https://www.linkedin.com/in/thadguidry/


On Wed, Aug 28, 2019 at 11:22 AM Marco Fossati <mailto:foss...@spaziodati.eu>> wrote:


Hi everyone,

Wearing the soweego project lead hat, I'm pleased to announce that the
Wikimedia Foundation has approved the *soweego 1.1* proposal:
https://meta.wikimedia.org/wiki/Grants:Project/Rapid/Hjfocs/soweego_1.1

The main goal is to put together different machine learning algorithms
and get the highest-quality links between Wikidata and large external
catalogs.

Stay tuned for more rock'n'roll at:
https://github.com/Wikidata/soweego

And while you're there, why don't you give a star?
Cheers,

Marco

___
Wikidata mailing list
Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] soweego 1.1 is under way

2019-08-28 Thread Marco Fossati

Hi everyone,

Wearing the soweego project lead hat, I'm pleased to announce that the 
Wikimedia Foundation has approved the *soweego 1.1* proposal:

https://meta.wikimedia.org/wiki/Grants:Project/Rapid/Hjfocs/soweego_1.1

The main goal is to put together different machine learning algorithms 
and get the highest-quality links between Wikidata and large external 
catalogs.


Stay tuned for more rock'n'roll at:
https://github.com/Wikidata/soweego

And while you're there, why don't you give a star?
Cheers,

Marco

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] soweego 1.0 release

2019-07-30 Thread Marco Fossati

Hi everyone,


TL;DR: soweego version 1 is out!
https://soweego.readthedocs.io/
Like it? Star it!


The soweego team is delighted to announce the release of *version 1* [1]!
If you like it, why don't you click on the Star button?

*soweego* links Wikidata to large catalogs through machine learning.
It partners with Mix'n'match [2], which mainly deals with small catalogs.

The soweego bot [3] is currently uploading *255 k confident* links to 
Wikidata: see it in action [4]!
*126 k* medium-confident* links are instead getting into Mix'n'match for 
curation: see the current catalogs [5-13].


The soweego team has also worked hard to address the following community 
requests:
1. sync Wikidata to external catalogs & check them to spot 
inconsistencies in Wikidata;

2. import new catalogs with reasonable effort.

Thinking of the best way to contribute? Try to *import a new catalog* [14].

Best,

Marco

[1] https://soweego.readthedocs.io/
[2] https://tools.wmflabs.org/mix-n-match/
[3] https://www.wikidata.org/wiki/User:Soweego_bot
[4] https://xtools.wmflabs.org/ec/wikidata.org/Soweego%20bot
[5] https://tools.wmflabs.org/mix-n-match/#/catalog/2694
[6] https://tools.wmflabs.org/mix-n-match/#/catalog/2695
[7] https://tools.wmflabs.org/mix-n-match/#/catalog/2696
[8] https://tools.wmflabs.org/mix-n-match/#/catalog/2709
[9] https://tools.wmflabs.org/mix-n-match/#/catalog/2710
[10] https://tools.wmflabs.org/mix-n-match/#/catalog/2711
[11] https://tools.wmflabs.org/mix-n-match/#/catalog/2478
[12] https://tools.wmflabs.org/mix-n-match/#/catalog/2712
[13] https://tools.wmflabs.org/mix-n-match/#/catalog/2713
[14] https://soweego.readthedocs.io/en/latest/new_catalog.html

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] soweego: link Wikidata to large catalogs

2019-07-10 Thread Marco Fossati

Dear all,

---
TL;DR: soweego version 1 will be released soon. In the meanwhile, why 
don't you consider endorsing the next steps?

https://meta.wikimedia.org/wiki/Grants:Project/Rapid/Hjfocs/soweego_1.1
---

This is a pre-release notification for early feedback.

Does the name *soweego* ring you a bell?
It is a machine learning-based pipeline that links Wikidata to large 
catalogs [1].
It is a close friend of Mix'n'match [2], which mainly caters for small 
catalogs.


The first version is almost done, and will start uploading results soon.
Confident links are going to feed Wikidata via a bot [3], while others 
will get into Mix'n'match for curation.


The next short-term steps are detailed in a rapid grant proposal [4], 
and I would be really grateful if you could consider an endorsement there.


The soweego team has also tried its best to address the following 
community requests:
1. plan a sync mechanism between Wikidata and large catalogs / implement 
checks against external catalogs to find mismatches in Wikidata;

2. enable users to add links to new catalogs in a reasonable time.

So, here is the most valuable contribution you can give to the project 
right now: understand how to *import a new catalog* [5].


Can't wait for your reactions.
Cheers,

Marco

[1] https://soweego.readthedocs.io/
[2] https://tools.wmflabs.org/mix-n-match/
[3] see past contributions: 
https://www.wikidata.org/w/index.php?title=Special:Contributions/Soweego_bot&offset=20190401194034&target=Soweego+bot

[4] https://meta.wikimedia.org/wiki/Grants:Project/Rapid/Hjfocs/soweego_1.1
[5] https://soweego.readthedocs.io/en/latest/new_catalog.html

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Mapping Wikidata to other ontologies

2018-09-24 Thread Marco Fossati

Hi Maarten,

On 9/22/18 13:28, Maarten Dammers wrote:
The equivalent property and equivalent class are used, but not that 
much. Did anyone already try a structured approach with reporting? I'm 
considering parsing popular ontology descriptions and producing reports 
of what is linked to what so it's easy to make missing links, but I 
don't want to do double work here.

FYI, I operated the bot that added DBpedia ontology mappings:
https://www.wikidata.org/wiki/User:DBpedia-mapper-bot

I've not been updating the mappings for quite some time, so it would be 
useful to refresh them.

Feel free to ping me if you want to check out the bot implementation.

Cheers,

Marco

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] [Wikidata-tech] New release of the Wikidata Toolkit + survey

2017-12-19 Thread Marco Fossati

Hi guys,

Just a couple of clarifications from the primary sources tool 
development side.


On 12/18/17 20:12, Antonin Delpeuch (lists) wrote:

- for the RDF export feature, maybe we could repurpose it to export
statements to the Primary Sources Tool. I think the new version of the
PST is expected to ingest statements in RDF,

Yes, RDF will be used in the new back end.
Providers should upload their datasets in RDF.
QuickStatements [1] will be still supported, by means of a converter 
[2]: if providers want to stick with it, they just need to run the 
converter before submitting their datasets.

See also [3] for reference.

in a format that might
differ somewhat from what is available in the WDQS.
The format is the same as WDQS [4], although it is a subset of the data 
model, for the sake of simplicity.


As a side note, The Wikidata toolkit seem to have a different RDF format 
(see first paragraph in [5]): I think it would be great if it could 
support the WDQS one.


Cheers,

Marco

[1] https://www.wikidata.org/wiki/Help:QuickStatements
[2] https://github.com/marfox/qs2rdf
[3] https://www.wikidata.org/wiki/Wikidata:Primary_sources_tool
[4] https://www.wikidata.org/wiki/Wikidata:Primary_sources_tool#Data_format
[5] https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata prefix search is now Elastic

2017-10-27 Thread Marco Fossati

Sounds good, thank you Daniel and Stas.
Best,

Marco

On 10/26/17 19:20, Stas Malyshev wrote:

Hi!


Thanks a lot Stas for this present.
Could you please share any pointers on how to integrate it into other
tools?


It's the same API as before, wbsearchentities. If you need additional
profiles - i.e., different scoring/filtering, talk to me and/or file
phab task and we can look into it.



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata prefix search is now Elastic

2017-10-26 Thread Marco Fossati

Thanks a lot Stas for this present.
Could you please share any pointers on how to integrate it into other tools?

Cheers,

Marco

On 10/25/17 22:22, Stas Malyshev wrote:

Wikidata and Search Platform teams are happy to announce that Wikidata
prefix search
is now using new and improved
ElasticSearch backend.


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Soweego: wide-coverage linking of external identifiers. Call for support

2017-09-24 Thread Marco Fossati
Hi everyone,

Remember the Wikidata primary sources tool?
https://www.wikidata.org/wiki/Wikidata:Primary_sources_tool

While the StrepHit team is building its next version, I'd like to
invite you to have a look at a new project proposal.
The main goal is to add a high volume of identifiers to Wikidata,
ensuring live maintenance of links.

Do you think that Wikidata should become the central linking hub of
open knowledge?

If so, I'd be really grateful if you could endorse the *soweego* project:
https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego

Of course, any comment is more than welcome on the discussion page.

Looking forward to your valuable feedback.
Best,

Marco

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Which external identifiers are worth covering?

2017-09-07 Thread Marco Fossati

Hi everyone,

As a data quality addict, I've been investigating the coverage of 
external identifiers linked to Wikidata items about people.


Given the numbers on SQID [1] and some SPARQL queries [2, 3], it seems 
that even the second most used ID (VIAF) only covers *25%* of people 
items circa.

Then, there is a long tail of IDs that are barely used at all.

So here is my question:
*which external identifiers deserve an effort to achieve exhaustive 
coverage?*


Looking forward to your valuable feedback.
Cheers,

Marco

[1] https://tools.wmflabs.org/sqid/#/browse?type=properties "Select 
datatype" set to "ExternalId", "Used for class" set to "human Q5"

[2] total people: http://tinyurl.com/ybvcm5uw
[3] people with a VIAF link: http://tinyurl.com/ya6dnpr7

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wordnet synset ID

2017-08-21 Thread Marco Fossati

Hi everyone,

I asked a related question during the presentation at Wikimania: my 
understanding is that Ontolex/Lemon [1, 2] was used to model the 
Wikidata lexicographical items.

WordNet was discarded, as it has an older model.
Best,

Marco

[1] https://www.w3.org/2016/05/ontolex/
[2] http://lemon-model.net/

On 8/15/17 23:39, Denny Vrandečić wrote:
 > That's a great question, I have no idea what the answer will turn out 
to be.

 >
 > Is there any current link between Wiktionary and WordNet? Or WordNet and
 > Wikipedia?


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Tool for consuming left-over data from import

2017-08-08 Thread Marco Fossati

Hi Antonin,

On 8/7/17 20:36, Antonin Delpeuch (lists) wrote:

Does anybody know an alternative to CrowdFlower that can be used for
free with volunteer workers?

There you go: https://crowdcrafting.org/
Hope this helps you keep up with your great work on openrefine.

I believe entity reconciliation is one of the most challenging tasks 
that keep third-party data providers away from imports to Wikidata.

Cheers,

Marco

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Primary sources tool uplift proposal

2017-07-04 Thread Marco Fossati

Hi everyone,

The StrepHit team [1] has submitted an official uplift proposal for the 
primary sources tool [2].


This is part of a Wikimedia project grant [3], which has 2 big goals:
1. to improve the reference coverage of Wikidata statements;
2. to standardize the data release workflow for third-party providers.

We have worked hard integrating past discussions, extensively 
investigating MediaWiki and Wikidata code bases, interacting with 
specific people from the community.
Now we are pretty much satisfied with our proposal, and we hope you are 
too. Feel free to react on the project page!


Special thanks go to the StrepHit folks Tommaso (User:Kiailandi) and 
Francesco (User:Afnecors) and everyone who supported us in this delicate 
phase


Best,

Marco

[1] 
https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Profile

[2] https://www.wikidata.org/wiki/Wikidata:Primary_sources_tool
[3] 
https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Renewal#Scope


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SQID now supports PrimarySources

2017-01-09 Thread Marco Fossati
Thanks for the report, Markus.
I will file it in the primary sources tool issues.
Best,

Marco


Il 28 dic 2016 3:33 PM, "Markus Kroetzsch" 
ha scritto:

Hi Marco,


On 27.12.2016 14:27, Marco Fossati wrote:

> Hi Markus and thanks for this major SQID update.
>
> On 12/16/16 13:32, Markus Kroetzsch wrote:
>
>> == Known issues ==
>>
>> * Some statements cannot be rejected in Primary Sources. This problem
>> affects both SQID and the Wikidata gadget in the same way. It seems to
>> be a bug in the PS web service, which we hope will be fixed at some
>> point.
>>
> Just to track down this, could you please point me to the corresponding
> phabricator issue?
>

I don't know if/where this is tracked so far. You can see the problem
already on Q1 if you try to reject the strange "point in time" statements
Primary Sources suggests there. The attempt to reject fails in both the PS
gadget and in SQID, as far as I recall because PS says that the rejected
statement's ID does not exist (in spite of the statement being suggested
with this ID). That's all I know.

Best,

Markus



Please note that known bugs previously filed on GitHub have been
> recently migrated to phabricator:
> https://github.com/Wikidata/primarysources/issues?q=is%3Aissue+is%3Aclosed
> Thanks,
>
> Marco
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Cleaning up Primary Sources (Freebase import)

2016-12-27 Thread Marco Fossati

I was about to mention the StrepHit renewal proposal.
Thanks Lydia for doing that faster than me! :-)
Best,

Marco

On 12/20/16 19:56, Lydia Pintscher wrote:

On Tue, Dec 20, 2016 at 7:40 PM, Gerard Meijssen
 wrote:

Hoi,
Please consider, it has been said all too often that Primary Sources is the
tool that should be used. Given that it has a bad UI and is not maintained;
what benefits does it hold?

Why do we throw away all the good work when we do not value it?


https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Renewal
will hopefully be approved in the next days to give Marco the time to
give the primary sources tool some much needed love.


Cheers
Lydia



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SQID now supports PrimarySources

2016-12-27 Thread Marco Fossati

Hi Markus and thanks for this major SQID update.

On 12/16/16 13:32, Markus Kroetzsch wrote:

== Known issues ==

* Some statements cannot be rejected in Primary Sources. This problem
affects both SQID and the Wikidata gadget in the same way. It seems to
be a bug in the PS web service, which we hope will be fixed at some point.
Just to track down this, could you please point me to the corresponding 
phabricator issue?
Please note that known bugs previously filed on GitHub have been 
recently migrated to phabricator:

https://github.com/Wikidata/primarysources/issues?q=is%3Aissue+is%3Aclosed
Thanks,

Marco

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Primary Sources Tool - Someone Goofed Up !

2016-12-02 Thread Marco Fossati

Hi Thad,

The examples you mentioned are facts automatically extracted from 
natural language texts.
It looks like those facts are not incorrect on their own: for instance, 
the first 3 people listed in https://www.wikidata.org/wiki/Q11629 seem 
to be painters indeed.


In my opinion, what is wrong here is the choice of the Wikidata property 
'creator' (P170), which should be definitely replaced by some more 
appropriate property.


Thanks for reporting that.
Best,

Marco

On 11/30/16 17:41, Thad Guidry wrote:

and another one: https://www.wikidata.org/wiki/Q11634

I think I am seeing a pattern here:  ME :)

On Wed, Nov 30, 2016 at 10:37 AM Thad Guidry mailto:thadgui...@gmail.com>> wrote:

Found another one:  https://www.wikidata.org/wiki/Q11633


On Wed, Nov 30, 2016 at 10:32 AM Thad Guidry mailto:thadgui...@gmail.com>> wrote:

Can someone look into how or why all of those 'creators' were
added with the Primary Sources Tool incorrectly on the topic
'painting' ?

https://www.wikidata.org/wiki/Q11629

WARNING:  The page actually will freeze for 1-2 min while the
'creator' statement tries to load into the browser !

-Thad



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Data import hub, data preperation instructions and import workflow for muggles

2016-11-22 Thread Marco Fossati

Hi Navino,

Currently, there is an (undocumented and untested) API endpoint 
accepting POST requests as QuickStatements datasets:

https://github.com/Wikidata/primarysources/tree/master/backend#import-statements
If you want to try it, feel free to privately ping me for an API token.

As a side note, the primary sources tool is undergoing a Wikimedia 
Foundation grant renewal request to give it a radical uplift:

https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Renewal

Best,

Marco

On 11/22/16 14:00, Navino Evans wrote:

Thanks Marco!

Do you know if there's a system in place yet for adding new data to the
Primary Sources Tool?  I thought it was still only covering Freebase
data at the moment, but it should be in the import guide for sure if it
can be used for new data sets already.

Cheers,

Navino




On 22 November 2016 at 09:43, Marco Fossati mailto:foss...@spaziodati.eu>> wrote:

Hi John, Navino,

the primary sources tool uses the QuickStatements syntax for
large-scale non-curated dataset imports, see:

https://www.wikidata.org/wiki/Wikidata:Data_donation#3._Work_with_the_Wikidata_community_to_import_the_data

<https://www.wikidata.org/wiki/Wikidata:Data_donation#3._Work_with_the_Wikidata_community_to_import_the_data>
https://www.wikidata.org/wiki/Wikidata:Primary_sources_tool
<https://www.wikidata.org/wiki/Wikidata:Primary_sources_tool>

Best,

Marco

On 11/21/16 21:39, Navino Evans wrote:

I've just added some more to the page in the previously 'coming
soon' Self import

<https://www.wikidata.org/wiki/User:John_Cummings/wikidataimport_guide#Option_2:_Self_import

<https://www.wikidata.org/wiki/User:John_Cummings/wikidataimport_guide#Option_2:_Self_import>>
section,
as it seemed like this is actually the place where
QuickStatements and
mix'n'match should come in.
I've tried to keep the details out, and just give a guide to
choosing
which tool/approach to use for a particular situation. The
mechanics of
using the tools etc should all be on other pages I presume.

Cheers,

Nav


On 21 November 2016 at 16:24, john cummings
mailto:mrjohncummi...@gmail.com>
<mailto:mrjohncummi...@gmail.com
<mailto:mrjohncummi...@gmail.com>>> wrote:

Hi Magnus

I've avoided mentioning those for now as I know you are
working on
new tools, also I'm not very good at using them so wouldn't
write
good instructions :) I hope that once this is somewhere 'proper'
that others with more knowledge can add this information in.

My main idea with this is to break up the steps so people can
collaborate on importing datasets and also learn skills
along the
workflow over time rather than having to learn everything in
one go.

Thanks

John

On 21 November 2016 at 17:11, Magnus Manske
mailto:magnusman...@googlemail.com>
<mailto:magnusman...@googlemail.com
<mailto:magnusman...@googlemail.com>>>
wrote:

There are other options to consider:
* Curated import/sync via mix'n'match
* Batch-based import via QuickStatements (also see
rewrite plans
at
https://www.wikidata.org/wiki/User:Magnus_Manske/quick_statements2
<https://www.wikidata.org/wiki/User:Magnus_Manske/quick_statements2>

<https://www.wikidata.org/wiki/User:Magnus_Manske/quick_statements2
<https://www.wikidata.org/wiki/User:Magnus_Manske/quick_statements2>>
)

On Mon, Nov 21, 2016 at 3:11 PM john cummings
mailto:mrjohncummi...@gmail.com>
<mailto:mrjohncummi...@gmail.com
<mailto:mrjohncummi...@gmail.com>>> wrote:

Dear all


Myself and Navino Evans have been working on a bare
bone as
possible workflow and instructions for making
importing data
into Wikidata available to muggles like me. We have
written
instructions up to the point where people would make a
request on the 'bot requests' page to import the
data into
Wikidata.


Please take a look and share your thoughts



https://www.wikidata.org/wiki/User:John_Cummings/Dataimporthub
<https://www.wikidata.org/wiki/User:John_Cummings/Dataimporthub>

<https://www.wikidata.org/wiki/User:John_Cummings/Dataimporthub
<https://

Re: [Wikidata] Data import hub, data preperation instructions and import workflow for muggles

2016-11-22 Thread Marco Fossati

Hi John, Navino,

the primary sources tool uses the QuickStatements syntax for large-scale 
non-curated dataset imports, see:

https://www.wikidata.org/wiki/Wikidata:Data_donation#3._Work_with_the_Wikidata_community_to_import_the_data
https://www.wikidata.org/wiki/Wikidata:Primary_sources_tool

Best,

Marco

On 11/21/16 21:39, Navino Evans wrote:

I've just added some more to the page in the previously 'coming
soon' Self import

 section,
as it seemed like this is actually the place where QuickStatements and
mix'n'match should come in.
I've tried to keep the details out, and just give a guide to choosing
which tool/approach to use for a particular situation. The mechanics of
using the tools etc should all be on other pages I presume.

Cheers,

Nav


On 21 November 2016 at 16:24, john cummings mailto:mrjohncummi...@gmail.com>> wrote:

Hi Magnus

I've avoided mentioning those for now as I know you are working on
new tools, also I'm not very good at using them so wouldn't write
good instructions :) I hope that once this is somewhere 'proper'
that others with more knowledge can add this information in.

My main idea with this is to break up the steps so people can
collaborate on importing datasets and also learn skills along the
workflow over time rather than having to learn everything in one go.

Thanks

John

On 21 November 2016 at 17:11, Magnus Manske
mailto:magnusman...@googlemail.com>>
wrote:

There are other options to consider:
* Curated import/sync via mix'n'match
* Batch-based import via QuickStatements (also see rewrite plans
at https://www.wikidata.org/wiki/User:Magnus_Manske/quick_statements2
 )

On Mon, Nov 21, 2016 at 3:11 PM john cummings
mailto:mrjohncummi...@gmail.com>> wrote:

Dear all


Myself and Navino Evans have been working on a bare bone as
possible workflow and instructions for making importing data
into Wikidata available to muggles like me. We have written
instructions up to the point where people would make a
request on the 'bot requests' page to import the data into
Wikidata.


Please take a look and share your thoughts


https://www.wikidata.org/wiki/User:John_Cummings/Dataimporthub



https://www.wikidata.org/wiki/User:John_Cummings/wikidataimport_guide




Thanks very much


John

___
Wikidata mailing list
Wikidata@lists.wikimedia.org

https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org 
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org 
https://lists.wikimedia.org/mailman/listinfo/wikidata





--

/nav...@histropedia.com /

@NavinoEvans 

-

   www.histropedia.com 

Twitter Facebo
ok
Google +




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Fwd: Re: [wikicite-discuss] Entity tagging and fact extraction (from a scholarly publisher perspective)

2016-11-11 Thread Marco Fossati
-- Messaggio inoltrato --
Da: "Marco Fossati" 
Data: 11 nov 2016 1:23 PM
Oggetto: Fwd: Re: [wikicite-discuss] Entity tagging and fact extraction
(from a scholarly publisher perspective)
A: "Marco Fossati" 
Cc:

-- Messaggio inoltrato ------
Da: "Marco Fossati" 
Data: 11 nov 2016 1:18 PM
Oggetto: Re: [wikicite-discuss] Entity tagging and fact extraction (from a
scholarly publisher perspective)
A: "Andrew Smeall" 
Cc: "Dario Taraborelli" , "Benjamin Good" <
ben.mcgee.g...@gmail.com>, "Discussion list for the Wikidata project." <
wikidata@lists.wikimedia.org>, "wikicite-discuss" <
wikicite-disc...@wikimedia.org>, "Daniel Mietchen" <
daniel.mietc...@googlemail.com>

Hi everyone,

Just a couple of thoughts, which are in line with Dario's first message:
1. the primary sources tool lets third party providers release *full
datasets* in a rather quick way. It is conceived to (a) ease the ingestion
of *non-curated* data and to (b) make the community directly decide which
statements should be included, instead of eventually complex a priori
discussions.
Important: the datasets should comply with the Wikidata vocabulary/ontology.

2. I see the mix'n'match tool as a way to *link* datasets with Wikidata via
ID mappings, thus only requiring statements that say "Wikidata entity X
links to the third party dataset entity Y".
This is pretty much what the linked data community has been doing so far.
No need to comply with the Wikidata vocabulary/ontology.

Best,

Marco

Il 11 nov 2016 10:27 AM, "Andrew Smeall"  ha
scritto:

> Regarding the topics/vocabularies issue:
>
> A challenge we're working on is finding a set of controlled vocabularies
> for all the subject areas we cover.
>
> We do use MeSH for those subjects, but this only applies to about 40% of
> our papers. In Engineering, for example, we've had more trouble finding an
> open taxonomy with the same level of depth as MeSH. For most internal
> applications, we need 100% coverage of all subjects.
>
> Machine learning for concept tagging is trendy now, partly because it
> doesn't require a preset vocabulary, but we are somewhat against this
> approach because we want to control the mapping of terms and a taxonomic
> hierarchy can be useful. The current ML tools I've seen can match to a
> controlled vocabulary, but then they need the publisher to supply the terms.
>
> The temptation to build a new vocabulary is strong, because it's the
> fastest way to get to something that is non-proprietary and universal. We
> can merge existing open vocabularies like MeSH and PLOS to get most of the
> way there, but we then need to extend that with concepts from our corpus.
>
> Thanks Daniel and Benjamin for your responses. Any other feedback would be
> great, and I'm always happy to delve into issues from the publisher
> perspective if that can be helpful.
>
> On Fri, Nov 11, 2016 at 4:54 PM, Dario Taraborelli <
> dtarabore...@wikimedia.org> wrote:
>
>> Benjamin – agreed, I too see Wikidata as mainly a place to hold all the
>> mappings. Once we support federated queries in WDQS, the benefit of ID
>> mapping (over extensive data ingestion) will become even more apparent.
>>
>> Hope Andrew and other interested parties can pick up this thread.
>>
>> On Wed, Nov 2, 2016 at 12:11 PM, Benjamin Good 
>> wrote:
>>
>>> Dario,
>>>
>>> One message you can send is that they can and should use existing
>>> controlled vocabularies and ontologies to construct the metadata they want
>>> to share.  For example, MeSH descriptors would be a good way for them to
>>> organize the 'primary topic' assertions for their articles and would make
>>> it easy to find the corresponding items in Wikidata when uploading.  Our
>>> group will be continuing to expand coverage of identifiers and concepts
>>> from vocabularies like that in Wikidata - and any help there from
>>> publishers would be appreciated!
>>>
>>> My view here is that Wikidata can be a bridge to the terminologies and
>>> datasets that live outside it - not really a replacement for them.  So, if
>>> they have good practices about using shared vocabularies already, it should
>>> (eventually) be relatively easy to move relevant assertions into the
>>> WIkidata graph while maintaining interoperability and integration with
>>> external software systems.
>>>
>>> -Ben
>>>
>>> On Wed, Nov 2, 2016 at 8:31 AM, 'Daniel Mietchen' via wikicite-discuss <
>>> wikicite-disc...@wikimedia.org> wro

Re: [Wikidata] Starting a WikiCite newsletter

2016-09-30 Thread Marco Fossati

Thanks a lot for your effort, Dario!
I will add my contribution about the primary sources tool.
Cheers,

Marco

On 9/30/16 00:49, Dario Taraborelli wrote:

Hey all,

it's been a while since we posted a big update on everything that's been
happening around WikiCite  on
this list. A few people suggested a regular (monthly or quarterly)
newsletter would be a good way to keep everyone posted on the latest
developments, so I am starting one.

Below are the main highlights for September. I'm cross-posting this to
the Wikidata mailing list and hosting the full-text of the WikiCite
newsletter  on Meta.

Please add anything I may have missed on the wiki. Copying the full
content below, future updates will only include a link to save bandwidth
and ugly HTML email.

Best,
Dario


September 2016


  Milestones


All English Wikipedia references citing PMCIDs

Thanks to Daniel Mietchen
, all references
used in English Wikipedia with a /PubMed Central identifier/ (P932
), based on a dataset

 produced
byAaron Hafaker  using
the mwcites
 library, have
been created in Wikidata. As of today, there are over 110,000 items
 using this property.


Half a million citations using P2860

James Hare  has been working
on importing open access review papers published in the last 5 years as
well as their citation graph. These review papers are not only critical
to Wikimedia projects, as sources of citations for Wikipedia articles
and statements, but they also open license their contents, which will
allow semi-automated statement extraction via text and data mining
strategies. As part of this project, the property /cites/ (P2860
) created during WikiCite
2016 has been used in over half a million statements representing
citations from one paper to another. While this is a tiny fraction of
the entire citation graph, it's a great way of making data available to
Wikimedia volunteers for crosslinking statements, sources and the works
they cite.


New properties

The /Crossref funder ID/ property (P3153
) can now be used to
represent data on the funders of particular works (when available),
which will allow to perform novel analyses on sources for Wikidata
statements as a function of particular funders.

The /uses property/ property (P3176
), which Finn Årup Nielsen
conveniently dubbed
 the "selfie
property"/,/ can now be used to identify works that mention specific
Wikidata properties.


  Events


WikiCite 2017

Our original plans to host the 2017 edition
 of WikiCite in San
Francisco in the week of January 16, 2017 (after the Wikimedia Developer
Summit) failed due to a major, Salesforce-style conference happening in
that week, which will bring tens of thousands of delegates to the city.
The WMF Travel team blacklisted that week for hosting events or meetings
in SF, since hotel rates will go through the roof. We're now looking at
alternative locations and dates in FY-Q4 (April-June 2017) in Europe,
most likely Berlin (like this year
), or piggybacking on
the 2017 Wikimedia Hackathon
 in Vienna (May
19-21, 2017), which will give us access to a large number of volunteers
as well as WMF and WMDE developers.


  Documentation


WikiCite 2016 Report

A draft report
 from WikiCite
2016  is available on
Meta. It will be closed in the coming days with the additional
information required by the funders of the event.


Book metadata modeling proposal

Chiara Storti
 and Andrea Zanni
 posted a proposal

 with
examples to address in a pragmatic way the complex issues surrounding
metadata modeling for books. If you're interested in the topic, please
chime in.


  Outreach

File:2016 VIVO Keynote - Dario
Taraborelli.webm

/Ver

Re: [Wikidata] Upload mathematical formulas in Wikidata

2016-09-21 Thread Marco Fossati

Hi Kaushal,

On 9/21/16 17:53, kaushal dudhat wrote:

We extracted a lot of them from Wikipedia. There are 17838 formulas now.
It would be great to get them uploaded into primary source tool.
The list of formulas in primary source tool syntax is attached here.

I'm happy to give you a hand.

Could you first repackage your dataset without the pickle format?
Statements seem to lack references.
Which Wikipedia chapter did you use to extract the data?
Perhaps you could just add an "imported from English Wikipedia" (P143 
Q328) reference to each statement;


Looking forward to your next release.
Cheers,

Marco

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Let's move forward with support for Wiktionary

2016-09-14 Thread Marco Fossati

Hi everyone,

FYI, there is an ongoing Wikimedia IEG project (main grantee in CC), 
which seems to be following a related direction:

https://meta.wikimedia.org/wiki/Grants:IEG/A_graphical_and_interactive_etymology_dictionary_based_on_Wiktionary

Its first phase will translate Wiktionary into machine-readable data.
I think it is worth to consider reusing its outcome if possible, since 
it may fit into the Wikidata data model:

https://meta.wikimedia.org/wiki/Grants_talk:IEG/A_graphical_and_interactive_etymology_dictionary_based_on_Wiktionary#Translation_to_the_Wikidata_Data_Model

Cheers,

Marco

On 9/14/16 10:51, Léa Lacroix wrote:

Hello,

Thanks a lot for your questions and feedbacks. Here are some answers, I
hope these will be useful.

/- How wikidata and wiktionary databases will be synchronized?/
New entity types will be created in Wikidata database, with new ids (ex.
L for lexemes). A Wiktionary will have the possibility to include data
from Wikidata in their pages (the complete entity or only some chosen
statements, as the community decides)

/- Will editing wiktionary change?/
Yes, changes will happen, but we're working on making editing Wiktionary
easier. Soon as we can provide some mockups, we will share them with you
for collecting feedbacks.

/- Will bots be allowed/able to edit wiktionary pages after the support
of wikidata in wiktionary?/
Yes, of course, we want the data to be machine-readable and editable,
and with the new structure, bots will be able to edit data stored in
Wikidata and still Wiktionary pages.

/- Can an edit in a wiktionary A break wiktionary B?/
Data about lexemes will be stored on Wikidata, and Wiktionaries will
choose if they want to use this data, which part of it and how. Yes, if
an information stored in Wikidata and displayed on several Wiktionaries
is modified, via Wikidata or a Wiktionary interface, this will affect
all the pages where the information is included.
Because Wikidata is a multilingual project, we already have to deal with
the language issue, and we hope that with the increase of the numbers of
editors coming from Wikidata and Wiktionaries, it will become easier to
find people with at least one common language to communicate between the
different projects.

/- What else can provide wikidata to wiktionary?/
Machine-readable data will allow users to create new tools, useful for
editors, based on the communities' needs. By helping the different
communities (Wiktionaries and Wikidata) working together on the same
project, we expect a growth of the number of people editing the
lexicographical data, providing more review and a better quality of the
data. Finally, when centralized and structured, the data will be easily
reusable by third parties, other websites or applications... and give a
better visibility of the volunteers' work.

On 13 September 2016 at 22:53, Amirouche Boubekki
mailto:amirou...@hypermove.net>> wrote:

Héllo,

I am very happy of this news.

I a wiki newbie interested in using wikidata to do text analysis.
I try to follow the discussion here and on french wiktionary.

I take this as opportunity to try to sum up some concerns that are
raised on french wiktionary [0]:

- How wikidata and wiktionary databases will be synchronized?

- Will editing wiktionary change? The concern is that this will make
editing wiktionary more difficult for people.

- Also, what about bots. Will bots be allowed/able to edit
wiktionary pages after the support of wikidata in wiktionary?

- Another concern is that if edits are done in some wiktionary and
that edit has an impact on another wiktionary. People will have
trouble to reconcil their opinion given they don't speak the same
language. Can an edit in a wiktionary A break wiktionary B?

I understand that wikidata requires new code to support the
organisation of new relations between the data. I understand that
with wikidata it will be easy to create interwiki links and
thesaurus kind of pages but what else can provide wikidata to
wiktionary?

[0] https://fr.wiktionary.org/wiki/Projet:Coop%C3%A9ration/Wikidata


Thanks,

i⋅am⋅amz3


On 2016-09-13 15:17, Lydia Pintscher wrote:

Hey everyone :)

Wiktionary is our third-largest sister project, both in term of
active
editors and readers. It is a unique resource, with the goal to
provide
a dictionary for every language, in every language. Since the
beginning of Wikidata but increasingly over the past months I have
been getting more and more requests for supporting Wiktionary and
lexicographical data in Wikidata. Having this data available openly
and freely licensed would be a major step forward in automated
translation, text analysis, text generation and much more. It will
enable and ease research. And most im

Re: [Wikidata] [wikicite-discuss] Re: (semi-)automatic statement references fro Wikidata from DBpedia

2016-09-08 Thread Marco Fossati

Thanks Ben for reading my mind, I was about to provide the same pointer. :-)
Let's try to keep the discussion on the primary sources tool in one 
place as much as possible.


Cheers,

Marco

On 9/1/16 18:23, Benjamin Good wrote:

Dimitris,

This seems like good way to seed a large scale data and reference import
process.  The trouble here is that wikidata already has large amounts of
such potentially useful data (e.g. most of freebase, the results of the
StepHit NLP system, etc.) but the processes for moving it in have thus
far gone slowly.  In fact the author of the StepHit system for mining
facts/references for wikidata is shifting his focus entirely to
improving that part of the pipeline (known currently as the 'primary
sources' tool) as it is the bottleneck.  It would be great to see you
get involved there:
https://www.wikidata.org/wiki/Wikidata:Requests_for_comment/Semi-automatic_Addition_of_References_to_Wikidata_Statements

Once we have a good technical and social pattern for verifying predicted
claims and references at scale, we can get to the business of loading
that system up with good input.

my two cents..
-ben


On Thu, Sep 1, 2016 at 7:53 AM, Dimitris Kontokostas mailto:jimk...@gmail.com>> wrote:

Hmm,it is hard to interpret no feedback at all here, it could be
a) the data is not usable for Wikidata
b) this is not an interesting idea for Wikidata (now) or
c) this is not a good place to ask

Based on the very high activity on this list I could only guess (b),
even though but this idea came from the Wikidata community 1+ year
ago. This is probably not relevant now.
https://lists.wikimedia.org/pipermail/wikidata/2015-June/006366.html


For reference, this is the prototype extractor that generated the
cited facts which can be run on newer dumps

https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/mappings/CitedFactsExtractor.scala



Best,
Dimitris

On Tue, Aug 30, 2016 at 9:16 PM, Dario Taraborelli
mailto:dtarabore...@wikimedia.org>> wrote:

cc'ing wikicite-discuss, this is going to be of relevance to
many people there too.

On Mon, Aug 29, 2016 at 11:09 PM, Dimitris Kontokostas
mailto:jimk...@gmail.com>> wrote:

You can have a look here.

http://downloads.dbpedia.org/temporary/citations/enwiki-20160305-citedFacts.tql.bz2


it is a quad file that contains DBpedia facts and I replaced
the context with the citation when the citation is on the
exact same line with the extracted fact. e.g.

>
> "An American in
Paris"@en
>
.

It is based on a complete English dump from ~April and
contains roughly 1M cited facts
This is more like a proof-of-concept and there are many ways
to improve and make it more usable for Wikidata

let me know what you think


On Mon, Aug 29, 2016 at 1:38 AM, Brill Lyle
mailto:wp.brilll...@gmail.com>> wrote:

Yes? I think so. Except I would like to see fuller
citations extracted / sampled from / to? I don't have
the technical skill to understand the extraction
completely but Yes. I think there is very rich data in
Wikipedia that is very extractable.

Could this approach be a good candidate reference
suggestions in Wikidata?
(This particular one is already a reference but the
anthem and GDP in the attachment are not for example)



- Erika
*
*
*Erika Herzog*
Wikipedia *User:BrillLyle
*

On Sat, Aug 27, 2016 at 9:37 AM, Dimitris Kontokostas
mailto:kontokos...@informatik.uni-leipzig.de>> wrote:

Hi,

I had this idea for some time now but never got to
test/write it down.
DBpedia extracts detailed context information in
Quads (where possible) on where each triple came
  

Re: [Wikidata] StrepHit IEG renewal: call for support

2016-07-26 Thread Marco Fossati

Hi Gerard,

Thanks for your feedback.

On 7/24/16 18:07, Gerard Meijssen wrote:

Hoi,
Having read the proposal, I am more than happy to endorse this follow
up. My previous annoyance with StrepHit was that it relied on the
Primary Sources tool. I am happy to note that it is now recognised how
deficient it is in the usability department.

The main thing will be to come up with strategies to involve people.
I completely agree: during the past 6 months I have spent a considerable 
effort trying to engage new users.

The known usability issues seem to block the process.

If you have some specific thoughts on this, please feel free to add them 
in the renewal talk page:

https://meta.wikimedia.org/w/index.php?title=Grants_talk:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Renewal&action=edit§ion=new

Cheers,

Marco


So
far the Primary Sources failed miserably in this department. Now that it
is no longer a Google project we may finally consider options without
the miserable thought of Google making its mark held by some. Our
license is CC-0 so Google is welcome anyway to everything we achieve.

My pet hate of the primary sources tool is that often there are
suggestions that already exist. This can be explained that a lot of
values are added without noticing the values available. It would however
be good when these are removed without suggesting that they are
declines. I do this all the time and am sorry because it negatively
impacts the Freebase statistics..

Anyway, the notion of the StrepHit functionality with improved methods
to add them to Wikidata make sense. Maybe it is even time to reconsider
many of the notions of Primary Sources.
Thanks,s
 GerardM

On 24 July 2016 at 15:33, Marco Fossati mailto:foss...@spaziodati.eu>> wrote:

[Begging pardon if you have already read this in the Wikidata
project chat]

Hi everyone,

If you care about data quality, you probably know that high quality
is synonym of references to trusted sources.

That's why the primary sources tool is out there as a Wikidata gadget:
https://www.wikidata.org/wiki/Wikidata:Primary_sources_tool

*The tool definitely needs an uplift.*
That's why I'm requesting a *renewal* of the StrepHit IEG.

Remember StrepHit, the Web agent that reads authoritative sources
and feeds Wikidata with references?
These 6 months of work have led to the release of the first version:
its datasets are now in the primary sources tool, together with
Freebase.
To support the IEG renewal, feel free to play with them!

Please follow the instructions in this request for comment to
activate the tool:

https://www.wikidata.org/wiki/Wikidata:Requests_for_comment/Semi-automatic_Addition_of_References_to_Wikidata_Statements

Are you satisfied with it? Do you agree with the current discussion?

If you have any remark for improvement, please help me refine the
renewal proposal via its talk page.
If you think the primary sources tool requires a boost, please
endorse the StrepHit IEG renewal!


https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Renewal

Cheers,

Marco

___
Wikidata mailing list
Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] StrepHit IEG renewal: call for support

2016-07-24 Thread Marco Fossati

[Begging pardon if you have already read this in the Wikidata project chat]

Hi everyone,

If you care about data quality, you probably know that high quality is 
synonym of references to trusted sources.


That's why the primary sources tool is out there as a Wikidata gadget:
https://www.wikidata.org/wiki/Wikidata:Primary_sources_tool

*The tool definitely needs an uplift.*
That's why I'm requesting a *renewal* of the StrepHit IEG.

Remember StrepHit, the Web agent that reads authoritative sources and 
feeds Wikidata with references?
These 6 months of work have led to the release of the first version: its 
datasets are now in the primary sources tool, together with Freebase.

To support the IEG renewal, feel free to play with them!

Please follow the instructions in this request for comment to activate 
the tool:

https://www.wikidata.org/wiki/Wikidata:Requests_for_comment/Semi-automatic_Addition_of_References_to_Wikidata_Statements

Are you satisfied with it? Do you agree with the current discussion?

If you have any remark for improvement, please help me refine the 
renewal proposal via its talk page.
If you think the primary sources tool requires a boost, please endorse 
the StrepHit IEG renewal!


https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Renewal

Cheers,

Marco

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Primary Sources Tool code has been moved to Wikidata org

2016-07-12 Thread Marco Fossati

Thanks for the heads-up, Lydia.
I assume future contributors won't have to sign a Google Contributor 
License Agreement, right?


Cheers,

Marco

On 7/12/16 17:11, Lydia Pintscher wrote:

Hey folks :)

Based on requests here Denny and I have worked on getting the Primary
Sources Tool code moved from the Google to the Wikidata organisation.
This has now happened and it is available at
https://github.com/Wikidata/primarysources from now on. I hope this
will lead to more contributions from more people as I believe it is an
important part of Wikidata's data flow.


Cheers
Lydia



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] [ANNOUNCEMENT] StrepHit 1.0 Beta Release

2016-06-16 Thread Marco Fossati

Hi Satya,

the knowledge base produced by StrepHit could be queried by a QA system, 
pretty much as any structured knowledge base.

Not sure what you want to know though, could you please expand?
Cheers,

Marco

On 6/16/16 18:51, Satya Gadepalli wrote:

Can this be used as factoid QA System?

thx


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] [ANNOUNCEMENT] StrepHit 1.0 Beta Release

2016-06-15 Thread Marco Fossati

Hi Ben,

On 6/15/16 18:24, Benjamin Good wrote:

Hi Marco,

Where might we find some statistics on the current accuracy of the
automated claim and reference extractors?  I assume that information
must be in there somewhere, but I had trouble finding it.

The StrepHit pipeline (codebase) is ready, while the project is ongoing.
We are not there yet, and will publish performance values in the final 
report.


This is a very ambitious project covering a very large technical
territory (which I applaud).  It would be great if your results could be
synthesized a bit more clearly so we can understand where the
weak/strong points are and where we might be able to help improve or
make use of what you have done in other domains.

Sure, this will be done in the final report.
Up to now, you can have a look at the midpoint report summary:
https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Midpoint#Summary

Best,

Marco


-Ben


On Wed, Jun 15, 2016 at 9:06 AM, Marco Fossati mailto:foss...@spaziodati.eu>> wrote:

[Feel free to blame me if you read this more than once]

To whom it may interest,

Full of delight, I would like to announce the first beta release of
*StrepHit*:

https://github.com/Wikidata/StrepHit

TL;DR: StrepHit is an intelligent reading agent that understands
text and translates it into *referenced* Wikidata statements.
It is a IEG project funded by the Wikimedia Foundation.

Key features:
-Web spiders to harvest a collection of documents (corpus) from
reliable sources
-automatic corpus analysis to understand the most meaningful verbs
-sentences and semi-structured data extraction
-train a machine learning classifier via crowdsourcing
-*supervised and rule-based fact extraction from text*
-Natural Language Processing utilities
-parallel processing

You can find all the details here:

https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References

https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Midpoint

If you like it, star it on GitHub!

Best,

Marco

___
Wikidata mailing list
Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] [ANNOUNCEMENT] StrepHit 1.0 Beta Release

2016-06-15 Thread Marco Fossati

[Feel free to blame me if you read this more than once]

To whom it may interest,

Full of delight, I would like to announce the first beta release of 
*StrepHit*:


https://github.com/Wikidata/StrepHit

TL;DR: StrepHit is an intelligent reading agent that understands text 
and translates it into *referenced* Wikidata statements.

It is a IEG project funded by the Wikimedia Foundation.

Key features:
-Web spiders to harvest a collection of documents (corpus) from reliable 
sources

-automatic corpus analysis to understand the most meaningful verbs
-sentences and semi-structured data extraction
-train a machine learning classifier via crowdsourcing
-*supervised and rule-based fact extraction from text*
-Natural Language Processing utilities
-parallel processing

You can find all the details here:
https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References
https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Midpoint

If you like it, star it on GitHub!

Best,

Marco

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] RFC - Primary Sources?

2016-06-15 Thread Marco Fossati

Hi Lydia,

On 6/15/16 07:42, Lydia Pintscher wrote:

Lydia - can you assign someone to come up to speed at whatever level

Denny requires to feel comfortable making the transfer?

I will take care of it with Denny in the next days.


Repasting part of a previous message with the list of requirements:

A. a developer to understand the back-end code [1], written in C++;
B. a developer to understand the front-end code [2], written in Javascript;
C. access to the WMF Labs machine to deploy the back-end [3];
D. a Wikidata administrator to deploy the front-end [4];
E. centralized and exhaustive documentation.

As part of the StrepHit project goals [5], my team is striving to help 
with A. (not exactly trivial) and C., but we really need B. and D. to be 
effective.


Best,

Marco

[1] https://github.com/google/primarysources/tree/master/backend
[2] https://github.com/google/primarysources/tree/master/frontend
[3] https://tools.wmflabs.org/wikidata-primary-sources
[4] 
https://github.com/google/primarysources/tree/master/frontend#deployment-on-wikidata
[5] 
https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References#Project_Goals


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] RFC - Primary Sources?

2016-06-15 Thread Marco Fossati

Hi Tom,

On 6/14/16 19:26, Tom Morris wrote:

Marco - Centralizing the discussion is good, but why not pick one of the
three existing channels (issue tracker, project page, this mailing list)
rather than creating a fourth channel?
The RFC is meant to put together low-level technical problems (issue 
tracker), usability discussions (project page), less structured 
discussions (mailing list).

*And* comments on the uploaded datasets.

As much as I love playing and
watching soccer, I'm much more interested in the vast trove of
identifiers and other curated information in Freebase than I am in
improving Wikidata's soccer coverage, but the Primary Sources tool could
be useful for some portions of the Freebase data, if it could be usable.
I guess you are referring to the StrepHit prototype dataset 
'strephit-soccer'.
Why don't you try the 'strephit-testing' one? It deals with biographies 
and has much broader coverage.


Best,

Marco

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] RFC - Primary Sources?

2016-06-14 Thread Marco Fossati

Hi Tom and thanks Lydia for the clarification,

that request for comments (RFC) [1] aims at gathering feedback both on 
the primary sources tool and the available datasets (especially StrepHit 
[2]), which are closely intertwined: the dataset is in the tool, so 
people can play with both in one single interaction and leave their 
thoughts in the RFC.


Sorry if the title is misleading: the pipeline is indeed semi-automatic, 
as the StrepHit dataset is generated automatically, while its validation 
requires human attention.


Since I'm trying to centralize the discussion, it would be great if you 
could expand in the RFC the 3 fundamental questions you raised.


Best,

Marco

[1] 
https://www.wikidata.org/wiki/Wikidata:Requests_for_comment/Semi-automatic_Addition_of_References_to_Wikidata_Statements
[2] 
https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References



On 6/14/16 08:27, Lydia Pintscher wrote:

On Tue, Jun 14, 2016 at 1:03 AM Tom Morris mailto:tfmor...@gmail.com>> wrote:

I'm confused by this from today's Wikidata weekly summary:

  * New request for comments: Semi-automatic Addition of References
to Wikidata Statements - feedback on the Primary Sources Tool



First of all, the title makes no sense because "semi-automatic
addition of references to Wikidata statements" is one of the main
things that the tool can't currently do. You'll almost always end up
with duplicate statements if there's an existing statement, rather
than the desired behavior of just adding the statement.

Second, I'm not sure who "Hjfocs" is (why does everyone have to make
up fake wikinames?), but why are they asking for more feedback when
there's been *ample* feedback already? There hasn't been an issue
with getting people to test the tool or provide feedback based on
the testing. The issue has been with getting anyone to *act* on the
feedback. Everything is a) "too hard," or b) "beyond our resources,"
or depends on something in category a or b, or is incompatible with
the arbitrary implementation scheme chosen, or some other excuse.

We're 12-18+ months into the project, depending on how you measure,
and not only is the tool not usable yet, but it's no longer
improving, so I think it's time to take a step back and ask some
fundamental questions.

- Is the current data pipeline and front end gadget the right
approach and the right technology for this task? Can they be fixed
to be suitable for users?
- If so, should Google continue to have sole responsibility for it
or should it be transferred to the Wikidata team or someone else
who'll actually work on it?
- If not, what should the data pipeline and tooling look like to
make maximum use of the Freebase data?

The whole project needs a reboot.


I realize you are upset but you are really barking up the wrong tree.
Marco is trying to give the whole thing more structure and sort through
all the requests to find a way forward. He is actually doing something
constructive about the issues you are raising.


Cheers
Lydia
--
Lydia Pintscher - http://about.me/lydia.pintscher
Product Manager for Wikidata

Wikimedia Deutschland e.V.
Tempelhofer Ufer 23-24
10963 Berlin
www.wikimedia.de 

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das
Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Primary sources tool sustainability (was: Re: Primary sources tool "reject claim" broken?)

2016-05-31 Thread Marco Fossati

Dear all,

Currently, the primary sources tool maintenance and improvement 
processes are fairly sub-optimal, as:
1. the core team at Google has not enough time to tackle bugs and 
feature requests;
2. the pull request/merge flow is insufficient alone, since both the 
back-end and the front-end must also be deployed in production, 
eventually by someone else.


I would like to report here the requirements to make the tool sustainable:
A. a developer to understand the back-end code [1], written in C++;
B. a developer to understand the front-end code [2], written in Javascript;
C. access to the WMF Labs machine to deploy the back-end [3];
D. a Wikidata administrator to deploy the front-end [4];
E. centralized and exhaustive documentation.

As part of the StrepHit project goals [5], my team is striving to help 
with A. (not exactly trivial) and C., but we really need B. and D. to be 
effective.


Cheers,

Marco

[1] https://github.com/google/primarysources/tree/master/backend
[2] https://github.com/google/primarysources/tree/master/frontend
[3] https://tools.wmflabs.org/wikidata-primary-sources
[4] 
https://github.com/google/primarysources/tree/master/frontend#deployment-on-wikidata
[5] 
https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References#Project_Goals


On 5/31/16 14:00, wikidata-requ...@lists.wikimedia.org wrote:

Date: Tue, 31 May 2016 08:47:18 + From: Sebastian Schaffert
 To: Thomas Steiner ,
"Discussion list for the Wikidata project."
 Subject: Re: [Wikidata] Primary sources
tool "reject claim" broken? Message-ID:

Content-Type: text/plain; charset="utf-8" Hi Thomas and all, there might
be a caching issue here. That part of the code is here:
https://github.com/google/primarysources/blob/master/backend/service/SourcesToolBackend.cc#L115
and it still seems right to me, but I'll check again. I won't have much
time in the next days though :( I'll give it one hour no, maybe I
discover something. Cheers, Sebastian On Tue, May 31, 2016 at 9:17 AM
Thomas Steiner  wrote:

>Hi Markus and Marco, all,
>
>Thanks for your support of and caring for the Primary Sources Tool.
>Please find my replies inline.
>

> >Dear ,

>
>I guess the core team still does, with the caveat explained by Denny
>in [1], the tl;dr is that we work on it on top of our regular jobs and
>that we are happy to hand it over to folks with more time on their
>hands.
>

> >The PS tool seems to break more and more. Besides the persisting issue

>with

> >duplicated claims being offered (even if they are already stored), there

>is

> >now also the issue that claims cannot be rejected. If I reject a claim,

>the

> >page reloads, but the suggestion still shows up after that.

>
>I checked both problems. It seems the writes from the front-end
>somehow do not make it to the back-end. I opened a random item Q632229
>and approved and rejected claims. The approval went through just fine
>[2].
>
>(i) However, I could reproduce the duplicate claims being shown, the
>reason is that the uniqueness comparison does not take references into
>account [3], a known @ToDo up for grabs.
>(ii) I could also in some cases reproduce the non-rejectable claims
>issue. I repeated disapproved statement 868483 [4], but if you query
>the back-end for incoming Freebase statements for Q632229, it keeps
>coming back as "unapproved" [5] (search for "868483").
>
>For (i), if someone wants to tackle this, happy to merge their Pull
>Request. For (ii), Sebastian, do you have a suspicion why this could
>be the case?
>
>Thanks,
>Tom
>
>--
>[1]
>https://lists.wikimedia.org/pipermail/wikidata/2016-February/008316.html
>[2]
>https://www.wikidata.org/w/index.php?title=Q632229&type=revision&diff=341928371&oldid=316931253
>[3]
>https://github.com/google/primarysources/blob/master/frontend/freebase2wikidata.js#L805
>[4]https://tools.wmflabs.org/wikidata-primary-sources/statements/868483
>[5]https://tools.wmflabs.org/wikidata-primary-sources/entities/Q632229
>
>--
>Dr. Thomas Steiner, Employee (http://blog.tomayac.com,
>https://twitter.com/tomayac)
>
>Google Germany GmbH, ABC-Str. 19, 20354 Hamburg, Germany
>Managing Directors: Matthew Scott Sucherman, Paul Terence Manicle
>Registration office and registration number: Hamburg, HRB 86891
>
>-BEGIN PGP SIGNATURE-
>Version: GnuPG v2.0.29 (GNU/Linux)
>
>
>iFy0uwAntT0bE3xtRa5AfeCheCkthAtTh3reSabiGbl0ck0fjumBl3DCharaCTersAttH3b0ttom
>hTtPs://xKcd.cOm/1181/
>-END PGP SIGNATURE-


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Primary sources tool "reject claim" broken?

2016-05-30 Thread Marco Fossati

Sorry, I forgot to rename the "digest" subject, fixed now.

On 5/30/16 16:06, Marco Fossati wrote:

Hi Markus,

this is a known issue:
https://github.com/google/primarysources/issues/94

It seems to be related to the front-end: @Thomas, this and
https://github.com/google/primarysources/issues/107 are blocking the
usage of the tool.
Would it be possible for you to investigate them?
Cheers,

Marco

On 5/30/16 14:00, wikidata-requ...@lists.wikimedia.org wrote:

Date: Mon, 30 May 2016 10:19:32 +0200
From: Markus Kroetzsch
To: "Discussion list for the Wikidata project."

Subject: [Wikidata] Primary sources tool "reject claim" broken?
Message-ID:<574bf794.1060...@tu-dresden.de>
Content-Type: text/plain; charset=utf-8; format=flowed

Dear ,

The PS tool seems to break more and more. Besides the persisting issue
with duplicated claims being offered (even if they are already stored),
there is now also the issue that claims cannot be rejected. If I reject
a claim, the page reloads, but the suggestion still shows up after that.

Cheers,

Markus

-- Markus Kroetzsch Faculty of Computer Science Technische Universität
Dresden +49 351 463 38486 http://korrekt.org/


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata Digest, Vol 54, Issue 23

2016-05-30 Thread Marco Fossati

Hi Markus,

this is a known issue:
https://github.com/google/primarysources/issues/94

It seems to be related to the front-end: @Thomas, this and 
https://github.com/google/primarysources/issues/107 are blocking the 
usage of the tool.

Would it be possible for you to investigate them?
Cheers,

Marco

On 5/30/16 14:00, wikidata-requ...@lists.wikimedia.org wrote:

Date: Mon, 30 May 2016 10:19:32 +0200
From: Markus Kroetzsch
To: "Discussion list for the Wikidata project."

Subject: [Wikidata] Primary sources tool "reject claim" broken?
Message-ID:<574bf794.1060...@tu-dresden.de>
Content-Type: text/plain; charset=utf-8; format=flowed

Dear ,

The PS tool seems to break more and more. Besides the persisting issue
with duplicated claims being offered (even if they are already stored),
there is now also the issue that claims cannot be rejected. If I reject
a claim, the page reloads, but the suggestion still shows up after that.

Cheers,

Markus

-- Markus Kroetzsch Faculty of Computer Science Technische Universität
Dresden +49 351 463 38486 http://korrekt.org/


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] nice

2016-03-03 Thread Marco Fossati

Glad to see an effort that integrates data from both databases!

Marco

On 3/3/16 13:00, wikidata-requ...@lists.wikimedia.org wrote:

Date: Wed, 02 Mar 2016 22:00:03 +
From: Denny Vrandečić
To: "Discussion list for the Wikidata project."

Subject: Re: [Wikidata] nice
Message-ID:

Content-Type: text/plain; charset="utf-8"

(and to make it clear, it is unclear whether this is an error due to
DBpedia or due to the companies extraction framework, I was not diving into
the data)

On Wed, Mar 2, 2016 at 1:59 PM Denny Vrandečić  wrote:


>Depends how good the DBpedia data really is - as the BBC article says,
>some 2007 football match in the UK was extracted as a "Battle"...
>
>On Wed, Mar 2, 2016 at 1:54 PM Daniel Kinzler
>wrote:
>

>>"They found 12,703 battles which had an exact location and date, 2,657 of
>>them
>>are from Wikidata, the others are from DPpedia."
>>
>>Maybe we can do better?
>>
>>Am 02.03.2016 um 22:14 schrieb Lydia Pintscher:

>> >On Wed, Mar 2, 2016 at 8:14 PM Gerard Meijssen <

>>gerard.meijs...@gmail.com

>> >> wrote:
>> >
>> > Hoi,
>> > Yup I missed that one.. this [1] was my source:)
>> > Gerard
>> >
>> > [1]http://www.bbc.com/news/magazine-35685889
>> >
>> >
>> >This is really great. I am thrilled about this because this isn't

>>coverage about

>> >Wikidata but coverage_with_  Wikidata on major news sites for the

>>second time

>> >this week
>> >(

>>http://www.faz.net/aktuell/feuilleton/kino/academy-awards-die-oscars-von-1929-bis-heute-12820119.html
>>being

>> >the other one). They're using Wikidata data to do meaningful reporting.

>>Our data

>> >and the project as a whole got (at the very least) good enough for

>>this. It

>> >feels to me like we've broken through a wall.
>> >High5 everyone! :D
>> >
>> >Cheers
>> >Lydia
>> >--
>> >Lydia Pintscher -http://about.me/lydia.pintscher
>> >Product Manager for Wikidata
>> >
>> >Wikimedia Deutschland e.V.
>> >Tempelhofer Ufer 23-24
>> >10963 Berlin
>> >www.wikimedia.de  
>> >
>> >Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
>> >
>> >Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg

>>unter der

>> >Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für
>> >Körperschaften I Berlin, Steuernummer 27/029/42207.
>> >
>> >
>> >___
>> >Wikidata mailing list
>> >Wikidata@lists.wikimedia.org
>> >https://lists.wikimedia.org/mailman/listinfo/wikidata
>> >

>>
>>
>>--
>>Daniel Kinzler
>>Senior Software Developer
>>
>>Wikimedia Deutschland
>>Gesellschaft zur Förderung Freien Wissens e.V.
>>
>>___
>>Wikidata mailing list
>>Wikidata@lists.wikimedia.org
>>https://lists.wikimedia.org/mailman/listinfo/wikidata
>>

>


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Primary sources tool (was: Re: Freebase to Wikidata: Results from Tpt internship)

2016-02-22 Thread Marco Fossati

Hi Tom,

FYI, the primary sources tool is not dead: besides Freebase, it will 
also cater for other datasets.


The StrepHit team will take care of it in the next few months, as per 
one of the project goals [1].
The code repository is owned by Google, and the StrepHit team will 
collaborate with the maintainers via the standard pull 
request/review/merge process.


Cheers,

Marco

[1] 
https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References#Project_Goals


On 2/22/16 13:00, wikidata-requ...@lists.wikimedia.org wrote:

From: Tom Morris
To: "Discussion list for the Wikidata project."

Subject: Re: [Wikidata] Freebase to Wikidata: Results from Tpt
internship
Message-ID:

Content-Type: text/plain; charset="utf-8"

Are there plans for next steps or is this the end of the project as far as
>the two of you go?
>

I'm going to assume that the lack of answer to this question over the last
four months, the lack of updates on the project, and the fact no one is
even bothering to respond to issues
  means that this project
is dead and abandoned.  That's pretty sad. For an internship, it sounds
like a cool project and a decent result. As an actual serious attempt to
make productive use of the Freebase data, it's a weak, half-hearted effort
by Google.


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] from Freebase to Wikidata: the great migration

2016-02-19 Thread Marco Fossati

I couldn't wait for a detailed description of the primary sources tool.
Thanks a lot to the authors for mentioning the StrepHit soccer dataset!

Cheers,

Marco

On 2/19/16 13:00, wikidata-requ...@lists.wikimedia.org wrote:

Date: Thu, 18 Feb 2016 11:07:41 -0600
From: Maximilian Klein
To: "Discussion list for the Wikidata project."

Subject: Re: [Wikidata] from Freebase to Wikidata: the great migration
Message-ID:

Content-Type: text/plain; charset="utf-8"

Congratulations on a fantastic project and a your acceptance in WWW2016.

Make a great day,
Max Klein ‽http://notconfusing.com/

On Thu, Feb 18, 2016 at 10:54 AM, Federico Leva (Nemo)
wrote:


>Lydia Pintscher, 18/02/2016 15:59:
>

>>Thomas, Denny, Sebastian, Thomas, and I have published a paper which was
>>accepted for the industry track at WWW 2016. It covers the migration
>>from Freebase to Wikidata. You can now read it here:
>>http://research.google.com/pubs/archive/44818.pdf
>>
>>

>Nice!
>

> >Concluding, in a fairly short amount of time, we have been
> >able to provide the Wikidata community with more than
> >14 million new Wikidata statements using a customizable

>
>I must admit that, despite knowing the context, I wasn't able to
>understand whether this is the number of "mapped"/"translated" statements
>or the number of statements actually added via the primary sources tool. I
>assume the latter given paragraph 5.3:
>

> >after removing dupli
> >cates and facts already contained in Wikidata, we obtain
> >14 million new statements. If all these statements were
> >added to Wikidata, we would see a 21% increase of the num-
> >ber of statements in Wikidata.

>

I was confused about that too. "the [Primary Sources] tool has been
used by more than a hundred users who performed about
90,000 approval or rejection actions. More than 14 million
statements have been uploaded in total."  I think that means that ≤ 90,000
items or statements were added of 14 million available to be add through
Primary Sources tool.


>
>Nemo
>
>___
>Wikidata mailing list
>Wikidata@lists.wikimedia.org
>https://lists.wikimedia.org/mailman/listinfo/wikidata
>


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Reliable sources list validation for StrepHit

2016-01-26 Thread Marco Fossati

Hi everyone,

The curated list of biographical sources for StrepHit has now passed the 
objective of 40 items [1].


Your help in validating the list is essential to ensure the reliability 
of the corpus that will be collected upon it.


In practice, are the sources:
1. *reliable* (cf. [2])?
2. *third-party*, i.e., not created by users of Wikimedia projects?

I kindly ask you to answer those questions in the discussion page of [1].
Cheers,

Marco

[1] 
https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Timeline#Biographies

[2] https://en.wikipedia.org/wiki/Wikipedia:Verifiability#Reliable_sources

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] StrepHit domain + sources selection (was Re: [REMINDER] StrepHit IEG project kick-off seminar)

2016-01-25 Thread Marco Fossati

Hi Daniel,

Thanks for getting in touch and for the useful information.
After 2 rounds of feedback from the community via StrepHit's 
dissemination activities, I have opted for the biographical domain.
Cf. 
https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Timeline#Biographies


The biomedical domain is next in line: while it is out of scope for the 
project time frame (6 months), I believe it will be an excellent use 
case in case of an extension.


Thanks again.
Cheers,

Marco


Date: Sat, 23 Jan 2016 23:00:26 +0100
From: Daniel Mietchen
To: "Discussion list for the Wikidata project."

Subject: Re: [Wikidata] [REMINDER] StrepHit IEG project kick-off
seminar
Message-ID:

Content-Type: text/plain; charset=UTF-8

Thanks, Marco - I just watched the video and would be interested in
knowing whether you have picked your focus by now from the three
domains you suggested (biographies, companies, biomedical) or perhaps
something else. If you go for biomedical, there would be overlap with
https://www.wikidata.org/wiki/Wikidata:WikiProject_Source_MetaData
and
https://www.wikidata.org/wiki/Wikidata:WikiProject_Medicine
and probably also
https://www.wikidata.org/wiki/User:ProteinBoxBot  ,
all of which have a number of active people who I'd expect to be
interested in a test run of your pipeline on biomedical topics.

In any case, I would welcome it if you would include PubMed into your
set of reliable third-party sources. We already have about 17k
Wikidata items with a PubMed ID (cf.
http://tools.wmflabs.org/autolist/autolist1.html?q=claim%5B698%5D  ),
which are increasingly being used as references for Wikidata
statements (cf.http://tinyurl.com/zhauolt  ), but we have barely
scratched the surface of what needs to be done here (the English
Wikipedia alone has ca. 30k medical articles, many of them heavily
referenced to sources indexed by PubMed).

So far, the link of those 17k items to Wikipedia is just that almost
all of the corresponding papers have been cited on some Wikipedia
(some come also from Wikispecies, Wikisource or other Wikimedia
sites), albeit not always using a PubMed ID (sometimes via DOI, PubMed
Central ID or via a link or in some other way).

Daniel

On Fri, Jan 15, 2016 at 10:47 AM, Marco Fossati  wrote:

>Hi everyone,
>
>the seminar will start in a few minutes.
>Cheers,
>
>Marco
>
>On 1/11/16 16:52, Marco Fossati wrote:

>>
>>Here is the link for the online streaming:
>>https://youtu.be/uvfd_HmPOrc
>>
>>Cheers,
>>
>>Marco
>>
>>2016-01-11 16:11 GMT+01:00 Marco Fossati ><mailto:foss...@spaziodati.eu>>:
>>
>> Dear all,
>>
>> This is a kind reminder for the upcoming StrepHit IEG project
>> kick-off seminar.
>> Schedule: 15 January 2016, 11:00 am
>>
>> **Important update:** the location has moved to downtown Trento.
>> **New location:** Aula Grande - Fondazione Bruno Kessler, Via
>> S.Croce 77, Trento, Italy -http://www.openstreetmap.org/way/67197096
>>
>> The seminar will be streamed online, a link will be shared as soon
>> as it is available.
>>
>> See you in Trento!
>> Cheers,
>>
>> Marco
>>
>> 2015-12-23 17:03 GMT+01:00 Marco Fossati > <mailto:foss...@spaziodati.eu>>:
>>
>>
>> [Begging pardon if you read this multiple times]
>>
>> Hi everyone,
>>
>> I would like to announce with great pleasure the StrepHit IEG
>> project kick-off seminar.
>> Of course, you are all invited to attend.
>>
>> The event will be held in a special day: Wikipedia's birthday!
>>
>> Below you can find the details.
>>
>> Schedule: 15 January 2016, 11:00 am, Luigi Stringa Conference Room
>> Location: Fondazione Bruno Kessler, Via Sommarive 18, Povo,
>> Trento, Italy -http://www.openstreetmap.org/way/28933739
>>
>> Abstract: We kick-off StrepHit, a project funded by the
>> Wikimedia Foundation through the Individual Engagement Grants
>> program.
>> StrepHit is a Natural Language Processing pipeline that
>> understands human language, extracts facts from text and
>> produces Wikidata statements with reference URLs.
>> It will enhance the data quality of Wikidata by suggesting
>> references to validate statements, and will help Wikidata become
>> the gold-standard hub of the Open Data landscape.
>>
>> Link:
>>
>>https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References
&

[Wikidata] Sourcerer / Sourcery (was: Re: Mix'n'match tool catalogues data)

2016-01-22 Thread Marco Fossati

Hi Magnus,


>I was aware of the Sourcerer tool: I'm concerned with those references
>coming from Wikipedia articles though, since they stem from inside a
>Wikimedia project, and I want to make sure that everything comes from
>the outside.
>

The Sourcerer references do NOT come from Wikipedia! I am using third-party
sites for which we already have IDs (e.g. GND) to auto-validate values, and
add the appropriate reference if identical. Basically, what you want to do,
on the cheap;-)

Whoops, sorry, I was probably confused by:
1. the Sourcerer **user script** [1], which claims to "get a list of all 
external links from all language editions of Wikipedia for an item";
2. the **Sourcery** tool [2], which claims to load "all URLs in all 
associated Wikipedia pages".


So, third-party references indeed, but still curated by Wikipedians, right?
That's what I meant, I should have been more specific in my concern.

Is it correct that the Sourcerer **bot** is a different thing, or am I 
getting this completely wrong?


Cheers,

Marco

[1] https://www.wikidata.org/wiki/Wikidata:Tools/User_scripts#Sourcerer
[2] https://tools.wmflabs.org/wikidata-todo/sourcery.html

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Mix'n'match tool catalogues data

2016-01-21 Thread Marco Fossati
Thanks Magnus for the pointers, the mix'n'match data are like jewels for 
StrepHit.
Does the database you mentioned also contain the **body** of the 
catalogs items, i.e., the raw text of biographies?

If so, I can avoid scraping all those sources, and it would be just perfect.

As a side note, I'm currently outreaching GLAM people to ask for more 
biographical sources: links are coming, and I'll definitely import them 
into mix'n'match as well.


I was aware of the Sourcerer tool: I'm concerned with those references 
coming from Wikipedia articles though, since they stem from inside a 
Wikimedia project, and I want to make sure that everything comes from 
the outside.

I'm open for discussion with the community about this.

What do you think?

Cheers,

Marco

On 1/21/16 13:00, wikidata-requ...@lists.wikimedia.org wrote:

Date: Wed, 20 Jan 2016 16:44:46 +
From: Magnus Manske
To: "Discussion list for the Wikidata project."

Subject: Re: [Wikidata] Mix'n'match tool catalogues data
Message-ID:

Content-Type: text/plain; charset="utf-8"

I also have a bot that can add references from various web sources:
https://bitbucket.org/magnusmanske/wikidata-todo/src/f56dfdaaaee053abaadefb584fcb4f714bc82545/scripts/autosource/botsource.php?at=master&fileviewer=file-view-default

Edits so far:
https://www.wikidata.org/wiki/Special:Contributions/SourcererBot


On Wed, Jan 20, 2016 at 4:42 PM Magnus Manske
wrote:


>Hi Marco,
>
>I run this tool. Quick answers:
>
>1. Yes. If you have a Labs account, you can see everything in database
>s51434__mixnmatch_p . You can also get most of the data via the API
>(undocumented; ask me for specifics, check out the requests of the
>interface in the browser, or try the source code at
>https://bitbucket.org/magnusmanske/mixnmatch/src/63c9ba58dd236e0aeb5a7ad12315047d787530f0/public_html/api.php?at=master&fileviewer=file-view-default
>)
>
>2. Anyone can match entries to Wikidata items. I added most of the
>catalogs, but you can also do that yourself at
>https://tools.wmflabs.org/mix-n-match/import.php  .
>
>Cheers,
>Magnus
>
>On Wed, Jan 20, 2016 at 4:31 PM Marco Fossati
>wrote:
>

>>Hi everyone,
>>
>>The mix'n'match tool [1] provides a list of catalogues from different
>>sources with lots of biographical data.
>>
>>The list seems like a great starting point for the selection of reliable
>>sources that would feed the StrepHit pipeline [2].
>>
>>I was wondering 2 things:
>>1. Is it possible to directly access those datasets?
>>2. Who are the contributors that maintain the list?
>>
>>If you are involved into this effort, please get in touch with me.
>>Cheers,
>>
>>Marco
>>
>>[1]https://tools.wmflabs.org/mix-n-match/
>>[2]
>>
>>https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References
>>
>>___
>>Wikidata mailing list
>>Wikidata@lists.wikimedia.org
>>https://lists.wikimedia.org/mailman/listinfo/wikidata
>>

>


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Mix'n'match tool catalogues data

2016-01-20 Thread Marco Fossati

Hi everyone,

The mix'n'match tool [1] provides a list of catalogues from different 
sources with lots of biographical data.


The list seems like a great starting point for the selection of reliable 
sources that would feed the StrepHit pipeline [2].


I was wondering 2 things:
1. Is it possible to directly access those datasets?
2. Who are the contributors that maintain the list?

If you are involved into this effort, please get in touch with me.
Cheers,

Marco

[1] https://tools.wmflabs.org/mix-n-match/
[2] 
https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] [REMINDER] StrepHit IEG project kick-off seminar

2016-01-15 Thread Marco Fossati

Hi everyone,

the seminar will start in a few minutes.
Cheers,

Marco

On 1/11/16 16:52, Marco Fossati wrote:

Here is the link for the online streaming:
https://youtu.be/uvfd_HmPOrc

Cheers,

Marco

2016-01-11 16:11 GMT+01:00 Marco Fossati mailto:foss...@spaziodati.eu>>:

Dear all,

This is a kind reminder for the upcoming StrepHit IEG project
kick-off seminar.
Schedule: 15 January 2016, 11:00 am

**Important update:** the location has moved to downtown Trento.
**New location:** Aula Grande - Fondazione Bruno Kessler, Via
S.Croce 77, Trento, Italy - http://www.openstreetmap.org/way/67197096

The seminar will be streamed online, a link will be shared as soon
as it is available.

See you in Trento!
Cheers,

Marco

2015-12-23 17:03 GMT+01:00 Marco Fossati mailto:foss...@spaziodati.eu>>:

[Begging pardon if you read this multiple times]

Hi everyone,

I would like to announce with great pleasure the StrepHit IEG
project kick-off seminar.
Of course, you are all invited to attend.

The event will be held in a special day: Wikipedia's birthday!

Below you can find the details.

Schedule: 15 January 2016, 11:00 am, Luigi Stringa Conference Room
Location: Fondazione Bruno Kessler, Via Sommarive 18, Povo,
Trento, Italy - http://www.openstreetmap.org/way/28933739

Abstract: We kick-off StrepHit, a project funded by the
Wikimedia Foundation through the Individual Engagement Grants
program.
StrepHit is a Natural Language Processing pipeline that
understands human language, extracts facts from text and
produces Wikidata statements with reference URLs.
It will enhance the data quality of Wikidata by suggesting
references to validate statements, and will help Wikidata become
the gold-standard hub of the Open Data landscape.

Link:

https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References

    Speaker's bio: Marco Fossati is a researcher with a double
background in Natural Languages and Information Technologies. He
works at the Data and Knowledge Management (DKM) research unit
at Fondazione Bruno Kessler, Trento, Italy. He is member of the
DBpedia Association board of trustees, founder and
representative of its Italian chapter. He has interdisciplinary
skills both in linguistics and in programming. His research
focuses on bridging the gap between Natural Language Processing
techniques and Large Scale Structured Knowledge Bases in order
to drive the Web of Data towards its full potential.

See you in Trento and long live Wikipedia!
Cheers,

Marco





___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] [REMINDER] StrepHit IEG project kick-off seminar

2016-01-11 Thread Marco Fossati
Here is the link for the online streaming:
https://youtu.be/uvfd_HmPOrc

Cheers,

Marco

2016-01-11 16:11 GMT+01:00 Marco Fossati :

> Dear all,
>
> This is a kind reminder for the upcoming StrepHit IEG project kick-off
> seminar.
> Schedule: 15 January 2016, 11:00 am
>
> **Important update:** the location has moved to downtown Trento.
> **New location:** Aula Grande - Fondazione Bruno Kessler, Via S.Croce 77,
> Trento, Italy - http://www.openstreetmap.org/way/67197096
>
> The seminar will be streamed online, a link will be shared as soon as it
> is available.
>
> See you in Trento!
> Cheers,
>
> Marco
>
> 2015-12-23 17:03 GMT+01:00 Marco Fossati :
>
>> [Begging pardon if you read this multiple times]
>>
>> Hi everyone,
>>
>> I would like to announce with great pleasure the StrepHit IEG project
>> kick-off seminar.
>> Of course, you are all invited to attend.
>>
>> The event will be held in a special day: Wikipedia's birthday!
>>
>> Below you can find the details.
>>
>> Schedule: 15 January 2016, 11:00 am, Luigi Stringa Conference Room
>> Location: Fondazione Bruno Kessler, Via Sommarive 18, Povo, Trento, Italy
>> - http://www.openstreetmap.org/way/28933739
>>
>> Abstract: We kick-off StrepHit, a project funded by the Wikimedia
>> Foundation through the Individual Engagement Grants program.
>> StrepHit is a Natural Language Processing pipeline that understands human
>> language, extracts facts from text and produces Wikidata statements with
>> reference URLs.
>> It will enhance the data quality of Wikidata by suggesting references to
>> validate statements, and will help Wikidata become the gold-standard hub of
>> the Open Data landscape.
>>
>> Link:
>> https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References
>>
>> Speaker's bio: Marco Fossati is a researcher with a double background in
>> Natural Languages and Information Technologies. He works at the Data and
>> Knowledge Management (DKM) research unit at Fondazione Bruno Kessler,
>> Trento, Italy. He is member of the DBpedia Association board of trustees,
>> founder and representative of its Italian chapter. He has interdisciplinary
>> skills both in linguistics and in programming. His research focuses on
>> bridging the gap between Natural Language Processing techniques and Large
>> Scale Structured Knowledge Bases in order to drive the Web of Data towards
>> its full potential.
>>
>> See you in Trento and long live Wikipedia!
>> Cheers,
>>
>> Marco
>>
>
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] [REMINDER] StrepHit IEG project kick-off seminar

2016-01-11 Thread Marco Fossati
Dear all,

This is a kind reminder for the upcoming StrepHit IEG project kick-off
seminar.
Schedule: 15 January 2016, 11:00 am

**Important update:** the location has moved to downtown Trento.
**New location:** Aula Grande - Fondazione Bruno Kessler, Via S.Croce 77,
Trento, Italy - http://www.openstreetmap.org/way/67197096

The seminar will be streamed online, a link will be shared as soon as it is
available.

See you in Trento!
Cheers,

Marco

2015-12-23 17:03 GMT+01:00 Marco Fossati :

> [Begging pardon if you read this multiple times]
>
> Hi everyone,
>
> I would like to announce with great pleasure the StrepHit IEG project
> kick-off seminar.
> Of course, you are all invited to attend.
>
> The event will be held in a special day: Wikipedia's birthday!
>
> Below you can find the details.
>
> Schedule: 15 January 2016, 11:00 am, Luigi Stringa Conference Room
> Location: Fondazione Bruno Kessler, Via Sommarive 18, Povo, Trento, Italy
> - http://www.openstreetmap.org/way/28933739
>
> Abstract: We kick-off StrepHit, a project funded by the Wikimedia
> Foundation through the Individual Engagement Grants program.
> StrepHit is a Natural Language Processing pipeline that understands human
> language, extracts facts from text and produces Wikidata statements with
> reference URLs.
> It will enhance the data quality of Wikidata by suggesting references to
> validate statements, and will help Wikidata become the gold-standard hub of
> the Open Data landscape.
>
> Link:
> https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References
>
> Speaker's bio: Marco Fossati is a researcher with a double background in
> Natural Languages and Information Technologies. He works at the Data and
> Knowledge Management (DKM) research unit at Fondazione Bruno Kessler,
> Trento, Italy. He is member of the DBpedia Association board of trustees,
> founder and representative of its Italian chapter. He has interdisciplinary
> skills both in linguistics and in programming. His research focuses on
> bridging the gap between Natural Language Processing techniques and Large
> Scale Structured Knowledge Bases in order to drive the Web of Data towards
> its full potential.
>
> See you in Trento and long live Wikipedia!
> Cheers,
>
> Marco
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] [ANNOUNCEMENT] StrepHit IEG project kick-off seminar

2015-12-27 Thread Marco Fossati
Hi Dario,

Date: Wed, 23 Dec 2015 08:04:33 -0800
> From: Dario Taraborelli 
> To: "Discussion list for the Wikidata project."
> 
> Subject: Re: [Wikidata] [ANNOUNCEMENT] StrepHit IEG project kick-off
> seminar
> Message-ID:
>  ugvjnbb-mjx...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hi Marco,
>
> will the seminar be streamed or recorded?
>
I have to check with FBK's staff, it should be straightforward.
I will take care of sharing the link with everyone.

Cheers,

Marco

>
> Dario
>
> On Wed, Dec 23, 2015 at 8:03 AM, Marco Fossati 
> wrote:
>
> > [Begging pardon if you read this multiple times]
> >
> > Hi everyone,
> >
> > I would like to announce with great pleasure the StrepHit IEG project
> > kick-off seminar.
> > Of course, you are all invited to attend.
> >
> > The event will be held in a special day: Wikipedia's birthday!
> >
> > Below you can find the details.
> >
> > Schedule: 15 January 2016, 11:00 am, Luigi Stringa Conference Room
> > Location: Fondazione Bruno Kessler, Via Sommarive 18, Povo, Trento, Italy
> > - http://www.openstreetmap.org/way/28933739
> >
> > Abstract: We kick-off StrepHit, a project funded by the Wikimedia
> > Foundation through the Individual Engagement Grants program.
> > StrepHit is a Natural Language Processing pipeline that understands human
> > language, extracts facts from text and produces Wikidata statements with
> > reference URLs.
> > It will enhance the data quality of Wikidata by suggesting references to
> > validate statements, and will help Wikidata become the gold-standard hub
> of
> > the Open Data landscape.
> >
> > Link:
> >
> https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References
> >
> > Speaker's bio: Marco Fossati is a researcher with a double background in
> > Natural Languages and Information Technologies. He works at the Data and
> > Knowledge Management (DKM) research unit at Fondazione Bruno Kessler,
> > Trento, Italy. He is member of the DBpedia Association board of trustees,
> > founder and representative of its Italian chapter. He has
> interdisciplinary
> > skills both in linguistics and in programming. His research focuses on
> > bridging the gap between Natural Language Processing techniques and Large
> > Scale Structured Knowledge Bases in order to drive the Web of Data
> towards
> > its full potential.
> >
> > See you in Trento and long live Wikipedia!
> > Cheers,
> >
> > Marco
> >
> > ___
> > Wikidata mailing list
> > Wikidata@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikidata
> >
> >
>
>
> --
>
>
> *Dario Taraborelli  *Head of Research, Wikimedia Foundation
> wikimediafoundation.org • nitens.org • @readermeter
> <http://twitter.com/readermeter>
> -- next part --
> An HTML attachment was scrubbed...
> URL: <
> https://lists.wikimedia.org/pipermail/wikidata/attachments/20151223/9f7376ef/attachment-0001.html
> >
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] [ANNOUNCEMENT] StrepHit IEG project kick-off seminar

2015-12-23 Thread Marco Fossati
[Begging pardon if you read this multiple times]

Hi everyone,

I would like to announce with great pleasure the StrepHit IEG project
kick-off seminar.
Of course, you are all invited to attend.

The event will be held in a special day: Wikipedia's birthday!

Below you can find the details.

Schedule: 15 January 2016, 11:00 am, Luigi Stringa Conference Room
Location: Fondazione Bruno Kessler, Via Sommarive 18, Povo, Trento, Italy -
http://www.openstreetmap.org/way/28933739

Abstract: We kick-off StrepHit, a project funded by the Wikimedia
Foundation through the Individual Engagement Grants program.
StrepHit is a Natural Language Processing pipeline that understands human
language, extracts facts from text and produces Wikidata statements with
reference URLs.
It will enhance the data quality of Wikidata by suggesting references to
validate statements, and will help Wikidata become the gold-standard hub of
the Open Data landscape.

Link:
https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References

Speaker's bio: Marco Fossati is a researcher with a double background in
Natural Languages and Information Technologies. He works at the Data and
Knowledge Management (DKM) research unit at Fondazione Bruno Kessler,
Trento, Italy. He is member of the DBpedia Association board of trustees,
founder and representative of its Italian chapter. He has interdisciplinary
skills both in linguistics and in programming. His research focuses on
bridging the gap between Natural Language Processing techniques and Large
Scale Structured Knowledge Bases in order to drive the Web of Data towards
its full potential.

See you in Trento and long live Wikipedia!
Cheers,

Marco
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] StrepHit won the IEG grants selection

2015-12-07 Thread Marco Fossati
Dear all,

I have no words to say how much I'm happy: StrepHit [1] has been selected
as an IEG project [2]!!!

I'd like to express my gratitude to all the community members that have
provided feedback and endorsements.
Thanks, thanks, thanks for believing in the idea.

Cheers!
-- 
Marco Fossati
http://about.me/marco.fossati
Twitter: @hjfocs
Skype: hell_j

[1]
https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References
[2] https://blog.wikimedia.org/2015/12/04/ieg-funds-fourteen-projects/
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Wikidata Mentors at the Google Summer of Code Summit

2015-11-02 Thread Marco Fossati

Hi everyone,

I was just wondering whether any Wikidatan will be present at the 
upcoming Google Summer of Code Mentor summit:

https://sites.google.com/site/gsoc2015ms/
If so, it would be cool to meet there, just ping me before the summit.
Cheers,
--
Marco Fossati
http://about.me/marco.fossati
Twitter: @hjfocs
Skype: hell_j

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] StrepHit IEG proposal: last call for support

2015-10-14 Thread Marco Fossati

Dear all,

This is a last call for supporting the StrepHit IEG proposal before the 
formal review period (October 20th):

https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References

StrepHit is a Natural Language Processing pipeline that harvests 
structured data from raw text and produces Wikidata statements with 
*reference URLs* from *reliable* sources.


We have already received lots of precious endorsements for which we are 
grateful, but your voice is crucial and we are missing yours!
If you like the idea, please consider clicking on the *endorse* blue 
button on the project page.


Looking forward to your updates.
Cheers,
--
Marco Fossati
http://about.me/marco.fossati
Twitter: @hjfocs
Skype: hell_j

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Freebase to Wikidata: Results from Tpt internship

2015-10-02 Thread Marco Fossati

Hi Denny, Thomas,

I would like to thank you both for your support in making the StrepHit 
soccer dataset available! I owe you some hectolitres of beer :-)


There is one thing that was mentioned during our summer discussions and 
that I sadly forgot: shall the Freebase ontology mappings be added to 
Wikidata?
If a Freebase endpoint still exists, it may make sense to proceed as per 
the DBpedia mappings.

See for instance the "equivalent class" claim in Astronaut:
https://www.wikidata.org/wiki/Q11631

Cheers!

On 10/2/15 14:00, wikidata-requ...@lists.wikimedia.org wrote:

Date: Thu, 01 Oct 2015 18:09:21 +
From: Denny Vrandečić
To: "Discussion list for the Wikidata project."

Subject: [Wikidata] Freebase to Wikidata: Results from Tpt internship
Message-ID:

Content-Type: text/plain; charset="utf-8"

First, thanks to Tpt for his amazing work! I have not expected to see such
rich results. He has exceeded my expectations by far, and produced much
more transferable data than I expected. Additionally, he also was working
on the primary sources tool directly and helped Marco Fossati to upload a
second, sports-related dataset (you can select that by clicking on the
gears icon next to the Freebase item link in the sidebar on Wikidata, when
you switch on the Primary Sources tool).


--
Marco Fossati
http://about.me/marco.fossati
Twitter: @hjfocs
Skype: hell_j

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] SrepHit IEG proposal: call for support (was Re: [ANNOUNCEMENT] first StrepHit dataset for the primary sources tool)

2015-09-21 Thread Marco Fossati

Dear all,

The StrepHit IEG proposal is now pretty much complete:
https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References

We have already received support and feedback, but you are the most 
relevant community and the project needs your specific help.


Your voice is vital and it can be heard on the project page in multiple 
ways. If you:

1. like the idea, please click on the *endorse* blue button;
2. want to get involved, please click on the *join* blue button;
3. share your thoughts, please click on the *give feedback* link.

Looking forward to your updates.
Cheers!

On 9/9/15 11:39, Marco Fossati wrote:

Hi Markus, everyone,

The project proposal is currently in active development.
I would like to focus now on the dissemination of the idea and the
engagement of the Wikidata community.
Hence, I would love to gather feedback on the following question:

Does StrepHit sounds interesting and useful for you?

It would be great if you could report your thoughts on the project talk
page:
https://meta.wikimedia.org/wiki/Grants_talk:IEG/StrepHit:_Wikidata_Statements_Validation_via_References


Cheers!

On 9/8/15 2:02 PM, wikidata-requ...@lists.wikimedia.org wrote:

Date: Mon, 07 Sep 2015 16:47:16 +0200
From: Markus Krötzsch
To: "Discussion list for the Wikidata project."

Subject: Re: [Wikidata] [ANNOUNCEMENT] first StrepHit dataset for the
primary sources tool
Message-ID:<55eda374.2090...@semantic-mediawiki.org>
Content-Type: text/plain; charset=utf-8; format=flowed

Dear Marco,

Sounds interesting, but the project page still has a lot of gaps. Will
you notify us again when you are done? It is a bit tricky to endorse a
proposal that is not finished yet;-)

Markus

On 04.09.2015 17:01, Marco Fossati wrote:

>[Begging pardon if you have already read this in the Wikidata
project chat]
>
>Hi everyone,
>
>As Wikidatans, we all know how much data quality matters.
>We all know what high quality stands for: statements need to be
>validated via references to external, non-wiki, sources.
>
>That's why the primary sources tool is being developed:
>https://www.wikidata.org/wiki/Wikidata:Primary_sources_tool
>And that's why I am preparing the StrepHit IEG proposal:
>https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References

>
>
>StrepHit (pronounced "strep hit", means "Statement? repherence it!") is
>a Natural Language Processing pipeline that understands human language,
>extracts structured data from raw text and produces Wikidata statements
>with reference URLs.
>
>As a demonstration to support the IEG proposal, you can find the
>**FBK-strephit-soccer** dataset uploaded to the primary sources tool
>backend.
>It's a small dataset serving the soccer domain use case.
>Please follow the instructions on the project page to activate it and
>start playing with the data.
>
>What is the biggest difference that sets StrepHit datasets apart from
>the currently uploaded ones?
>At least one reference URL is always guaranteed for each statement.
>This means that if StrepHit finds some new statement that was not there
>in Wikidata before, it will always propose its external references.
>We do not want to manually reject all the new statements with no
>reference, right?
>
>If you like the idea, please endorse the StrepHit IEG proposal!




--
Marco Fossati
http://about.me/marco.fossati
Twitter: @hjfocs
Skype: hell_j

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] [ANNOUNCEMENT] first StrepHit dataset for the primary sources tool

2015-09-09 Thread Marco Fossati

Hi Markus, everyone,

The project proposal is currently in active development.
I would like to focus now on the dissemination of the idea and the 
engagement of the Wikidata community.

Hence, I would love to gather feedback on the following question:

Does StrepHit sounds interesting and useful for you?

It would be great if you could report your thoughts on the project talk 
page:

https://meta.wikimedia.org/wiki/Grants_talk:IEG/StrepHit:_Wikidata_Statements_Validation_via_References

Cheers!

On 9/8/15 2:02 PM, wikidata-requ...@lists.wikimedia.org wrote:

Date: Mon, 07 Sep 2015 16:47:16 +0200
From: Markus Krötzsch
To: "Discussion list for the Wikidata project."

Subject: Re: [Wikidata] [ANNOUNCEMENT] first StrepHit dataset for the
primary sources tool
Message-ID:<55eda374.2090...@semantic-mediawiki.org>
Content-Type: text/plain; charset=utf-8; format=flowed

Dear Marco,

Sounds interesting, but the project page still has a lot of gaps. Will
you notify us again when you are done? It is a bit tricky to endorse a
proposal that is not finished yet;-)

Markus

On 04.09.2015 17:01, Marco Fossati wrote:

>[Begging pardon if you have already read this in the Wikidata project chat]
>
>Hi everyone,
>
>As Wikidatans, we all know how much data quality matters.
>We all know what high quality stands for: statements need to be
>validated via references to external, non-wiki, sources.
>
>That's why the primary sources tool is being developed:
>https://www.wikidata.org/wiki/Wikidata:Primary_sources_tool
>And that's why I am preparing the StrepHit IEG proposal:
>https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References
>
>
>StrepHit (pronounced "strep hit", means "Statement? repherence it!") is
>a Natural Language Processing pipeline that understands human language,
>extracts structured data from raw text and produces Wikidata statements
>with reference URLs.
>
>As a demonstration to support the IEG proposal, you can find the
>**FBK-strephit-soccer** dataset uploaded to the primary sources tool
>backend.
>It's a small dataset serving the soccer domain use case.
>Please follow the instructions on the project page to activate it and
>start playing with the data.
>
>What is the biggest difference that sets StrepHit datasets apart from
>the currently uploaded ones?
>At least one reference URL is always guaranteed for each statement.
>This means that if StrepHit finds some new statement that was not there
>in Wikidata before, it will always propose its external references.
>We do not want to manually reject all the new statements with no
>reference, right?
>
>If you like the idea, please endorse the StrepHit IEG proposal!


--
Marco Fossati
http://about.me/marco.fossati
Twitter: @hjfocs
Skype: hell_j

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] [ANNOUNCEMENT] first StrepHit dataset for the primary sources tool

2015-09-05 Thread Marco Fossati

Hi Gerard,

Let me add a further reply to your comment.

On 9/5/15 2:01 PM, wikidata-requ...@lists.wikimedia.org wrote:

Message: 3
Date: Fri, 4 Sep 2015 19:26:38 +0200
From: Gerard Meijssen

No.
Quality is not determined by sources. Sources do lie.

When you want quality, you seek sources where they matter most. It is not
by going for "all" of them
I completely agree with you that many sources can be flawed. I may have 
neglected the term "trustworthy" before "sources" and added it in the 
Wikidata project chat.
The IEG proposal will also include an investigation phase to select a 
set of authoritative sources, see the first task in the proposal work 
package:

https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References#Work_Package

I'll expand on this.

Cheers,
--
Marco Fossati
http://about.me/marco.fossati
Twitter: @hjfocs
Skype: hell_j

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] [ANNOUNCEMENT] first StrepHit dataset for the primary sources tool

2015-09-05 Thread Marco Fossati

Dear all,

On 9/5/15 2:01 PM, wikidata-requ...@lists.wikimedia.org wrote:

Message: 3
Date: Fri, 4 Sep 2015 19:26:38 +0200
From: Gerard Meijssen

Quality is not determined by sources. Sources do lie.

When you want quality, you seek sources where they matter most.

Thanks @Gerard for your criticism, let me reply to your concerns.
The following references contrast your points. I got inspired by them 
when developing the idea:


https://www.wikidata.org/wiki/Wikidata:Referencing_improvements_input
http://blog.wikimedia.de/2015/01/03/scaling-wikidata-success-means-making-the-pie-bigger/
https://tools.wmflabs.org/wikidata-todo/sourcery.html
https://phabricator.wikimedia.org/T76230
https://phabricator.wikimedia.org/T76232
https://phabricator.wikimedia.org/T76231
https://phabricator.wikimedia.org/T90881


Message: 4
Date: Fri, 4 Sep 2015 19:34:22 +0200
From: Lydia Pintscher

Thank you for working on this, Marco. This is a great step forward. I
wish you good luck for the IEG proposal!

Thanks @Lydia for your encouragement!

Cheers,
--
Marco Fossati
http://about.me/marco.fossati
Twitter: @hjfocs
Skype: hell_j

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] [ANNOUNCEMENT] first StrepHit dataset for the primary sources tool

2015-09-04 Thread Marco Fossati

[Begging pardon if you have already read this in the Wikidata project chat]

Hi everyone,

As Wikidatans, we all know how much data quality matters.
We all know what high quality stands for: statements need to be 
validated via references to external, non-wiki, sources.


That's why the primary sources tool is being developed:
https://www.wikidata.org/wiki/Wikidata:Primary_sources_tool
And that's why I am preparing the StrepHit IEG proposal:
https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References

StrepHit (pronounced "strep hit", means "Statement? repherence it!") is 
a Natural Language Processing pipeline that understands human language, 
extracts structured data from raw text and produces Wikidata statements 
with reference URLs.


As a demonstration to support the IEG proposal, you can find the 
**FBK-strephit-soccer** dataset uploaded to the primary sources tool 
backend.

It's a small dataset serving the soccer domain use case.
Please follow the instructions on the project page to activate it and 
start playing with the data.


What is the biggest difference that sets StrepHit datasets apart from 
the currently uploaded ones?

At least one reference URL is always guaranteed for each statement.
This means that if StrepHit finds some new statement that was not there 
in Wikidata before, it will always propose its external references.
We do not want to manually reject all the new statements with no 
reference, right?


If you like the idea, please endorse the StrepHit IEG proposal!

Cheers,
--
Marco Fossati
http://about.me/marco.fossati
Twitter: @hjfocs
Skype: hell_j

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] On Human computation (was Re: Freebase is dead, long live :BaseKB)

2015-07-20 Thread Marco Fossati

Thanks @Benjamin for the pointers!
I completely agree with @Tom.

I've also been researching techniques for crowdsourcing micro-tasks, 
mostly for NLP activities like frame semantics annotation:

http://www.aclweb.org/anthology/P13-2130
http://ceur-ws.org/Vol-1030/paper-03.pdf

I found out that the crowd of paid workers can really make the 
difference, even for such difficult and subjective tasks.


So here are my 2 cents to get the best out of it:
1. Extreme care for quality check mechanisms: for instance, the 
CrowdFlower.com platform has a facility that allows to automatically 
discard untrusted workers;

2. The micro-task must be atomic, i.e., not containing multiple sub-tasks;
3. The UI design is always crucial: simple words, clear examples, avoid 
screen scrolling.


Cheers!

On 7/18/15 2:00 PM, wikidata-requ...@lists.wikimedia.org wrote:

Date: Fri, 17 Jul 2015 13:42:55 -0400
From: Tom Morris
To: "Discussion list for the Wikidata project."

Subject: Re: [Wikidata] Freebase is dead, long live :BaseKB
Message-ID:

Content-Type: text/plain; charset="utf-8"

3,000 judgments per person per day sounds high to me, particularly on a
sustained basis, but it really depends on the type of task.  Some of the
tasks were very simple with custom high performance single purpose "games"
designed around them.  For example, Genderizer presented a person's
information and allowed choices of Male, Female, Other, and Skip.  Using
arrow key bindings for the four choices to allow quick selection without
moving one's hand, pipelining preloading the next topic in the background,
and allowing votes to be undone in case of error were all features which
allowed voters to make choices very quickly.

The figures quoted in the paper below (18 seconds per judgment) work out to
more like 1,800 judgments per eight hour day.  They collected 2.3 million
judgments over the course of a year from 555 volunteers (1.05 million
judgments) and 84 paid workers (1.25 million).

On Fri, Jul 17, 2015 at 12:35 PM, Benjamin Good
wrote:


>They wrote a really insightful paper about how their processes for
>large-scale data curation worked.  Among may other things, they
>investigated mechanical turk 'micro tasks' versus hourly workers and
>generally found the latter to be more cost effective.
>
>"The Anatomy of a Large-Scale Human Computation Engine"
>http://wiki.freebase.com/images/e/e0/Hcomp10-anatomy.pdf
>

The full citation, in case someone needs to track it down, is:

Kochhar, Shailesh, Stefano Mazzocchi, and Praveen Paritosh. "The anatomy of
a large-scale human computation engine." *Proceedings of the acm sigkdd
workshop on human computation*. ACM, 2010.

There's also a slide presentation by the same name which presents some
additional information:
http://www.slideshare.net/brixofglory/rabj-freebase-all-5049845

Praveen Paritosh has written a number of papers on the topic of human
computation, if you're interested in that (I am!):
https://scholar.google.com/citations?user=_wX4sFYJ&hl=en&oi=sra


--
Marco Fossati
http://about.me/marco.fossati
Twitter: @hjfocs
Skype: hell_j

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Help needed for Freebase to Wikidata migration

2015-06-17 Thread Marco Fossati

Hi Thomas,

1. Are these mappings [1, 2] coming from earlier discussions [3] already 
validated/integrated?


2. As the maintainer of the DBpedia mapper bot, the procedure to add 
schema mappings to Wikidata classes and properties has already been 
discussed [4].
The bot code can be easily adapted for Freebase, let me know and I can 
volunteer to open a request for the new bot.


Hope this helps.
Cheers!

[1] 
https://docs.google.com/spreadsheets/d/1QXISrOrsr8EEjTtkIKsbIQGhg7jRXh_RKFlA-Xig84I/edit?usp=sharing
[2] 
https://docs.google.com/spreadsheets/d/1USoyyvgouOK8t7PjtP_yveVbKM2WZI64lhtvhG0oj3Y/edit?usp=sharing
[3] 
https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Freebase#Statistical_mappings
[4] 
https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot/DBpedia-mapper-bot


On 6/16/15 2:01 PM, wikidata-requ...@lists.wikimedia.org wrote:

Date: Mon, 15 Jun 2015 16:24:20 -0700
From: Thomas Pellissier-Tanon
To:wikidata@lists.wikimedia.org
Subject: [Wikidata] Help needed for Freebase to Wikidata migration
Message-ID:

Content-Type: text/plain; charset="utf-8"

Hey everyone,

As you may already know, I am currently working on the importation of
Freebase content into Wikidata [1] using the primary source tool [2].

One of the big challenges of the migration is to build a good mapping of
the properties of Freebase to Wikidata ones.There are a few thousand of
properties so it is a task too big to be done alone. Your help is far more
than welcome for this task on this page:
https://www.wikidata.org/wiki/Wikidata:WikiProject_Freebase/Mapping

Cheers,

Thomas

[1]https://www.wikidata.org/wiki/Wikidata:WikiProject_Freebase
[2]https://www.wikidata.org/wiki/Wikidata:Primary_sources_tool


--
Marco Fossati
http://about.me/marco.fossati
Twitter: @hjfocs
Skype: hell_j

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata