[Wikidata] Fwd: Re: [wikicite-discuss] Entity tagging and fact extraction (from a scholarly publisher perspective)

Marco Fossati Fri, 11 Nov 2016 13:32:39 -0800

---------- Messaggio inoltrato ----------
Da: "Marco Fossati" <foss...@fbk.eu>
Data: 11 nov 2016 1:23 PM
Oggetto: Fwd: Re: [wikicite-discuss] Entity tagging and fact extraction
(from a scholarly publisher perspective)
A: "Marco Fossati" <foss...@spaziodati.eu>
Cc:


---------- Messaggio inoltrato ----------
Da: "Marco Fossati" <foss...@fbk.eu>
Data: 11 nov 2016 1:18 PM
Oggetto: Re: [wikicite-discuss] Entity tagging and fact extraction (from a
scholarly publisher perspective)
A: "Andrew Smeall" <andrew.sme...@hindawi.com>
Cc: "Dario Taraborelli" <dtarabore...@wikimedia.org>, "Benjamin Good" <
ben.mcgee.g...@gmail.com>, "Discussion list for the Wikidata project." <
wikidata@lists.wikimedia.org>, "wikicite-discuss" <
wikicite-disc...@wikimedia.org>, "Daniel Mietchen" <
daniel.mietc...@googlemail.com>

Hi everyone,

Just a couple of thoughts, which are in line with Dario's first message:
1. the primary sources tool lets third party providers release *full
datasets* in a rather quick way. It is conceived to (a) ease the ingestion
of *non-curated* data and to (b) make the community directly decide which
statements should be included, instead of eventually complex a priori
discussions.
Important: the datasets should comply with the Wikidata vocabulary/ontology.

2. I see the mix'n'match tool as a way to *link* datasets with Wikidata via
ID mappings, thus only requiring statements that say "Wikidata entity X
links to the third party dataset entity Y".
This is pretty much what the linked data community has been doing so far.
No need to comply with the Wikidata vocabulary/ontology.

Best,

Marco

Il 11 nov 2016 10:27 AM, "Andrew Smeall" <andrew.sme...@hindawi.com> ha
scritto:

> Regarding the topics/vocabularies issue:
>
> A challenge we're working on is finding a set of controlled vocabularies
> for all the subject areas we cover.
>
> We do use MeSH for those subjects, but this only applies to about 40% of
> our papers. In Engineering, for example, we've had more trouble finding an
> open taxonomy with the same level of depth as MeSH. For most internal
> applications, we need 100% coverage of all subjects.
>
> Machine learning for concept tagging is trendy now, partly because it
> doesn't require a preset vocabulary, but we are somewhat against this
> approach because we want to control the mapping of terms and a taxonomic
> hierarchy can be useful. The current ML tools I've seen can match to a
> controlled vocabulary, but then they need the publisher to supply the terms.
>
> The temptation to build a new vocabulary is strong, because it's the
> fastest way to get to something that is non-proprietary and universal. We
> can merge existing open vocabularies like MeSH and PLOS to get most of the
> way there, but we then need to extend that with concepts from our corpus.
>
> Thanks Daniel and Benjamin for your responses. Any other feedback would be
> great, and I'm always happy to delve into issues from the publisher
> perspective if that can be helpful.
>
> On Fri, Nov 11, 2016 at 4:54 PM, Dario Taraborelli <
> dtarabore...@wikimedia.org> wrote:
>
>> Benjamin – agreed, I too see Wikidata as mainly a place to hold all the
>> mappings. Once we support federated queries in WDQS, the benefit of ID
>> mapping (over extensive data ingestion) will become even more apparent.
>>
>> Hope Andrew and other interested parties can pick up this thread.
>>
>> On Wed, Nov 2, 2016 at 12:11 PM, Benjamin Good <ben.mcgee.g...@gmail.com>
>> wrote:
>>
>>> Dario,
>>>
>>> One message you can send is that they can and should use existing
>>> controlled vocabularies and ontologies to construct the metadata they want
>>> to share.  For example, MeSH descriptors would be a good way for them to
>>> organize the 'primary topic' assertions for their articles and would make
>>> it easy to find the corresponding items in Wikidata when uploading.  Our
>>> group will be continuing to expand coverage of identifiers and concepts
>>> from vocabularies like that in Wikidata - and any help there from
>>> publishers would be appreciated!
>>>
>>> My view here is that Wikidata can be a bridge to the terminologies and
>>> datasets that live outside it - not really a replacement for them.  So, if
>>> they have good practices about using shared vocabularies already, it should
>>> (eventually) be relatively easy to move relevant assertions into the
>>> WIkidata graph while maintaining interoperability and integration with
>>> external software systems.
>>>
>>> -Ben
>>>
>>> On Wed, Nov 2, 2016 at 8:31 AM, 'Daniel Mietchen' via wikicite-discuss <
>>> wikicite-disc...@wikimedia.org> wrote:
>>>
>>>> I'm traveling ( https://twitter.com/EvoMRI/status/793736211009536000
>>>> ), so just in brief:
>>>> In terms of markup, some general comments are in
>>>> https://www.ncbi.nlm.nih.gov/books/NBK159964/ , which is not specific
>>>> to Hindawi but partly applies to them too.
>>>>
>>>> A problem specific to Hindawi (cf.
>>>> https://commons.wikimedia.org/wiki/Category:Media_from_Hindawi) is the
>>>> bundling of the descriptions of all supplementary files, which
>>>> translates into uploads like
>>>> https://commons.wikimedia.org/wiki/File:Evolution-of-Coronar
>>>> y-Flow-in-an-Experimental-Slow-Flow-Model-in-Swines-Angiogra
>>>> phic-and-623986.f1.ogv
>>>> (with descriptions for nine files)
>>>> and eight files with no description, e.g.
>>>> https://commons.wikimedia.org/wiki/File:Evolution-of-Coronar
>>>> y-Flow-in-an-Experimental-Slow-Flow-Model-in-Swines-Angiogra
>>>> phic-and-623986.f2.ogv
>>>> .
>>>>
>>>> There are other problems in their JATS, and it would be good if they
>>>> would participate in
>>>> http://jats4r.org/ . Happy to dig deeper with Andrew or whoever is
>>>> interested.
>>>>
>>>> Where they are ahead of the curve is licensing information, so they
>>>> could help us set up workflows to get that info into Wikidata.
>>>>
>>>> In terms of triple suggestions to Wikidata:
>>>> - as long as article metadata is concerned, I would prefer to
>>>> concentrate on integrating our workflows with the major repositories
>>>> of metadata, to which publishers are already posting. They could help
>>>> us by using more identifiers (e.g. for authors, affiliations, funders
>>>> etc.), potentially even from Wikidata (e.g. for keywords/ P921, for
>>>> both journals and articles) and by contributing to the development of
>>>> tools (e.g. a bot that goes through the CrossRef database every day
>>>> and creates Wikidata items for newly published papers).
>>>> - if they have ways to extract statements from their publication
>>>> corpus, it would be good if they would let us/ ContentMine/ StrepHit
>>>> etc. know, so we could discuss how to move this forward.
>>>> d.
>>>>
>>>> On Wed, Nov 2, 2016 at 1:42 PM, Dario Taraborelli
>>>> <dtarabore...@wikimedia.org> wrote:
>>>> > I'm at the Crossref LIVE 16 event in London where I just gave a
>>>> presentation
>>>> > on WikiCite and Wikidata targeted at scholarly publishers.
>>>> >
>>>> > Beside Crossref and Datacite people, I talked to a bunch of folks
>>>> interested
>>>> > in collaborating on Wikidata integration, particularly from PLOS,
>>>> Hindawi
>>>> > and Springer Nature. I started an interesting discussion with Andrew
>>>> Smeall,
>>>> > who runs strategic projects at Hindawi, and I wanted to open it up to
>>>> > everyone on the lists.
>>>> >
>>>> > Andrew asked me if – aside from efforts like ContentMine and StrepHit
>>>> –
>>>> > there are any recommendations for publishers (especially OA
>>>> publishers) to
>>>> > mark up their contents and facilitate information extraction and
>>>> entity
>>>> > matching or even push triples to Wikidata to be considered for
>>>> ingestion.
>>>> >
>>>> > I don't think we have a recommended workflow for data providers for
>>>> > facilitating triple suggestions to Wikidata, other than leveraging the
>>>> > Primary Sources Tool. However, aligning keywords and terms with the
>>>> > corresponding Wikidata items via ID mapping sounds like a good first
>>>> step. I
>>>> > pointed Andrew to Mix'n'Match as a handy way of mapping identifiers,
>>>> but if
>>>> > you have other ideas on how to best support 2-way integration of
>>>> Wikidata
>>>> > with scholarly contents, please chime in.
>>>> >
>>>> > Dario
>>>> >
>>>> > --
>>>> >
>>>> > Dario Taraborelli  Head of Research, Wikimedia Foundation
>>>> > wikimediafoundation.org • nitens.org • @readermeter
>>>> >
>>>> > --
>>>> > WikiCite 2016 – May 26-26, 2016, Berlin
>>>> > Meta: https://meta.wikimedia.org/wiki/WikiCite_2016
>>>> > Twitter: https://twitter.com/wikicite16
>>>> > ---
>>>> > You received this message because you are subscribed to the Google
>>>> Groups
>>>> > "wikicite-discuss" group.
>>>> > To unsubscribe from this group and stop receiving emails from it,
>>>> send an
>>>> > email to wikicite-discuss+unsubscr...@wikimedia.org.
>>>>
>>>> --
>>>> WikiCite 2016 – May 26-26, 2016, Berlin
>>>> Meta: https://meta.wikimedia.org/wiki/WikiCite_2016
>>>> Twitter: https://twitter.com/wikicite16
>>>> ---
>>>> You received this message because you are subscribed to the Google
>>>> Groups "wikicite-discuss" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to wikicite-discuss+unsubscr...@wikimedia.org.
>>>>
>>>>
>>>
>>
>>
>> --
>>
>> *Dario Taraborelli  *Head of Research, Wikimedia Foundation
>> wikimediafoundation.org • nitens.org • @readermeter
>> <http://twitter.com/readermeter>
>>
>> --
>> WikiCite 2016 – May 26-26, 2016, Berlin
>> Meta: https://meta.wikimedia.org/wiki/WikiCite_2016
>> Twitter: https://twitter.com/wikicite16
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "wikicite-discuss" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to wikicite-discuss+unsubscr...@wikimedia.org.
>>
>
>
>
> --
> ------------------------------
> Andrew Smeall
> Head of Strategic Projects
>
> Hindawi Publishing Corporation
> Kirkman House
> 12-14 Whitfield Street, 3rd Floor
> London, W1T 2RF
> United Kingdom
> ------------------------------
>
> --
> WikiCite 2016 – May 26-26, 2016, Berlin
> Meta: https://meta.wikimedia.org/wiki/WikiCite_2016
> Twitter: https://twitter.com/wikicite16
> ---
> You received this message because you are subscribed to the Google Groups
> "wikicite-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to wikicite-discuss+unsubscr...@wikimedia.org.
>

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

[Wikidata] Fwd: Re: [wikicite-discuss] Entity tagging and fact extraction (from a scholarly publisher perspective)

Reply via email to