---------- Messaggio inoltrato ---------- Da: "Marco Fossati" <foss...@fbk.eu> Data: 11 nov 2016 1:23 PM Oggetto: Fwd: Re: [wikicite-discuss] Entity tagging and fact extraction (from a scholarly publisher perspective) A: "Marco Fossati" <foss...@spaziodati.eu> Cc:
---------- Messaggio inoltrato ---------- Da: "Marco Fossati" <foss...@fbk.eu> Data: 11 nov 2016 1:18 PM Oggetto: Re: [wikicite-discuss] Entity tagging and fact extraction (from a scholarly publisher perspective) A: "Andrew Smeall" <andrew.sme...@hindawi.com> Cc: "Dario Taraborelli" <dtarabore...@wikimedia.org>, "Benjamin Good" < ben.mcgee.g...@gmail.com>, "Discussion list for the Wikidata project." < wikidata@lists.wikimedia.org>, "wikicite-discuss" < wikicite-disc...@wikimedia.org>, "Daniel Mietchen" < daniel.mietc...@googlemail.com> Hi everyone, Just a couple of thoughts, which are in line with Dario's first message: 1. the primary sources tool lets third party providers release *full datasets* in a rather quick way. It is conceived to (a) ease the ingestion of *non-curated* data and to (b) make the community directly decide which statements should be included, instead of eventually complex a priori discussions. Important: the datasets should comply with the Wikidata vocabulary/ontology. 2. I see the mix'n'match tool as a way to *link* datasets with Wikidata via ID mappings, thus only requiring statements that say "Wikidata entity X links to the third party dataset entity Y". This is pretty much what the linked data community has been doing so far. No need to comply with the Wikidata vocabulary/ontology. Best, Marco Il 11 nov 2016 10:27 AM, "Andrew Smeall" <andrew.sme...@hindawi.com> ha scritto: > Regarding the topics/vocabularies issue: > > A challenge we're working on is finding a set of controlled vocabularies > for all the subject areas we cover. > > We do use MeSH for those subjects, but this only applies to about 40% of > our papers. In Engineering, for example, we've had more trouble finding an > open taxonomy with the same level of depth as MeSH. For most internal > applications, we need 100% coverage of all subjects. > > Machine learning for concept tagging is trendy now, partly because it > doesn't require a preset vocabulary, but we are somewhat against this > approach because we want to control the mapping of terms and a taxonomic > hierarchy can be useful. The current ML tools I've seen can match to a > controlled vocabulary, but then they need the publisher to supply the terms. > > The temptation to build a new vocabulary is strong, because it's the > fastest way to get to something that is non-proprietary and universal. We > can merge existing open vocabularies like MeSH and PLOS to get most of the > way there, but we then need to extend that with concepts from our corpus. > > Thanks Daniel and Benjamin for your responses. Any other feedback would be > great, and I'm always happy to delve into issues from the publisher > perspective if that can be helpful. > > On Fri, Nov 11, 2016 at 4:54 PM, Dario Taraborelli < > dtarabore...@wikimedia.org> wrote: > >> Benjamin – agreed, I too see Wikidata as mainly a place to hold all the >> mappings. Once we support federated queries in WDQS, the benefit of ID >> mapping (over extensive data ingestion) will become even more apparent. >> >> Hope Andrew and other interested parties can pick up this thread. >> >> On Wed, Nov 2, 2016 at 12:11 PM, Benjamin Good <ben.mcgee.g...@gmail.com> >> wrote: >> >>> Dario, >>> >>> One message you can send is that they can and should use existing >>> controlled vocabularies and ontologies to construct the metadata they want >>> to share. For example, MeSH descriptors would be a good way for them to >>> organize the 'primary topic' assertions for their articles and would make >>> it easy to find the corresponding items in Wikidata when uploading. Our >>> group will be continuing to expand coverage of identifiers and concepts >>> from vocabularies like that in Wikidata - and any help there from >>> publishers would be appreciated! >>> >>> My view here is that Wikidata can be a bridge to the terminologies and >>> datasets that live outside it - not really a replacement for them. So, if >>> they have good practices about using shared vocabularies already, it should >>> (eventually) be relatively easy to move relevant assertions into the >>> WIkidata graph while maintaining interoperability and integration with >>> external software systems. >>> >>> -Ben >>> >>> On Wed, Nov 2, 2016 at 8:31 AM, 'Daniel Mietchen' via wikicite-discuss < >>> wikicite-disc...@wikimedia.org> wrote: >>> >>>> I'm traveling ( https://twitter.com/EvoMRI/status/793736211009536000 >>>> ), so just in brief: >>>> In terms of markup, some general comments are in >>>> https://www.ncbi.nlm.nih.gov/books/NBK159964/ , which is not specific >>>> to Hindawi but partly applies to them too. >>>> >>>> A problem specific to Hindawi (cf. >>>> https://commons.wikimedia.org/wiki/Category:Media_from_Hindawi) is the >>>> bundling of the descriptions of all supplementary files, which >>>> translates into uploads like >>>> https://commons.wikimedia.org/wiki/File:Evolution-of-Coronar >>>> y-Flow-in-an-Experimental-Slow-Flow-Model-in-Swines-Angiogra >>>> phic-and-623986.f1.ogv >>>> (with descriptions for nine files) >>>> and eight files with no description, e.g. >>>> https://commons.wikimedia.org/wiki/File:Evolution-of-Coronar >>>> y-Flow-in-an-Experimental-Slow-Flow-Model-in-Swines-Angiogra >>>> phic-and-623986.f2.ogv >>>> . >>>> >>>> There are other problems in their JATS, and it would be good if they >>>> would participate in >>>> http://jats4r.org/ . Happy to dig deeper with Andrew or whoever is >>>> interested. >>>> >>>> Where they are ahead of the curve is licensing information, so they >>>> could help us set up workflows to get that info into Wikidata. >>>> >>>> In terms of triple suggestions to Wikidata: >>>> - as long as article metadata is concerned, I would prefer to >>>> concentrate on integrating our workflows with the major repositories >>>> of metadata, to which publishers are already posting. They could help >>>> us by using more identifiers (e.g. for authors, affiliations, funders >>>> etc.), potentially even from Wikidata (e.g. for keywords/ P921, for >>>> both journals and articles) and by contributing to the development of >>>> tools (e.g. a bot that goes through the CrossRef database every day >>>> and creates Wikidata items for newly published papers). >>>> - if they have ways to extract statements from their publication >>>> corpus, it would be good if they would let us/ ContentMine/ StrepHit >>>> etc. know, so we could discuss how to move this forward. >>>> d. >>>> >>>> On Wed, Nov 2, 2016 at 1:42 PM, Dario Taraborelli >>>> <dtarabore...@wikimedia.org> wrote: >>>> > I'm at the Crossref LIVE 16 event in London where I just gave a >>>> presentation >>>> > on WikiCite and Wikidata targeted at scholarly publishers. >>>> > >>>> > Beside Crossref and Datacite people, I talked to a bunch of folks >>>> interested >>>> > in collaborating on Wikidata integration, particularly from PLOS, >>>> Hindawi >>>> > and Springer Nature. I started an interesting discussion with Andrew >>>> Smeall, >>>> > who runs strategic projects at Hindawi, and I wanted to open it up to >>>> > everyone on the lists. >>>> > >>>> > Andrew asked me if – aside from efforts like ContentMine and StrepHit >>>> – >>>> > there are any recommendations for publishers (especially OA >>>> publishers) to >>>> > mark up their contents and facilitate information extraction and >>>> entity >>>> > matching or even push triples to Wikidata to be considered for >>>> ingestion. >>>> > >>>> > I don't think we have a recommended workflow for data providers for >>>> > facilitating triple suggestions to Wikidata, other than leveraging the >>>> > Primary Sources Tool. However, aligning keywords and terms with the >>>> > corresponding Wikidata items via ID mapping sounds like a good first >>>> step. I >>>> > pointed Andrew to Mix'n'Match as a handy way of mapping identifiers, >>>> but if >>>> > you have other ideas on how to best support 2-way integration of >>>> Wikidata >>>> > with scholarly contents, please chime in. >>>> > >>>> > Dario >>>> > >>>> > -- >>>> > >>>> > Dario Taraborelli Head of Research, Wikimedia Foundation >>>> > wikimediafoundation.org • nitens.org • @readermeter >>>> > >>>> > -- >>>> > WikiCite 2016 – May 26-26, 2016, Berlin >>>> > Meta: https://meta.wikimedia.org/wiki/WikiCite_2016 >>>> > Twitter: https://twitter.com/wikicite16 >>>> > --- >>>> > You received this message because you are subscribed to the Google >>>> Groups >>>> > "wikicite-discuss" group. >>>> > To unsubscribe from this group and stop receiving emails from it, >>>> send an >>>> > email to wikicite-discuss+unsubscr...@wikimedia.org. >>>> >>>> -- >>>> WikiCite 2016 – May 26-26, 2016, Berlin >>>> Meta: https://meta.wikimedia.org/wiki/WikiCite_2016 >>>> Twitter: https://twitter.com/wikicite16 >>>> --- >>>> You received this message because you are subscribed to the Google >>>> Groups "wikicite-discuss" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to wikicite-discuss+unsubscr...@wikimedia.org. >>>> >>>> >>> >> >> >> -- >> >> *Dario Taraborelli *Head of Research, Wikimedia Foundation >> wikimediafoundation.org • nitens.org • @readermeter >> <http://twitter.com/readermeter> >> >> -- >> WikiCite 2016 – May 26-26, 2016, Berlin >> Meta: https://meta.wikimedia.org/wiki/WikiCite_2016 >> Twitter: https://twitter.com/wikicite16 >> --- >> You received this message because you are subscribed to the Google Groups >> "wikicite-discuss" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to wikicite-discuss+unsubscr...@wikimedia.org. >> > > > > -- > ------------------------------ > Andrew Smeall > Head of Strategic Projects > > Hindawi Publishing Corporation > Kirkman House > 12-14 Whitfield Street, 3rd Floor > London, W1T 2RF > United Kingdom > ------------------------------ > > -- > WikiCite 2016 – May 26-26, 2016, Berlin > Meta: https://meta.wikimedia.org/wiki/WikiCite_2016 > Twitter: https://twitter.com/wikicite16 > --- > You received this message because you are subscribed to the Google Groups > "wikicite-discuss" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to wikicite-discuss+unsubscr...@wikimedia.org. >
_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata