Re: [Wikidata-l] WikiData Categories
On 04/07/14 14:49, Magnus Manske wrote: On Fri, Jul 4, 2014 at 1:40 PM, Scott MacLeod worlduniversityandsch...@gmail.com mailto:worlduniversityandsch...@gmail.com wrote: Jane, Lydia and WikiDatans, These are great and helpful developments, which seem to be quite far along now. Jane and WikiDatans, can you point to similar helpful examples that would distinguish how WikiData Categories and what one can extract withMagnus' reasonator tool from what one can 'extract' with SemanticWiki from WikiData Categories? Can everyone please stop with the categories? Wikidata has items and properties. I assume you mean properties here. As for tools to get to data, * Reasonator [1] is for viewing a single item, and see related items * WDQ [2] is for machine-readable querying of Wikidata; basically, what SPARQL does on SMW * Autolist [3] is for getting clickable results from WDQ, intersecting results with Wikipedia (!) categories, and semi-automated editing Well, and of course some items are used as classes, which might be somewhat related to categories (in one of their many uses). For an overview of these, see http://tools.wmflabs.org/wikidata-exports/miga/ To find instances of a particular class, you can then use the tools Magnus already mentioned. Cheers, Markus ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Finding image URL from Commons image name
Brilliant, thanks for the useful and informative answers :-) Markus On 03/07/14 07:21, Legoktm wrote: And there's an API module for this too: https://commons.wikimedia.org/w/api.php?action=querytitles=File:Albert%20Einstein%20Head.jpgprop=imageinfoiiprop=urlformat=jsonfm :) -- Legoktm On 7/2/14, 1:59 PM, Liangent wrote: Also there are Special:FilePath and thumb.php. I'm not sure how this affects caching though. http://commons.wikimedia.org/wiki/Special:FilePath/Example.svg http://commons.wikimedia.org/w/thumb.php?f=Example.svgw=420 -Liangent On Jul 3, 2014 4:50 AM, Emilio J. Rodríguez-Posada emi...@gmail.com mailto:emi...@gmail.com wrote: Hello Markus; The URL of a Commons image is build like this: https://upload.wikimedia.org/wikipedia/commons/x/xy/File_name.ext Where X and XY are the first char and firstsecond chars respectively of the md5sum of the filename (replacing the spaces with _). For a 200px thumb: https://upload.wikimedia.org/wikipedia/commons/thumb/x/xy/File_name.ext/200px-File_name.ext The SVG files are a special case, therefore .png is appended to .ext, being .ext.png. For SVG doesn't mind to use big thumb sizes, but when file is JPG, don't try to generate a thumb bigger than the original file or you will get a beautiful error. Regards 2014-07-02 22:33 GMT+02:00 Markus Krötzsch mar...@semantic-mediawiki.org mailto:mar...@semantic-mediawiki.org: Dear Wikidatarians, From Commons media properties, I get the string name of a file on Commons. I can easily use it to build a link to the Commons page fo rthat image. * But how do I get the raw image URL? * And can I also get the raw URL of a small-scale (thumbnail) image? I would like to beautify my Wikidata applications to show some images. I know this is more of a general MediaWiki question, but it is much more relevant in Wikidata, so I am posting it here first. I guess somebody has already solved this since we have images in various Wikidata-based applications and gadgets. Thanks Markus _ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org mailto:Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/__mailman/listinfo/wikidata-l https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org mailto:Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Wikidata just got 10 times easier to use
On 02/07/14 16:29, David Cuenca wrote: On Tue, Jul 1, 2014 at 11:07 PM, Markus Krötzsch mar...@semantic-mediawiki.org mailto:mar...@semantic-mediawiki.org wrote: My hope is that with my other suggestion (using P31 values as features to correlate with), the property suggester will already be able to outperform my little toy algorithm anyway. One could also combine the two (my algorithm is really simple [1]), but maybe this is not needed. Interesting. That could also help to identify values with a high deviation, and perhaps even do a better job than some template constraints. I was trying to check more classes, but the server seems to have trouble: Error: could not load file 'classes/Classes.csv' http://tools.wmflabs.org/wikidata-exports/miga/?classes#_cat=Classes/Id=Q2087181 Strange. Works for me. But we had some temporary service problems at WMF Labs recently, so maybe there was some aftermath of these. In any case, I should update the software -- Yaron has further improved Miga to lower the initial load times significantly. I'll send another email when I have new code/new data there. Anyhow, many thanks for working on this. My pleasure. :-) Markus Micru ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Wikidata just got 10 times easier to use
On 01/07/14 21:47, Lydia Pintscher wrote: On Tue, Jul 1, 2014 at 9:44 PM, Andy Mabbett a...@pigsonthewing.org.uk wrote: On 1 July 2014 20:20, Lydia Pintscher lydia.pintsc...@wikimedia.de wrote: We have just deployed the entity suggester. This helps you with suggesting properties. So when you now add a new statement to an item it will suggest what should most likely be added to that item. One example: You are on an item about a person but it doesn't have a date of birth yet. Since a lot of other items about persons have a date of birth it will suggest you also add one to this item. This is a great idea, but I've just tried it on Q4810979 (about an historic building) and it prompted me for a date of birth, gender, taxon rank or taxon name. Teething troubles? We still need to tweak it a bit here and there, yeah. We're working on that right now. Also it will get smarter as more statements are added to items. I hope tweaking will suffice. At least it seems that there is already enough data to find slightly more related related properties ;-). Here is the list of properties that I get for the two classes of Q4810979 (recall that I compute related properties for each class). (1) historic house museum http://tools.wmflabs.org/wikidata-exports/miga/?classes#_cat=Classes/Id=Q2087181 Related properties: English Heritage list number, OS grid reference, owned by, inspired by, coordinate location, visitors per year, Commons category, architect, mother house, manager/director, country, commissioned by, architectural style, MusicBrainz place ID, use, date of foundation or creation, street (2) Grade I listed building http://tools.wmflabs.org/wikidata-exports/miga/?classes#_cat=Classes/Id=Q15700818 Related properties: English Heritage list number, masts, Minor Planet Center observatory code, home port, coordinate location, OS grid reference, mother house, architect, manager/director, Emporis ID, MusicBrainz place ID, country, architectural style, visitors per year, Commons category, Structurae ID (structure), officially opened by, floors above ground, inspired by, religious order, number of platforms, street, owned by, diocese These are computed fully automatically from the data, with no manual filtering or user input. But don't get me wrong -- great work! Brilliant to have such a thing integrated into the UI. In any case, my algorithm for computing the related properties is certainly very different from theirs; I am sure it also has its glitches. Cheers, Markus ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Wikidata just got 10 times easier to use
On 01/07/14 22:14, Markus Krötzsch wrote: ... (2) Grade I listed building http://tools.wmflabs.org/wikidata-exports/miga/?classes#_cat=Classes/Id=Q15700818 Related properties: English Heritage list number, masts, Minor Planet Center observatory code, home port, coordinate location, OS grid reference, mother house, architect, manager/director, Emporis ID, MusicBrainz place ID, country, architectural style, visitors per year, Commons category, Structurae ID (structure), officially opened by, floors above ground, inspired by, religious order, number of platforms, street, owned by, diocese These are computed fully automatically from the data, with no manual filtering or user input. But don't get me wrong -- great work! Brilliant to have such a thing integrated into the UI. In any case, my algorithm for computing the related properties is certainly very different from theirs; I am sure it also has its glitches. P.S. One weakness of my algorithm you can already see: it has troubles estimating the relevance of very rare properties, such as Minor Planet Center observatory code above. A single wrong annotation may then lead to wrong suggestions. Also, it seems from my list under (2) that some Grade I listed buildings are ships. This seems to be an error that is amplified by the fact that property masts is used only 11 times in the dataset I evaluated (last week's data). I guess the new property suggester rather errs on the other side, being tricked into suggesting very frequent properties even in places that don't need them. -- Markus ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Wikidata just got 10 times easier to use
On 01/07/14 22:43, Bene* wrote: Am 01.07.2014 22:23, schrieb Markus Krötzsch: P.S. One weakness of my algorithm you can already see: it has troubles estimating the relevance of very rare properties, such as Minor Planet Center observatory code above. A single wrong annotation may then lead to wrong suggestions. Also, it seems from my list under (2) that some Grade I listed buildings are ships. This seems to be an error that is amplified by the fact that property masts is used only 11 times in the dataset I evaluated (last week's data). I guess the new property suggester rather errs on the other side, being tricked into suggesting very frequent properties even in places that don't need them. However, it is obviously better if the algorithm performs well for frequently used properties. Isn't it possible to combine those two systems so they improve each other. One could check how often the property is used and then rely on Markus' or the students' algorithm. My hope is that with my other suggestion (using P31 values as features to correlate with), the property suggester will already be able to outperform my little toy algorithm anyway. One could also combine the two (my algorithm is really simple [1]), but maybe this is not needed. Cheers Markus [1] For each class C and property P, I count: * #C: the number of items in class C * #P: the number of items using property P * #PC: the number of items in class C using the property P * #items: the total number of items Then I compute two rates: * rateCP = #PC / #C (fraction of items in a class with the property) * rateP = #P / #items (fraction of all items with the property) I then rank the properties for each class by the ratio of rateCP/rateP (intuitively: by what factor does the property of P increase for items in C?). Moreover, I apply two sigmoid functions [2] to the rates as additional factors, so as to ensure that properties are less relevant if they have very high or very low values for the rates. I don't care about things that almost everything/almost nothing has. Obviously, one can tweak this if one wants to include properties that almost everything has anyway. [2] https://www.google.com/search?sclient=psy-abq=1+%2F+%281+%2B+exp%286+*+%28-2+*+x+%2B+0.5%29%29%29btnG= ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
[Wikidata-l] Fwd: Babelfy: Word Sense Disambiguation and Entity Linking Together!
FYI: this project claims to use Wikidata (among other resources) for multilingual word-sense disambiguation. One of the first third-party uses of Wikidata that I am aware of (but other pointers are welcome if you have them). Wiktionary and OmegaWiki are also mentioned here. Cheers, Markus Original Message Subject:Babelfy: Word Sense Disambiguation and Entity Linking Together! Resent-Date:Mon, 16 Jun 2014 10:34:07 + Resent-From:semantic-...@w3.org Date: Mon, 16 Jun 2014 09:43:12 +0200 From: Andrea Moro andrea8m...@gmail.com To: undisclosed-recipients:; == Babelfy: Word Sense Disambiguation and Entity Linking together! http://babelfy.org == As an output of the MultiJEDI Starting Grant http://multijedi.org, funded by the European Research Council and headed by Prof. Roberto Navigli, the Linguistic Computing Laboratory http://lcl.uniroma1.itof the Sapienza University of Rome is proud to announce the first release of Babelfy http://babelfy.org. Babelfy [1] is a joint, unified approach to Word Sense Disambiguation and Entity Linking for arbitrary languages. The approach is based on a loose identification of candidate meanings coupled with a densest subgraph heuristic which selects high-coherence semantic interpretations. Its performance on both disambiguation and entity linking tasks is on a par with, or surpasses, those of task-specific state-of-the-art systems. Babelfy draws primarily on BabelNet (http://babelnet.org), a very large encyclopedic dictionary and semantic network. BabelNet 2.5 covers 50 languages and provides both lexicographic and encyclopedic knowledge for all the open-class parts of speech, thanks to the seamless integration of WordNet, Wikipedia, Wiktionary, OmegaWiki, Wikidata and the Open Multilingual WordNet. Features in Babelfy: * 50 languages covered! * Available via easy-to-use Java APIs. * Disambiguation and entity linking is performed using BabelNet, thereby implicitly annotating according to several different inventories such as WordNet, Wikipedia, OmegaWiki, etc. Babelfy the world(be there and get a free BabelNet t-shirt!): * Monday, June 23 - ACL 2014 (Baltimore, MD, USA) - TACL paper presentation http://www.transacl.org/wp-content/uploads/2014/05/54.pdf * Tuesday, August 19 - ECAI 2014 (Prague, Czech Republic) - Multilingual Semantic Processing with BabelNet http://www.ecai2014.org/tutorials/ * Sunday, August 24 - COLING 2014(Dublin, Ireland) - Multilingual Word Sense Disambiguation and Entity Linking http://www.coling-2014.org/tutorials.php [1] Andrea Moro, Alessandro Raganato, Roberto Navigli. Entity Linking meets Word Sense Disambiguation: a Unified Approach http://www.transacl.org/wp-content/uploads/2014/05/54.pdf. Transactions of the Association for Computational Linguistics (TACL), 2, pp. 231-244 (2014). ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Wikidata RDF exports
Eric, Two general remarks first: (1) Protege is for small and medium ontologies, but not really for such large datasets. To get SPARQL support for the whole data, you could to install Virtuoso. It also comes with a simple Web query UI. Virtuoso does not do much reasoning, but you can use SPARQL 1.1 transitive closure in queries (using * after properties), so you can find all subclasses there too. (You could also try this in Protege ...) (2) If you want to explore the class hierarchy, you can also try our new class browser: http://tools.wmflabs.org/wikidata-exports/miga/?classes It has the whole class hierarchy, but without the leaves (=instances of classes + subclasses that have no own subclasses/instances). For example, it tells you that lepton has 5 direct subclasses, but shows only one: http://tools.wmflabs.org/wikidata-exports/miga/?classes#_item=3338 On the other hand, it includes relationships of classes and properties that are not part of the RDF (we extract this from the data by considering co-occurrence). Example: Classes that have no superclasses but at least 10 instances, and which are often used with the property 'sex or gender': http://tools.wmflabs.org/wikidata-exports/miga/?classes#_cat=Classes/Direct%20superclasses=__null/Number%20of%20direct%20instances=10%20-%202/Related%20properties=sex%20or%20gender I already added superclasses for some of those in Wikidata now -- data in the browser is updated with some delay based on dump files. More answers below: On 14/06/14 05:52, emw wrote: Markus, Thank you very much for this. Translating Wikidata into the language of the Semantic Web is important. Being able to explore the Wikidata taxonomy [1] by doing SPARQL queries in Protege [2] (even primitive queries) is really neat, e.g. SELECT ?subject WHERE { ?subject rdfs:subClassOf http://www.wikidata.org/entity/Q82586 . } This is more of an issue of my ignorance of Protege, but I notice that the above query returns only the direct subclasses of Q82586. The full set of subclasses for Q82586 (lepton) is visible at http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q82586rp=279lang=en -- a few of the 2nd-level subclasses (muon neutrino, tau neutrino, electron neutrino) are shown there but not returned by that SPARQL query. It seems rdfs:subClassOf isn't being treated as a transitive property in Protege. Any ideas? You need a reasoner to compute this properly. For a plain class hierarchy as in our case, ELK should be a good choice [1]. You can install the ELK Protege plugin and use it to classify the ontology [2]. Protege will then show the copmuted class hierarchy in the browser; I am not sure what happens to the SPARQL queries (it's quite possible that they don't use the reasoner). [1] https://code.google.com/p/elk-reasoner/ [2] https://code.google.com/p/elk-reasoner/wiki/ElkProtege Do you know when the taxonomy data in OWL will have labels available? We had not thought of this as a use case. A challenge is that the label data is quite big because of the many languages. Should we maybe create an English label file for the classes? Descriptions too or just labels? Also, regarding the complete dumps, would it be possible to export a smaller subset of the faithful data? The files under Complete Data Dumps in http://tools.wmflabs.org/wikidata-exports/rdf/exports/20140526/ look too big to load into Protege on most personal computers, and would likely require adjusting JVM settings on higher-end computers to load. If it's feasible to somehow prune those files -- and maybe even combine them into one file that could be easily loaded into Protege -- that would be especially nice. What kind of pruning do you have in mind? You can of course take a subset of the data, but then some of the data will be missing. A general remark on mixing and matching RDF files. We use N3 format, where every line in the ontology is self-contained (no multi-line constructs, no header, no namespaces). Therefore, any subset of the lines of any of our files is still a valid file. So if you want to have only a slice of the data (maybe to experiment with), then you could simply do something like: gunzip -c wikidata-statements.nt.gz | head -1 partial-data.nt head simply selects the first 1 lines here. You could also use grep to select specific triples instead, such as: zgrep http://www.w3.org/2000/01/rdf-schema#label; wikidata-terms.nt.gz | grep @en . en-labels.nt This selects all English labels. I am using zgrep here for a change; you can also use gunzip as above. Similar methods can also be used to count things in the ontology (use grep -c to count lines = triples). Finally, you can combine multiple files into one by simply concatenating them in any order: cat partial-data-1.nt mydata.nt cat partial-data-2.nt mydata.nt ... Maybe you can experiment a bit and let us know if there is any export that would be particularly
Re: [Wikidata-l] Wikidata RDF exports
Hi Gerard, On 13/06/14 11:08, Gerard Meijssen wrote: Hoi, When you leave out qualifiers, you will find that Ronald Reagan was never president of the United States and only an actor. Yes, omitting the statements with qualifiers is wrong but as a consequence the total of the information is wrong as well. I do not see the point of this functionality. It is wrong any way I look at it. Without qualifiers information is wrong. Without statements information is wrong and without the items involved the information is incomplete and wrong. As I see it you cannot win. Including this type of RDF export produces something that I fail to see serves any purpose or it is the purpose that you can. Surely, Wikidata will never be complete. There will always be some statements missing. If we would follow your reasoning, the data would therefore never be of any use. I think this is a bit drastic. Anyway, why argue? If you don't like the simplified exports, just use the full ones. We clearly say that simplified is not faithful, and we have a detailed documentation about what is in each of the files. So it does not seem likely that people will be confused. Best regards, Markus ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Wikidata RDF exports
Hi Gerard, As I said, I don't follow your arguments. Wikidata Query, for example, has also started without any qualifiers at all, and yet it was a useful tool from the beginning. Your feedback is always welcome, but there is a point when critique is no longer constructive, and when it is best to agree to disagree. I think we have reached that point. Markus On 13/06/14 12:37, Gerard Meijssen wrote: Hoi, There is a huge difference between being complete and leaving out essential information. When you consider Ronald Reagan [1], it is essential information that he was a president of the USA and a governor of California. When you only make him an actor and a politician, the information you are left with gives the impression he is more relevant as an actor. You brought attention to new functionality that is essentially broken. It does not give a fair impression of the Wikidata content. I have been arguing against overly referring to academic tools and standards. For me this announcement is yet another pointer that many of the tools are overrated and only have an academic relevance. Thanks, GerardM [1] http://tools.wmflabs.org/reasonator/?q=9960 On 13 June 2014 11:41, Markus Krötzsch mar...@semantic-mediawiki.org mailto:mar...@semantic-mediawiki.org wrote: Hi Gerard, On 13/06/14 11:08, Gerard Meijssen wrote: Hoi, When you leave out qualifiers, you will find that Ronald Reagan was never president of the United States and only an actor. Yes, omitting the statements with qualifiers is wrong but as a consequence the total of the information is wrong as well. I do not see the point of this functionality. It is wrong any way I look at it. Without qualifiers information is wrong. Without statements information is wrong and without the items involved the information is incomplete and wrong. As I see it you cannot win. Including this type of RDF export produces something that I fail to see serves any purpose or it is the purpose that you can. Surely, Wikidata will never be complete. There will always be some statements missing. If we would follow your reasoning, the data would therefore never be of any use. I think this is a bit drastic. Anyway, why argue? If you don't like the simplified exports, just use the full ones. We clearly say that simplified is not faithful, and we have a detailed documentation about what is in each of the files. So it does not seem likely that people will be confused. Best regards, Markus _ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org mailto:Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/__mailman/listinfo/wikidata-l https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Wikidata RDF exports
On 13/06/14 15:52, Bene* wrote: ... Did I understand you right, Markus, that you leave out all statements which have at least one qualifier? Wouldn't it make more sense to leave out the qualifiers only but add the statements without qualifiers anyway? Because this would solve eg. Gerard's problem with Ronald Reagan. But it would introduce other problems. Qualifiers are often used with time information, for example to record many historic population figures of one town. If you just leave away the qualifiers you get many different population numbers that cannot be distinguished. Simply put: * Leaving away statements makes the export incomplete (as Wikidata always is, just to a larger degree). * Leaving away qualifiers makes the export incorrect (since it replaces statements by different statements that may or may not hold true). We could do both and let the users choose what they find more acceptable (if any), but we started with the first approach. If someone says they need the second approach for their application to work, we could implement this, but I'd rather wait to see if anybody wants this. Best, Markus ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Wikidata RDF exports
Gerard, You sometimes sound as if everything is lost just because somebody put an RDF file on the Web ;-) If you don't like the simplified export, why don't you just use our main export which contains all the data? Can't we all be happy -- the people who want simple and the people who want complete? Cheers, Markus ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
[Wikidata-l] New Wikidata classes browser (and updated property browser)
Hi all, I have extended our new interactive property browser with a class browser and updated everything to the latest data. You can use these new services to answer questions like: (1) What kinds of things do we actually have on Wikidata? (show all classes with more than 100 instances: humans, cities, galaxies, computer games, earthquakes, ... overall a very clear an readable list of our main subjects) (2) Which properties are typically used on lighthouses (Q39715)? (or any other class) (3) Which types of things have a patron saint (P417)? (or any other property) (4) What are the most used string properties on Wikidata? (5) Which properties are often used in qualifiers? (6) Which properties are often used in statements that have qualifiers? (7) Which properties are not used at all? (8) What are the (direct and indirect) superclasses of ninja (Q9402)? (there are expected things like profession and soldier but also thermodynamic process, which is probably not intended, although there might be some truth to it) (9) What are the most used classes that do not have a superclass? (10) What are the classes with most subclasses? (they have almost 2000 subclasses!) And many more ... I could play with it all day :-) We now offer two datasets: one with properties and classes, and one with properties only. The one with classes takes longer to load (about 10min on my machine) but it's very fast after that. The other one is faster to load and can still answer questions like (4), (5), (6), (7). Here are the links: * Classes+properties: http://tools.wmflabs.org/wikidata-exports/miga/?classes# (be patient, it will be fast once loaded) * Properties only: http://tools.wmflabs.org/wikidata-exports/miga/ Each dataset has an about page with some example queries. To see additional related properties and classes, scroll down to the bottom of the pages of individual properties or classes. Known limitations: * We leave away classes without instances and subclasses, to make the dataset smaller (it's large enough as it is). * Some classes are shown without labels. These are usually things that should not be classes anyway (filter them by narrowing your search to things with more than 10 instances). In any case, this only happens to things that have no superclass. Maybe I will fix this in the future. Feedback is welcome. Cheers, Markus On 11/06/14 14:36, Markus Krötzsch wrote: Hi all, We have prepared a new browser for Wikidata Properties: http://tools.wmflabs.org/wikidata-exports/miga/ It is based on Miga data browser [1]. This means it only works in Google Chrome/Chromium, Opera, Safari, and the Android Browser but not in Internet Explorer, Firefox, and Rekonq. You can browse properties by datatype or usage numbers, and also find related properties for every property (using my own custom notion of relatedness based on relative co-occurrence and overall prevalence). When filtering by usage numbers, you sometimes see only very coarse filters, but you can always apply the same filter again to get more fine grained steps. You can also edit the URL in the browser to modify filters if you cannot find the right one in the UI. This is still experimental. The data is based on the dump of 26 May. The data files for Miga were created using Wikidata Toolkit [2]; I will commit the specific code in due course. Feedback is welcome. Cheers, Markus [1] http://migadv.com/ [2] https://www.mediawiki.org/wiki/Wikidata_Toolkit ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] New Wikidata classes browser (and updated property browser)
[Including Yaron, the Miga developer, who is not on this list yet] On 12/06/14 17:21, Thomas Douillard wrote: Hi Markus, first thanks a lot for these tools. It would be cool to include a link to the property browser into some template, ( Template:P' for example , as Template:Q' generates a link to reasonator. Is there a way to get the database id of some property by its number ? The internal IDs of Miga are no good for linking, since they depend on the list of items and thus might change with updates. However, the following work: http://tools.wmflabs.org/wikidata-exports/miga/#_cat=Properties/Id=P31 http://tools.wmflabs.org/wikidata-exports/miga/?classes#_cat=Classes/Id=Q39715 Etc. A nice feature in Miga is that all queries work, even if they are not clickable through the UI. If you look at the URL you get after some clicking, it is easy to see how to change it to get other results. Cheers, Markus ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Wikidata RDF exports
On 10/06/14 22:50, Gerard Meijssen wrote: Hoi, It is stated that there are no qualifiers included. In one of the articles you write that it is to be understood that the vailidity of the information is dependent on the existing qualifiers. What is the value of these RDF exports with the qualifiers missing? Our normal exports include all the qualifiers and references. Our simplified exports include only those statements that don't have qualifiers. You are right that it would lead to wrong information to leave away quantifiers. Cheers, Markus Thanks, GerardM On 10 June 2014 10:43, Markus Kroetzsch markus.kroetz...@tu-dresden.de mailto:markus.kroetz...@tu-dresden.de wrote: Hi all, We are now offering regular RDF dumps for the content of Wikidata: http://tools.wmflabs.org/__wikidata-exports/rdf/ http://tools.wmflabs.org/wikidata-exports/rdf/ RDF is the Resource Description Framework of the W3C that can be used to exchange data on the Web. The Wikidata RDF exports consist of several files that contain different parts and views of the data, and which can be used independently. Details on the available exports and the RDF encoding used in each can be found in the paper Introducing Wikidata to the Linked Data Web [1]. The available RDF exports can be found in the directory http://tools.wmflabs.org/__wikidata-exports/rdf/exports/ http://tools.wmflabs.org/wikidata-exports/rdf/exports/. New exports are generated regularly from current data dumps of Wikidata and will appear in this directory shortly afterwards. All dump files have been generated using Wikidata Toolkit [2]. There are some important differences in comparison to earlier dumps: * Data is split into several dump files for convenience. Pick whatever you are most interested in. * All dumps are generated using the OpenRDF library for Java (better quality than ad hoc serialization; much slower too ;-) * All dumps are in N3 format, the simplest RDF serialization format that there is * In addition to the faithful dumps, some simplified dumps are also available (one statement = one triple; no qualifiers and references). * Links to external data sets are added to the data for Wikidata properties that point to datasets with RDF exports. That's the Linked in Linked Open Data. Suggestions for improvements and contributions on github are welcome. Cheers, Markus [1] http://korrekt.org/page/__Introducing_Wikidata_to_the___Linked_Data_Web http://korrekt.org/page/Introducing_Wikidata_to_the_Linked_Data_Web [2] https://www.mediawiki.org/__wiki/Wikidata_Toolkit https://www.mediawiki.org/wiki/Wikidata_Toolkit -- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 tel:%2B49%20351%20463%2038486 http://korrekt.org/ _ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org mailto:Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/__mailman/listinfo/wikidata-l https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Wikidata:List of properties/Summary table
On 11/06/14 17:13, Derric Atzrott wrote: You might also find the new property browser helpful: http://tools.wmflabs.org/wikidata-exports/miga/ (as mentioned before, requires one of Google Chrome, Safari, Opera, or Android Browser to work). While an excellent list and a neat tool, it sadly isn't organised in a way that fits my needs. I just needed a simple list organised by type of object that I could refer back to in order to make sure that I don't miss properties for which I do have data. I am pleased though that your tool gives the actual name of the properties though. In some of the property proposal discussions on Wikidata the property has not actually been named the exact same name as what was proposed, which can be quite confusing when you go to use it. Yes, I know what you mean. I'd love to integrate property group information into our view as well, but I don't know where to get this information from (other than by scraping it from the wiki page, which does not seem to be right). Any pointers to where these groups are managed? Regards, Markus ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Wikidata Toolkit 0.2.0 released
On 11/06/14 19:52, Maximilian Klein wrote: Excellent work Markus. Your tools are helping me to debunk bad science the world over[1]. Keep up the great work. Thanks :-) Max PS. By the way, if you do stack overflow you may want to chime in on this purpose-built question[2]. I lost my account when my OpenId-provider ClaimId stopped providing OpenIds ... if anybody knows how to recover these ids, drop me a line :-p Markus [1] https://medium.com/the-physics-arxiv-blog/wikipedia-mining-algorithm-reveals-the-most-influential-people-in-35-centuries-of-human-history-ede5ef827b76 [2] http://opendata.stackexchange.com/questions/107/when-will-the-wikidata-database-be-available-for-download/ Max Klein ‽ http://notconfusing.com/ On Tue, Jun 10, 2014 at 1:35 AM, Markus Krötzsch mar...@semantic-mediawiki.org mailto:mar...@semantic-mediawiki.org wrote: Dear all, I am happy to announce the second release of Wikidata Toolkit [1], the Java library for programming with Wikidata and Wikibase. This release fixes bugs and improves features of the first release (download, parse, process Wikidata exports) and it adds new components for serializing JSON and RDF exports for Wikidata. A separate announcement regarding the RDF exports will be sent shortly. Maven users can get the library directly from Maven Central (see [1]); this is the preferred method of installation. There is also an all-in-one JAR at github [2] and of course the sources [3]. Version 0.2.0 is still in alpha. For the next release, we will focus on the following tasks: * Faster loading of Wikibase dumps + support for the new JSON format that will be used in the dumps soon * Support for storing and querying data after loading it * Initial steps towards storing data in a binary format after loading it Feedback is welcome. Developers are also invited to contribute via github. Cheers, Markus [1] https://www.mediawiki.org/__wiki/Wikidata_Toolkit https://www.mediawiki.org/wiki/Wikidata_Toolkit [2] https://github.com/Wikidata/__Wikidata-Toolkit/releases https://github.com/Wikidata/Wikidata-Toolkit/releases (you'll also need to install the third party dependencies manually when using this) [3] https://github.com/Wikidata/__Wikidata-Toolkit/ https://github.com/Wikidata/Wikidata-Toolkit/ _ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org mailto:Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/__mailman/listinfo/wikidata-l https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Wikidata:List of properties/Summary table
On 11/06/14 20:49, Bene* wrote: Am 11.06.2014 17:27, schrieb Markus Krötzsch: Yes, I know what you mean. I'd love to integrate property group information into our view as well, but I don't know where to get this information from (other than by scraping it from the wiki page, which does not seem to be right). Any pointers to where these groups are managed? Currently information about properties is only stored on their talk pages but I think the wikidata development team is working on claims for properties so we can store such information in a more structured way. I know, but, for example, P303 is listed under animal breeds on https://www.wikidata.org/wiki/Wikidata:List_of_properties/Summary_table whereas I cannot find this information on: https://www.wikidata.org/wiki/Property_talk:P303 I wonder if this information is really anywhere else but in the code of the bot maintainer ... Regards, Markus However, I am not sure how soon this will be released. Regards, Bene PS: Someone who knows the tracking bug for this around? ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Wikidata query feature: status and plans
On 07/06/14 00:40, Joe Filceolaire wrote: Well they can ask. As there is no real definition of what is a city and what the limits of each city are I'm not sure they will get a useful answer. The population of the City of London (Q23311), for instance, is only 7,375! Should we change it from 'instance of:city' to 'instance of:village'? Side remark: in the UK, city and town are special legal statuses of settlements. This terminology is what City of London refers to. There is a clear and crisp definition for what this means, but it is not what we mean by our class city in Wikidata. In particular, this has no direct relationship to size: the largest UK towns have over 100k inhabitants. The class city is used for relatively large and permanent human settlement[s] [1], which does not say much (because the vagueness of relatively). Maybe we should even wonder if city is a good class to use in Wikidata. Saying that something has been awarded city status in the UK (Q1867820) has a clear meaning. Saying that something is a human settlement is also rather clear. But drawing the line between village, city and town is quite tricky, and will probably never be done uniformly across the data. Conclusion: if you are looking for, say, human settlements with more than 100k inhabitants, then you should be searching for just that (which I think is basically what you also are saying below :-). Markus [1] https://en.wikipedia.org/wiki/City Even a basic query like 'people born in the Czech republic' has problems. Should it include people born in Czechoslovakia or the Austro-Hungarian provinces of Bohemia and Moravia? To exclude these the query needs to check not just if the 'place of birth' of an item is 'in the administrative entity:Czech Republic' today but whether that was true on the 'date of birth' of each of those people. This isn't to say that such queries are not useful. Just to point out that real world data is tricky. The cool thing is that we are going to have the data in Wikidata to make it theoretically feasible to drill down and get answers to these tricky questions. Once the data is there, open licensed for anyone to use, then it is just a matter of a letting loose a thousand PhDs to devise clever ways to query it. If we build it they will come! At least that is my understanding. Joe On Fri, Jun 6, 2014 at 9:21 PM, Jeroen De Dauw jeroended...@gmail.com mailto:jeroended...@gmail.com wrote: Hey Yury, We are indeed planning to use the Ask query language for Wikidata. People will be able to define queries on dedicated query pages that contain a query entity. These query entities will represent things such as The cities with highest population in Europe. People will then be able to access the result for those queries via the web API and be able to embed different views on them into wiki pages. These views will be much like SMW result formats, and we might indeed be able to share code between the two projects for that. This functionality is still some way off though. We still need to do a lot of work, such as creating a nice visual query builder. To already get something out to the users, we plan to enable more simple queries via the web API in the near future. Cheers -- Jeroen De Dauw - http://www.bn2vs.com Software craftsmanship advocate Evil software architect at Wikimedia Germany ~=[,,_,,]:3 ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org mailto:Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
[Wikidata-l] Wikidata Toolkit 0.2.0 released
Dear all, I am happy to announce the second release of Wikidata Toolkit [1], the Java library for programming with Wikidata and Wikibase. This release fixes bugs and improves features of the first release (download, parse, process Wikidata exports) and it adds new components for serializing JSON and RDF exports for Wikidata. A separate announcement regarding the RDF exports will be sent shortly. Maven users can get the library directly from Maven Central (see [1]); this is the preferred method of installation. There is also an all-in-one JAR at github [2] and of course the sources [3]. Version 0.2.0 is still in alpha. For the next release, we will focus on the following tasks: * Faster loading of Wikibase dumps + support for the new JSON format that will be used in the dumps soon * Support for storing and querying data after loading it * Initial steps towards storing data in a binary format after loading it Feedback is welcome. Developers are also invited to contribute via github. Cheers, Markus [1] https://www.mediawiki.org/wiki/Wikidata_Toolkit [2] https://github.com/Wikidata/Wikidata-Toolkit/releases (you'll also need to install the third party dependencies manually when using this) [3] https://github.com/Wikidata/Wikidata-Toolkit/ ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Wikidata query feature: status and plans
On 10/06/14 11:11, Luca Martinelli wrote: We may possibly use an ad hoc item City of United Kingdom, subclass of city and UK administrative division, may we? Sure, that's possible. Maybe this is even necessary. I had suggested to link to city status in the UK -- but there is no item town status in the UK so one would need to have helper items there as well. If we need new items in either case, the class-based modelling seems nicer since it fits into the existing class hierarchy as you suggest. Markus L. Il 10/giu/2014 10:21 Markus Krötzsch mar...@semantic-mediawiki.org mailto:mar...@semantic-mediawiki.org ha scritto: On 07/06/14 00:40, Joe Filceolaire wrote: Well they can ask. As there is no real definition of what is a city and what the limits of each city are I'm not sure they will get a useful answer. The population of the City of London (Q23311), for instance, is only 7,375! Should we change it from 'instance of:city' to 'instance of:village'? Side remark: in the UK, city and town are special legal statuses of settlements. This terminology is what City of London refers to. There is a clear and crisp definition for what this means, but it is not what we mean by our class city in Wikidata. In particular, this has no direct relationship to size: the largest UK towns have over 100k inhabitants. The class city is used for relatively large and permanent human settlement[s] [1], which does not say much (because the vagueness of relatively). Maybe we should even wonder if city is a good class to use in Wikidata. Saying that something has been awarded city status in the UK (Q1867820) has a clear meaning. Saying that something is a human settlement is also rather clear. But drawing the line between village, city and town is quite tricky, and will probably never be done uniformly across the data. Conclusion: if you are looking for, say, human settlements with more than 100k inhabitants, then you should be searching for just that (which I think is basically what you also are saying below :-). Markus [1] https://en.wikipedia.org/wiki/__City https://en.wikipedia.org/wiki/City Even a basic query like 'people born in the Czech republic' has problems. Should it include people born in Czechoslovakia or the Austro-Hungarian provinces of Bohemia and Moravia? To exclude these the query needs to check not just if the 'place of birth' of an item is 'in the administrative entity:Czech Republic' today but whether that was true on the 'date of birth' of each of those people. This isn't to say that such queries are not useful. Just to point out that real world data is tricky. The cool thing is that we are going to have the data in Wikidata to make it theoretically feasible to drill down and get answers to these tricky questions. Once the data is there, open licensed for anyone to use, then it is just a matter of a letting loose a thousand PhDs to devise clever ways to query it. If we build it they will come! At least that is my understanding. Joe On Fri, Jun 6, 2014 at 9:21 PM, Jeroen De Dauw jeroended...@gmail.com mailto:jeroended...@gmail.com mailto:jeroended...@gmail.com mailto:jeroended...@gmail.com__ wrote: Hey Yury, We are indeed planning to use the Ask query language for Wikidata. People will be able to define queries on dedicated query pages that contain a query entity. These query entities will represent things such as The cities with highest population in Europe. People will then be able to access the result for those queries via the web API and be able to embed different views on them into wiki pages. These views will be much like SMW result formats, and we might indeed be able to share code between the two projects for that. This functionality is still some way off though. We still need to do a lot of work, such as creating a nice visual query builder. To already get something out to the users, we plan to enable more simple queries via the web API in the near future. Cheers -- Jeroen De Dauw - http://www.bn2vs.com Software craftsmanship advocate Evil software architect at Wikimedia Germany ~=[,,_,,]:3 _ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org mailto:Wikidata-l@lists.wikimedia.org mailto:Wikidata-l@lists
Re: [Wikidata-l] What is the point of properties?
On 29/05/14 21:04, Andrew Gray wrote: One other issue to bear in mind: it's *simple* to have properties as a separate thing. I have been following this discussion with some interest but... well, I don't think I'm particularly stupid, but most of it is completely above my head. Saying here are items, here are a set of properties you can define relating to them, here's some notes on how to use properties is going to get a lot more people able to contribute than if they need to start understanding theoretical aspects of semantic relationships... Good point. The thread has really gone off in a rather philosophical direction :-) As Jane said, examples (of places where a property should be used *and* of places where it should not be used) are definitely much more useful to help our editors on the ground. I usually use items I know as role models or have a look for suitable showcase items. Markus On 28 May 2014 09:37, Daniel Kinzler daniel.kinz...@wikimedia.de wrote: Key differences between Properties and Items: * Properties have a data type, items don't. * Items have sitelinks, Properties don't. * Items have Statements, Properties will support Claims (without sources). The software needs these constraints/guarantees to be able to take shortcuts, provide specialized UI and API functionality, etc. Yes, it would be possible to use items as properties instead of having a separate entity type. But they are structurally and functionally different, so it makes sense to have a strict separate. This makes a lot of things easier, e.g.: * setting different permissions for properties * mapping to rdf vocabularies More fundamentally, they are semantically different: an item describes a concept in the real world, while a property is a structural component used for such a description. Yes, properies are simmilar to data items, and in some cases, there may be an item representing the same concept that is represented by a property entity. I don't see why that is a problem, while I can see a lot of confusion arising from mixing them. -- daniel Am 28.05.2014 09:25, schrieb David Cuenca: Since the very beginning I have kept myself busy with properties, thinking about which ones fit, which ones are missing to better describe reality, how integrate into the ones that we have. The thing is that the more I work with them, the less difference I see with normal items and if soon there will be statements allowed in property pages, the difference will blur even more. I can understand that from the software development point of view it might make sense to have a clear difference. Or for the community to get a deeper understanding of the underlying concepts represented by words. But semantically I see no difference between: cement (Q45190) emissivity (P1295) 0.54 and cement (Q45190) emissivity (Q899670) 0.54 Am I missing something here? Are properties really needed or are we adding unnecessary artificial constraints? Cheers, Micru ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l -- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V. ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] What is the point of properties?
On 29/05/14 12:41, Thomas Douillard wrote: @David: I think you should have a look to fuzzy logic https://www.wikidata.org/wiki/Q224821:) Or at probabilistic logic, possibilistic logic, epistemic logic, ... it's endless. Let's first complete the data we are sure of before we start to discuss whether Pluto is a planet with fuzzy degree 0.6 or 0.7 ;-) (The problem with quantitative logics is that there is usually no reference for the numbers you need there, so they are not well suited for a secondary data collection like Wikidata that relies on other sources. The closest concept that still might work is probabilistic logic, since you can really get some probabilities from published data; but even there it is hard to use the probability as a raw value without specifying very clearly what the experiment looked like.) Markus 2014-05-29 1:48 GMT+02:00 David Cuenca dacu...@gmail.com mailto:dacu...@gmail.com: Markus, On Thu, May 29, 2014 at 12:53 AM, Markus Krötzsch mar...@semantic-mediawiki.org mailto:mar...@semantic-mediawiki.org wrote: This is an easy question once you have been clear about what human behaviour is. According to enwiki, it is a range of behaviours *exhibited by* humans. Settled :) Let's leave it at defined as a trait of What would anybody do with this data? In what application could it be of interest? Well, our goal it to gather the whole human knowledge, not to use it. I can think of several applications, but let's leave that open. Never underestimate human creativity ;-) Moreover, as a great Icelandic ontologist once said: There is definitely, definitely, definitely no logic, to human behaviour ;-) Definitely, that is why we spend so much time in front of flickering squares making them flicker even more. It makes total sense :P I think constraints are already understood in this way. The name comes from databases, where a constraint violation is indeed a rather hard error. On the other hand, ironically, constraints (as a technical term) are often considered to be a softer form of modelling than (onto)logical axioms: a constraint can be violated while a logical axiom (as the name suggests) is always true -- if it is not backed by the given data, new data will be inferred. So as a technical term, constraint is quite appropriate for the mechanism we have, although it may not be the best term to clarify the intention. Ok, I will not fight traditional labels nor conventions. I was interested in pointing out to the inappropriateness of using a word inside our community with a definition that doesn't matches its use, when there is another word that matches perfectly and conveys its meaning better to users. Some important ideas like classification (instance of/subclass of) belong completely to the analytical realm. We don't observe classes, we define them. A planet is what we call a planet, and this can change even if the actual lumps in space are pretty much the same. Agreed. Better labels could be defined as instance of/defined as subclass of Now inferences are slightly different. If we know that X implies Y, then if A says X we can infer that (implicitly) A says Y. That is a logical relationship (or rule) on the level of what is claimed, rather than on the level of statements. Note that we still need to have a way to find out that X implies Y, which is a content-level claim that should have its own reference somewhere. We mainly use inference in this sense with subclass of in reasonator or when checking constraints. In this case, the implications are encoded as subclass-of statements (If X is a piano, then X is an instrument). This allows us to have references on the implications. Nope, nope, nope. I was not referring to hard implications, but to heuristic ones. Consider that these properties in the item namespace: defined as a trait of defined as having defined as instance of Would translate as these constraints in the property namespace: likely to be a trait of likely to have likely to be an instance of In general, an interesting question here is what the status of subclass of really is. Do we gather this information from external sources (surely there must be a book that tells us that pianos are instruments) or do we as a community define this for Wikidata (surely, the overall hierarchy we get is hardly the universal class hierarchy of the world but a very specific classification that is different from other classifications that may exist elsewhere)? Best not to think about it too much and to gather sources whenever
Re: [Wikidata-l] What is the point of properties?
On 29/05/14 13:53, Thomas Douillard wrote: hehe, maybe some kind inferences can lead to a good heuristic to suggest properties and values in the entity suggester. As they naturally become softer and softer by combination of uncertainties, this could also provide some kind of limits for inferences by fixing a probability below which we don't add a fuzzy fact to the set of facts. Maybe we could fix an heuristic starting fuzziness or probability score based on 1 sourced claim - big score ; one disputed claim ; based on ranks and so on. Sorry, I have to expand on this a bit ... My main point was that there are many fuzzy logics (depending on the t-norm you chose) and many probabilistic logics (depending on the stochastic assumptions you make). The meaning of a score crucially depends on which logic you are in. Moreover, at least in fuzzy logic, the scores only are relevant in comparison to other scores (there is no absolute meaning to 0.3) -- therefore you need to ensure that the scores are assigned in a globally consistent way (0.3 in Wikidata would have to mean exactly the same wherever it is used). This makes it extremely hard to implement such an approach in practice in a large, distributed knowledge base like ours. What's more, you cannot find these scores in books or newspapers, so you somehow have to make them up in another way. You suggested to use this for statements that are not generally accepted, but how do you measure how disputed a statement is? If two thirds of references are for it and the rest is against it, do you assign 0.66 as a score? It's very tricky. Fuzzy logic has its main use in fuzzy control (the famous washing machine example), which is completely different and largely unrelated to fuzzy knowledge representation. In knowledge representation, fuzzy approaches are also studied, but their application is usually in a closed system (e.g., if you have one system that extracts data from a text and assigns certainties to all extracted facts in the same way). It's still unclear how to choose the right logic, but at least it will give you a uniform treatment of your data according to some fixed principles (whether they make sense or not). The situation is much clearer in probabilistic logics, where you define your assumptions first (e.g., you assume that events are independent or that dependencies are captured in some specific way). This makes it more rigorous, but also harder to apply, since in practice these assumptions rarely hold. This is somewhat tolerable if you have a rather uniform data set (e.g., a lot of sensor measurements that give you some probability for actual states of the underlying system). But if you have a huge, open, cross-domain system like Wikidata, it would be almost impossible to force it into a particular probability framework where 0.3 really means in 30% of all cases. Also note that scientific probability is always a limit of observed frequencies. It says: if you do something again and again, this is the rate you will get. Often-heard statements like We have an 80% chance to succeed! or Chances are almost zero that the Earth will blow up tomorrow! are scientifically pointless, since you cannot repeat the experiments that they claim to make statements about. Many things we have in Wikidata are much more on the level of such general statements than on the level that you normally use probability for (good example of a proper use of probability: based the tests that we did so far, this patient has a 35% chance of having cancer -- these are not the things we normally have in Wikidata). Markus 2014-05-29 13:43 GMT+02:00 Markus Krötzsch mar...@semantic-mediawiki.org mailto:mar...@semantic-mediawiki.org: On 29/05/14 12:41, Thomas Douillard wrote: @David: I think you should have a look to fuzzy logic https://www.wikidata.org/__wiki/Q224821 https://www.wikidata.org/wiki/Q224821:) Or at probabilistic logic, possibilistic logic, epistemic logic, ... it's endless. Let's first complete the data we are sure of before we start to discuss whether Pluto is a planet with fuzzy degree 0.6 or 0.7 ;-) (The problem with quantitative logics is that there is usually no reference for the numbers you need there, so they are not well suited for a secondary data collection like Wikidata that relies on other sources. The closest concept that still might work is probabilistic logic, since you can really get some probabilities from published data; but even there it is hard to use the probability as a raw value without specifying very clearly what the experiment looked like.) Markus ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
[Wikidata-l] What data should be in Wikidata? (Was: What is the point of properties?)
David, I need to answer to your first assertion separately: On 29/05/14 01:48, David Cuenca wrote: Well, our goal it to gather the whole human knowledge, not to use it. No, that is really not the case. Our goal is to gather carefully selected parts of the human knowledge. Our community defines what these parts are. Just like in Wikipedia. Even if you wanted to gather all human knowledge this goal would not be a useful principle for deciding what to do first. For example, we know that every natural number is an element of the natural numbers. It is obviously not our goal to gather these infinitely many statements (if you disagree, you could try to propose a bot that starts to import this data ;-). Therefore, it is clear that gathering *all* knowledge is not even an abstract ideal of our community. Quite the contrary: we explicitly don't want it. The natural numbers are just an extreme example. Many other cases exists (for instance, we do not import all free databases into Wikidata, although they are finite). The question then is: How do we know what data we want and what data we don't want? What principles do we base our decision on? For me, there are two main principles: * practical utility (does it serve a purpose that we care about?) * simplicity and clarity (is it natural to express and easy to understand?) You said that we cannot foresee *all* applications, but that does not mean that we should start to create data for which we cannot foresee *any*. There is just too much data of the latter kind, and we need to make a choice. Don't get me wrong: I consider myself an inclusionist. Better to have some useless data than to miss some important content. But there is no neutral ground here -- we all must draw a line somewhere (or start writing the natural number import bot ;-). My position is: if we have data that is very hard to capture and at the same time has no conceivable use, then we should not spend our energy on it while there is so much clearly defined, important data that we are still missing. Markus ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] What is the point of properties?
The other answers, under the original subject: On 29/05/14 01:48, David Cuenca wrote: Settled :) Let's leave it at defined as a trait of I don't think it is very clear what the intention of this property is. What are the limits of its use? What is it meant to do? Can behaviour really be a trait of a species? If we allow it here, it seems to apply to all kinds of connections: density/car? eternity/time? time/reality? evil/devil? rigour/science? -- this is opening a can of worms. It will be hard to maintain this. Wikiuser13 recently added consists of: Neptune to Q1. It was fixed. But it is a good example of the kind of confusion that comes from such general ontological (in the philosophical sense) properties. And consists of is still very simple compared to defined as a trait of. Can't we focus on more obvious things like has social network account for a while? ;-) ... Some important ideas like classification (instance of/subclass of) belong completely to the analytical realm. We don't observe classes, we define them. A planet is what we call a planet, and this can change even if the actual lumps in space are pretty much the same. Agreed. Better labels could be defined as instance of/defined as subclass of I don't think this is better. The short names are fine. As I explained in my email, Wikidata statements are mainly about what the external references say. The distinction between defined and observed is not on the surface of this. The main question is Did the reference say that pianos are instruments? but not Did the reference say pianos are instruments because of the definition of 'piano'? Therefore, we don't need to put this information in our labels. Now inferences are slightly different. If we know that X implies Y, then if A says X we can infer that (implicitly) A says Y. That is a logical relationship (or rule) on the level of what is claimed, rather than on the level of statements. Note that we still need to have a way to find out that X implies Y, which is a content-level claim that should have its own reference somewhere. We mainly use inference in this sense with subclass of in reasonator or when checking constraints. In this case, the implications are encoded as subclass-of statements (If X is a piano, then X is an instrument). This allows us to have references on the implications. Nope, nope, nope. I was not referring to hard implications, but to heuristic ones. Consider that these properties in the item namespace: defined as a trait of defined as having defined as instance of Would translate as these constraints in the property namespace: likely to be a trait of likely to have likely to be an instance of I think you might have misunderstood my email. I was arguing *in favour* of soft constraints, but in the paragraph before the one about inferences that you reply to here. Inferences are hard ways for obtaining new knowledge from our own definitions. Example: If X is the father of Y according to reference A Then Y is the child of X according to reference A This is as hard as it can get. We are absolutely sure of this since this rule just explains the relationship between two different ways we have for encoding family relationships. Below, you said expectations inferred from definitions should not be treated as hard constraints -- maybe this mixture of terms indicates that I have not been clear enough about the distinction between inference and constraint. They are really completely different ways of looking at things. Inferences are something that adds (inevitable) conclusions to your knowledge, while constraints just tell you what to check for. If you accept the premises of an inference and the inference rule, then you must also accept the conclusion -- there is no soft way of reading this. To make it soft, you can start to formalise softness in your knowledge, using fuzzy logic or whatnot (see my other email with Thomas). I don't think we can use soft inferences (in the sense of fuzzy logic et al.) but I am in favour of soft constraints (in the sense of your expectations). I guess we agree on all of this, but have a bit of trouble in making ourselves clear :-) But it is rather subtle material after all. In general, an interesting question here is what the status of subclass of really is. Do we gather this information from external sources (surely there must be a book that tells us that pianos are instruments) or do we as a community define this for Wikidata (surely, the overall hierarchy we get is hardly the universal class hierarchy of the world but a very specific classification that is different from other classifications that may exist elsewhere)? Best not to think about it too much and to gather sources whenever we have them ;-) I think it is good to think about it and to consider options to deal with it. Like for instance:
Re: [Wikidata-l] What is the point of properties?
Hi David, Interesting remark. Let's explore this idea a bit. I will give you two main reasons why we have properties separate, one practical and one conceptual. First the practical point. Certainly, everything that is used as a property needs to have a datatype, since otherwise the wiki would not know what kind of input UI to show. So you cannot use just any item as a property straight away -- it needs to have a datatype first. So, yes, you could abolish the namespace Property but you still would have a clear, crisp distinction between property items (those with datatype) and normal items (those without a datatype). Because of this, most of the other functions would work the same as before (for example, property autocompletion would still only show properties, not arbitrary items). A complication with this approach is that property datatypes cannot change in Wikibase. This design was picked since there is no way to convert existing data from one datatype to another in general. So changing the datatype would create problems by making a lot of data invalid, and require special handling and special UI to handle this situation. With properties living in a separate namespace, this is not a real restriction: you can just create a new property and give it the same label (after naming the old one differently, e.g., putting DEPRECATED in its name). Then you can migrate the data in some custom fashion. But if properties would be items, we would have a problem here: the item is already linked to many Wikipedias and other projects, and it might be used in LUA scripts, queries, or even external applications like Denny's Javascript translation library. You cannot change item ids easily. Also, many items would not have a datatype, so the first one who (accidentally?) is entered will be fixed. So we would definitely need to rethink the whole idea of unchangeable datatypes. My other important reason is conceptual. Properties are not considered part of the (encyclopaedic) data but rather part of the schema that the community has picked to organise that data. As in your example, emissivity (Q899670) is a notion in physics as described in a Wikipedia article. There are many things to say about this notion (for example, it has a history: somebody must have defined this first -- although Wikipedia does not say it in this case). As in all cases, some statements might be disputed while others are widely acknowledged to be true. For the property emissivity (P1295), the situation is quite different. It was introduced as an element used to enter data, similar to a row in a database table or an infobox template in some Wikipedia. It does probably closely relate to the actual physical notion Q899670, but it still is a different thing. For example, it was first introduced by User:Jakec, who is probably not the person who introduced the physical concept ;-) Anything that we will say about P1295 in the future refers to the property -- a concept of our own making, that is not described in any external source (there are no publications discussing P1295). This is also the reason why properties are supposed to support *claims* not *statements*. That is, they will have property-value pairs and qualifiers, but no references or ranks. Indeed, anything we say about properties has the status of a definition. If we say it, it's true. There is no other authority on Wikidata properties. You could of course still have items and properties share a page and somehow define which statements/claims refer to which concept, but this does not seem to make things easier for users. These are, for me, the two main reasons why it makes sense to keep properties apart from items on a technical level. Besides this, it is also convenient to separate the 1000-something properties from the 15-million something items for reasons of maintenance. Best regards, Markus On 28/05/14 09:25, David Cuenca wrote: Since the very beginning I have kept myself busy with properties, thinking about which ones fit, which ones are missing to better describe reality, how integrate into the ones that we have. The thing is that the more I work with them, the less difference I see with normal items and if soon there will be statements allowed in property pages, the difference will blur even more. I can understand that from the software development point of view it might make sense to have a clear difference. Or for the community to get a deeper understanding of the underlying concepts represented by words. But semantically I see no difference between: cement (Q45190) emissivity (P1295) 0.54 and cement (Q45190) emissivity (Q899670) 0.54 Am I missing something here? Are properties really needed or are we adding unnecessary artificial constraints? Cheers, Micru ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org
Re: [Wikidata-l] What is the point of properties?
On 28/05/14 10:37, Daniel Kinzler wrote: Key differences between Properties and Items: * Properties have a data type, items don't. * Items have sitelinks, Properties don't. * Items have Statements, Properties will support Claims (without sources). The software needs these constraints/guarantees to be able to take shortcuts, provide specialized UI and API functionality, etc. Yes, it would be possible to use items as properties instead of having a separate entity type. But they are structurally and functionally different, so it makes sense to have a strict separate. This makes a lot of things easier, e.g.: * setting different permissions for properties * mapping to rdf vocabularies This one point requires a tiny remark: there is no problem in OWL or RDF to use the same URI as a property, an individual, and a class in different contexts. The only thing that OWL (DL) forbids is to use one property for literal values (like string) and for object values (like other items), but this would not occur in our case anyway since we have clearly defined types. I completely agree with all the rest :-) Cheers, Markus More fundamentally, they are semantically different: an item describes a concept in the real world, while a property is a structural component used for such a description. Yes, properies are simmilar to data items, and in some cases, there may be an item representing the same concept that is represented by a property entity. I don't see why that is a problem, while I can see a lot of confusion arising from mixing them. -- daniel Am 28.05.2014 09:25, schrieb David Cuenca: Since the very beginning I have kept myself busy with properties, thinking about which ones fit, which ones are missing to better describe reality, how integrate into the ones that we have. The thing is that the more I work with them, the less difference I see with normal items and if soon there will be statements allowed in property pages, the difference will blur even more. I can understand that from the software development point of view it might make sense to have a clear difference. Or for the community to get a deeper understanding of the underlying concepts represented by words. But semantically I see no difference between: cement (Q45190) emissivity (P1295) 0.54 and cement (Q45190) emissivity (Q899670) 0.54 Am I missing something here? Are properties really needed or are we adding unnecessary artificial constraints? Cheers, Micru ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] What is the point of properties?
David, Regarding the question of how to classify properties and how to relate them to items: * same as (in the sense of owl:sameAs) is not the right concept here. In fact, it has often been discouraged to use this on the Web, since it has very strong implications: it means that in all uses of the one identifier, one could just as well use the other identifier, and that it is indistinguishable if something has been said about the one or the other. That seems too strong here, at least for most cases. * In the world of OWL DL, sameAs specifically refers to individuals, not to classes or properties. Saying P sameAs Q does not imply that P and Q have the same extension as properties. For the latter, OWL has the relationship owl:equivalentProperties. This distinction of instance level and schema level is similar to the distinction we have between instance of and subclass of. * Therefore, I would suggest to use a property called subproperty of as one way of relating properties (analogously to subclass of). It has to be checked if this actually occurs in Wikidata (do we have any properties that would be in this relation, or do we make it a modelling principle to have only the most specific properties in Wikidata?). * The relationship from properties to items could be modelled with the existing property subject of (P805). * It might be useful to also have a taxonomic classification of properties. For example, we already group properties into properties for people, organisations, etc. Such information could also be added with a specific property (this would be a bit more like a category system on property pages). On the other hand, some of this might coincide with constraint information that could be expressed as claims. For instance, person properties might be those with Type (i.e., rdfs:domain) constraint human. By the way, our constraint system could use some systematisation -- there are many overlaps in what you can do with one constraint or another. Cheers, Markus On 28/05/14 12:14, David Cuenca wrote: Markus, The explanation about the implications of renaming/deleting makes most sense and just that justifies already the separation in two. It is equally true that when we create a property, we might have cleaned the original concept so much that it might differ (even slightly) with the understood concept that the item represents. However, even after that process, the new concept is still an item... The process of imbuing a concept with permanent characteristics (adding a datatype) and the practical approach, also seems to recommend keeping items and properties separate. Thanks for showing me that reasoning :) I am still wondering about how are we going to classify properties. Maybe it will require a broader discussion, but if they are the same (or mostly the same) as items, then we can just link them as same as, and build the classing structure just for the items. OTOH, if they are different, then we will need to mirror that classification for properties, which seems quite redundant. Plus adding a new datatype, property. All in all, my conclusion about this is that properties are just concepts with special qualities that justify the separation in the software (even if in real life there is no separation). many thanks for your detailed answer, and sorry if I'm bringing up already discussed topics. It is just that when you stare long into wikidata, wikidata stares back into you ;) Cheers, Micru On Wed, May 28, 2014 at 11:39 AM, Markus Krötzsch mar...@semantic-mediawiki.org mailto:mar...@semantic-mediawiki.org wrote: Hi David, Interesting remark. Let's explore this idea a bit. I will give you two main reasons why we have properties separate, one practical and one conceptual. First the practical point. Certainly, everything that is used as a property needs to have a datatype, since otherwise the wiki would not know what kind of input UI to show. So you cannot use just any item as a property straight away -- it needs to have a datatype first. So, yes, you could abolish the namespace Property but you still would have a clear, crisp distinction between property items (those with datatype) and normal items (those without a datatype). Because of this, most of the other functions would work the same as before (for example, property autocompletion would still only show properties, not arbitrary items). A complication with this approach is that property datatypes cannot change in Wikibase. This design was picked since there is no way to convert existing data from one datatype to another in general. So changing the datatype would create problems by making a lot of data invalid, and require special handling and special UI to handle this situation. With properties living in a separate namespace, this is not a real restriction: you can just create a new property and give it the same label
Re: [Wikidata-l] Using external vocabularies (like RDA) in WikiData ?
On 28/05/14 15:56, Daniel Kinzler wrote: Am 28.05.2014 15:05, schrieb Jean-Baptiste Pressac: Hello, I am reading the documentation of WikiData where I learned that new properties could be suggested for discussion. But this means adding knew properties to WikiData. However, is it possible to use existing RDF vocabularies Not directly. At the moment, you would just rely on a convention saying that a given wikibase property is equivalent to a concept from some other vocabulary. However, we are in the process of allowing claims on properties however. Once this is possible, you will be able to connect properties to external identifiers, much in the way data items about people etc are cross-linked with external identifiers. This would allow you to model the equivalence between wikidata properties and other vicabularied. However, the software itself would not be aware of the equivalence, so it would not be explicit in the RDF representation of data items. But it would be easy for an external tool that knows how to interpret such claims on properties to build an appropriate mapping using owl:sameAs or a similar mechanism. Daniel is right about this mechanism (but, as I said earlier today, owl:equivalentProperty is the way to go here, not owl:sameAs). However, there is another important point to consider: statements in Wikidata cannot be expressed as single triples in RDF. You need auxiliary nodes for statements to represent qualifiers and references. For details, see our technical report http://korrekt.org/page/Introducing_Wikidata_to_the_Linked_Data_Web Due to this, you cannot just take external properties and use them to replace Wikidata properties: the RDF version of Wikidata does not have any property that links subjects (items) and objects (values) directly. There are several approaches to get back to single triples (mainly: named graphs and simplified exports); see the technical report for details. The other issue is that one has to be aware of is that we use properties not just for the main part of a statement, but also for qualifiers and for references. One should be clear about whether an external property applies to all or only to some of these uses. For example, an external property that has Person as its domain should never be used in a reference, even if (maybe in an error) somebody has used the Wikidata property in a reference. We plan to generate the RDF dumps described in the technical report regularly. This would be a possible place for implementing the re-use of external vocabularies. If you are interested in this, you are welcome to join -- basically, one could have a mechanism based on either a hard-coded mapping (in the export code) or based on templates on property talk pages (like constraints now). Cheers, Markus ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] What is the point of properties?
helpers (constraints) distinct from sourced information (statements about items). My recommendation is to rely mainly on the main taxonomy instead of creating a parallel property taxonomy, and then think of ways to extract information from the main taxonomy to convert it automatically into constraints. All the maintenance takes effort, so the more it can be automated, the more efficient volunteers will be. And if we can simplify the maintenance of properties, we will be able to simplify the creation of properties too, specially when we face the next surge which will come with the datatype number with units. I agree with the general goals, but I don't think that things become any easier if we confuse information about properties with information about items. We can still re-use information we have about items (like the class hierarchy that we already use in constraints) to avoid duplication, but some things are clearly not part of the item taxonomy. Cheers, Markus On Wed, May 28, 2014 at 2:48 PM, Markus Krötzsch mar...@semantic-mediawiki.org mailto:mar...@semantic-mediawiki.org wrote: David, Regarding the question of how to classify properties and how to relate them to items: * same as (in the sense of owl:sameAs) is not the right concept here. In fact, it has often been discouraged to use this on the Web, since it has very strong implications: it means that in all uses of the one identifier, one could just as well use the other identifier, and that it is indistinguishable if something has been said about the one or the other. That seems too strong here, at least for most cases. * In the world of OWL DL, sameAs specifically refers to individuals, not to classes or properties. Saying P sameAs Q does not imply that P and Q have the same extension as properties. For the latter, OWL has the relationship owl:equivalentProperties. This distinction of instance level and schema level is similar to the distinction we have between instance of and subclass of. * Therefore, I would suggest to use a property called subproperty of as one way of relating properties (analogously to subclass of). It has to be checked if this actually occurs in Wikidata (do we have any properties that would be in this relation, or do we make it a modelling principle to have only the most specific properties in Wikidata?). * The relationship from properties to items could be modelled with the existing property subject of (P805). * It might be useful to also have a taxonomic classification of properties. For example, we already group properties into properties for people, organisations, etc. Such information could also be added with a specific property (this would be a bit more like a category system on property pages). On the other hand, some of this might coincide with constraint information that could be expressed as claims. For instance, person properties might be those with Type (i.e., rdfs:domain) constraint human. By the way, our constraint system could use some systematisation -- there are many overlaps in what you can do with one constraint or another. Cheers, Markus On 28/05/14 12:14, David Cuenca wrote: Markus, The explanation about the implications of renaming/deleting makes most sense and just that justifies already the separation in two. It is equally true that when we create a property, we might have cleaned the original concept so much that it might differ (even slightly) with the understood concept that the item represents. However, even after that process, the new concept is still an item... The process of imbuing a concept with permanent characteristics (adding a datatype) and the practical approach, also seems to recommend keeping items and properties separate. Thanks for showing me that reasoning :) I am still wondering about how are we going to classify properties. Maybe it will require a broader discussion, but if they are the same (or mostly the same) as items, then we can just link them as same as, and build the classing structure just for the items. OTOH, if they are different, then we will need to mirror that classification for properties, which seems quite redundant. Plus adding a new datatype, property. All in all, my conclusion about this is that properties are just concepts with special qualities that justify the separation in the software (even if in real life there is no separation). many thanks for your detailed answer, and sorry if I'm bringing up already discussed topics. It is just that when you stare long into wikidata, wikidata stares back into you ;) Cheers, Micru
Re: [Wikidata-l] What is the point of properties?
David, One of the uses is: what is the relationship between a human and his behavior? This is an easy question once you have been clear about what human behaviour is. According to enwiki, it is a range of behaviours *exhibited by* humans. The bigger question for me is, whether it is useful to record this relationship (exhibited by) in Wikidata. What would anybody do with this data? In what application could it be of interest? Moreover, as a great Icelandic ontologist once said: There is definitely, definitely, definitely no logic, to human behaviour ;-) On that regard, I hate the word constraint, because it means that we are placing a straitjacket on reality, when it is the other way round, recurring patterns in the real world make us expect that a value will fall within the bonds of our expectations. I think constraints are already understood in this way. The name comes from databases, where a constraint violation is indeed a rather hard error. On the other hand, ironically, constraints (as a technical term) are often considered to be a softer form of modelling than (onto)logical axioms: a constraint can be violated while a logical axiom (as the name suggests) is always true -- if it is not backed by the given data, new data will be inferred. So as a technical term, constraint is quite appropriate for the mechanism we have, although it may not be the best term to clarify the intention. However, I would like to go to bring the conversation to a deeper level. ... With all this I want to make the point that there are two sources of expectations: - from our experience seeing repetitions and patterns in the values (male/female/etc between 10 and 50), which belong to the property - from the agreed definition of the concept itself, which belong to the data Yes. I agree with this as a basic dichotomy of things we may want to record in Wikidata. Some things are true by definition, while others are just very likely by observation. The exact population of Paris we will never know, but we are completely sure that a piano is an instrument. (Maybe somebody with a better philosophical background than me could give a better perspective of these notions -- analytical vs. empirical come to mind, but I am sure there is more.) Some important ideas like classification (instance of/subclass of) belong completely to the analytical realm. We don't observe classes, we define them. A planet is what we call a planet, and this can change even if the actual lumps in space are pretty much the same. However, there is yet a deeper level here (you asked for it ;-). Wikidata is not about facts but about statements with references. We do not record Pluto was a planet until 2006 but Pluto was a planet until 2006 *according to the IAU*. Likewise, we don't say Berlin has 3 million inhabitants but Berlin has 3 million inhabitants *according to the Amt fuer Statistik Berlin-Brandenburg*. If you compare these two statements, you can see that they are both empirical, based on our observation of a particular reference. We do not have analytical knowledge of what the IAU or the Amt fuer Statistic might say. So in this sense constraints can only ever be rough guidelines. It does not make logical sense to say if source A says X then source B must say Y -- even if we know that X implies Y (maybe by definition), we don't know what sources A and B say. All we can do with constraints it to uncover possible contradictions between sources, which might then be looked into. Now inferences are slightly different. If we know that X implies Y, then if A says X we can infer that (implicitly) A says Y. That is a logical relationship (or rule) on the level of what is claimed, rather than on the level of statements. Note that we still need to have a way to find out that X implies Y, which is a content-level claim that should have its own reference somewhere. We mainly use inference in this sense with subclass of in reasonator or when checking constraints. In this case, the implications are encoded as subclass-of statements (If X is a piano, then X is an instrument). This allows us to have references on the implications. In general, an interesting question here is what the status of subclass of really is. Do we gather this information from external sources (surely there must be a book that tells us that pianos are instruments) or do we as a community define this for Wikidata (surely, the overall hierarchy we get is hardly the universal class hierarchy of the world but a very specific classification that is different from other classifications that may exist elsewhere)? Best not to think about it too much and to gather sources whenever we have them ;-) Besides these two notions (constraints to uncover inconsistent references, and logical axioms to derive new statements from given ones), there is also a third type of constraint that is purely analytical. If we *define* that our
Re: [Wikidata-l] Subclass of/instance of
On 14/05/14 19:33, Joe Filceolaire wrote: Except that there are lots of people who have appeared in one movie who don't consider themselves actors and should not have the 'occupation=actor/actress'. There are good reasons for some constraints to be gadgets that can be overridden rather than hard coded semantic limits. Sure, we completely agree here. It was just an example. But it shows why we need any such feature to be controlled by the community ;-) I do think we should be able to have hard coded reverse properties and symmettric properties. By hard coded do you mean stored explicitly (as opposed to: inferred in some way)? It will always be possible to store anything explicitly in this sense (but I guess you know this; maybe I misunderstood what you said; feel free to clarify). In general, what I mentioned about inferencing is not supposed to alter the way in which the site works. It would be more like a layer on top that could be useful for asking queries. For example, imagine you want to query for the grandmother of a person: we don't have this property in Wikidata but we have enough information to answer the query. So you would have to research how to get this information by combining existing properties. The idea is that one could have a place to keep this information (= the definition of grandmother in terms of Wikidata properties). We would then have a community approved way of finding grandmothers in Wikidata, and you would be much faster with your query. At the same time, you could look up the definition to find out how Wikidata really stores this information. None of this would would change how the underlying data works, but it could contribute to some data modelling problems because it gives you an option to support a property without the added maintenance cost on the data management level. Cheers, Markus On Wed, May 14, 2014 at 2:33 PM, Markus Krötzsch mar...@semantic-mediawiki.org mailto:mar...@semantic-mediawiki.org wrote: I guess there is already a group of people who deal w Hi Eric, Thanks for all the information. This was very helpful. I only get to answer now since we have been quite busy building RDF exports for Wikidata (and writing a paper about it). I will soon announce this here (we still need to fix a few details). You were asking about using these properties like rdfs:subClassOf and rdf:type. I think that's entirely possible, since the modelling is very reasonable and would probably yield good results. Our reasoner ELK could easily handle the class hierarchy in terms of size, but you don't really need such a highly optimized tool for this as long as you only have subClassOf. In fact, the page you linked to shows that it is perfectly possible to compute the class hierarchy with Wikidata Query and to display all of it on one page. ELK's main task is to compute class hierarchies for more complicated ontologies, which we do not have yet. OTOH, query answering and data access are different tasks that ELK is not really intended for (although it could do some of this as well). Regarding future perspectives: one thing that we have also done is to extract OWL axioms from property constraint templates on Wikidata talk pages (we will publish the result soon, when announcing the rest). This gives you only some specific types of OWL axioms, but it is making things a bit more interesting already. In particular, there are some constraints that tell you that an item should have a certain class, so this is something you could reason with. However, the current property constraint system does not work too well for stating axioms that are not related to a particular property (such as: Every [instance of] person who appears as an actor in some film should be [instance of] in the class 'actor' -- which property or item page should this be stated on?). But the constraints show that it makes sense to express such information somehow. In the end, however, the real use of OWL (and similar ontology languages) is to remove the need for making everything explicit. That is, instead of constraints (which say: if your data looks like X, then your data should also include Y) you have axioms (which say: if your data looks like X, then Y follows automatically). So this allows you to remove redundancy rather than to detect omissions. This would make more sense with derived notions that one does not want to store in the database, but which make sense for queries (like grandmother). One would need a bit more infrastructure for this; in particular, one would need to define grandmother (with labels in many languages) even if one does not want to use it as a property but only in queries. Maybe one could have a separate Wikibase installation for defining such derived notions without needing
Re: [Wikidata-l] Subclass of/instance of
Hi Eric, Thanks for all the information. This was very helpful. I only get to answer now since we have been quite busy building RDF exports for Wikidata (and writing a paper about it). I will soon announce this here (we still need to fix a few details). You were asking about using these properties like rdfs:subClassOf and rdf:type. I think that's entirely possible, since the modelling is very reasonable and would probably yield good results. Our reasoner ELK could easily handle the class hierarchy in terms of size, but you don't really need such a highly optimized tool for this as long as you only have subClassOf. In fact, the page you linked to shows that it is perfectly possible to compute the class hierarchy with Wikidata Query and to display all of it on one page. ELK's main task is to compute class hierarchies for more complicated ontologies, which we do not have yet. OTOH, query answering and data access are different tasks that ELK is not really intended for (although it could do some of this as well). Regarding future perspectives: one thing that we have also done is to extract OWL axioms from property constraint templates on Wikidata talk pages (we will publish the result soon, when announcing the rest). This gives you only some specific types of OWL axioms, but it is making things a bit more interesting already. In particular, there are some constraints that tell you that an item should have a certain class, so this is something you could reason with. However, the current property constraint system does not work too well for stating axioms that are not related to a particular property (such as: Every [instance of] person who appears as an actor in some film should be [instance of] in the class 'actor' -- which property or item page should this be stated on?). But the constraints show that it makes sense to express such information somehow. In the end, however, the real use of OWL (and similar ontology languages) is to remove the need for making everything explicit. That is, instead of constraints (which say: if your data looks like X, then your data should also include Y) you have axioms (which say: if your data looks like X, then Y follows automatically). So this allows you to remove redundancy rather than to detect omissions. This would make more sense with derived notions that one does not want to store in the database, but which make sense for queries (like grandmother). One would need a bit more infrastructure for this; in particular, one would need to define grandmother (with labels in many languages) even if one does not want to use it as a property but only in queries. Maybe one could have a separate Wikibase installation for defining such derived notions without needing to change Wikidata? There are no statements on properties yet, but one could also use item pages to define derived properties when using another site ... Best regards, Markus P.S. Thanks for all the work on the semantic modelling aspects of Wikidata. I have seen that you have done a lot in the discussions to clarify things there. On 06/05/14 04:53, emw wrote: Hi Markus, You asked who is creating all these [subclass of] statements and how is this done? The class hierarchy in http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q35120rp=279lang=en shows a few relatively large subclass trees for specialist domains, including molecular biology and mineralogy. The several thousand subclass of 'gene' and 'protein' subclass claims were created by members of WikiProject Molecular biology (WD:MB), based on discussions in [1] and [2]. The decision to use P279 instead of P31 there was based on the fact that the is-a relation in Gene Ontology maps to rdfs:subClassOf, which P279 is based on. The claims were added by a bot [3], with input from WD:MB members. The data ultimately comes from external biological databases. A glance at the mineralogy class hierarchy indicates it has been constructed by WikiProject Mineralogy [4] members through non-bot edits. I imagine most of the other subclass of claims are done manually or semi-automatically outside specific Wikiproject efforts. In other words, I think most of the other P279 claims are added by Wikidata users going into the UI and building usually-reasonable concept hierarchies on domains they're interested in. I've worked on constructing class hierarchies for health problems (e.g. diseases and injuries) [5] and medical procedures [6] based on classifications like ICD-10 and assertions and templates on Wikipedia (e.g. [8]). It's not incredibly surprising to me that Wikidata has about 36,000 subclass of (P279) claims [9]. The property has been around for over a year and is a regular topic of discussion [10] along with instance of (P31), which has over 6,600,000 claims. You noted a dubious claim subclass of claim for 'House of Staufen' (Q130875). I agree that instance of would probably be the better membership property
Re: [Wikidata-l] Wikidata Toolkit 0.1.0 released
Hi Gerard. On 09/04/14 10:54, Gerard Meijssen wrote: Hoi, What is the relevance of these tools when you have to have specialised environments to use them ? Not sure what you mean. Wikidata Toolkit doesn't have any requirements other than plain old Java to run. Nevertheless, we'd also like to support people who are using some of the common Java development tools that are around, especially the free ones. Currently, we only have instructions for Eclipse users, but we could extend this. Which tools do you normally use to develop Java? Cheers Markus On 9 April 2014 10:41, Daniel Kinzler daniel.kinz...@wikimedia.de mailto:daniel.kinz...@wikimedia.de wrote: Am 08.04.2014 23:34, schrieb Denny Vrandečić: I was trying to use this, but my Java is a bit rusty. How do I run the DumpProcessingExample? I did the following steps: git clone https://github.com/Wikidata/Wikidata-Toolkit cd Wikidata-Toolkit mvn install mvn test Now, how do I start DumpProcessingExample? Looks like you are supposed to run it from Eclipse. It would be very useful if maven would generate a jar with all dependencies for the examples, or if there was a shell script that would allow us to run classes without the need to specify the full class path. Finding out how to get all the libs you need into the classpath is one of the major annoyances of java... -- daniel -- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V. ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org mailto:Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
[Wikidata-l] Wikidata submissions to Wikimania
Dear all, There are quite a few Wikidata-related submissions to Wikimania [0]. The selection of the program committee seems to be based on user votes to some extent, so don't forget to add your name to the submission pages you care about :-). I just added another two: * How to use Wikidata: Things to make and do with 30 million statements [1] A general introductory talk about Wikidata data reuse in all of its forms. * Wikidata Toolkit: A Java library for working with Wikidata [2] A tutorial for working with Wikidata Toolkit (expected to be much more feature rich at the time of Wikimania ;-) Feedback is welcome. Cheers, Markus [0] https://wikimania2014.wikimedia.org/wiki/Category:Submissions [1] https://wikimania2014.wikimedia.org/wiki/Submissions/How_to_use_Wikidata:_Things_to_make_and_do_with_30_million_statements [2] https://wikimania2014.wikimedia.org/wiki/Submissions/Wikidata_Toolkit:_A_Java_library_for_working_with_Wikidata ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
[Wikidata-l] What's up with our incremental (daily) dumps?
Hi, Since a few weeks now, no daily dumps have been published for Wikidata. Only empty directories are created every day. I could not find a related email on any list I scanned. Can anybody clarify what the situation is now? Cheers, Markus ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] What's up with our incremental (daily) dumps?
On 13/03/14 17:14, Katie Filbert wrote: On Thu, Mar 13, 2014 at 5:06 PM, Markus Krötzsch mar...@semantic-mediawiki.org mailto:mar...@semantic-mediawiki.org wrote: Hi, Since a few weeks now, no daily dumps have been published for Wikidata. Only empty directories are created every day. I could not find a related email on any list I scanned. Can anybody clarify what the situation is now? The issue is due to the dumps being moved to the Ashburn data center. https://bugzilla.wikimedia.org/show_bug.cgi?id=62315 They should be running again soonish. Good to know. Thanks, Markus ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Large scale glitch in references
Hi ValterVB, On 04/03/14 20:17, ValterVB wrote: Hi Markus, it’s an error of my bot (ValterVBot). Thanks to noted them. I can fix it probably on friday or saturday, source should be Q11920 not Q11329. Sorry for this problem. Great, that should be fine. ValterVB PS I’m not sure if I reply to mail-archive or in you private mail, in second case, can you post this mail? Thanks. Done. Best regards, Markus ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] supported and planned wikidata uris( was Re:Meta header for asserting that a web page is about a Wikidata subject)
Hi, On 26/02/14 22:40, Michael Smethurst wrote: Hello *Really* not meaning to jump down any http-range-14 rabbit holes but wasn't there a plan for wikidata to have uris representing things and pages about those things? From conversations on this list I sketched a picture a while back of all the planned URIs: http://smethur.st/wp-uploads/2012/07/46159634-wikidata.png Where http://wikidata.org/id/Qetc Was the thing uri (which you could point a foaf:PrimaryTopic at) As Denny said in reply to another message, the preferred URI for this is http://www.wikidata.org/entity/Qetc This is also the form of URIs used within Wikidata data for certain things (e.g., coordinates that refer to earth use the URI http://www.wikidata.org/entity/Q2; to do so, even in JSON). and http://wikidata.org/wiki/Qetc Was the document uri Yes. However, for metadata it is usually preferred to use the entity URI, since the document http://wikidata.org/wiki/Qetc is just an automatic UI rendering of the data, and as such relatively uninteresting. One will eventually get (using content negotiation) all data in RDF from http://www.wikidata.org/entity/Qetc (JSON should already work, and html works of course, when opening the entity URI in normal browsers). The only reason for using the wiki URI directly would be if one uses a property that requires a document as its value, but in this case one should probably better use another property. Best regards, Markus Mainly asking not for the wikipedia wikidata relationships but wondering if there's a more up to date picture of supported wikidata uri patterns and redirects? Recently I was trying to find a way to programmatically get wikidata uris from wikipedia uris and tried various combinations of: http://wikidata.org/title/enwiki:Berlin http://en.wikidata.org/item/Berlin http://en.wikidata.org/title/Berlin (all mentioned on the list / wiki) but all of them return a 404 Is the a way to do this? Michael On 26/02/2014 19:09, Dan Brickley dan...@danbri.org wrote: On 26 February 2014 10:45, Joonas Suominen joonas.suomi...@wikimedia.fi wrote: How about using RDFa and foaf:primaryTopic like in this example https://en.wikipedia.org/wiki/RDFa#XHTML.2BRDFa_1.0_example 2014-02-26 20:18 GMT+02:00 Paul Houle ontolo...@gmail.com: Isn't there some way to do this with schema.org? The FOAF options were designed for relations between entities and documents - foaf:primaryTopic relates a Document to a thing that the doc is primarily about (i.e. assumes entity IDs as value, pedantically). the inverse, foaf:isPrimaryTopicOf, was designed to allow an entity description in a random page to anchor itself against well known pages. In particular we had Wikipedia in mind. http://xmlns.com/foaf/spec/#term_primaryTopic http://xmlns.com/foaf/spec/#term_isPrimaryTopicOf (Both of these share a classic Semantic Web pickyness about distinguishing things from pages about those things). Much more recently at schema.org we've added a new property/relationship called http://schema.org/sameAs It relates an entity to a reference page (e.g. wikipedia) that can be used as a kind of proxy identifier for the real world thing that it describes. Not to be confused with owl:sameAs which is for saying here are two ways of identifying the exact same real world entity. None of these are a perfect fit for a relationship between a random Web page and a reference page. But maybe close enough? Both FOAF and schema.org are essentially dictionaries of hopefully-useful terms, so you can use them in HTML head, or body, according to taste, policy, tooling etc. And you can choose a syntax (microdata, rdfa, json-ld etc.). I'd recommend using the new schema.org 'sameAs', .e.g. in rdfa lite, link href=https://en.wikipedia.org/wiki/Buckingham_Palace; property=http://schema.org/sameAs; / This technically says the thing we're describing in the current element is Buckingham_Palace. If you want to be more explicit and say this Web page is about a real world Place and that place is Buckingham_Palace ... you can do this too with a bit more nesting; the HTML body might be a better place for it. Dan ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l - http://www.bbc.co.uk This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this. - ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org
Re: [Wikidata-l] CFP - IEEE Co-sponsored CyberSec2014 - Lebanon Section
This call is a scam. The conference is not a legit academic event but aims at making money. It is a sad truth that there is an increasingly large amount of (more or less) academic conference spam these days. IEEE has been criticized for sponsoring events without sufficient quality control [1], and I tend to ignore all events that advertise in its name. Some typical signs of scam conferences: * Keywords: cyber, world, global, multiconference, as well as weird keyword combinations and neologisms (peacefare?!) * claimed(?) IEEE sponsoring * registration fees per accepted paper instead of per participant (pay to publish), see e.g. [2]; this is probably the one 100% sure sign of a fake conference; I have never seen any legit event doing this * non-committing choice of words (potential inclusion to IEEE Xplore, possible keynote speakers etc.) * lack of (trustworthy) names and institutions related to the event (that's hard to judge if you are not in a relevant research community); in some cases known names may be abused or suggested to cause confusion There are more hints on recognizing scam conferences and journals online, e.g. at [3]. In general, I am in favour of filtering all academic calls for papers from this list. Even if the event is legit and has a relevant topic, this is not a forum to ask for academic contributions; there are more than enough channels these days to advertise events. Calls for participations in community events are a different story. Markus [1] http://blog.lib.umn.edu/denis036/thisweekinevolution/2011/07/would_ieee_really_sponsor_a_fa.html [2] http://sdiwc.net/conferences/2014/cybersec2014/registration/ [3] http://www.cs.bris.ac.uk/Teaching/learning/junk.conferences.html On 16/01/14 08:44, Sven Manguard wrote: I am beginning to get tired of these types of solicitations, as they seem to be coming in regularly, and more often than not, have little to do with Wikidata. Do people on this list find them useful? If so, is this the most appropriate list? If not, is there any interest in prohibiting posts like this? Sven On Jan 16, 2014 2:38 AM, Liezelle Ann Canadilla lieze...@sdiwc.info mailto:lieze...@sdiwc.info wrote: All the registered papers will be submitted to IEEE for potential inclusion to IEEE Xplore as well as other Abstracting and Indexing (AI) databases. TITLE: The Third International Conference on Cyber Security, Cyber Warfare, and Digital Forensic (CyberSec2014) EVENT VENUE: Lebanese University, Lebanon CONFERENCE DATES: Apr. 29 – May 1, 2014 EVENT URL: http://sdiwc.net/conferences/2014/cybersec2014/ OBJECTIVE: To provide a medium for professionals, engineers, academicians, scientists, and researchers from over the world to present the result of their research activities in the field of Computer Science, Engineering and Information Technology. CyberSec2014 provides opportunities for the delegates to share the knowledge, ideas, innovations and problem solving techniques. Submitted papers will be reviewed by the technical program committee of the conference. KEYWORDS: Cyber Security, Digital Forensics, Information Assurance and Security Management, Cyber Peacefare and Physical Security, and many more... SUBMISSION URL: http://sdiwc.net/conferences/2014/cybersec2014/openconf/openconf.php FIRST SUBMISSION DEADLINE: March 29, 2014 CONTACT EMAIL: cyb2...@sdiwc.net mailto:cyb2...@sdiwc.net ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org mailto:Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] ontology Wikidata API, managing ontology structure and evolutions
On 10/01/14 03:21, emw wrote: What about monthly/dump-based aggregated property usage statistics? Property usage statistics would be very valuable, Dimitris. It would help inform community decisions about how to steer changes in property usage with less disruption. It would have other significant benefits as well. Getting daily counts like https://www.wikidata.org/wiki/Wikidata:Database_reports/Popular_properties back up and running would be a good place to start. That report hasn't been updated since October 2013. We could go further by showing counts for all properies, not just the top 100. More detailed data would be great, too. Wikidata editors recently posted a list of the most popular objects for 'instance of' (P31) claims at https://www.wikidata.org/w/index.php?title=Property_talk:P31oldid=99405143#Value_statistics. Having daily data like that for all properties would be quite useful. Thanks for the suggestions. I will put all of these on the list for the Wikidata Toolkit development. Providing up-to-date analytics of this kind is a good basic use case for this project. (Btw, the project starts officially in mid Feb and runs for six months, but we will start working before that already; but there will be a bit more planning before we start hacking). Markus If anyone does end up doing something like this, I would recommend archiving the data at http://dumps.wikimedia.org/other/ in addition to posting it in a regularly updated report in Wikidata. Cheers, Eric https://www.wikidata.org/wiki/User:Emw On Thu, Jan 9, 2014 at 12:59 PM, Dimitris Kontokostas kontokos...@informatik.uni-leipzig.de mailto:kontokos...@informatik.uni-leipzig.de wrote: What about monthly/dump-based aggregated property usage statistics? People would be able to check property trends or maybe subscribe to specific properties via rss. On Thu, Jan 9, 2014 at 3:55 PM, Daniel Kinzler daniel.kinz...@wikimedia.de mailto:daniel.kinz...@wikimedia.de wrote: Am 08.01.2014 16:20, schrieb Thomas Douillard: Hi, a problem seems (not very surprisingly) to emerge into Wikidata : the managing of the evolution of how we do things on Wikidata. Properties are deleted, which made some consumer of the datas sometimes a little frustrated they are not informed of that and could not take part of the discussion. They are informed if they follow the relevant channels. There's no way to inform them if they don't. These channels can very likely be improved, yes. That being said: a property that is still widely used should very rarely be deleted, if at all. Usually, properties would be phased out by replacing them with another property, and only then they get deleted. Of course, 3rd parties that rely on specific properties would still face the problem that the property they use is simply no longer used (that's the actual problem - whether it is deleted doesn't really matter, I think). So, the question is really: how should 3rd party users be notified in changes of policy and best practice regarding the usage and meaning of properties? That's an interesting question, one that doesn't have a technical solution I can see. -- daniel -- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V. ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org mailto:Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l -- Dimitris Kontokostas Department of Computer Science, University of Leipzig Research Group: http://aksw.org Homepage:http://aksw.org/DimitrisKontokostas ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org mailto:Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] How are queries doing?
Hi, On a related note, there is also an upcoming project, Wikidata Toolkit [1], that will look into implementing query functionality over Wikidata content, not to replace the Wikidata query features but to provide functionality that is not a top priority for the core development. The first step towards this will be to collect concrete requirements (what queries exactly? on what part of the data?). I will send an email about this in due course, but input is always welcome. There is no query result integration into Wikipedia via this route, but a range of interesting Wikidata-driven Web services and query features could be created. The lack of tight coupling to Wikipedia deployment makes this project a lot more flexible, with room for experiments and new ideas that might also inspire future core features. Cheers, Markus [1] https://meta.wikimedia.org/wiki/Grants:IEG/Wikidata_Toolkit On 08/01/14 12:22, Gerard Meijssen wrote: Hoi, I agree that the integration of Wikidata in all the different Wikipedias, Wikivoyages, Wikisources, Wiktionaries and Commons is the most important objective. It is so important because this ensures that the data will be actually used. We are doing fine I think. However, not all the quirks of specific Wikis can be supported. Wikidata is data driven and consequently it matters a lot if a Wikipedia article is an article, a list or used for disambiguation. Thanks, GerardM On 8 January 2014 12:04, Dan Brickley dan...@danbri.org mailto:dan...@danbri.org wrote: On 7 January 2014 22:08, Jan Kučera kozuc...@gmail.com mailto:kozuc...@gmail.com wrote: nice to read all the reasoning why queries are yet still not possible, but I think we live in 2014 not and not 1914 actually... seems like the problem is too small budget or bad management... can not really think of another reason. How much do you think would it cost to make queries reality for production at Wikidata? Absolutely the most important thing about Wikidata is the deep integration (both technical and social) into the Wikipedia universe. Building a sensible query framework for a system working at Wikipedia scale (http://www.alexa.com/siteinfo/wikipedia.org) is far from trivial. I've glad to hear that Wikidata are taking the time to do this carefully. Dan ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org mailto:Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Wikidata-Freebase mappings published under CC0
On 12/11/13 16:26, Sven Manguard wrote: Google would not have sent over a large chunk of cash to help get Wikidata started if it didn't think it could use Wikidata. That Russian search engine comany would not have sent over a large chunk of cash to keep Wikidata going if it didn't think it could use Wikidata. That doesn't mean Google is being malicious (about this, at least), it means that they are making a business decision. As long as Google doesn't try to make decisions about Wikidata content or operations - something it would have no economic reason to do anyways - I don't have a problem with that. Just don't pretend that Google is doing this out of the goodness of ther hearts. Another important thing to note in this context is that all funding for Wikidata so far had the form of donations, which is crucially different from sponsoring (where you get something in return). The donors who give their money hope, of course, that the project will do something that they will find useful, but they exercise no control whatsoever in the development process. The donations are not bound to any condition, not even reporting, and there is no way to retract them. So each donor's initial intentions, whatever they were, have no influence on the execution of the project (moreover, I Kant think of any reason why the intentions rather than the outcome should determine the value of the deed ;-). Cheers, Markus On Nov 11, 2013 6:27 PM, Cristian Consonni kikkocrist...@gmail.com mailto:kikkocrist...@gmail.com wrote: 2013/11/11 Denny Vrandečić vrande...@google.com mailto:vrande...@google.com: as you know, I have recently started a new job. My mission to get more free data out to the world has not changed due to that, though. =)) I am very happy to hear this. Also, the mapping is awesome. 2013/11/11 Klein,Max kle...@oclc.org mailto:kle...@oclc.org: I regretted writing what I did after thinking about it over lunch, since it is not Assume good faith towards Google. Maybe one of the reasons that I was sensitive to it was because I'm representing VIAF in Wikidata, which is kind of the same as Freebase in Wikidata, and I wouldn't want people assuming bad faith about VIAF. Thanks for being clear and open about your work, its a real inspiration. With apologies, Yours too, Max. Thank you both for your very good work. Cristian ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org mailto:Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Questions about statement qualifiers
Hi Antoine, The main answer to your questions is that the data model of Wikidata defines a *data structure* not the *informal meaning* that this data structure has in an application context (that is: what we, humans, want to say when we enter it). I try to explain this a bit better below. How the presence or absence of a qualifier contributes to the informal meaning of a statement is not something that is defined by Wikidata. Just like Wikidata does not define what the property office held means, it also does not define what it means if office held is used with additional qualifiers. This is entirely governed by the community who uses these structures to express something. Of course, the community tries to do this in a systematic, reasoned, and intuitive way. However, there will never be a general rule how to interpret an arbitrary quantifier. In particular, it is not true that quantifiers are statements about statements. First, to avoid confusion, I need to explain Wikidata's terminology. A statement in Wikidata comprises the whole data structure: main property and value, qualifiers, and references. The structure without the references (the thing that the references provide evidence for) is called a claim. A claim thus contains a main property and value (or no value or unknown value) and zero or more qualifier properties with values. Every claim encodes something that is claimed about the subject of the page (the Wikidata entity), and the references given are supporting this claim (as a whole). You already illustrated yourself how this is different from making statements about statements: it would lead to confusion when several statements have the same main property-value but different qualifiers. This is also why our RDF export does not use the same resource for reifying statements with the same main property-value. Instead, we only share the same resource it two claims are completely identical (including qualifiers). It is true that many qualifiers have a certain meta flavour, but this is not always the case. An interesting case that you might have seen is P161 (cast member) that is used to denote the actors in a film. The typical qualifier there is P453 (role), used to name the role (character) that the person played in the film. If you look at this, this is more like a ternary relation hasActor(film,actor,role) than like a meta-statement. Indeed, an n-ary relationship cannot in general be represented by a meta-statement about a binary relation, again for the same reasons that you gave in your email. In this view, one should maybe also think of a relationship usPresident(person,start date, end date) than of an annotated assertion usPresident(person). Wikidata is special in that qualifiers are optional, yet the modelling view of n-ary relations might be closer to the pragmatic truth, since it avoids any meta-statements (it also elegantly justifies why there are no meta-meta-statements, i.e., qualifiers on qualifiers). Best regards, Markus On 31/10/13 11:39, Antoine Zimmermann wrote: Hello, I have a few questions about how statement qualifiers should be used. First, my understanding of qualifiers is that they define statements about statements. So, if I have the statement: Q17(Japan) P6(head of government) Q132345(Shinzō Abe) with the qualifier: P39(office held) Q274948(Prime Minister of Japan) it means that the statement holds an office, right? It seems to me that this is incorrect and that this qualifier should in fact be a statement about Shinzō Abe. Can you confirm this? Second, concerning temporal qualifiers: what does it mean that the start or end is no value? I can imagine two interpretations: 1. the statement is true forever (a person is a dead person from the moment of their death till the end of the universe) 2. (for end date) the statement is still true, we cannot predict when it's going to end. For me, case number 2 should rather be marked as unknown value rather than no value. But again, what does unknown value means in comparison to having no indicated value? Third, what if a statement is temporarily true (say, X held office from T1 to T2) then becomes false and become true again (like X held same office from T3 to T4 with T3 T2)? The situation exists for Q35171(Grover Cleveland) who has the following statement: Q35171 P39(position held) Q11696(President of the United States of America) with qualifiers, and a second occurrence of the same statement with different qualifiers. The wikidata user interface makes it clear that there are two occurrences of the statement with different qualifiers, but how does the wikidata data model allows me to distinguish between these two occurrences? How do I know that: P580(start date) March 4 1885 only applies to the first occurrence of the statement, while: P580(start date) March 4 1893 only applies to the second occurrence of the statement? I could have a
Re: [Wikidata-l] Application: sexing people by name/research gender bias
On 14/10/13 17:52, Klein,Max wrote: Hi all, First of all I think this is fantastic research. It goes to show, it's not just properties that we can correlate, but also the Labels, Aliases, Sitelinks, and the connections between each field. I would like to point out, as Markus does in his discussion - the relative disproportionate representation of sex in Acadmia is the motivation for studying this. Let us be sensitive to results in that field. Lets remember our simplifying assumptions. We have flattened sex and gender into one measure, and at that this research makes a binary male/female classification, where even the wikidata sex property is trinary (intersex). I hope that in the future we can increase or change our view to how we model sex. Indeed, it the debates on gender inequality and gender multiplicity look at things on very different zoom levels. The goal of my little experiment (I would not call it research, as it has neither a hypothesis nor any form of evaluation) was not to put individual people into rigid gender buckets but to estimate rough global distributions. My error margins are far too wide to make any realistic statement about minority genders even if I had a method to consider them. As far as social definitions of gender go, this is probably something to study in a wider context of representation of social minorities in certain professional fields. Cheers, Markus From: wikidata-l-boun...@lists.wikimedia.org wikidata-l-boun...@lists.wikimedia.org on behalf of Paul A. Houle p...@ontology2.com Sent: Sunday, October 13, 2013 5:32 PM To: Discussion list for the Wikidata project. Subject: Re: [Wikidata-l] Application: sexing people by name/research gender bias Just as a suggestion, you can turn these kind of numbers into a probability distribution using the beta distribution. If you use (1,1) as a prior you get something like beta(251,1) for the the probability of the probability that somebody named Aaron is male. -Original Message- From: Markus Krötzsch Sent: Sunday, October 13, 2013 6:16 PM To: Discussion list for the Wikidata project. Subject: [Wikidata-l] Application: sexing people by name/research gender bias Hi all, I'd like to share a little Wikidata application: I just used Wikidata to guess the sex of people based on their (first) name [1]. My goal was to determine gender bias among the authors in several research areas. This is how some people spend their free time on weekends ;-) In the process, I also created a long list of first names with associated sex information from Wikidata [2]. It is not super clean but it served its purpose. If you are a researcher, then maybe the gender bias of journals/conferences is interesting to you as well. Details and some discussion of the results are online [1]. Cheers, Markus [1] http://korrekt.org/page/Note:Sex_Distributions_in_Research [2] https://docs.google.com/spreadsheet/ccc?key=0AstQ5xfO-xXGdE9UVkxNc0JMVWJzNmJqNmhPRjc0cncusp=sharing ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Application: sexing people by name/research gender bias
On 14/10/13 18:18, Tom Morris wrote: Naming patterns change over time and geography. If you're interested in the gender of current day authors, you should probably constrain your name sampling to the same timeframe. I think geography has a much bigger impact than time here. Unfortunately, the names I try to find the sex for do not come with an obvious hint on their geographic origin, so I cannot really use this. I think filtering by time will not have a big impact, since most people on Wikipedia are from the 20th century anyway. So there should be a natural tendency to overrule older uses of names. There's an app that works of the Freebase data here: http://namegender.freebaseapps.com/ It also has an API that returns JSON: http://namegender.freebaseapps.com/gender_api?name=andrea Based on the top name stats, it looks like its sample is a little more than twice the size of Wikidata's. Nice. Christian Thiele also pointed me to a beautiful web service based on Wikipedia Personendaten (German language, but many things are easy to figure out, I guess): http://toolserver.org/~apper/pd/vorname/top http://toolserver.org/~apper/pd/vorname/Maria This illustrates nicely how to take the effect of time into account. Markus On Sun, Oct 13, 2013 at 6:16 PM, Markus Krötzsch mar...@semantic-mediawiki.org mailto:mar...@semantic-mediawiki.org wrote: Hi all, I'd like to share a little Wikidata application: I just used Wikidata to guess the sex of people based on their (first) name [1]. My goal was to determine gender bias among the authors in several research areas. This is how some people spend their free time on weekends ;-) In the process, I also created a long list of first names with associated sex information from Wikidata [2]. It is not super clean but it served its purpose. If you are a researcher, then maybe the gender bias of journals/conferences is interesting to you as well. Details and some discussion of the results are online [1]. Cheers, Markus [1] http://korrekt.org/page/Note:__Sex_Distributions_in_Research http://korrekt.org/page/Note:Sex_Distributions_in_Research [2] https://docs.google.com/__spreadsheet/ccc?key=0AstQ5xfO-__xXGdE9UVkxNc0JMVWJzNmJqNmhPRjc__0cncusp=sharing https://docs.google.com/spreadsheet/ccc?key=0AstQ5xfO-xXGdE9UVkxNc0JMVWJzNmJqNmhPRjc0cncusp=sharing _ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org mailto:Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/__mailman/listinfo/wikidata-l https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Application: sexing people by name/research gender bias
On 13/10/13 23:21, Magnus Manske wrote: If you need to push through automated sexing for items without sex property, point to my similar attempt in June: https://www.wikidata.org/wiki/Wikidata:Bot_requests#Set_sex:male_for_item_list Thanks, the list I got from the items with sex is already longer than I need. My main problem is sexing Asian authors. Not sure if name-based approaches are promising there at all. Markus On Sun, Oct 13, 2013 at 11:16 PM, Markus Krötzsch mar...@semantic-mediawiki.org mailto:mar...@semantic-mediawiki.org wrote: Hi all, I'd like to share a little Wikidata application: I just used Wikidata to guess the sex of people based on their (first) name [1]. My goal was to determine gender bias among the authors in several research areas. This is how some people spend their free time on weekends ;-) In the process, I also created a long list of first names with associated sex information from Wikidata [2]. It is not super clean but it served its purpose. If you are a researcher, then maybe the gender bias of journals/conferences is interesting to you as well. Details and some discussion of the results are online [1]. Cheers, Markus [1] http://korrekt.org/page/Note:__Sex_Distributions_in_Research http://korrekt.org/page/Note:Sex_Distributions_in_Research [2] https://docs.google.com/__spreadsheet/ccc?key=0AstQ5xfO-__xXGdE9UVkxNc0JMVWJzNmJqNmhPRjc__0cncusp=sharing https://docs.google.com/spreadsheet/ccc?key=0AstQ5xfO-xXGdE9UVkxNc0JMVWJzNmJqNmhPRjc0cncusp=sharing _ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org mailto:Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/__mailman/listinfo/wikidata-l https://lists.wikimedia.org/mailman/listinfo/wikidata-l -- undefined ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Pushing Wikidata to the next level
Hi -- or better: Heya! -- Lydia: Congratulations to your new role! This is great news for the project, which allows Wikidata to proceed on its important mission in perfect continuity. Denny has made huge contributions to the project in the past 1.5 years -- a task that often involved balancing many forces, both on a technical and on a social level. Without his commitment and energy, we would not be in this encouraging position today. We would also have a lot less funding to draw from. We are really extremely fortunate to continue with a product manager who is perfectly prepared for this important role: someone who has the key skills as well as the specific experience, and who has a profound understanding of what open source and open knowledge are all about. So welcome again in your new job, and all the best for the next steps. Cheers, Markus On 01/10/13 15:30, Lydia Pintscher wrote: (crossposting from http://blog.wikimedia.de/?p=17250) In early 2010 I met Denny and Markus for the first time in a small room at the Karlsruhe Institute of Technology to talk about Semantic MediaWiki, its development and its community. I was intrigued by the idea they'd been pushing for since 2005 - bringing structured data to Wikipedia. So when the time came to assemble the team for the development of Wikidata and Denny approached me to do community communications for it there was no way I could have said no. The project sounded amazing and the timing was perfect since I was about to finish my studies of computer science. In the one and a half years since then we have achieved something amazing. We've built a great technical base for Wikidata and much more importantly we've built an amazing community around it. We've built the foundation for something extraordinary. On a personal level I could never have dreamed where this one meeting in a small room in Karlsruhe has taken me now. From now on I will be taking over product ownership of Wikidata as its product manager. Up until today we've built the foundation for something extraordinary. But at the same time there are still a lot of things that need to be worked on by all of us together. The areas that we need to focus on now are: * Building trust in our data. The project is still young and the Wikipedia editors and others are still wary of using data from Wikidata on a large scale. We need to build tools and processes to make our data more trustworthy. * Improving the user experience around Wikidata. Building Wikidata to the point where it is today was a tremendous technical task that we achieved in a rather short time. This though meant that in places the user experience has not gotten as much attention. We need to make the experience of using Wikidata smoother. * Making Wikidata easier to understand. Wikidata is a very geeky and technical project. However to be truly successful it will need to be easy to get the ideas behind it. These are crucial for Wikidata to have the impact we all want it to have. And we will all need to work on those - both in the development team and in the rest of the Wikidata community. Let's make Wikidata a joy to use and get it used in places and ways we can't even imagine yet. Cheers Lydia ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
[Wikidata-l] Wikidata Toolkit: call for feedback/support
Dear Wikidatanions (*), I have just drafted a little proposal for creating more tools for external people to work with Wikidata, especially to build services on top of its data [1]. Your feedback and support is needed. Idea: Currently, this is quite hard for people, since we only have WDA for reading/analysing dumps [2] and Wikidata Query as a single web service to ask queries [3]. We should have more support for programmers who want to load, query, analyse, and otherwise use the data. The proposal is to start such a toolkit to enable more work with the data. The plan is to kickstart this project with a small team using Wikimedia's Individual Engagement program. For this we will need your support -- feel free to add your voice to the wiki page [1]. Of course, comments of all sorts are also great -- this email thread will be linked from the page. If you would like to be involved with the project, that's great too; let me know and I can add you to the proposal. The proposal will already be submitted tomorrow, but support should also be possible after that, I hope. Cheers, Markus (*) Do we have a demonym yet? Wikipedian sounds natural, Wikidatan less so. Maybe this should be another thread ... ;-) [1] https://meta.wikimedia.org/wiki/Grants:IEG/Wikidata_Toolkit [2] http://github.com/mkroetzsch/wda [3] http://208.80.153.172/wdq/ -- Markus Kroetzsch, Departmental Lecturer Department of Computer Science, University of Oxford Room 306, Parks Road, OX1 3QD Oxford, United Kingdom +44 (0)1865 283529 http://korrekt.org/ ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] claims Datatypes inconsistency suspicion
Hi Daniel, if I understand you correctly, you are in favour of equating datavalue types and property types. This would solve indeed the problems at hand. The reason why both kinds of types are distinct in SMW and also in Wikidata is that property types are naturally more extensible than datavalue types. CommonsMedia is a good example of this: all you need is a custom UI and you can handle new data without changing the underlying data model. This makes it easy for contributors to add new types without far-reaching ramifications in the backend (think of numbers, which could be decimal, natural, positive, range-restricted, etc. but would still be treated as a number in the backend). Using fewer datavalue types also improves interoperability. E.g., you want to compare two numbers, even if one is a natural number and another one is a decimal. There is no simple rule for deciding how many datavalue types there should be. The general guideline is to decide on datavalue types based on use cases. I am arguing for diversifying IRIs and strings since there are many contexts and applications where this is a crucial difference. Conversely, I don't know of any application where it makes sense to keep the two similar (this would have to be something where we compare strings and IRIs on a data level, e.g., if you were looking for all websites with URLs that are alphabetically greater than the postcode of a city in England :-p). In general, however, it will be good to keep the set of basic datavalue types small, while allowing the set of property types to grow. The set of base datavalue types that we use is based on the experience in SMW as well as on existing formats like XSD (which also has many derived types but only a few base types). As for the possible confusion, I think some naming discipline would clarify this. In SMW, there is a stronger difference between both kinds of types, and a fixed schema for property type ids that makes it easy to recognise them. In any case, using string for IRIs does not seem to solve any problem. It does not simplify the type system in general and it does not help with the use cases that I mentioned. What I do not agree with are your arguments about all of this being internal. We would not have this discussion if it were. The data model of Wikidata is the primary conceptual model that specifies what Wikidata stores. You might still be right that some of the implementation is internal, but the arguments we both exchange are not really on the implementation level ;-). Best wishes Markus, offline soon for travelling On 26/08/13 10:35, Daniel Kinzler wrote: Am 25.08.2013 19:19, schrieb Markus Krötzsch: If we have an IRI DV, considering that URLs are special IRIs, it seems clear that IRI would be the best way of storing them. The best way of storing them really depends on the storage platform. It may be a string or something else. I think the real issue here is that we are exposing something that is really an internal detail (the data value type) instead of the high level information we actually should be exposing, namely property type. I think splitting the two was a mistake, and I think exposing the DV type while making the property type all but inaccessible makes things a lot worse. In my opinion, data should be self-descriptive, so the *semantic* type of the property should be included along with the value. People expect this, and assume that this is what the DV type is. But it's not, and should not be used or abused for this purpose. Ideally, it should not matter at all to any 3rd party if use use a string or IRI DV internally. The (semantic) property type would be URL, and that's all that matters. I'm quite unhappy about the current situation; we are beginning to see the backlash of the decision not to include the property type inline. If we don't do anything about this now, I fear the confusion is going to get worse. -- daniel ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] claims Datatypes inconsistency suspicion
Dear Hady, On 22/08/13 14:44, Hady elsahar wrote: Hello Markus , thanks for pointing to wda code it's very useful , i guess by looking on the Wikidata glossary property data types and data value types are the same thing : http://www.wikidata.org/wiki/Wikidata:Glossary#Datatypes this may be shallow a little bit , but what i saw is that (correct me if i'm mistaken) : - they don't use the same names when you search for the datatype of the property item and the value type of the item that uses this property. another problem is that they decided to represent commonsMedia in strings , for some purpose i don't know that's why i didn't get it and thought it's some sort of consistency In most cases, however, you can infer the property type from the datavalue type, but not in all. Unfortunately, you do not generally find the property type in a dump before you find its first use. could you point me why depending on such mappings didn't always work , for just wikipedia common files ? 'wikibase-item' = 'wikibase-entityid' 'string' = 'string' 'time' = 'time' 'globe-coordinate' = 'globecoordinate' 'commonsMedia' = 'string' The key is to understand that property types and value types are *not* the same. They match in many cases, but not in all. In the future, there might be more property types that use the same value type. Property types are what the user sees; they define every detail of user interaction and UI. Value types are part of the underlying data model; they define what the content of the data is. For most data processing, you should not need to know the property type. The situation with commonsMedia is a bit bad because it should be a URL rather than a string. What I do in wda is effectively a type conversion from string to URI in this particular case. Maybe we can fix this somehow in the future when URIs are supported as a value datatype. Markus On Thu, Aug 22, 2013 at 11:33 AM, Markus Krötzsch mar...@semantic-mediawiki.org mailto:mar...@semantic-mediawiki.org wrote: Hi all, I think one source of confusion here are the overlapping names of property datatypes and datavalue types. Basically, the mapping is as follows right now: [Format: property type = datavalue type occurring in current dumps] 'wikibase-item' = 'wikibase-entityid' 'string' = 'string' 'time' = 'time' 'globe-coordinate' = 'globecoordinate' 'commonsMedia' = 'string' The point is that string on the left is not the same as string on the right. (Also note the lack of a consistent naming scheme for these ids :-/ ...) In most cases, however, you can infer the property type from the datavalue type, but not in all. Unfortunately, you do not generally find the property type in a dump before you find its first use. The wda script's RDF export has code for dealing with this. It remembers all types that it finds (from P entities in the dump), it infers types from values where possible, and it uses the API to find out the type of a property if all else fails (typically, if you find a string value but don't know yet if the property is of type string or commonsMedia). In addition, the script has a hardcoded list of known types that can be extended (there are not so many properties and their types never change, hence one can do this quite easily). You can find all the code at [1]. Cheers, Markus [1] https://github.com/mkroetzsch/__wda/blob/master/includes/__epTurtleFileWriter.py https://github.com/mkroetzsch/wda/blob/master/includes/epTurtleFileWriter.py (esp. see __getPropertyType() and __fetchPropertyType()) On 21/08/13 21:00, Byrial Jensen wrote: Den 21-08-2013 21 tel:21-08-2013%2021:09, Hady elsahar skrev: Hello Jeroen , can i get from your words that this page : http://www.wikidata.org/wiki/__Special:ListDatatypes http://www.wikidata.org/wiki/Special:ListDatatypes is not up to date ?if so how can i get all the datatypes in Wikidata ? Pages in the virtual Special namespace are generated by MediaWiki on demand, and are therefore always (in principle - there can be caching in some cases) up to date. string could be anything ( so time could be a string) , but there's a defined lower level representation of common media files . so is it wrong to represent it as string , Time cannot be a string, as there are several components in a time value (time, timezone, precision, calendar model, before and after precisions). I see nothing wrong in storing commonsMedia values as string values. You will know from the property's datatype that the string is a CommonsMedia string. Regards, - Byrial
Re: [Wikidata-l] claims Datatypes inconsistency suspicion
Hi all, I think one source of confusion here are the overlapping names of property datatypes and datavalue types. Basically, the mapping is as follows right now: [Format: property type = datavalue type occurring in current dumps] 'wikibase-item' = 'wikibase-entityid' 'string' = 'string' 'time' = 'time' 'globe-coordinate' = 'globecoordinate' 'commonsMedia' = 'string' The point is that string on the left is not the same as string on the right. (Also note the lack of a consistent naming scheme for these ids :-/ ...) In most cases, however, you can infer the property type from the datavalue type, but not in all. Unfortunately, you do not generally find the property type in a dump before you find its first use. The wda script's RDF export has code for dealing with this. It remembers all types that it finds (from P entities in the dump), it infers types from values where possible, and it uses the API to find out the type of a property if all else fails (typically, if you find a string value but don't know yet if the property is of type string or commonsMedia). In addition, the script has a hardcoded list of known types that can be extended (there are not so many properties and their types never change, hence one can do this quite easily). You can find all the code at [1]. Cheers, Markus [1] https://github.com/mkroetzsch/wda/blob/master/includes/epTurtleFileWriter.py (esp. see __getPropertyType() and __fetchPropertyType()) On 21/08/13 21:00, Byrial Jensen wrote: Den 21-08-2013 21:09, Hady elsahar skrev: Hello Jeroen , can i get from your words that this page : http://www.wikidata.org/wiki/Special:ListDatatypes is not up to date ?if so how can i get all the datatypes in Wikidata ? Pages in the virtual Special namespace are generated by MediaWiki on demand, and are therefore always (in principle - there can be caching in some cases) up to date. string could be anything ( so time could be a string) , but there's a defined lower level representation of common media files . so is it wrong to represent it as string , Time cannot be a string, as there are several components in a time value (time, timezone, precision, calendar model, before and after precisions). I see nothing wrong in storing commonsMedia values as string values. You will know from the property's datatype that the string is a CommonsMedia string. Regards, - Byrial ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Exporting RDF from Wikidata?
On 15/08/13 21:38, Dan Brickley wrote: ... FWIW there's also RDF/XML if you use a *.rdf suffix. This btw is of great interest to us over in the schema.org http://schema.org project; earlier today I was showing http://www.wikidata.org/wiki/Special:EntityData/Q199154.rdf http://www.wikidata.org/wiki/Special:EntityData/Q199154.rdf to colleagues there... this is a Wikidata description of a particular sport. In schema.org http://schema.org we have a few places that hardcode a short list of well known sports, and we're interested in mechanisms that allow us to hand off to Wikidata for the long tail. So http://schema.org/SportsActivityLocation has 9 hand-designed subtypes; we have been discussing the idea of something like http://schema.org/SportsActivityLocation?sport=Q199154 http://www.wikidata.org/wiki/Special:EntityData/Q199154.rdf to integrate Wikidata into the story for other sports. Similar issues arise with religions and places of worship (http://schema.org/PlaceOfWorship). Any thoughts on this from a Wikidata perspective would be great. This is definitely something that we would like to encourage. Wikidata ids are fairly stable (not based on labels or languages) and fairly well grounded (described and named in many languages + linked to many Wikipedia pages, authority files, and external databases). So they should make suitable identifiers. No identifier will ever be reused, but it can happen that a Wikidata item is deleted, in which case it is no longer a suitable identifier. In theory, it can also happen that the data of an item changes so completely that the meaning of the item is different, but this is quite unlikely. One can access historic data fairly easily as long as the item is not deleted completely (not sure if a historic RDF export [by revision number] is planned, but it would not be hard to implement). And of course one would want the identifiers to be somewhat dynamic to capture changes of ideas over time (sports change all the time, e.g., if official rules are modified, but probably one does not want new IDs for every version of football). I am not sure if one needs to use http://schema.org/SportsActivityLocation?sport=Q199154 instead of using http://http://www.wikidata.org/entity/www.wikidata.org/Q199154 directly. Would these two have different meanings somehow? I guess they could, but there should not be a problem with long-term sustainability of the Wikidata URIs (just in case this is the main reason for creating new URIs here). Is there any prospect of inline RDFa within the main Wikidata per-entity pages? It would be great to have http://schema.org/sameAs in those pages linking to dbpedia, wikipedia,freebase etc too... This is not currently planned. One interesting starting point could be to identify the Wikidata properties that express same as. For example, many properties link to other data collections by giving IDs (which often correspond to URIs, only that URL datavalues are not quite implemented yet). However, the granularity of other databases is often not the same, and it might not be true that these IDs unambiguously define the identity of the subject. For example, we had on this list a question recently whether Norman Cook should have individual entities for his various synonyms or not; MusicBrainz has several IDs for him based on synonyms, but Wikipedia has only one article about the person. In such cases, links to other datasets should probably not be interpreted as sameAs. We currently use schema.org's about for linking Wikipedia pages to Wikidata ids. It seems wrong to say that an abstract URI (about a Wikidata entity) is the same as the URL of a Webpage that covers that topic. (This comment is about the links to Wikipedia you mentioned, not about cases with dedicated URIs that are not the web page URLs; the URIs for Wikipedia articles are in a strong sense the Wikidata URIs that we already start from ;-) Btw, it is planned (vaguely) that property pages can hold more information, which could be used to declare identifier properties in the system at some point. But this will still take a while to implement. Markus ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Exporting RDF from Wikidata?
On 15/08/13 19:33, Jona Christopher Sahnwaldt wrote: http://www.wikidata.org/entity/Q215607.nt which redirects to http://www.wikidata.org/wiki/Special:EntityData/Q215607.nt The RDF stuff at Wikidata is in flux. The RDF you get probably won't contain all the data that the HTML page shows, and the RDF structure may change. Indeed, the feature is simply not fully implemented yet. The best preview you can get right now is the dump generated by the python script. The plan is to make essentially the same available on a per-item basis via the URIs and URLs as above (in several syntaxes, depending on URL or, when using the URI, content negotiation). Markus On 15 August 2013 20:25, Kingsley Idehen kide...@openlinksw.com wrote: All, How do I obtain an RDF rendition of the Wikidata document http://www.wikidata.org/wiki/Q215607 ? Naturally, I've scoured the Web for examples and I keep on coming up empty :-( -- Regards, Kingsley Idehen Founder CEO OpenLink Software Company Web: http://www.openlinksw.com Personal Weblog: http://www.openlinksw.com/blog/~kidehen Twitter/Identi.ca handle: @kidehen Google+ Profile: https://plus.google.com/112399767740508618350/about LinkedIn Profile: http://www.linkedin.com/in/kidehen ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Wikidata RDF export available
On 12/08/13 17:56, Nicolas Torzec wrote: With respect to the RDF export I'd advocate for: 1) an RDF format with one fact per line. 2) the use of a mature/proven RDF generation framework. Optimizing too early based on a limited and/or biased view of the potential use cases may not be a good idea in the long run. I'd rather keep it simple and standard at the data publishing level, and let consumers access data easily and optimize processing to their need. RDF has several official, standardised syntaxes, and one of them is Turtle. Using it is not a form of optimisation, just a choice of syntax. Every tool I have ever used for serious RDF work (triple stores, libraries, even OWL tools) supports any of the standard RDF syntaxes *just as well*. I do see that there are some advantages in some formats and others in others (I agree with most arguments that have been put forward). But would it not be better to first take a look at the actual content rather than debating the syntactic formatting now? As I said, this is not the final syntax anyway, which will be created with different code in a different programming language. Also, I should not have to run a preprocessing step for filtering out the pieces of data that do not follow the standardŠ To the best of our knowledge, there are no such pieces in the current dump. We should try to keep this conversation somewhat related to the actual Wikidata dump that is created by the current version of the Python script on github (I will also upload a dump again tomorrow; currently, you can only get the dump by running the script yourself). I know I suggested that one could parse Turtle in a robust way (which I still think one can) but I am not suggesting for a moment that this should be necessary for using Wikidata dumps in the future. I am committed to fix any error as it is found, but so far I don't get much input in that direction. Note that I also understand the need for a format that groups every facts about an subject into one record, and serialize them one record per line. It sometime makes life easier for bulk processing of large datasets. But that's a different discussion. As I said: advantages and disadvantages. This is why we will probably have all desired formats at some time. But someone needs to start somewhere. Markus -- Nicolas Torzec. On 8/12/13 1:49 AM, Markus Krötzsch mar...@semantic-mediawiki.org wrote: On 11/08/13 22:29, Tom Morris wrote: On Sat, Aug 10, 2013 at 2:30 PM, Markus Krötzsch mar...@semantic-mediawiki.org mailto:mar...@semantic-mediawiki.org wrote: Anyway, if you restrict yourself to tools that are installed by default on your system, then it will be difficult to do many interesting things with a 4.5G RDF file ;-) Seriously, the RDF dump is really meant specifically for tools that take RDF inputs. It is not very straightforward to encode all of Wikidata in triples, and it leads to some inconvenient constructions (especially a lot of reification). If you don't actually want to use an RDF tool and you are just interested in the data, then there would be easier ways of getting it. A single fact per line seems like a pretty convenient format to me. What format do you recommend that's easier to process? I'd suggest some custom format that at least keeps single data values in one line. For example, in RDF, you have to do two joins to find all items that have a property with a date in the year 2010. Even with a line-by-line format, you will not be able to grep this. So I think a less normalised representation would be nicer for direct text-based processing. For text-based processing, I would probably prefer a format where one statement is encoded on one line. But it really depends on what you want to do. Maybe you could also remove some data to obtain something that is easier to process. Markus ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] question about 2 different json formats
On 10/08/13 10:29, Byrial Jensen wrote: ... (BTW, the time values seems to be OK again, after many syntax errors in the beginning. But the coordinate values have some strange (probably erroneous?) variations: Values where the precision and/or globe is given as null, and values where the globe is given as the string earth instead of an entity). Thanks for the warning. This was something that has been causing problems in the RDF dump too. I am now validating the globe settings more carefully. Cheers, Markus About the inconsistency in the dump file, is there any bug entry created for this? (I can create one, if anyone can point me the proper place to do that). Not for my sake. I adapted to two entity formats in the dumps immediately when the new format started to appear. Best regards, - Byrial ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Wikidata RDF export available
Good morning. I just found a bug that was caused by a bug in the Wikidata dumps (a value that should be a URI was not). This led to a few dozen lines with illegal qnames of the form w: . The updated script fixes this. Cheers, Markus On 09/08/13 18:15, Markus Krötzsch wrote: Hi Sebastian, On 09/08/13 15:44, Sebastian Hellmann wrote: Hi Markus, we just had a look at your python code and created a dump. We are still getting a syntax error for the turtle dump. You mean just as in at around 15:30 today ;-)? The code is under heavy development, so changes are quite frequent. Please expect things to be broken in some cases (this is just a little community project, not part of the official Wikidata development). I have just uploaded a new statements export (20130808) to http://semanticweb.org/RDF/Wikidata/ which you might want to try. I saw, that you did not use a mature framework for serializing the turtle. Let me explain the problem: Over the last 4 years, I have seen about two dozen people (undergraduate and PhD students, as well as Post-Docs) implement simple serializers for RDF. They all failed. This was normally not due to the lack of skill, but due to the lack of missing time. They wanted to do it quick, but they didn't have the time to implement it correctly in the long run. There are some really nasty problems ahead like encoding or special characters in URIs. I would direly advise you to: 1. use a Python RDF framework 2. do some syntax tests on the output, e.g. with rapper 3. use a line by line format, e.g. use turtle without prefixes and just one triple per line (It's like NTriples, but with Unicode) Yes, URI encoding could be difficult if we were doing it manually. Note, however, that we are already using a standard library for URI encoding in all non-trivial cases, so this does not seem to be a very likely cause of the problem (though some non-zero probability remains). In general, it is not unlikely that there are bugs in the RDF somewhere; please consider this export as an early prototype that is meant for experimentation purposes. If you want an official RDF dump, you will have to wait for the Wikidata project team to get around doing it (this will surely be based on an RDF library). Personally, I already found the dump useful (I successfully imported some 109 million triples of some custom script into an RDF store), but I know that it can require some tweaking. We are having a problem currently, because we tried to convert the dump to NTriples (which would be handled by a framework as well) with rapper. We assume that the error is an extra somewhere (not confirmed) and we are still searching for it since the dump is so big Ok, looking forward to hear about the results of your search. A good tip for checking such things is to use grep. I did a quick grep on my current local statements export to count the numbers of and (this takes less than a minute on my laptop, including on-the-fly decompression). Both numbers were equal, making it unlikely that there is any unmatched in the current dumps. Then I used grep to check that and only occur in the statements files in lines with commons URLs. These are created using urllib, so there should never be any or in them. so we can not provide a detailed bug report. If we had one triple per line, this would also be easier, plus there are advantages for stream reading. bzip2 compression is very good as well, no need for prefix optimization. Not sure what you mean here. Turtle prefixes in general seem to be a Good Thing, not just for reducing the file size. The code has no easy way to get rid of prefixes, but if you want a line-by-line export you could subclass my exporter and overwrite the methods for incremental triple writing so that they remember the last subject (or property) and create full triples instead. This would give you a line-by-line export in (almost) no time (some uses of [...] blocks in object positions would remain, but maybe you could live with that). Best wishes, Markus All the best, Sebastian Am 03.08.2013 23:22, schrieb Markus Krötzsch: Update: the first bugs in the export have already been discovered -- and fixed in the script on github. The files I uploaded will be updated on Monday when I have a better upload again (the links file should be fine, the statements file requires a rather tolerant Turtle string literal parser, and the labels file has a malformed line that will hardly work anywhere). Markus On 03/08/13 14:48, Markus Krötzsch wrote: Hi, I am happy to report that an initial, yet fully functional RDF export for Wikidata is now available. The exports can be created using the wda-export-data.py script of the wda toolkit [1]. This script downloads recent Wikidata database dumps and processes them to create RDF/Turtle files. Various options are available to customize the output (e.g., to export statements but not references, or to export only texts in English and Wolof). The file
Re: [Wikidata-l] Wikidata language codes
On 10/08/13 11:07, John Erling Blad wrote: The language code no is the metacode for Norwegian, and nowiki was in the beginning used for both Norwegian Bokmål, Riksmål and Nynorsk. The later split of and made nnwiki, but nowiki continued as before. After a while all Nynorsk content was migrated. Now nowiki has content in Bokmål and Riksmål, first one is official in Norway and the later is an unofficial variant. After the last additions to Bokmål there are very few forms that are only legal n Riksmål, so for all practical purposes nowiki has become a pure Bokmål wiki. I think all content in Wikidata should use either nn or nb, and all existing content with no as language code should be folded into nb. It would be nice if no could be used as an alias for nb, as this is de facto situation now, but it is probably not necessary and could create a discussion with the Nynorsk community. The site code should be nowiki as long as the community does not ask for a change. Thanks for the clarification. I will keep no to mean no for now. What I wonder is: if users choose to enter a no label on Wikidata, what is the language setting that they see? Does this say Norwegian (any variant) or what? That's what puzzles me. I know that a Wikipedia can allow multiple languages (or dialects) to coexist, but in the Wikidata language selector I thought you can only select real languages, not language groups. Markus On 8/6/13, Markus Krötzsch mar...@semantic-mediawiki.org wrote: Hi Purodha, thanks for the helpful hints. I have implemented most of these now in the list on git (this is also where you can see the private codes I have created where needed). I don't see a big problem in changing the codes in future exports if better options become available (it's much easier than changing codes used internally). One open question that I still have is what it means if a language that usually has a script tag appears without such a tag (zh vs. zh-Hans/zh-Hant or sr vs. sr-Cyrl/sr-Latn). Does this really mean that we do not know which script is used under this code (either could appear)? The other question is about the duplicate language tags, such as 'crh' and 'crh-Latn', which both appear in the data but are mapped to the same code. Maybe one of the codes is just phased out and will disappear over time? I guess the Wikidata team needs to answer this. We also have some codes that mean the same according to IANA, namely kk and kk-Cyrl, but which are currently not mapped to the same canonical IANA code. Finally, I wondered about Norwegian. I gather that no.wikipedia.org is in Norwegian Bokmål (nb), which is how I map the site now. However, the language data in the dumps (not the site data) uses both no and nb. Moreover, many items have different texts for nb and no. I wonder if both are still Bokmål, and there is just a bug that allows people to enter texts for nb under two language settings (for descriptions this could easily be a different text, even if in the same language). We also have nn, and I did not check how this relates to no (same text or different?). Cheers, Markus On 05/08/13 15:41, P. Blissenbach wrote: Hi Markus, Our code 'sr-ec' is at this moment effectively equivalent to 'sr-Cyrl', likewise is our code 'sr-el' currently effectively equivalent to 'sr-Latn'. Both might change, once dialect codes of Serbian are added to the IANA subtag registry at http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry Our code 'nrm' is not being used for the Narom language as ISO 639-3 does, see: http://www-01.sil.org/iso639-3/documentation.asp?id=nrm We rather use it for the Norman / Nourmaud, as described in http://en.wikipedia.org/wiki/Norman_language The Norman language is recognized by the linguist list and many others but as of yet not present in ISO 639-3. It should probably be suggested to be added. We should probaly map it to a private code meanwhile. Our code 'ksh' is currently being used to represent a superset of what it stands for in ISO 639-3. Since ISO 639 lacks a group code for Ripuarian, we use the code of the only Ripuarian variety (of dozens) having a code, to represent the whole lot. We should probably suggest to add a group code to ISO 639, and at least the dozen+ Ripuarian languages that we are using, and map 'ksh' to a private code for Ripuarian meanwhile. Note also, that for the ALS/GSW and the KSH Wikipedias, page titles are not guaranteed to be in the languages of the Wikipedias. They are often in German instead. Details to be found in their respective page titleing rules. Moreover, for the ksh Wikipedia, unlike some other multilingual or multidialectal Wikipedias, texts are not, or quite often incorrectly, labelled as belonging to a certain dialect. See also: http://meta.wikimedia.org/wiki/Special_language_codes Greetings -- Purodha *Gesendet:* Sonntag, 04. August 2013 um 19:01 Uhr *Von:* Markus Krötzsch mar...@semantic-mediawiki.org *An:* Federico Leva (Nemo
Re: [Wikidata-l] Wikidata RDF export available
Hi Sebastian, On 09/08/13 15:44, Sebastian Hellmann wrote: Hi Markus, we just had a look at your python code and created a dump. We are still getting a syntax error for the turtle dump. You mean just as in at around 15:30 today ;-)? The code is under heavy development, so changes are quite frequent. Please expect things to be broken in some cases (this is just a little community project, not part of the official Wikidata development). I have just uploaded a new statements export (20130808) to http://semanticweb.org/RDF/Wikidata/ which you might want to try. I saw, that you did not use a mature framework for serializing the turtle. Let me explain the problem: Over the last 4 years, I have seen about two dozen people (undergraduate and PhD students, as well as Post-Docs) implement simple serializers for RDF. They all failed. This was normally not due to the lack of skill, but due to the lack of missing time. They wanted to do it quick, but they didn't have the time to implement it correctly in the long run. There are some really nasty problems ahead like encoding or special characters in URIs. I would direly advise you to: 1. use a Python RDF framework 2. do some syntax tests on the output, e.g. with rapper 3. use a line by line format, e.g. use turtle without prefixes and just one triple per line (It's like NTriples, but with Unicode) Yes, URI encoding could be difficult if we were doing it manually. Note, however, that we are already using a standard library for URI encoding in all non-trivial cases, so this does not seem to be a very likely cause of the problem (though some non-zero probability remains). In general, it is not unlikely that there are bugs in the RDF somewhere; please consider this export as an early prototype that is meant for experimentation purposes. If you want an official RDF dump, you will have to wait for the Wikidata project team to get around doing it (this will surely be based on an RDF library). Personally, I already found the dump useful (I successfully imported some 109 million triples of some custom script into an RDF store), but I know that it can require some tweaking. We are having a problem currently, because we tried to convert the dump to NTriples (which would be handled by a framework as well) with rapper. We assume that the error is an extra somewhere (not confirmed) and we are still searching for it since the dump is so big Ok, looking forward to hear about the results of your search. A good tip for checking such things is to use grep. I did a quick grep on my current local statements export to count the numbers of and (this takes less than a minute on my laptop, including on-the-fly decompression). Both numbers were equal, making it unlikely that there is any unmatched in the current dumps. Then I used grep to check that and only occur in the statements files in lines with commons URLs. These are created using urllib, so there should never be any or in them. so we can not provide a detailed bug report. If we had one triple per line, this would also be easier, plus there are advantages for stream reading. bzip2 compression is very good as well, no need for prefix optimization. Not sure what you mean here. Turtle prefixes in general seem to be a Good Thing, not just for reducing the file size. The code has no easy way to get rid of prefixes, but if you want a line-by-line export you could subclass my exporter and overwrite the methods for incremental triple writing so that they remember the last subject (or property) and create full triples instead. This would give you a line-by-line export in (almost) no time (some uses of [...] blocks in object positions would remain, but maybe you could live with that). Best wishes, Markus All the best, Sebastian Am 03.08.2013 23:22, schrieb Markus Krötzsch: Update: the first bugs in the export have already been discovered -- and fixed in the script on github. The files I uploaded will be updated on Monday when I have a better upload again (the links file should be fine, the statements file requires a rather tolerant Turtle string literal parser, and the labels file has a malformed line that will hardly work anywhere). Markus On 03/08/13 14:48, Markus Krötzsch wrote: Hi, I am happy to report that an initial, yet fully functional RDF export for Wikidata is now available. The exports can be created using the wda-export-data.py script of the wda toolkit [1]. This script downloads recent Wikidata database dumps and processes them to create RDF/Turtle files. Various options are available to customize the output (e.g., to export statements but not references, or to export only texts in English and Wolof). The file creation takes a few (about three) hours on my machine depending on what exactly is exported. For your convenience, I have created some example exports based on yesterday's dumps. These can be found at [2]. There are three Turtle files: site links
Re: [Wikidata-l] PoC: Combining Wikidata and Clojure logic programming
On 07/08/13 15:40, Mingli Yuan wrote: Also, something similar to Magnus' Wiri, here is a bot developed by us on sina weibo (a twitter-like microblogging provider in China) * http://weibo.com/n/%E6%9E%9C%E5%A3%B3%E5%A8%98 We use dataset from wikidata with some dirty hacks. It is only a several-days quick work. Sounds exciting (and we always like to learn about uses of the data), but could you give a short description in English what is happening there? The above link takes me to a Chinese registration form only ;-) Markus We really very excited about the availability of such big dataset. The potential of Wikipedia and Wikidata is unlimited! Long live the free knowledge! Regards, Mingli On Wed, Aug 7, 2013 at 10:21 PM, Magnus Manske magnusman...@googlemail.com mailto:magnusman...@googlemail.com wrote: On Wed, Aug 7, 2013 at 3:20 PM, Mingli Yuan mingli.y...@gmail.com mailto:mingli.y...@gmail.com wrote: Very cool, Magnus! Does it do real query on wikidata? or it is only a UI thing? It does use live wikidata. Reasoning is hacked with a few hardcoded regular expressions ;-) ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org mailto:Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Related research and a working system
Dear Adam, thanks for the pointer. The paper gives an overview of how to design a wiki-based data curation platform for a specific target community. Some of the insights could also apply to Wikidata, while other won't transfer (e.g., you cannot invite the Wikidata community for a mini-workshop to gather requirements). What I did not find in the paper are numbers of any kind. How do you know that they manage petabytes of data? I also did not figure out how many users they cater for (e.g., they write: 'A small number of testers we called the “seed community” were involved in the testing and experimentation phase. This community generated the initial wiki contents that could then be used to solicit further contributions from a larger community of users' -- but I cannot find how big this small community and this larger community were; this would be important to understand how similar their scenario is to ours). Anyway, good to know about this recent work. I will send them an email to make them aware of this thread (and of Wikidata). Cheers, Markus On 05/08/13 09:35, Adam Wight wrote: Dear comrades, I just learned of a system based on MediaWiki which shares many of the same objectives as Wikidata: collaborative data storage and analysis, tracking of provenance, and facilitating citations, to name a few. I'd like to encourage a dialogue with these scientists, I do not think they are aware of your initiative, and they definitely have valuable practical experience after seeing the real-world use of their system. Currently they are managing several petabytes of data. Research paper by the site creators: http://opensym.org/wsos2013/proceedings/p0301-sowe.pdf Sorry I cannot link to their site itself-it might require an account... -Adam Wight ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] PoC: Combining Wikidata and Clojure logic programming
Hi Mingli, thanks, this very interesting, but I think I need a bit more context to understand what you are doing and why. Is your goal to create a library for accessing Wikidata from Clojure (like a Clojure API for Wikidata)? Or is your goal to use logical inference over Wikidata and you just use Clojure as a tool since it was most convenient? To your question: * Do we have a long term plan to evolve wikidata towards a semantic-rich dataset? There are no concrete designs for adding reasoning features to Wikidata so far (if this is what you mean). There are various open questions, especially related to inferencing over quantifiers. But there are also important technical questions, especially regarding performance. I intend to work the theory out in more detail soon (that is: How should logical rules over the Wikidata data model look work in principle?). The implementation then is the next step. I don't think that any of this will be part of the core features of Wikidata soon, but hopefully we can set up a useful external service for Wikidata search and analytics (e.g., to check for property constraint violations in real time instead of using custom code and bots). Cheers, Markus On 05/08/13 17:30, Mingli Yuan wrote: Hi, folks, After one night quick work, I had gave a proof-of-concept to demonstrate the feasibility that we can combine Wikidata and Clojure logic programming together. The source code is at here: https://github.com/mountain/knowledge An example of an entity: https://github.com/mountain/knowledge/blob/master/src/entities/albert_einstein.clj Example of types: https://github.com/mountain/knowledge/blob/master/src/meta/types.clj Example of predicates: https://github.com/mountain/knowledge/blob/master/src/meta/properties.clj Example of inference: https://github.com/mountain/knowledge/blob/master/test/knowledge/test.clj Also we found it is very easy to get another language versions of the data other than English. So, thanks very much for your guys' great work! But I found the semantic layer of wikidata is shallow, that means you can only knows who are Einstein's father and children, but it can not be inferred automatically from wikidata that Einstein's father is the grandfather of Einstein's children. So my question is that: * Do we have a long term plan to evolve wikidata towards a semantic-rich dataset? Regards, Mingli ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Wikidata language codes
Hi Purodha, thanks for the helpful hints. I have implemented most of these now in the list on git (this is also where you can see the private codes I have created where needed). I don't see a big problem in changing the codes in future exports if better options become available (it's much easier than changing codes used internally). One open question that I still have is what it means if a language that usually has a script tag appears without such a tag (zh vs. zh-Hans/zh-Hant or sr vs. sr-Cyrl/sr-Latn). Does this really mean that we do not know which script is used under this code (either could appear)? The other question is about the duplicate language tags, such as 'crh' and 'crh-Latn', which both appear in the data but are mapped to the same code. Maybe one of the codes is just phased out and will disappear over time? I guess the Wikidata team needs to answer this. We also have some codes that mean the same according to IANA, namely kk and kk-Cyrl, but which are currently not mapped to the same canonical IANA code. Finally, I wondered about Norwegian. I gather that no.wikipedia.org is in Norwegian Bokmål (nb), which is how I map the site now. However, the language data in the dumps (not the site data) uses both no and nb. Moreover, many items have different texts for nb and no. I wonder if both are still Bokmål, and there is just a bug that allows people to enter texts for nb under two language settings (for descriptions this could easily be a different text, even if in the same language). We also have nn, and I did not check how this relates to no (same text or different?). Cheers, Markus On 05/08/13 15:41, P. Blissenbach wrote: Hi Markus, Our code 'sr-ec' is at this moment effectively equivalent to 'sr-Cyrl', likewise is our code 'sr-el' currently effectively equivalent to 'sr-Latn'. Both might change, once dialect codes of Serbian are added to the IANA subtag registry at http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry Our code 'nrm' is not being used for the Narom language as ISO 639-3 does, see: http://www-01.sil.org/iso639-3/documentation.asp?id=nrm We rather use it for the Norman / Nourmaud, as described in http://en.wikipedia.org/wiki/Norman_language The Norman language is recognized by the linguist list and many others but as of yet not present in ISO 639-3. It should probably be suggested to be added. We should probaly map it to a private code meanwhile. Our code 'ksh' is currently being used to represent a superset of what it stands for in ISO 639-3. Since ISO 639 lacks a group code for Ripuarian, we use the code of the only Ripuarian variety (of dozens) having a code, to represent the whole lot. We should probably suggest to add a group code to ISO 639, and at least the dozen+ Ripuarian languages that we are using, and map 'ksh' to a private code for Ripuarian meanwhile. Note also, that for the ALS/GSW and the KSH Wikipedias, page titles are not guaranteed to be in the languages of the Wikipedias. They are often in German instead. Details to be found in their respective page titleing rules. Moreover, for the ksh Wikipedia, unlike some other multilingual or multidialectal Wikipedias, texts are not, or quite often incorrectly, labelled as belonging to a certain dialect. See also: http://meta.wikimedia.org/wiki/Special_language_codes Greetings -- Purodha *Gesendet:* Sonntag, 04. August 2013 um 19:01 Uhr *Von:* Markus Krötzsch mar...@semantic-mediawiki.org *An:* Federico Leva (Nemo) nemow...@gmail.com *Cc:* Discussion list for the Wikidata project. wikidata-l@lists.wikimedia.org *Betreff:* [Wikidata-l] Wikidata language codes (Was: Wikidata RDF export available) Small update: I went through the language list at https://github.com/mkroetzsch/wda/blob/master/includes/epTurtleFileWriter.py#L472 and added a number of TODOs to the most obvious problematic cases. Typical problems are: * Malformed language codes ('tokipona') * Correctly formed language codes without any official meaning (e.g., 'cbk-zam') * Correctly formed codes with the wrong meaning (e.g., 'sr-ec': Serbian from Ecuador?!) * Language codes with redundant information (e.g., 'kk-cyrl' should be the same as 'kk' according to IANA, but we have both) * Use of macrolanguages instead of languages (e.g., zh is not Mandarin but just Chinese; I guess we mean Mandarin; less sure about Kurdish ...) * Language codes with incomplete information (e.g., sr should be sr-Cyrl or sr-Latn, both of which already exist; same for zh and zh-Hans/zh-Hant, but also for zh-HK [is this simplified or traditional?]). I invite any language experts to look at the file and add comments/improvements. Some of the issues should possibly also be considered on the implementation side: we don't want two distinct codes for the same thing. Cheers, Markus On 04/08/13 16:35, Markus Krötzsch wrote: On 04/08/13 13:17, Federico Leva (Nemo) wrote: Markus Krötzsch, 04/08/2013 12:32: * Wikidata uses
Re: [Wikidata-l] Wikidata RDF export available
On 04/08/13 13:17, Federico Leva (Nemo) wrote: Markus Krötzsch, 04/08/2013 12:32: * Wikidata uses be-x-old as a code, but MediaWiki messages for this language seem to use be-tarask as a language code. So there must be a mapping somewhere. Where? Where I linked it. Are you sure? The file you linked has mappings from site ids to language codes, not from language codes to language codes. Do you mean to say: If you take only the entries of the form 'XXXwiki' in the list, and extract a language code from the XXX, then you get a mapping from language codes to language codes that covers all exceptions in Wikidata? This approach would give us: 'als' : 'gsw', 'bat-smg': 'sgs', 'be_x_old' : 'be-tarask', 'crh': 'crh-latn', 'fiu_vro': 'vro', 'no' : 'nb', 'roa-rup': 'rup', 'zh-classical' : 'lzh' 'zh-min-nan': 'nan', 'zh-yue': 'yue' Each of the values on the left here also occur as language tags in Wikidata, so if we map them, we use the same tag for things that Wikidata has distinct tags for. For example, Q27 has a label for yue but also for zh-yue [1]. It seems to be wrong to export both of these with the same language tag if Wikidata uses them for different purposes. Maybe this is a bug in Wikidata and we should just not export texts with any of the above codes at all (since they always are given by another tag directly)? * MediaWiki's http://www.mediawiki.org/wiki/Manual:$wgDummyLanguageCodes provides some mappings. For example, it maps zh-yue to yue. Yet, Wikidata use both of these codes. What does this mean? Answers to Nemo's points inline: On 04/08/13 06:15, Federico Leva (Nemo) wrote: Markus Krötzsch, 03/08/2013 15:48: ... Apart from the above, doesn't wgLanguageCode in https://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php have what you need? Interesting. However, the list there does not contain all 300 sites that we currently find in Wikidata dumps (and some that we do not find there, including things like dkwiki that seem to be outdated). The full list of sites we support is also found in the file I mentioned above, just after the language list (variable siteLanguageCodes). Of course not all wikis are there, that configuration is needed only when the subdomain is wrong. It's still not clear to me what codes you are considering wrong. Well, the obvious: if a language used in Wikidata labels or on Wikimedia sites has an official IANA code [2], then we should use this code. Every other code would be wrong. For languages that do not have any accurate code, we should probably use a private code, following the requirements of BCP 47 for private use subtags (in particular, they should have a single x somewhere). This does not seem to be done correctly by my current code. For example, we now map 'map_bmswiki' to 'map-bms'. While both 'map' and 'bms' are lANA language tags, I am not sure that their combination makes sense. The language should be Basa Banyumasan, but bms is for Bilma Kanuri (and it is a language code, not a dialect code). Note that map-bms does not occur in the file you linked to, so I guess there is some more work to do. Markus [1] http://www.wikidata.org/wiki/Special:Export/Q27 [2] http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
[Wikidata-l] Wikidata language codes (Was: Wikidata RDF export available)
Small update: I went through the language list at https://github.com/mkroetzsch/wda/blob/master/includes/epTurtleFileWriter.py#L472 and added a number of TODOs to the most obvious problematic cases. Typical problems are: * Malformed language codes ('tokipona') * Correctly formed language codes without any official meaning (e.g., 'cbk-zam') * Correctly formed codes with the wrong meaning (e.g., 'sr-ec': Serbian from Ecuador?!) * Language codes with redundant information (e.g., 'kk-cyrl' should be the same as 'kk' according to IANA, but we have both) * Use of macrolanguages instead of languages (e.g., zh is not Mandarin but just Chinese; I guess we mean Mandarin; less sure about Kurdish ...) * Language codes with incomplete information (e.g., sr should be sr-Cyrl or sr-Latn, both of which already exist; same for zh and zh-Hans/zh-Hant, but also for zh-HK [is this simplified or traditional?]). I invite any language experts to look at the file and add comments/improvements. Some of the issues should possibly also be considered on the implementation side: we don't want two distinct codes for the same thing. Cheers, Markus On 04/08/13 16:35, Markus Krötzsch wrote: On 04/08/13 13:17, Federico Leva (Nemo) wrote: Markus Krötzsch, 04/08/2013 12:32: * Wikidata uses be-x-old as a code, but MediaWiki messages for this language seem to use be-tarask as a language code. So there must be a mapping somewhere. Where? Where I linked it. Are you sure? The file you linked has mappings from site ids to language codes, not from language codes to language codes. Do you mean to say: If you take only the entries of the form 'XXXwiki' in the list, and extract a language code from the XXX, then you get a mapping from language codes to language codes that covers all exceptions in Wikidata? This approach would give us: 'als' : 'gsw', 'bat-smg': 'sgs', 'be_x_old' : 'be-tarask', 'crh': 'crh-latn', 'fiu_vro': 'vro', 'no' : 'nb', 'roa-rup': 'rup', 'zh-classical' : 'lzh' 'zh-min-nan': 'nan', 'zh-yue': 'yue' Each of the values on the left here also occur as language tags in Wikidata, so if we map them, we use the same tag for things that Wikidata has distinct tags for. For example, Q27 has a label for yue but also for zh-yue [1]. It seems to be wrong to export both of these with the same language tag if Wikidata uses them for different purposes. Maybe this is a bug in Wikidata and we should just not export texts with any of the above codes at all (since they always are given by another tag directly)? * MediaWiki's http://www.mediawiki.org/wiki/Manual:$wgDummyLanguageCodes provides some mappings. For example, it maps zh-yue to yue. Yet, Wikidata use both of these codes. What does this mean? Answers to Nemo's points inline: On 04/08/13 06:15, Federico Leva (Nemo) wrote: Markus Krötzsch, 03/08/2013 15:48: ... Apart from the above, doesn't wgLanguageCode in https://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php have what you need? Interesting. However, the list there does not contain all 300 sites that we currently find in Wikidata dumps (and some that we do not find there, including things like dkwiki that seem to be outdated). The full list of sites we support is also found in the file I mentioned above, just after the language list (variable siteLanguageCodes). Of course not all wikis are there, that configuration is needed only when the subdomain is wrong. It's still not clear to me what codes you are considering wrong. Well, the obvious: if a language used in Wikidata labels or on Wikimedia sites has an official IANA code [2], then we should use this code. Every other code would be wrong. For languages that do not have any accurate code, we should probably use a private code, following the requirements of BCP 47 for private use subtags (in particular, they should have a single x somewhere). This does not seem to be done correctly by my current code. For example, we now map 'map_bmswiki' to 'map-bms'. While both 'map' and 'bms' are lANA language tags, I am not sure that their combination makes sense. The language should be Basa Banyumasan, but bms is for Bilma Kanuri (and it is a language code, not a dialect code). Note that map-bms does not occur in the file you linked to, so I guess there is some more work to do. Markus [1] http://www.wikidata.org/wiki/Special:Export/Q27 [2] http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
[Wikidata-l] Wikidata RDF export available
Hi, I am happy to report that an initial, yet fully functional RDF export for Wikidata is now available. The exports can be created using the wda-export-data.py script of the wda toolkit [1]. This script downloads recent Wikidata database dumps and processes them to create RDF/Turtle files. Various options are available to customize the output (e.g., to export statements but not references, or to export only texts in English and Wolof). The file creation takes a few (about three) hours on my machine depending on what exactly is exported. For your convenience, I have created some example exports based on yesterday's dumps. These can be found at [2]. There are three Turtle files: site links only, labels/descriptions/aliases only, statements only. The fourth file is a preliminary version of the Wikibase ontology that is used in the exports. The export format is based on our earlier proposal [3], but it adds a lot of details that had not been specified there yet (namespaces, references, ID generation, compound datavalue encoding, etc.). Details might still change, of course. We might provide regular dumps at another location once the format is stable. As a side effect of these activities, the wda toolkit [1] is also getting more convenient to use. Creating code for exporting the data into other formats is quite easy. Features and known limitations of the wda RDF export: (1) All current Wikidata datatypes are supported. Commons-media data is correctly exported as URLs (not as strings). (2) One-pass processing. Dumps are processed only once, even though this means that we may not know the types of all properties when we first need them: the script queries wikidata.org to find missing information. This is only relevant when exporting statements. (3) Limited language support. The script uses Wikidata's internal language codes for string literals in RDF. In some cases, this might not be correct. It would be great if somebody could create a mapping from Wikidata language codes to BCP47 language codes (let me know if you think you can do this, and I'll tell you where to put it) (4) Limited site language support. To specify the language of linked wiki sites, the script extracts a language code from the URL of the site. Again, this might not be correct in all cases, and it would be great if somebody had a proper mapping from Wikipedias/Wikivoyages to language codes. (5) Some data excluded. Data that cannot currently be edited is not exported, even if it is found in the dumps. Examples include statement ranks and timezones for time datavalues. I also currently exclude labels and descriptions for simple English, formal German, and informal Dutch, since these would pollute the label space for English, German, and Dutch without adding much benefit (other than possibly for simple English descriptions, I cannot see any case where these languages should ever have different Wikidata texts at all). Feedback is welcome. Cheers, Markus [1] https://github.com/mkroetzsch/wda Run python wda-export.data.py --help for usage instructions [2] http://semanticweb.org/RDF/Wikidata/ [3] http://meta.wikimedia.org/wiki/Wikidata/Development/RDF -- Markus Kroetzsch, Departmental Lecturer Department of Computer Science, University of Oxford Room 306, Parks Road, OX1 3QD Oxford, United Kingdom +44 (0)1865 283529 http://korrekt.org/ ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] list for bug emails
On 14/04/12 15:38, Gerard Meijssen wrote: Hoi, The Wikidata project is probably the software used by OmegaWiki, the original Wikidata. Ah, great, this completes the confusion :-D Cheers, Markus On 14 April 2012 16:12, Jeroen De Dauw jeroended...@gmail.com mailto:jeroended...@gmail.com wrote: Hey, Wikidata WikidataClient WikidataRepo Although the project is called WikiData, the software is called Wikibase. So we should have Wikibase and Wikibase client for the extensions, and Wikidata for the project, although I'm not sure we really need the later. Cheers -- Jeroen De Dauw http://www.bn2vs.com Don't panic. Don't be evil. -- ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org mailto:Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Fwd: [Wiki-research-l] Wikidata opinion piece in The Atlantic
On 12/04/12 21:10, Daniel Kinzler wrote: This is an interesting criticism, and there's an excellent retort by Denny in the comments. Just fyi. Thanks, very good discussion and very good answer by Denny. I should have a chat with Mark at some point to check out what he thinks about it (it is a bit ironic that we use international news sources to communicate when sitting in offices that are 500m away from each other ;-) Markus Original Message Subject: [Wiki-research-l] Wikidata opinion piece in The Atlantic Date: Tue, 10 Apr 2012 16:50:49 -0700 From: En Pinedeyntest...@hotmail.com Reply-To: Research into Wikimedia content and communities wiki-researc...@lists.wikimedia.org To:wiki-researc...@lists.wikimedia.org Here's an opinion piece, The Problem with Wikidata, by Mark Graham, who is a Research Fellow at the Oxford Internet Institute, which appears on The Atlantic's website. I'm not personally supporting or opposing his views but I found this to be an interesting read. http://www.theatlantic.com/technology/archive/2012/04/the-problem-with-wikidata/255564/ Pine ___ Wiki-research-l mailing list wiki-researc...@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Spatial data definition
Hi Andreas, thanks for the input. I have drafted the current text about geo-related datatypes, but I am far from being an expert in this area. Our mapping expert in Wikidata is Katie (Aude), who has also been working with OpenStreetMap, but further expert input on this topic would be quite valuable. As in all areas, we need to find a balance between generality and usability, so I am slightly in favour of committing to one SR for now (as I understand, the data can be converted easily between SRs but -- as opposed to other cases where people measure something -- most of the world seems to be happy with one of them). I have now included a link to this thread into an editorial remark in the data model, so we do not forget about this discussion when working out the details. Markus On 04/04/12 14:16, Andreas Trawoeger wrote: Hi everybody! As the guy who has to honor to shortly receive some funding from Wikimedia Germany for handling spatial open government data [0] I would like to make some remarks on the current geo definitions in the Wikidata model: 1. Spatial Reference System Identifier (SRID [1]) definition is missing Every GeoCoordinatesValue field should either have a corresponding SRID field that defines the used spatial reference system (SRS [2]) or mandate the use of a single SRS like WGS84 [3] which is currently the standard used by GPS, OpenStreetMap and Wikipedia. 2. Geographic shapes should be defined in either Well-known text (WKT [4]) or GeoJSON [5] WKT is the defacto standard to store spatial data in a rational database and GeoJSON is the defacto standard to access geo data via web. Both formats can be easily transformed into each other. So which one you choose pretty much depends on your preferred choice of SQL vs. NoSQL database. So in summary I would propose the following data model for spatial data: Geographic locations Datatype IRI: http://wikidata.org/vocabulary/datatype_geocoords Value: GeoCoordinatesValue Mandatory spatial reference system: EPSG 4326 (WGS 84/GPS) Type: Decimal Geographic objects Datatype IRI: http://wikidata.org/vocabulary/datatype_geoobjects Value: GeoObjectsValue Type: GeoJSON [5] Geographic objects SRID Datatype IRI: http://wikidata.org/vocabulary/datatype_geoobjects_srid Value: GeoObjectsSridValue Type: EPSG Spatial Reference System Identifier (SRID [1]) That model would allow a structure where every spatial object can have a complex geometry stored in its original geodetic system and still have an easily manageable location in GPS format. cu andreas [0] http://de.wikipedia.org/wiki/Wikipedia:Community-Projektbudget#2._kartenwerkstatt.at [1] https://en.wikipedia.org/wiki/Spatial_reference_system_identifier [2] https://en.wikipedia.org/wiki/Spatial_reference_system [3] https://en.wikipedia.org/wiki/WGS84 [4] https://en.wikipedia.org/wiki/Well-known_text [5] https://en.wikipedia.org/wiki/GeoJSON ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] SNAK - assertion?
Martynas, what you are proposing below is not W3C recommended RDF but an extension of triples to quads. As far as I know, this extension is not compatible yet with existing standards such as SPARQL and OWL. Named graphs work with SPARQL, but are mostly used in another way than you suggest. Most RDF database tools would be *very* unhappy to get millions of named graphs in combination with queries that use variables as graph names. The syntax you use is not a W3C standard either. This does not say that N-Quads aren't a good idea if one can get them to work with the rest of the Semantic Web stack, but it really defeats your own arguments. We are committed to supporting *existing* standards (as we have said many times already), but we will not base our software design on a non-standard RDF-variant that works with neither OWL nor SPARQL. Markus On 06/04/12 13:09, Martynas Jusevicius wrote: Hey Denny, I gave it a shot: http://dbpedia.org/resource/France http://dbpedia.org/ontology/PopulatedPlace/populationDensity 116^^http://dbpedia.org/datatype/inhabitantsPerSquareKilometre http://wikidata.org/graphs/France2012 . http://dbpedia.org/resource/France http://dbpedia.org/ontology/populationDensity 116^^http://www.w3.org/2001/XMLSchema#double http://wikidata.org/graphs/France2012 . http://wikidata.org/graphs/France2012 http://purl.org/dc/terms/date 2012^^http://www.w3.org/2001/XMLSchema#year http://wikidata.org/graphs/France2012 . http://wikidata.org/graphs/France2012 http://purl.org/dc/terms/source _:source http://wikidata.org/graphs/France2012 . _:sourcehttp://purl.org/dc/terms/published 2010^^http://www.w3.org/2001/XMLSchema#year http://wikidata.org/graphs/France2012 . _:sourcehttp://purl.org/dc/terms/title Bilan demographique@fr http://wikidata.org/graphs/France2012 . The syntax is N-Quads. It does not use reification, but instead named graphs for provenance. The necessary concepts were already present in DBPedia. As you might know, temporal provenance is not the strongest point of RDF. However conventions and solutions are available, and I am sure implementing them would require far less effort than creating a custom data model from scratch, not to mention the benefits of potential reuse. There's quite some research done on RDF provenance, which is worth looking into if provenance is really a key feature for Wikidata from day one. I see it as something that should work transparently behind the scenes, and therefore could be rolled-out later on. You would get much better and more extensive advice than mine on semantic-...@w3.org -- the only prerequisite is willingness to cooperate. RDF's strength is that it solves data integration problems by pivotal conversion, reducing the number of model transformations from quadratic to linear: http://en.wikipedia.org/wiki/Data_conversion#Pivotal_conversion A custom data model brings up questions which already have an answer in the Semantic Web stack: # can data from different Wikidata instances be merged or interlinked natively? # is there a native query language? In case of SQL, how performant will it be given many JOINs and the planned use of provenance? # what and how many custom serialization formats and API mechanisms will have to follow? Stacking one custom solution on top of another can eventually result in huge costs. I honestly think the energy of Wikidata could be directed in a more productive way. Martynas graphity.org 2012/4/5 Denny Vrandečićdenny.vrande...@wikimedia.de: Dear Martynas, if you try to model the following statement in RDF The population density of France, as of an 2012 estimate, is 116 per square kilometer, according to the Bilan demographique 2010. you might notice that RDF requires a reification of the statement. The data model that you have seen provides us with an abstract and concise way to talk about these reifications (i.e. via the statement model, just as in RDF). We still have not finished the document describing how to map our data model to OWL/RDF, but we have thought about this the whole time while discussing the data model. But if you find a simpler, and more RDFish way to express the above statement, please feel free to enlighten me. I would be indeed very interested. Cheers, Denny 2012/4/5 Martynas Juseviciusmarty...@graphity.org it doesn't look like reuse of existing concepts and standards is a priority for this project. One cannot build a Semantic Web application by ignoring its main building block, which is the RDF data model. -- Project director Wikidata Wikimedia Deutschland e.V. | Eisenacher Straße 2 | 10777 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
Re: [Wikidata-l] Data_model: Metamodel: Wikipedialink
On 04/04/12 23:23, Gregor Hagedorn wrote: Wikidata can (and probably will) store information about each moon of Uranus, e.g., its mass. It does probably not make sense to store the mass of Moons of Uranus if there is such an article. It does not help to know that the article Moons on Uranus also talks (among other things) about some moon that has a particular mass: you need to know what *exactly* you are talking about to exploit this data. An article on Moons of Uranus could still (eventually) embed Wikidata data to improve its display, but this data must refer to individual moons, not to the article as a whole. The problem I see is that you have no definition to which real object the data are tied. We agree that the problem is not the interwiki links per se. It is what results from it. How do we tie data to a wikidata page when we don't know what it is about? This is a hard question. The best answer I can come up with now (on the bus to Oxford) is as follows: the meaning of Wikidata items is subject to social agreement, based on shared experience, communication, and human-language documentation. The latter is provided in labels and descriptions, in Wikipedia articles that are connected to a Wikidata item, and also in Wikidata property pages that document properties. I know that this may not be a satisfactory answer to your question of how we can *really* *know* what a Wikidata item is about. If you want to dig deeper into this issue, there is a lot of interesting literature, which can give you many more details than I can. What we are dealing with is the well-known philosophical problem of /grounding/. In essence, the state of discussion boils down to the following: there is no known way of connecting the symbols of a purely symbolic system (such as a computer program) to real-world objects in a formal way. Going deeper into the discussion reveals that there is also no agreed-upon way to clarify the meaning of real and object in the first place. In spite of all this, humans somehow manage to understand each other, which brings us to the point of how amazing they all are :-) Wikidata is but a humble technical tool that provides an environment for articulating and (I hope) improving this understanding in a novel way. This cannot provide a formal grounding, but it might come as close to this ideal as we have gotten yet. Regards, Markus -- Dr. Markus Kroetzsch Department of Computer Science, University of Oxford Room 306, Parks Road, OX1 3QD Oxford, United Kingdom +44 (0)1865 283529 http://korrekt.org/ ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Notability in Wikidata
In general, policies for notability in Wikidata will be governed by the community of (all) Wikidata editors. On the technical side, we aim to achieve two things: * The system should be able to handle a lot of data. * The interfaces and data access features should minimize the negative impact that additional (correct but not very important) data has on usage. Of course, both goals have their limits and there will always be good (technical or social) reasons to not include everything. We would rather like to support linking and data integration with external data bases than suggest *every* fact of the world to be copied to Wikidata. Markus On 31/03/12 20:22, emijrp wrote: Hi all; I'm thinking about notability in Wikidata and how it may conflict with Wikipedia current policies and community conceptions. Will Wikidata allow to create entities for small villages, asteroids, galaxies, stars, species, etc, that are not allowed today at Wikipedia? Including about those that don't have article in any Wikipedia? I will be happy if so. Regards, emijrp ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l -- Dr. Markus Kroetzsch Department of Computer Science, University of Oxford Room 306, Parks Road, OX1 3QD Oxford, United Kingdom +44 (0)1865 283529 http://korrekt.org/ ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l