from:"Markus Krötzsch"

Re: [Wikidata-l] WikiData Categories

2014-07-04 Thread Markus Krötzsch


On 04/07/14 14:49, Magnus Manske wrote:




On Fri, Jul 4, 2014 at 1:40 PM, Scott MacLeod
worlduniversityandsch...@gmail.com
mailto:worlduniversityandsch...@gmail.com wrote:

Jane, Lydia and WikiDatans,

These are great and helpful developments, which seem to be quite far
along now.

Jane and WikiDatans, can you point to similar helpful examples that
would distinguish how WikiData Categories and what one can extract
withMagnus' reasonator tool from what one can 'extract' with
SemanticWiki from WikiData Categories?


Can everyone please stop with the categories? Wikidata has items and
properties. I assume you mean properties here.

As for tools to get to data,
* Reasonator [1] is for viewing a single item, and see related items
* WDQ [2] is for machine-readable querying of Wikidata; basically, what
SPARQL does on SMW
* Autolist [3] is for getting clickable results from WDQ, intersecting
results with Wikipedia (!) categories, and semi-automated editing


Well, and of course some items are used as classes, which might be 
somewhat related to categories (in one of their many uses). For an 
overview of these, see


http://tools.wmflabs.org/wikidata-exports/miga/

To find instances of a particular class, you can then use the tools 
Magnus already mentioned.


Cheers,

Markus



___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Finding image URL from Commons image name

2014-07-03 Thread Markus Krötzsch


Brilliant, thanks for the useful and informative answers :-)

Markus

On 03/07/14 07:21, Legoktm wrote:

And there's an API module for this too:
https://commons.wikimedia.org/w/api.php?action=querytitles=File:Albert%20Einstein%20Head.jpgprop=imageinfoiiprop=urlformat=jsonfm
:)

-- Legoktm

On 7/2/14, 1:59 PM, Liangent wrote:

Also there are Special:FilePath and thumb.php. I'm not sure how this
affects caching though.

http://commons.wikimedia.org/wiki/Special:FilePath/Example.svg

http://commons.wikimedia.org/w/thumb.php?f=Example.svgw=420

-Liangent

On Jul 3, 2014 4:50 AM, Emilio J. Rodríguez-Posada emi...@gmail.com
mailto:emi...@gmail.com wrote:

 Hello Markus;

 The URL of a Commons image is build like this:

 https://upload.wikimedia.org/wikipedia/commons/x/xy/File_name.ext

 Where X and XY are the first char and firstsecond chars
 respectively of the md5sum of the filename (replacing the spaces
 with _).

 For a 200px thumb:

 
https://upload.wikimedia.org/wikipedia/commons/thumb/x/xy/File_name.ext/200px-File_name.ext

 The SVG files are a special case, therefore .png is appended to
 .ext, being .ext.png. For SVG doesn't mind to use big thumb sizes,
 but when file is JPG, don't try to generate a thumb bigger than the
 original file or you will get a beautiful error.

 Regards


 2014-07-02 22:33 GMT+02:00 Markus Krötzsch
 mar...@semantic-mediawiki.org mailto:mar...@semantic-mediawiki.org:

 Dear Wikidatarians,

 From Commons media properties, I get the string name of a file
 on Commons. I can easily use it to build a link to the Commons
 page fo rthat image.

 * But how do I get the raw image URL?
 * And can I also get the raw URL of a small-scale (thumbnail) image?

 I would like to beautify my Wikidata applications to show some
 images. I know this is more of a general MediaWiki question, but
 it is much more relevant in Wikidata, so I am posting it here
 first. I guess somebody has already solved this since we have
 images in various Wikidata-based applications and gadgets.

 Thanks

 Markus

 _
 Wikidata-l mailing list
 Wikidata-l@lists.wikimedia.org
 mailto:Wikidata-l@lists.wikimedia.org
 https://lists.wikimedia.org/__mailman/listinfo/wikidata-l
 https://lists.wikimedia.org/mailman/listinfo/wikidata-l



 ___
 Wikidata-l mailing list
 Wikidata-l@lists.wikimedia.org mailto:Wikidata-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata-l



___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l



___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l




___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Wikidata just got 10 times easier to use

2014-07-02 Thread Markus Krötzsch


On 02/07/14 16:29, David Cuenca wrote:

On Tue, Jul 1, 2014 at 11:07 PM, Markus Krötzsch
mar...@semantic-mediawiki.org mailto:mar...@semantic-mediawiki.org
wrote:

My hope is that with my other suggestion (using P31 values as
features to correlate with), the property suggester will already be
able to outperform my little toy algorithm anyway. One could also
combine the two (my algorithm is really simple [1]), but maybe this
is not needed.


Interesting. That could also help to identify values with a high
deviation, and perhaps even do a better job than some template constraints.
I was trying to check more classes, but the server seems to have
trouble: Error: could not load file 'classes/Classes.csv'
http://tools.wmflabs.org/wikidata-exports/miga/?classes#_cat=Classes/Id=Q2087181


Strange. Works for me. But we had some temporary service problems at WMF 
Labs recently, so maybe there was some aftermath of these.


In any case, I should update the software -- Yaron has further improved 
Miga to lower the initial load times significantly. I'll send another 
email when I have new code/new data there.




Anyhow, many thanks for working on this.


My pleasure. :-)

Markus



Micru


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l




___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Wikidata just got 10 times easier to use

2014-07-01 Thread Markus Krötzsch


On 01/07/14 21:47, Lydia Pintscher wrote:

On Tue, Jul 1, 2014 at 9:44 PM, Andy Mabbett a...@pigsonthewing.org.uk wrote:

On 1 July 2014 20:20, Lydia Pintscher lydia.pintsc...@wikimedia.de wrote:

We have just deployed the entity suggester. This helps you with
suggesting properties. So when you now add a new statement to an item
it will suggest what should most likely be added to that item. One
example: You are on an item about a person but it doesn't have a date
of birth yet. Since a lot of other items about persons have a date of
birth it will suggest you also add one to this item.


This is a great idea, but I've just tried it on Q4810979 (about an
historic building) and it prompted me for a date of birth, gender,
taxon rank or taxon name.

Teething troubles?


We still need to tweak it a bit here and there, yeah. We're working on
that right now. Also it will get smarter as more statements are added
to items.


I hope tweaking will suffice. At least it seems that there is already 
enough data to find slightly more related related properties ;-). Here 
is the list of properties that I get for the two classes of Q4810979 
(recall that I compute related properties for each class).


(1) historic house museum
http://tools.wmflabs.org/wikidata-exports/miga/?classes#_cat=Classes/Id=Q2087181

Related properties: English Heritage list number, OS grid reference, 
owned by, inspired by, coordinate location, visitors per year, Commons 
category, architect, mother house, manager/director, country, 
commissioned by, architectural style, MusicBrainz place ID, use, date of 
foundation or creation, street


(2) Grade I listed building
http://tools.wmflabs.org/wikidata-exports/miga/?classes#_cat=Classes/Id=Q15700818

Related properties: English Heritage list number, masts, Minor Planet 
Center observatory code, home port, coordinate location, OS grid 
reference, mother house, architect, manager/director, Emporis ID, 
MusicBrainz place ID, country, architectural style, visitors per year, 
Commons category, Structurae ID (structure), officially opened by, 
floors above ground, inspired by, religious order, number of platforms, 
street, owned by, diocese


These are computed fully automatically from the data, with no manual 
filtering or user input. But don't get me wrong -- great work! Brilliant 
to have such a thing integrated into the UI. In any case, my algorithm 
for computing the related properties is certainly very different from 
theirs; I am sure it also has its glitches.


Cheers,

Markus



___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Wikidata just got 10 times easier to use

2014-07-01 Thread Markus Krötzsch


On 01/07/14 22:14, Markus Krötzsch wrote:
...


(2) Grade I listed building
http://tools.wmflabs.org/wikidata-exports/miga/?classes#_cat=Classes/Id=Q15700818


Related properties: English Heritage list number, masts, Minor Planet
Center observatory code, home port, coordinate location, OS grid
reference, mother house, architect, manager/director, Emporis ID,
MusicBrainz place ID, country, architectural style, visitors per year,
Commons category, Structurae ID (structure), officially opened by,
floors above ground, inspired by, religious order, number of platforms,
street, owned by, diocese

These are computed fully automatically from the data, with no manual
filtering or user input. But don't get me wrong -- great work! Brilliant
to have such a thing integrated into the UI. In any case, my algorithm
for computing the related properties is certainly very different from
theirs; I am sure it also has its glitches.


P.S. One weakness of my algorithm you can already see: it has troubles 
estimating the relevance of very rare properties, such as Minor Planet 
Center observatory code above. A single wrong annotation may then lead 
to wrong suggestions. Also, it seems from my list under (2) that some 
Grade I listed buildings are ships. This seems to be an error that is 
amplified by the fact that property masts is used only 11 times in the 
dataset I evaluated (last week's data). I guess the new property 
suggester rather errs on the other side, being tricked into suggesting 
very frequent properties even in places that don't need them.


-- Markus


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Wikidata just got 10 times easier to use

2014-07-01 Thread Markus Krötzsch


On 01/07/14 22:43, Bene* wrote:

Am 01.07.2014 22:23, schrieb Markus Krötzsch:

P.S. One weakness of my algorithm you can already see: it has troubles
estimating the relevance of very rare properties, such as Minor
Planet Center observatory code above. A single wrong annotation may
then lead to wrong suggestions. Also, it seems from my list under (2)
that some Grade I listed buildings are ships. This seems to be an
error that is amplified by the fact that property masts is used only
11 times in the dataset I evaluated (last week's data). I guess the
new property suggester rather errs on the other side, being tricked
into suggesting very frequent properties even in places that don't
need them.

However, it is obviously better if the algorithm performs well for
frequently used properties. Isn't it possible to combine those two
systems so they improve each other. One could check how often the
property is used and then rely on Markus' or the students' algorithm.


My hope is that with my other suggestion (using P31 values as features 
to correlate with), the property suggester will already be able to 
outperform my little toy algorithm anyway. One could also combine the 
two (my algorithm is really simple [1]), but maybe this is not needed.


Cheers

Markus

[1] For each class C and property P, I count:

* #C: the number of items in class C
* #P: the number of items using property P
* #PC: the number of items in class C using the property P
* #items: the total number of items

Then I compute two rates:

* rateCP = #PC / #C (fraction of items in a class with the property)
* rateP = #P / #items (fraction of all items with the property)

I then rank the properties for each class by the ratio of rateCP/rateP 
(intuitively: by what factor does the property of P increase for items 
in C?). Moreover, I apply two sigmoid functions [2] to the rates as 
additional factors, so as to ensure that properties are less relevant 
if they have very high or very low values for the rates. I don't care 
about things that almost everything/almost nothing has. Obviously, one 
can tweak this if one wants to include properties that almost 
everything has anyway.


[2] 
https://www.google.com/search?sclient=psy-abq=1+%2F+%281+%2B+exp%286+*+%28-2+*+x+%2B+0.5%29%29%29btnG=


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

[Wikidata-l] Fwd: Babelfy: Word Sense Disambiguation and Entity Linking Together!

2014-06-16 Thread Markus Krötzsch

FYI: this project claims to use Wikidata (among other resources) for 
multilingual word-sense disambiguation. One of the first third-party 
uses of Wikidata that I am aware of (but other pointers are welcome if 
you have them). Wiktionary and OmegaWiki are also mentioned here.


Cheers,

Markus


 Original Message 
Subject:Babelfy: Word Sense Disambiguation and Entity Linking Together!
Resent-Date:Mon, 16 Jun 2014 10:34:07 +
Resent-From:semantic-...@w3.org
Date:   Mon, 16 Jun 2014 09:43:12 +0200
From:   Andrea Moro andrea8m...@gmail.com
To: undisclosed-recipients:;



==

Babelfy: Word Sense Disambiguation and Entity Linking together!

http://babelfy.org

==


As an output of the MultiJEDI Starting Grant http://multijedi.org,
funded by the European Research Council and headed by Prof. Roberto
Navigli, the Linguistic Computing Laboratory http://lcl.uniroma1.itof
the Sapienza University of Rome is proud to announce the first release
of Babelfy http://babelfy.org.


Babelfy [1] is a joint, unified approach to Word Sense Disambiguation
and Entity Linking for arbitrary languages. The approach is based on a
loose identification of candidate meanings coupled with a densest
subgraph heuristic which selects high-coherence semantic
interpretations. Its performance on both disambiguation and entity
linking tasks is on a par with, or surpasses, those of task-specific
state-of-the-art systems.


Babelfy draws primarily on BabelNet (http://babelnet.org), a very large
encyclopedic dictionary and semantic network. BabelNet 2.5 covers 50
languages and provides both lexicographic and encyclopedic knowledge for
all the open-class parts of speech, thanks to the seamless integration
of WordNet, Wikipedia, Wiktionary, OmegaWiki, Wikidata and the Open
Multilingual WordNet.


Features in Babelfy:


* 50 languages covered!

* Available via easy-to-use Java APIs.

* Disambiguation and entity linking is performed using BabelNet, thereby
implicitly annotating according to several different inventories such as
WordNet, Wikipedia, OmegaWiki, etc.


Babelfy the world(be there and get a free BabelNet t-shirt!):


* Monday, June 23 - ACL 2014 (Baltimore, MD, USA) - TACL paper
presentation http://www.transacl.org/wp-content/uploads/2014/05/54.pdf

* Tuesday, August 19 - ECAI 2014 (Prague, Czech Republic) - Multilingual
Semantic Processing with BabelNet http://www.ecai2014.org/tutorials/

* Sunday, August 24 - COLING 2014(Dublin, Ireland) - Multilingual Word
Sense Disambiguation and Entity Linking
http://www.coling-2014.org/tutorials.php



[1] Andrea Moro, Alessandro Raganato, Roberto Navigli. Entity Linking
meets Word Sense Disambiguation: a Unified Approach
http://www.transacl.org/wp-content/uploads/2014/05/54.pdf.
Transactions of the Association for Computational Linguistics (TACL), 2,
pp. 231-244 (2014).





___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Wikidata RDF exports

2014-06-14 Thread Markus Krötzsch


Eric,

Two general remarks first:

(1) Protege is for small and medium ontologies, but not really for such 
large datasets. To get SPARQL support for the whole data, you could to 
install Virtuoso. It also comes with a simple Web query UI. Virtuoso 
does not do much reasoning, but you can use SPARQL 1.1 transitive 
closure in queries (using * after properties), so you can find all 
subclasses there too. (You could also try this in Protege ...)


(2) If you want to explore the class hierarchy, you can also try our new 
class browser:


http://tools.wmflabs.org/wikidata-exports/miga/?classes

It has the whole class hierarchy, but without the leaves (=instances 
of classes + subclasses that have no own subclasses/instances). For 
example, it tells you that lepton has 5 direct subclasses, but shows 
only one:


http://tools.wmflabs.org/wikidata-exports/miga/?classes#_item=3338

On the other hand, it includes relationships of classes and properties 
that are not part of the RDF (we extract this from the data by 
considering co-occurrence). Example:


Classes that have no superclasses but at least 10 instances, and which 
are often used with the property 'sex or gender':


http://tools.wmflabs.org/wikidata-exports/miga/?classes#_cat=Classes/Direct%20superclasses=__null/Number%20of%20direct%20instances=10%20-%202/Related%20properties=sex%20or%20gender

I already added superclasses for some of those in Wikidata now -- data 
in the browser is updated with some delay based on dump files.



More answers below:

On 14/06/14 05:52, emw wrote:

Markus,

Thank you very much for this.  Translating Wikidata into the language of
the Semantic Web is important.  Being able to explore the Wikidata
taxonomy [1] by doing SPARQL queries in Protege [2] (even primitive
queries) is really neat, e.g.

SELECT ?subject
WHERE
{
?subject rdfs:subClassOf http://www.wikidata.org/entity/Q82586 .
}

This is more of an issue of my ignorance of Protege, but I notice that
the above query returns only the direct subclasses of Q82586.  The full
set of subclasses for Q82586 (lepton) is visible at
http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q82586rp=279lang=en
-- a few of the 2nd-level subclasses (muon neutrino, tau neutrino,
electron neutrino) are shown there but not returned by that SPARQL
query.  It seems rdfs:subClassOf isn't being treated as a transitive
property in Protege.  Any ideas?


You need a reasoner to compute this properly. For a plain class 
hierarchy as in our case, ELK should be a good choice [1]. You can 
install the ELK Protege plugin and use it to classify the ontology [2]. 
Protege will then show the copmuted class hierarchy in the browser; I am 
not sure what happens to the SPARQL queries (it's quite possible that 
they don't use the reasoner).


[1] https://code.google.com/p/elk-reasoner/
[2] https://code.google.com/p/elk-reasoner/wiki/ElkProtege



Do you know when the taxonomy data in OWL will have labels available?


We had not thought of this as a use case. A challenge is that the label 
data is quite big because of the many languages. Should we maybe create 
an English label file for the classes? Descriptions too or just labels?




Also, regarding the complete dumps, would it be possible to export a
smaller subset of the faithful data?  The files under Complete Data
Dumps in
http://tools.wmflabs.org/wikidata-exports/rdf/exports/20140526/ look too
big to load into Protege on most personal computers, and would likely
require adjusting JVM settings on higher-end computers to load.  If it's
feasible to somehow prune those files -- and maybe even combine them
into one file that could be easily loaded into Protege -- that would be
especially nice.


What kind of pruning do you have in mind? You can of course take a 
subset of the data, but then some of the data will be missing.


A general remark on mixing and matching RDF files. We use N3 format, 
where every line in the ontology is self-contained (no multi-line 
constructs, no header, no namespaces). Therefore, any subset of the 
lines of any of our files is still a valid file. So if you want to have 
only a slice of the data (maybe to experiment with), then you could 
simply do something like:


gunzip -c wikidata-statements.nt.gz | head -1  partial-data.nt

head simply selects the first 1 lines here. You could also use 
grep to select specific triples instead, such as:


zgrep http://www.w3.org/2000/01/rdf-schema#label; wikidata-terms.nt.gz 
| grep @en .  en-labels.nt


This selects all English labels. I am using zgrep here for a change; you 
can also use gunzip as above. Similar methods can also be used to count 
things in the ontology (use grep -c to count lines = triples).


Finally, you can combine multiple files into one by simply concatenating 
them in any order:


cat partial-data-1.nt  mydata.nt
cat partial-data-2.nt  mydata.nt
...

Maybe you can experiment a bit and let us know if there is any export 
that would be particularly

Re: [Wikidata-l] Wikidata RDF exports

2014-06-13 Thread Markus Krötzsch


Hi Gerard,

On 13/06/14 11:08, Gerard Meijssen wrote:

Hoi,
When you leave out qualifiers, you will find that Ronald Reagan was
never president of the United States and only an actor. Yes, omitting
the statements with qualifiers is wrong but as a consequence the total
of the information is wrong as well.

I do not see the point of this functionality. It is wrong any way I look
at it. Without qualifiers information is wrong. Without statements
information is wrong and without the items involved the information is
incomplete and wrong.

As I see it you cannot win. Including this type of RDF export produces
something that I fail to see serves any purpose or it is the purpose
that you can.


Surely, Wikidata will never be complete. There will always be some 
statements missing. If we would follow your reasoning, the data would 
therefore never be of any use. I think this is a bit drastic.


Anyway, why argue? If you don't like the simplified exports, just use 
the full ones. We clearly say that simplified is not faithful, and 
we have a detailed documentation about what is in each of the files. So 
it does not seem likely that people will be confused.


Best regards,

Markus


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Wikidata RDF exports

2014-06-13 Thread Markus Krötzsch


Hi Gerard,

As I said, I don't follow your arguments. Wikidata Query, for example, 
has also started without any qualifiers at all, and yet it was a useful 
tool from the beginning.


Your feedback is always welcome, but there is a point when critique is 
no longer constructive, and when it is best to agree to disagree. I 
think we have reached that point.


Markus


On 13/06/14 12:37, Gerard Meijssen wrote:

Hoi,
There is a huge difference between being complete and leaving out
essential information. When you consider Ronald Reagan [1], it is
essential information that he was a president of the USA and a governor
of California. When you only make him an actor and a politician, the
information you are left with gives the impression he is more relevant
as an actor.

You brought attention to new functionality that is essentially broken.
It does not give a fair impression of the Wikidata content. I have been
arguing against overly referring to academic tools and standards. For me
this announcement is yet another pointer that many of the tools are
overrated and only have an academic relevance.
Thanks,
GerardM

[1] http://tools.wmflabs.org/reasonator/?q=9960


On 13 June 2014 11:41, Markus Krötzsch mar...@semantic-mediawiki.org
mailto:mar...@semantic-mediawiki.org wrote:

Hi Gerard,


On 13/06/14 11:08, Gerard Meijssen wrote:

Hoi,
When you leave out qualifiers, you will find that Ronald Reagan was
never president of the United States and only an actor. Yes,
omitting
the statements with qualifiers is wrong but as a consequence the
total
of the information is wrong as well.

I do not see the point of this functionality. It is wrong any
way I look
at it. Without qualifiers information is wrong. Without statements
information is wrong and without the items involved the
information is
incomplete and wrong.

As I see it you cannot win. Including this type of RDF export
produces
something that I fail to see serves any purpose or it is the purpose
that you can.


Surely, Wikidata will never be complete. There will always be some
statements missing. If we would follow your reasoning, the data
would therefore never be of any use. I think this is a bit drastic.

Anyway, why argue? If you don't like the simplified exports, just
use the full ones. We clearly say that simplified is not
faithful, and we have a detailed documentation about what is in
each of the files. So it does not seem likely that people will be
confused.

Best regards,

Markus



_
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org mailto:Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/__mailman/listinfo/wikidata-l
https://lists.wikimedia.org/mailman/listinfo/wikidata-l




___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l




___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Wikidata RDF exports

2014-06-13 Thread Markus Krötzsch


On 13/06/14 15:52, Bene* wrote:
...




Did I understand you right, Markus, that you leave out all statements
which have at least one qualifier? Wouldn't it make more sense to leave
out the qualifiers only but add the statements without qualifiers
anyway? Because this would solve eg. Gerard's problem with Ronald Reagan.


But it would introduce other problems. Qualifiers are often used with 
time information, for example to record many historic population figures 
of one town. If you just leave away the qualifiers you get many 
different population numbers that cannot be distinguished.


Simply put:
* Leaving away statements makes the export incomplete (as Wikidata 
always is, just to a larger degree).
* Leaving away qualifiers makes the export incorrect (since it replaces 
statements by different statements that may or may not hold true).


We could do both and let the users choose what they find more acceptable 
(if any), but we started with the first approach. If someone says they 
need the second approach for their application to work, we could 
implement this, but I'd rather wait to see if anybody wants this.


Best,

Markus



___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Wikidata RDF exports

2014-06-13 Thread Markus Krötzsch


Gerard,

You sometimes sound as if everything is lost just because somebody put 
an RDF file on the Web ;-)


If you don't like the simplified export, why don't you just use our main 
export which contains all the data? Can't we all be happy -- the people 
who want simple and the people who want complete?


Cheers,

Markus



___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

[Wikidata-l] New Wikidata classes browser (and updated property browser)

2014-06-12 Thread Markus Krötzsch


Hi all,

I have extended our new interactive property browser with a class 
browser and updated everything to the latest data. You can use these new 
services to answer questions like:


(1) What kinds of things do we actually have on Wikidata? (show all 
classes with more than 100 instances: humans, cities, galaxies, computer 
games, earthquakes, ... overall a very clear an readable list of our 
main subjects)
(2) Which properties are typically used on lighthouses (Q39715)? (or any 
other class)
(3) Which types of things have a patron saint (P417)? (or any other 
property)

(4) What are the most used string properties on Wikidata?
(5) Which properties are often used in qualifiers?
(6) Which properties are often used in statements that have qualifiers?
(7) Which properties are not used at all?
(8) What are the (direct and indirect) superclasses of ninja (Q9402)? 
(there are expected things like profession and soldier but also 
thermodynamic process, which is probably not intended, although there 
might be some truth to it)

(9) What are the most used classes that do not have a superclass?
(10) What are the classes with most subclasses? (they have almost 2000 
subclasses!)


And many more ... I could play with it all day :-)

We now offer two datasets: one with properties and classes, and one with 
properties only. The one with classes takes longer to load (about 10min 
on my machine) but it's very fast after that. The other one is faster to 
load and can still answer questions like (4), (5), (6), (7). Here are 
the links:


* Classes+properties: 
http://tools.wmflabs.org/wikidata-exports/miga/?classes# (be patient, it 
will be fast once loaded)

* Properties only: http://tools.wmflabs.org/wikidata-exports/miga/

Each dataset has an about page with some example queries. To see 
additional related properties and classes, scroll down to the bottom of 
the pages of individual properties or classes.


Known limitations:
* We leave away classes without instances and subclasses, to make the 
dataset smaller (it's large enough as it is).
* Some classes are shown without labels. These are usually things that 
should not be classes anyway (filter them by narrowing your search to 
things with more than 10 instances). In any case, this only happens to 
things that have no superclass. Maybe I will fix this in the future.


Feedback is welcome.

Cheers,

Markus


On 11/06/14 14:36, Markus Krötzsch wrote:

Hi all,

We have prepared a new browser for Wikidata Properties:

http://tools.wmflabs.org/wikidata-exports/miga/

It is based on Miga data browser [1]. This means it only works in Google
Chrome/Chromium, Opera, Safari, and the Android Browser but not in
Internet Explorer, Firefox, and Rekonq.

You can browse properties by datatype or usage numbers, and also find
related properties for every property (using my own custom notion of
relatedness based on relative co-occurrence and overall prevalence).
When filtering by usage numbers, you sometimes see only very coarse
filters, but you can always apply the same filter again to get more fine
grained steps. You can also edit the URL in the browser to modify
filters if you cannot find the right one in the UI.

This is still experimental. The data is based on the dump of 26 May. The
data files for Miga were created using Wikidata Toolkit [2]; I will
commit the specific code in due course. Feedback is welcome.

Cheers,

Markus

[1] http://migadv.com/
[2] https://www.mediawiki.org/wiki/Wikidata_Toolkit

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l



___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] New Wikidata classes browser (and updated property browser)

2014-06-12 Thread Markus Krötzsch


[Including Yaron, the Miga developer, who is not on this list yet]

On 12/06/14 17:21, Thomas Douillard wrote:

Hi Markus, first thanks a lot for these tools.

It would be cool to include a link to the property browser into some
template, ( Template:P' for example , as Template:Q' generates a link to
reasonator. Is there a way to get the database id of some property by
its number ?


The internal IDs of Miga are no good for linking, since they depend on 
the list of items and thus might change with updates. However, the 
following work:


http://tools.wmflabs.org/wikidata-exports/miga/#_cat=Properties/Id=P31

http://tools.wmflabs.org/wikidata-exports/miga/?classes#_cat=Classes/Id=Q39715

Etc.

A nice feature in Miga is that all queries work, even if they are not 
clickable through the UI. If you look at the URL you get after some 
clicking, it is easy to see how to change it to get other results.


Cheers,

Markus

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Wikidata RDF exports

2014-06-11 Thread Markus Krötzsch


On 10/06/14 22:50, Gerard Meijssen wrote:

Hoi,
It is stated that there are no qualifiers included. In one of the
articles you write that it is to be understood that the vailidity of the
information is dependent on the existing qualifiers.

What is the value of these RDF exports with the qualifiers missing?


Our normal exports include all the qualifiers and references.

Our simplified exports include only those statements that don't have 
qualifiers. You are right that it would lead to wrong information to 
leave away quantifiers.


Cheers,

Markus


Thanks,
  GerardM


On 10 June 2014 10:43, Markus Kroetzsch markus.kroetz...@tu-dresden.de
mailto:markus.kroetz...@tu-dresden.de wrote:

Hi all,

We are now offering regular RDF dumps for the content of Wikidata:

http://tools.wmflabs.org/__wikidata-exports/rdf/
http://tools.wmflabs.org/wikidata-exports/rdf/

RDF is the Resource Description Framework of the W3C that can be
used to exchange data on the Web. The Wikidata RDF exports consist
of several files that contain different parts and views of the data,
and which can be used independently. Details on the available
exports and the RDF encoding used in each can be found in the paper
Introducing Wikidata to the Linked Data Web [1].

The available RDF exports can be found in the directory
http://tools.wmflabs.org/__wikidata-exports/rdf/exports/
http://tools.wmflabs.org/wikidata-exports/rdf/exports/. New
exports are generated regularly from current data dumps of Wikidata
and will appear in this directory shortly afterwards.

All dump files have been generated using Wikidata Toolkit [2]. There
are some important differences in comparison to earlier dumps:

* Data is split into several dump files for convenience. Pick
whatever you are most interested in.
* All dumps are generated using the OpenRDF library for Java (better
quality than ad hoc serialization; much slower too ;-)
* All dumps are in N3 format, the simplest RDF serialization format
that there is
* In addition to the faithful dumps, some simplified dumps are also
available (one statement = one triple; no qualifiers and references).
* Links to external data sets are added to the data for Wikidata
properties that point to datasets with RDF exports. That's the
Linked in Linked Open Data.

Suggestions for improvements and contributions on github are welcome.

Cheers,

Markus

[1]
http://korrekt.org/page/__Introducing_Wikidata_to_the___Linked_Data_Web
http://korrekt.org/page/Introducing_Wikidata_to_the_Linked_Data_Web
[2] https://www.mediawiki.org/__wiki/Wikidata_Toolkit
https://www.mediawiki.org/wiki/Wikidata_Toolkit

--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486 tel:%2B49%20351%20463%2038486
http://korrekt.org/

_
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org mailto:Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/__mailman/listinfo/wikidata-l
https://lists.wikimedia.org/mailman/listinfo/wikidata-l




___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l




___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Wikidata:List of properties/Summary table

2014-06-11 Thread Markus Krötzsch


On 11/06/14 17:13, Derric Atzrott wrote:

You might also find the new property browser helpful:

http://tools.wmflabs.org/wikidata-exports/miga/

(as mentioned before, requires one of Google Chrome, Safari, Opera, or
Android Browser to work).


While an excellent list and a neat tool, it sadly isn't organised in a way that 
fits my needs.  I just needed a simple list organised by type of object that I 
could refer back to in order to make sure that I don't miss properties for 
which I do have data.

I am pleased though that your tool gives the actual name of the properties 
though.  In some of the property proposal discussions on Wikidata the property 
has not actually been named the exact same name as what was proposed, which can 
be quite confusing when you go to use it.


Yes, I know what you mean. I'd love to integrate property group 
information into our view as well, but I don't know where to get this 
information from (other than by scraping it from the wiki page, which 
does not seem to be right). Any pointers to where these groups are managed?


Regards,

Markus



___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Wikidata Toolkit 0.2.0 released

2014-06-11 Thread Markus Krötzsch

On 11/06/14 19:52, Maximilian Klein wrote:

Excellent work Markus. Your tools are helping me to debunk bad science
the world over[1].
Keep up the great work.

Thanks :-)

Max

PS. By the way, if you do stack overflow you may want to chime in on
this purpose-built question[2].

I lost my account when my OpenId-provider ClaimId stopped providing
OpenIds ... if anybody knows how to recover these ids, drop me a line :-p

Markus

[1]
https://medium.com/the-physics-arxiv-blog/wikipedia-mining-algorithm-reveals-the-most-influential-people-in-35-centuries-of-human-history-ede5ef827b76
[2]
http://opendata.stackexchange.com/questions/107/when-will-the-wikidata-database-be-available-for-download/

Max Klein
‽ http://notconfusing.com/

On Tue, Jun 10, 2014 at 1:35 AM, Markus Krötzsch
mar...@semantic-mediawiki.org mailto:mar...@semantic-mediawiki.org
wrote:

Dear all,

I am happy to announce the second release of Wikidata Toolkit [1],
the Java library for programming with Wikidata and Wikibase. This
release fixes bugs and improves features of the first release
(download, parse, process Wikidata exports) and it adds new
components for serializing JSON and RDF exports for Wikidata. A
separate announcement regarding the RDF exports will be sent shortly.

Maven users can get the library directly from Maven Central (see
[1]); this is the preferred method of installation. There is also an
all-in-one JAR at github [2] and of course the sources [3].

Version 0.2.0 is still in alpha. For the next release, we will focus
on the following tasks:

* Faster loading of Wikibase dumps + support for the new JSON format
that will be used in the dumps soon
* Support for storing and querying data after loading it
* Initial steps towards storing data in a binary format after loading it

Feedback is welcome. Developers are also invited to contribute via
github.

Cheers,

Markus

[1] https://www.mediawiki.org/__wiki/Wikidata_Toolkit
https://www.mediawiki.org/wiki/Wikidata_Toolkit
[2] https://github.com/Wikidata/__Wikidata-Toolkit/releases
https://github.com/Wikidata/Wikidata-Toolkit/releases
(you'll also need to install the third party dependencies manually
when using this)
[3] https://github.com/Wikidata/__Wikidata-Toolkit/
https://github.com/Wikidata/Wikidata-Toolkit/

_
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org mailto:Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/__mailman/listinfo/wikidata-l
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Wikidata:List of properties/Summary table

2014-06-11 Thread Markus Krötzsch


On 11/06/14 20:49, Bene* wrote:

Am 11.06.2014 17:27, schrieb Markus Krötzsch:

Yes, I know what you mean. I'd love to integrate property group
information into our view as well, but I don't know where to get this
information from (other than by scraping it from the wiki page, which
does not seem to be right). Any pointers to where these groups are
managed?


Currently information about properties is only stored on their talk
pages but I think the wikidata development team is working on claims for
properties so we can store such information in a more structured way.


I know, but, for example, P303 is listed under animal breeds on

https://www.wikidata.org/wiki/Wikidata:List_of_properties/Summary_table

whereas I cannot find this information on:

https://www.wikidata.org/wiki/Property_talk:P303

I wonder if this information is really anywhere else but in the code of 
the bot maintainer ...


Regards,

Markus


However, I am not sure how soon this will be released.

Regards,
Bene

PS: Someone who knows the tracking bug for this around?

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l



___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Wikidata query feature: status and plans

2014-06-10 Thread Markus Krötzsch


On 07/06/14 00:40, Joe Filceolaire wrote:

Well they can ask.

As there is no real definition of what is a city and what the limits of
each city are I'm not sure they will get a useful answer. The population
of the City of London (Q23311), for instance, is only 7,375! Should we
change it from 'instance of:city' to 'instance of:village'?


Side remark: in the UK, city and town are special legal statuses of 
settlements. This terminology is what City of London refers to. There 
is a clear and crisp definition for what this means, but it is not what 
we mean by our class city in Wikidata. In particular, this has no 
direct relationship to size: the largest UK towns have over 100k 
inhabitants.


The class city is used for relatively large and permanent human 
settlement[s] [1], which does not say much (because the vagueness of 
relatively). Maybe we should even wonder if city is a good class to 
use in Wikidata. Saying that something has been awarded city status in 
the UK (Q1867820) has a clear meaning. Saying that something is a human 
settlement is also rather clear. But drawing the line between 
village, city and town is quite tricky, and will probably never be 
done uniformly across the data.


Conclusion: if you are looking for, say, human settlements with more 
than 100k inhabitants, then you should be searching for just that (which 
I think is basically what you also are saying below :-).


Markus

[1] https://en.wikipedia.org/wiki/City




Even a basic query like 'people born in the Czech republic' has
problems. Should it include people born in Czechoslovakia or the
Austro-Hungarian provinces of Bohemia and Moravia? To exclude these the
query needs to check not just if the 'place of birth' of an item is 'in
the administrative entity:Czech Republic' today but whether that was
true on the 'date of birth' of each of those people.

This isn't to say that such queries are not useful. Just to point out
that real world data is tricky. The cool thing is that we are going to
have the data in Wikidata to make it theoretically feasible to drill
down and get answers to these tricky questions. Once the data is there,
open licensed for anyone to use, then it is just a matter of a letting
loose a thousand PhDs to devise clever ways to query it.

If we build it they will come!

At least that is my understanding.

Joe


On Fri, Jun 6, 2014 at 9:21 PM, Jeroen De Dauw jeroended...@gmail.com
mailto:jeroended...@gmail.com wrote:

Hey Yury,

We are indeed planning to use the Ask query language for Wikidata.

People will be able to define queries on dedicated query pages that
contain a query entity. These query entities will represent things
such as The cities with highest population in Europe. People will
then be able to access the result for those queries via the web API
and be able to embed different views on them into wiki pages. These
views will be much like SMW result formats, and we might indeed be
able to share code between the two projects for that.

This functionality is still some way off though. We still need to do
a lot of work, such as creating a nice visual query builder. To
already get something out to the users, we plan to enable more
simple queries via the web API in the near future.

Cheers

--
Jeroen De Dauw - http://www.bn2vs.com
Software craftsmanship advocate
Evil software architect at Wikimedia Germany
~=[,,_,,]:3

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org mailto:Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l




___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l




___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

[Wikidata-l] Wikidata Toolkit 0.2.0 released

2014-06-10 Thread Markus Krötzsch


Dear all,

I am happy to announce the second release of Wikidata Toolkit [1], the 
Java library for programming with Wikidata and Wikibase. This release 
fixes bugs and improves features of the first release (download, parse, 
process Wikidata exports) and it adds new components for serializing 
JSON and RDF exports for Wikidata. A separate announcement regarding the 
RDF exports will be sent shortly.


Maven users can get the library directly from Maven Central (see [1]); 
this is the preferred method of installation. There is also an 
all-in-one JAR at github [2] and of course the sources [3].


Version 0.2.0 is still in alpha. For the next release, we will focus on 
the following tasks:


* Faster loading of Wikibase dumps + support for the new JSON format 
that will be used in the dumps soon

* Support for storing and querying data after loading it
* Initial steps towards storing data in a binary format after loading it

Feedback is welcome. Developers are also invited to contribute via github.

Cheers,

Markus

[1] https://www.mediawiki.org/wiki/Wikidata_Toolkit
[2] https://github.com/Wikidata/Wikidata-Toolkit/releases
(you'll also need to install the third party dependencies manually when 
using this)

[3] https://github.com/Wikidata/Wikidata-Toolkit/

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Wikidata query feature: status and plans

2014-06-10 Thread Markus Krötzsch


On 10/06/14 11:11, Luca Martinelli wrote:

We may possibly use an ad hoc item City of United Kingdom, subclass of
city and UK administrative division, may we?


Sure, that's possible. Maybe this is even necessary. I had suggested to 
link to city status in the UK -- but there is no item town status in 
the UK so one would need to have helper items there as well. If we need 
new items in either case, the class-based modelling seems nicer since it 
fits into the existing class hierarchy as you suggest.


Markus





L.

Il 10/giu/2014 10:21 Markus Krötzsch mar...@semantic-mediawiki.org
mailto:mar...@semantic-mediawiki.org ha scritto:

On 07/06/14 00:40, Joe Filceolaire wrote:

Well they can ask.

As there is no real definition of what is a city and what the
limits of
each city are I'm not sure they will get a useful answer. The
population
of the City of London (Q23311), for instance, is only 7,375!
Should we
change it from 'instance of:city' to 'instance of:village'?


Side remark: in the UK, city and town are special legal statuses
of settlements. This terminology is what City of London refers to.
There is a clear and crisp definition for what this means, but it is
not what we mean by our class city in Wikidata. In particular,
this has no direct relationship to size: the largest UK towns have
over 100k inhabitants.

The class city is used for relatively large and permanent human
settlement[s] [1], which does not say much (because the vagueness
of relatively). Maybe we should even wonder if city is a good
class to use in Wikidata. Saying that something has been awarded
city status in the UK (Q1867820) has a clear meaning. Saying that
something is a human settlement is also rather clear. But drawing
the line between village, city and town is quite tricky, and
will probably never be done uniformly across the data.

Conclusion: if you are looking for, say, human settlements with more
than 100k inhabitants, then you should be searching for just that
(which I think is basically what you also are saying below :-).

Markus

[1] https://en.wikipedia.org/wiki/__City
https://en.wikipedia.org/wiki/City



Even a basic query like 'people born in the Czech republic' has
problems. Should it include people born in Czechoslovakia or the
Austro-Hungarian provinces of Bohemia and Moravia? To exclude
these the
query needs to check not just if the 'place of birth' of an item
is 'in
the administrative entity:Czech Republic' today but whether that was
true on the 'date of birth' of each of those people.

This isn't to say that such queries are not useful. Just to
point out
that real world data is tricky. The cool thing is that we are
going to
have the data in Wikidata to make it theoretically feasible to drill
down and get answers to these tricky questions. Once the data is
there,
open licensed for anyone to use, then it is just a matter of a
letting
loose a thousand PhDs to devise clever ways to query it.

If we build it they will come!

At least that is my understanding.

Joe


On Fri, Jun 6, 2014 at 9:21 PM, Jeroen De Dauw
jeroended...@gmail.com mailto:jeroended...@gmail.com
mailto:jeroended...@gmail.com
mailto:jeroended...@gmail.com__ wrote:

 Hey Yury,

 We are indeed planning to use the Ask query language for
Wikidata.

 People will be able to define queries on dedicated query
pages that
 contain a query entity. These query entities will represent
things
 such as The cities with highest population in Europe.
People will
 then be able to access the result for those queries via the
web API
 and be able to embed different views on them into wiki
pages. These
 views will be much like SMW result formats, and we might
indeed be
 able to share code between the two projects for that.

 This functionality is still some way off though. We still
need to do
 a lot of work, such as creating a nice visual query builder. To
 already get something out to the users, we plan to enable more
 simple queries via the web API in the near future.

 Cheers

 --
 Jeroen De Dauw - http://www.bn2vs.com
 Software craftsmanship advocate
 Evil software architect at Wikimedia Germany
 ~=[,,_,,]:3

 _
 Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
mailto:Wikidata-l@lists.wikimedia.org
mailto:Wikidata-l@lists

Re: [Wikidata-l] What is the point of properties?

2014-05-30 Thread Markus Krötzsch


On 29/05/14 21:04, Andrew Gray wrote:

One other issue to bear in mind: it's *simple* to have properties as a
separate thing. I have been following this discussion with some
interest but... well, I don't think I'm particularly stupid, but most
of it is completely above my head.

Saying here are items, here are a set of properties you can define
relating to them, here's some notes on how to use properties is going
to get a lot more people able to contribute than if they need to start
understanding theoretical aspects of semantic relationships...


Good point. The thread has really gone off in a rather philosophical 
direction :-) As Jane said, examples (of places where a property should 
be used *and* of places where it should not be used) are definitely much 
more useful to help our editors on the ground. I usually use items I 
know as role models or have a look for suitable showcase items.


Markus



On 28 May 2014 09:37, Daniel Kinzler daniel.kinz...@wikimedia.de wrote:

Key differences between Properties and Items:

* Properties have a data type, items don't.
* Items have sitelinks, Properties don't.
* Items have Statements, Properties will support Claims (without sources).

The software needs these constraints/guarantees to be able to take shortcuts,
provide specialized UI and API functionality, etc.

Yes, it would be possible to use items as properties instead of having a
separate entity type. But they are structurally and functionally different, so
it makes sense to have a strict separate. This makes a lot of things easier, 
e.g.:

* setting different permissions for properties
* mapping to rdf vocabularies

More fundamentally, they are semantically different: an item describes a concept
in the real world, while a property is a structural component used for such a
description.

Yes, properies are simmilar to data items, and in some cases, there may be an
item representing the same concept that is represented by a property entity. I
don't see why that is a problem, while I can see a lot of confusion arising from
mixing them.

-- daniel


Am 28.05.2014 09:25, schrieb David Cuenca:

Since the very beginning I have kept myself busy with properties, thinking about
which ones fit, which ones are missing to better describe reality, how integrate
into the ones that we have. The thing is that the more I work with them, the
less difference I see with normal items and if soon there will be statements
allowed in property pages, the difference will blur even more.
I can understand that from the software development point of view it might make
sense to have a clear difference. Or for the community to get a deeper
understanding of the underlying concepts represented by words.

But semantically I see no difference between:
cement (Q45190) emissivity (P1295) 0.54
and
cement (Q45190) emissivity (Q899670) 0.54

Am I missing something here? Are properties really needed or are we adding
unnecessary artificial constraints?

Cheers,
Micru


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l




--
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l







___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] What is the point of properties?

2014-05-29 Thread Markus Krötzsch


On 29/05/14 12:41, Thomas Douillard wrote:

@David:
I think you should have a look to fuzzy logic
https://www.wikidata.org/wiki/Q224821:)


Or at probabilistic logic, possibilistic logic, epistemic logic, ... 
it's endless. Let's first complete the data we are sure of before we 
start to discuss whether Pluto is a planet with fuzzy degree 0.6 or 0.7 ;-)


(The problem with quantitative logics is that there is usually no 
reference for the numbers you need there, so they are not well suited 
for a secondary data collection like Wikidata that relies on other 
sources. The closest concept that still might work is probabilistic 
logic, since you can really get some probabilities from published data; 
but even there it is hard to use the probability as a raw value without 
specifying very clearly what the experiment looked like.)


Markus




2014-05-29 1:48 GMT+02:00 David Cuenca dacu...@gmail.com
mailto:dacu...@gmail.com:

Markus,


On Thu, May 29, 2014 at 12:53 AM, Markus Krötzsch
mar...@semantic-mediawiki.org
mailto:mar...@semantic-mediawiki.org wrote:

This is an easy question once you have been clear about what
human behaviour is. According to enwiki, it is a range of
behaviours *exhibited by* humans.


Settled :) Let's leave it at defined as a trait of

What would anybody do with this data? In what application could
it be of interest?


Well, our goal it to gather the whole human knowledge, not to use
it. I can think of several applications, but let's leave that open.
Never underestimate human creativity ;-)


Moreover, as a great Icelandic ontologist once said: There is
definitely, definitely, definitely no logic, to human behaviour ;-)


Definitely, that is why we spend so much time in front of flickering
squares making them flicker even more. It makes total sense :P

I think constraints are already understood in this way. The
name comes from databases, where a constraint violation is
indeed a rather hard error. On the other hand, ironically,
constraints (as a technical term) are often considered to be a
softer form of modelling than (onto)logical axioms: a constraint
can be violated while a logical axiom (as the name suggests) is
always true -- if it is not backed by the given data, new data
will be inferred. So as a technical term, constraint is quite
appropriate for the mechanism we have, although it may not be
the best term to clarify the intention.


Ok, I will not fight traditional labels nor conventions. I was
interested in pointing out to the inappropriateness of using a word
inside our community with a definition that doesn't matches its use,
when there is another word that matches perfectly and conveys its
meaning better to users.

Some important ideas like classification (instance of/subclass
of) belong completely to the analytical realm. We don't observe
classes, we define them. A planet is what we call a planet,
and this can change even if the actual lumps in space are pretty
much the same.


Agreed. Better labels could be defined as instance of/defined as
subclass of

Now inferences are slightly different. If we know that X implies
Y, then if A says X we can infer that (implicitly) A says Y.
That is a logical relationship (or rule) on the level of what is
claimed, rather than on the level of statements. Note that we
still need to have a way to find out that X implies Y, which
is a content-level claim that should have its own reference
somewhere. We mainly use inference in this sense with subclass
of in reasonator or when checking constraints. In this case,
the implications are encoded as subclass-of statements (If X is
a piano, then X is an instrument). This allows us to have
references on the implications.


Nope, nope, nope. I was not referring to hard implications, but to
heuristic ones.

Consider that these properties in the item namespace:
defined as a trait of
defined as having
defined as instance of

Would translate as these constraints in the property namespace:
likely to be a trait of
likely to have
likely to be an instance of


In general, an interesting question here is what the status of
subclass of really is. Do we gather this information from
external sources (surely there must be a book that tells us that
pianos are instruments) or do we as a community define this for
Wikidata (surely, the overall hierarchy we get is hardly the
universal class hierarchy of the world but a very specific
classification that is different from other classifications that
may exist elsewhere)? Best not to think about it too much and to
gather sources whenever

Re: [Wikidata-l] What is the point of properties?

2014-05-29 Thread Markus Krötzsch


On 29/05/14 13:53, Thomas Douillard wrote:

hehe, maybe some kind inferences can lead to a good heuristic to suggest
properties and values in the entity suggester. As they naturally become
softer and softer by combination of uncertainties, this could also
provide some kind of limits for inferences by fixing a probability below
which we don't add a fuzzy fact to the set of facts.

Maybe we could fix an heuristic starting fuzziness or probability score
based on  1 sourced claim - big score ; one disputed claim ; based on
ranks and so on.


Sorry, I have to expand on this a bit ...

My main point was that there are many fuzzy logics (depending on the 
t-norm you chose) and many probabilistic logics (depending on the 
stochastic assumptions you make). The meaning of a score crucially 
depends on which logic you are in. Moreover, at least in fuzzy logic, 
the scores only are relevant in comparison to other scores (there is no 
absolute meaning to 0.3) -- therefore you need to ensure that the 
scores are assigned in a globally consistent way (0.3 in Wikidata would 
have to mean exactly the same wherever it is used).


This makes it extremely hard to implement such an approach in practice 
in a large, distributed knowledge base like ours. What's more, you 
cannot find these scores in books or newspapers, so you somehow have to 
make them up in another way. You suggested to use this for statements 
that are not generally accepted, but how do you measure how disputed a 
statement is? If two thirds of references are for it and the rest is 
against it, do you assign 0.66 as a score? It's very tricky.


Fuzzy logic has its main use in fuzzy control (the famous washing 
machine example), which is completely different and largely unrelated 
to fuzzy knowledge representation. In knowledge representation, fuzzy 
approaches are also studied, but their application is usually in a 
closed system (e.g., if you have one system that extracts data from a 
text and assigns certainties to all extracted facts in the same way). 
It's still unclear how to choose the right logic, but at least it will 
give you a uniform treatment of your data according to some fixed 
principles (whether they make sense or not).


The situation is much clearer in probabilistic logics, where you define 
your assumptions first (e.g., you assume that events are independent or 
that dependencies are captured in some specific way). This makes it more 
rigorous, but also harder to apply, since in practice these assumptions 
rarely hold. This is somewhat tolerable if you have a rather uniform 
data set (e.g., a lot of sensor measurements that give you some 
probability for actual states of the underlying system). But if you have 
a huge, open, cross-domain system like Wikidata, it would be almost 
impossible to force it into a particular probability framework where 
0.3 really means in 30% of all cases.


Also note that scientific probability is always a limit of observed 
frequencies. It says: if you do something again and again, this is the 
rate you will get. Often-heard statements like We have an 80% chance to 
succeed! or Chances are almost zero that the Earth will blow up 
tomorrow! are scientifically pointless, since you cannot repeat the 
experiments that they claim to make statements about. Many things we 
have in Wikidata are much more on the level of such general statements 
than on the level that you normally use probability for (good example of 
a proper use of probability: based the tests that we did so far, this 
patient has a 35% chance of having cancer -- these are not the things 
we normally have in Wikidata).


Markus




2014-05-29 13:43 GMT+02:00 Markus Krötzsch
mar...@semantic-mediawiki.org mailto:mar...@semantic-mediawiki.org:

On 29/05/14 12:41, Thomas Douillard wrote:

@David:
I think you should have a look to fuzzy logic
https://www.wikidata.org/__wiki/Q224821
https://www.wikidata.org/wiki/Q224821:)


Or at probabilistic logic, possibilistic logic, epistemic logic, ...
it's endless. Let's first complete the data we are sure of before we
start to discuss whether Pluto is a planet with fuzzy degree 0.6 or
0.7 ;-)

(The problem with quantitative logics is that there is usually no
reference for the numbers you need there, so they are not well
suited for a secondary data collection like Wikidata that relies on
other sources. The closest concept that still might work is
probabilistic logic, since you can really get some probabilities
from published data; but even there it is hard to use the
probability as a raw value without specifying very clearly what the
experiment looked like.)

Markus




___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

[Wikidata-l] What data should be in Wikidata? (Was: What is the point of properties?)

2014-05-29 Thread Markus Krötzsch


David,

I need to answer to your first assertion separately:

On 29/05/14 01:48, David Cuenca wrote:

Well, our goal it to gather the whole human knowledge, not to use it.


No, that is really not the case. Our goal is to gather carefully 
selected parts of the human knowledge. Our community defines what these 
parts are. Just like in Wikipedia.


Even if you wanted to gather all human knowledge this goal would not 
be a useful principle for deciding what to do first. For example, we 
know that every natural number is an element of the natural numbers. It 
is obviously not our goal to gather these infinitely many statements (if 
you disagree, you could try to propose a bot that starts to import this 
data ;-). Therefore, it is clear that gathering *all* knowledge is not 
even an abstract ideal of our community. Quite the contrary: we 
explicitly don't want it.


The natural numbers are just an extreme example. Many other cases exists 
(for instance, we do not import all free databases into Wikidata, 
although they are finite). The question then is: How do we know what 
data we want and what data we don't want? What principles do we base our 
decision on? For me, there are two main principles:


* practical utility (does it serve a purpose that we care about?)
* simplicity and clarity (is it natural to express and easy to understand?)

You said that we cannot foresee *all* applications, but that does not 
mean that we should start to create data for which we cannot foresee 
*any*. There is just too much data of the latter kind, and we need to 
make a choice.


Don't get me wrong: I consider myself an inclusionist. Better to have 
some useless data than to miss some important content. But there is no 
neutral ground here -- we all must draw a line somewhere (or start 
writing the natural number import bot ;-). My position is: if we have 
data that is very hard to capture and at the same time has no 
conceivable use, then we should not spend our energy on it while there 
is so much clearly defined, important data that we are still missing.


Markus


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] What is the point of properties?

2014-05-29 Thread Markus Krötzsch


The other answers, under the original subject:

On 29/05/14 01:48, David Cuenca wrote:


Settled :) Let's leave it at defined as a trait of


I don't think it is very clear what the intention of this property is. 
What are the limits of its use? What is it meant to do? Can behaviour 
really be a trait of a species? If we allow it here, it seems to apply 
to all kinds of connections: density/car? eternity/time? time/reality? 
evil/devil? rigour/science? -- this is opening a can of worms. It will 
be hard to maintain this.


Wikiuser13 recently added consists of: Neptune to Q1. It was fixed. 
But it is a good example of the kind of confusion that comes from such 
general ontological (in the philosophical sense) properties. And 
consists of is still very simple compared to defined as a trait of. 
Can't we focus on more obvious things like has social network account 
for a while? ;-)


...

Some important ideas like classification (instance of/subclass of)
belong completely to the analytical realm. We don't observe classes,
we define them. A planet is what we call a planet, and this can
change even if the actual lumps in space are pretty much the same.


Agreed. Better labels could be defined as instance of/defined as
subclass of


I don't think this is better. The short names are fine. As I explained 
in my email, Wikidata statements are mainly about what the external 
references say. The distinction between defined and observed is not 
on the surface of this. The main question is Did the reference say that 
pianos are instruments? but not Did the reference say pianos are 
instruments because of the definition of 'piano'? Therefore, we don't 
need to put this information in our labels.




Now inferences are slightly different. If we know that X implies Y,
then if A says X we can infer that (implicitly) A says Y. That
is a logical relationship (or rule) on the level of what is claimed,
rather than on the level of statements. Note that we still need to
have a way to find out that X implies Y, which is a content-level
claim that should have its own reference somewhere. We mainly use
inference in this sense with subclass of in reasonator or when
checking constraints. In this case, the implications are encoded as
subclass-of statements (If X is a piano, then X is an instrument).
This allows us to have references on the implications.


Nope, nope, nope. I was not referring to hard implications, but to
heuristic ones.

Consider that these properties in the item namespace:
defined as a trait of
defined as having
defined as instance of

Would translate as these constraints in the property namespace:
likely to be a trait of
likely to have
likely to be an instance of


I think you might have misunderstood my email. I was arguing *in favour* 
of soft constraints, but in the paragraph before the one about 
inferences that you reply to here. Inferences are hard ways for 
obtaining new knowledge from our own definitions. Example:


If X is the father of Y according to reference A
Then Y is the child of X according to reference A

This is as hard as it can get. We are absolutely sure of this since this 
rule just explains the relationship between two different ways we have 
for encoding family relationships.


Below, you said expectations inferred from definitions should not be 
treated as hard constraints -- maybe this mixture of terms indicates 
that I have not been clear enough about the distinction between 
inference and constraint. They are really completely different ways 
of looking at things. Inferences are something that adds (inevitable) 
conclusions to your knowledge, while constraints just tell you what to 
check for. If you accept the premises of an inference and the inference 
rule, then you must also accept the conclusion -- there is no soft way 
of reading this. To make it soft, you can start to formalise softness 
in your knowledge, using fuzzy logic or whatnot (see my other email with 
Thomas).


I don't think we can use soft inferences (in the sense of fuzzy logic 
et al.) but I am in favour of soft constraints (in the sense of your 
expectations). I guess we agree on all of this, but have a bit of 
trouble in making ourselves clear :-) But it is rather subtle material 
after all.





In general, an interesting question here is what the status of
subclass of really is. Do we gather this information from external
sources (surely there must be a book that tells us that pianos are
instruments) or do we as a community define this for Wikidata
(surely, the overall hierarchy we get is hardly the universal class
hierarchy of the world but a very specific classification that is
different from other classifications that may exist elsewhere)? Best
not to think about it too much and to gather sources whenever we
have them ;-)


I think it is good to think about it and to consider options to deal
with it. Like for instance:

Re: [Wikidata-l] What is the point of properties?

2014-05-28 Thread Markus Krötzsch


Hi David,

Interesting remark. Let's explore this idea a bit. I will give you two 
main reasons why we have properties separate, one practical and one 
conceptual.


First the practical point. Certainly, everything that is used as a 
property needs to have a datatype, since otherwise the wiki would not 
know what kind of input UI to show. So you cannot use just any item as a 
property straight away -- it needs to have a datatype first. So, yes, 
you could abolish the namespace Property but you still would have a 
clear, crisp distinction between property items (those with datatype) 
and normal items (those without a datatype). Because of this, most of 
the other functions would work the same as before (for example, property 
autocompletion would still only show properties, not arbitrary items).


A complication with this approach is that property datatypes cannot 
change in Wikibase. This design was picked since there is no way to 
convert existing data from one datatype to another in general. So 
changing the datatype would create problems by making a lot of data 
invalid, and require special handling and special UI to handle this 
situation. With properties living in a separate namespace, this is not a 
real restriction: you can just create a new property and give it the 
same label (after naming the old one differently, e.g., putting 
DEPRECATED in its name). Then you can migrate the data in some custom 
fashion. But if properties would be items, we would have a problem here: 
the item is already linked to many Wikipedias and other projects, and it 
might be used in LUA scripts, queries, or even external applications 
like Denny's Javascript translation library. You cannot change item ids 
easily. Also, many items would not have a datatype, so the first one who 
(accidentally?) is entered will be fixed. So we would definitely need to 
rethink the whole idea of unchangeable datatypes.


My other important reason is conceptual. Properties are not considered 
part of the (encyclopaedic) data but rather part of the schema that the 
community has picked to organise that data. As in your example, 
emissivity (Q899670) is a notion in physics as described in a 
Wikipedia article. There are many things to say about this notion (for 
example, it has a history: somebody must have defined this first -- 
although Wikipedia does not say it in this case). As in all cases, some 
statements might be disputed while others are widely acknowledged to be 
true.


For the property emissivity (P1295), the situation is quite different. 
It was introduced as an element used to enter data, similar to a row in 
a database table or an infobox template in some Wikipedia. It does 
probably closely relate to the actual physical notion Q899670, but it 
still is a different thing. For example, it was first introduced by 
User:Jakec, who is probably not the person who introduced the physical 
concept ;-) Anything that we will say about P1295 in the future refers 
to the property -- a concept of our own making, that is not described in 
any external source (there are no publications discussing P1295).


This is also the reason why properties are supposed to support *claims* 
not *statements*. That is, they will have property-value pairs and 
qualifiers, but no references or ranks. Indeed, anything we say about 
properties has the status of a definition. If we say it, it's true. 
There is no other authority on Wikidata properties. You could of course 
still have items and properties share a page and somehow define which 
statements/claims refer to which concept, but this does not seem to make 
things easier for users.


These are, for me, the two main reasons why it makes sense to keep 
properties apart from items on a technical level. Besides this, it is 
also convenient to separate the 1000-something properties from the 
15-million something items for reasons of maintenance.


Best regards,

Markus


On 28/05/14 09:25, David Cuenca wrote:

Since the very beginning I have kept myself busy with properties,
thinking about which ones fit, which ones are missing to better describe
reality, how integrate into the ones that we have. The thing is that the
more I work with them, the less difference I see with normal items
and if soon there will be statements allowed in property pages, the
difference will blur even more.
I can understand that from the software development point of view it
might make sense to have a clear difference. Or for the community to get
a deeper understanding of the underlying concepts represented by words.

But semantically I see no difference between:
cement (Q45190) emissivity (P1295) 0.54
and
cement (Q45190) emissivity (Q899670) 0.54

Am I missing something here? Are properties really needed or are we
adding unnecessary artificial constraints?

Cheers,
Micru


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org

Re: [Wikidata-l] What is the point of properties?

2014-05-28 Thread Markus Krötzsch


On 28/05/14 10:37, Daniel Kinzler wrote:

Key differences between Properties and Items:

* Properties have a data type, items don't.
* Items have sitelinks, Properties don't.
* Items have Statements, Properties will support Claims (without sources).

The software needs these constraints/guarantees to be able to take shortcuts,
provide specialized UI and API functionality, etc.

Yes, it would be possible to use items as properties instead of having a
separate entity type. But they are structurally and functionally different, so
it makes sense to have a strict separate. This makes a lot of things easier, 
e.g.:

* setting different permissions for properties
* mapping to rdf vocabularies


This one point requires a tiny remark: there is no problem in OWL or RDF 
to use the same URI as a property, an individual, and a class in 
different contexts. The only thing that OWL (DL) forbids is to use one 
property for literal values (like string) and for object values (like 
other items), but this would not occur in our case anyway since we have 
clearly defined types. I completely agree with all the rest :-)


Cheers,

Markus



More fundamentally, they are semantically different: an item describes a concept
in the real world, while a property is a structural component used for such a
description.

Yes, properies are simmilar to data items, and in some cases, there may be an
item representing the same concept that is represented by a property entity. I
don't see why that is a problem, while I can see a lot of confusion arising from
mixing them.

-- daniel


Am 28.05.2014 09:25, schrieb David Cuenca:

Since the very beginning I have kept myself busy with properties, thinking about
which ones fit, which ones are missing to better describe reality, how integrate
into the ones that we have. The thing is that the more I work with them, the
less difference I see with normal items and if soon there will be statements
allowed in property pages, the difference will blur even more.
I can understand that from the software development point of view it might make
sense to have a clear difference. Or for the community to get a deeper
understanding of the underlying concepts represented by words.

But semantically I see no difference between:
cement (Q45190) emissivity (P1295) 0.54
and
cement (Q45190) emissivity (Q899670) 0.54

Am I missing something here? Are properties really needed or are we adding
unnecessary artificial constraints?

Cheers,
Micru


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l







___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] What is the point of properties?

2014-05-28 Thread Markus Krötzsch


David,

Regarding the question of how to classify properties and how to relate 
them to items:


* same as (in the sense of owl:sameAs) is not the right concept here. 
In fact, it has often been discouraged to use this on the Web, since it 
has very strong implications: it means that in all uses of the one 
identifier, one could just as well use the other identifier, and that it 
is indistinguishable if something has been said about the one or the 
other. That seems too strong here, at least for most cases.


* In the world of OWL DL, sameAs specifically refers to individuals, not 
to classes or properties. Saying P sameAs Q does not imply that P and 
Q have the same extension as properties. For the latter, OWL has the 
relationship owl:equivalentProperties. This distinction of instance 
level and schema level is similar to the distinction we have between 
instance of and subclass of.


* Therefore, I would suggest to use a property called subproperty of 
as one way of relating properties (analogously to subclass of). It has 
to be checked if this actually occurs in Wikidata (do we have any 
properties that would be in this relation, or do we make it a modelling 
principle to have only the most specific properties in Wikidata?).


* The relationship from properties to items could be modelled with the 
existing property subject of (P805).


* It might be useful to also have a taxonomic classification of 
properties. For example, we already group properties into properties for 
people, organisations, etc. Such information could also be added 
with a specific property (this would be a bit more like a category 
system on property pages). On the other hand, some of this might 
coincide with constraint information that could be expressed as claims. 
For instance, person properties might be those with Type (i.e., 
rdfs:domain) constraint human. By the way, our constraint system could 
use some systematisation -- there are many overlaps in what you can do 
with one constraint or another.


Cheers,

Markus

On 28/05/14 12:14, David Cuenca wrote:

Markus,
The explanation about the implications of renaming/deleting makes most
sense and just that justifies already the separation in two.
It is equally true that when we create a property, we might have
cleaned the original concept so much that it might differ (even
slightly) with the understood concept that the item represents. However,
even after that process, the new concept is still an item...

The process of imbuing a concept with permanent characteristics (adding
a datatype) and the practical approach, also seems to recommend keeping
items and properties separate.
Thanks for showing me that reasoning :)

I am still wondering about how are we going to classify properties.
Maybe it will require a broader discussion, but if they are the same (or
mostly the same) as items, then we can just link them as same as, and
build the classing structure just for the items. OTOH, if they are
different, then we will need to mirror that classification for
properties, which seems quite redundant. Plus adding a new datatype,
property.

All in all, my conclusion about this is that properties are just
concepts with special qualities that justify the separation in the
software (even if in real life there is no separation).

many thanks for your detailed answer, and sorry if I'm bringing up
already discussed topics. It is just that when you stare long into
wikidata, wikidata stares back into you ;)

Cheers,
Micru


On Wed, May 28, 2014 at 11:39 AM, Markus Krötzsch
mar...@semantic-mediawiki.org mailto:mar...@semantic-mediawiki.org
wrote:

Hi David,

Interesting remark. Let's explore this idea a bit. I will give you
two main reasons why we have properties separate, one practical and
one conceptual.

First the practical point. Certainly, everything that is used as a
property needs to have a datatype, since otherwise the wiki would
not know what kind of input UI to show. So you cannot use just any
item as a property straight away -- it needs to have a datatype
first. So, yes, you could abolish the namespace Property but you
still would have a clear, crisp distinction between property items
(those with datatype) and normal items (those without a datatype).
Because of this, most of the other functions would work the same as
before (for example, property autocompletion would still only show
properties, not arbitrary items).

A complication with this approach is that property datatypes cannot
change in Wikibase. This design was picked since there is no way to
convert existing data from one datatype to another in general. So
changing the datatype would create problems by making a lot of data
invalid, and require special handling and special UI to handle
this situation. With properties living in a separate namespace, this
is not a real restriction: you can just create a new property and
give it the same label

Re: [Wikidata-l] Using external vocabularies (like RDA) in WikiData ?

2014-05-28 Thread Markus Krötzsch


On 28/05/14 15:56, Daniel Kinzler wrote:

Am 28.05.2014 15:05, schrieb Jean-Baptiste Pressac:

Hello,
I am reading the documentation of WikiData where I learned that new properties
could be suggested for discussion. But this means adding knew properties to
WikiData. However, is it possible to use existing RDF vocabularies


Not directly. At the moment, you would just rely on a convention saying that a
given wikibase property is equivalent to a concept from some other vocabulary.

However, we are in the process of allowing claims on properties however. Once
this is possible, you will be able to connect properties to external
identifiers, much in the way data items about people etc are cross-linked with
external identifiers.

This would allow you to model the equivalence between wikidata properties and
other vicabularied. However, the software itself would not be aware of the
equivalence, so it would not be explicit in the RDF representation of data
items. But it would be easy for an external tool that knows how to interpret
such claims on properties to build an appropriate mapping using owl:sameAs or a
similar mechanism.


Daniel is right about this mechanism (but, as I said earlier today, 
owl:equivalentProperty is the way to go here, not owl:sameAs). However, 
there is another important point to consider: statements in Wikidata 
cannot be expressed as single triples in RDF. You need auxiliary nodes 
for statements to represent qualifiers and references. For details, see 
our technical report 
http://korrekt.org/page/Introducing_Wikidata_to_the_Linked_Data_Web


Due to this, you cannot just take external properties and use them to 
replace Wikidata properties: the RDF version of Wikidata does not have 
any property that links subjects (items) and objects (values) directly. 
There are several approaches to get back to single triples (mainly: 
named graphs and simplified exports); see the technical report for details.


The other issue is that one has to be aware of is that we use properties 
not just for the main part of a statement, but also for qualifiers and 
for references. One should be clear about whether an external property 
applies to all or only to some of these uses. For example, an external 
property that has Person as its domain should never be used in a 
reference, even if (maybe in an error) somebody has used the Wikidata 
property in a reference.


We plan to generate the RDF dumps described in the technical report 
regularly. This would be a possible place for implementing the re-use of 
external vocabularies. If you are interested in this, you are welcome to 
join -- basically, one could have a mechanism based on either a 
hard-coded mapping (in the export code) or based on templates on 
property talk pages (like constraints now).


Cheers,

Markus


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] What is the point of properties?

2014-05-28 Thread Markus Krötzsch

 helpers 
(constraints) distinct from sourced information (statements about items).




My recommendation is to rely mainly on the main taxonomy instead of
creating a parallel property taxonomy, and then think of ways to extract
information from the main taxonomy to convert it automatically into
constraints.
All the maintenance takes effort, so the more it can be automated, the
more efficient volunteers will be. And if we can simplify the
maintenance of properties, we will be able to simplify the creation of
properties too, specially when we face the next surge which will come
with the datatype number with units.


I agree with the general goals, but I don't think that things become any 
easier if we confuse information about properties with information about 
items. We can still re-use information we have about items (like the 
class hierarchy that we already use in constraints) to avoid 
duplication, but some things are clearly not part of the item taxonomy.


Cheers,

Markus





On Wed, May 28, 2014 at 2:48 PM, Markus Krötzsch
mar...@semantic-mediawiki.org mailto:mar...@semantic-mediawiki.org
wrote:

David,

Regarding the question of how to classify properties and how to
relate them to items:

* same as (in the sense of owl:sameAs) is not the right concept
here. In fact, it has often been discouraged to use this on the Web,
since it has very strong implications: it means that in all uses of
the one identifier, one could just as well use the other identifier,
and that it is indistinguishable if something has been said about
the one or the other. That seems too strong here, at least for most
cases.

* In the world of OWL DL, sameAs specifically refers to individuals,
not to classes or properties. Saying P sameAs Q does not imply
that P and Q have the same extension as properties. For the latter,
OWL has the relationship owl:equivalentProperties. This distinction
of instance level and schema level is similar to the distinction we
have between instance of and subclass of.

* Therefore, I would suggest to use a property called subproperty
of as one way of relating properties (analogously to subclass
of). It has to be checked if this actually occurs in Wikidata (do
we have any properties that would be in this relation, or do we make
it a modelling principle to have only the most specific properties
in Wikidata?).

* The relationship from properties to items could be modelled with
the existing property subject of (P805).

* It might be useful to also have a taxonomic classification of
properties. For example, we already group properties into properties
for people, organisations, etc. Such information could also be
added with a specific property (this would be a bit more like a
category system on property pages). On the other hand, some of
this might coincide with constraint information that could be
expressed as claims. For instance, person properties might be those
with Type (i.e., rdfs:domain) constraint human. By the way, our
constraint system could use some systematisation -- there are many
overlaps in what you can do with one constraint or another.

Cheers,

Markus


On 28/05/14 12:14, David Cuenca wrote:

Markus,
The explanation about the implications of renaming/deleting
makes most
sense and just that justifies already the separation in two.
It is equally true that when we create a property, we might have
cleaned the original concept so much that it might differ (even
slightly) with the understood concept that the item represents.
However,
even after that process, the new concept is still an item...

The process of imbuing a concept with permanent characteristics
(adding
a datatype) and the practical approach, also seems to recommend
keeping
items and properties separate.
Thanks for showing me that reasoning :)

I am still wondering about how are we going to classify properties.
Maybe it will require a broader discussion, but if they are the
same (or
mostly the same) as items, then we can just link them as same
as, and
build the classing structure just for the items. OTOH, if they are
different, then we will need to mirror that classification for
properties, which seems quite redundant. Plus adding a new datatype,
property.

All in all, my conclusion about this is that properties are just
concepts with special qualities that justify the separation in the
software (even if in real life there is no separation).

many thanks for your detailed answer, and sorry if I'm bringing up
already discussed topics. It is just that when you stare long into
wikidata, wikidata stares back into you ;)

Cheers,
Micru

Re: [Wikidata-l] What is the point of properties?

2014-05-28 Thread Markus Krötzsch


David,


One of the uses is: what is the relationship between a
human and his behavior?


This is an easy question once you have been clear about what human 
behaviour is. According to enwiki, it is a range of behaviours 
*exhibited by* humans. The bigger question for me is, whether it is 
useful to record this relationship (exhibited by) in Wikidata. What 
would anybody do with this data? In what application could it be of 
interest?


Moreover, as a great Icelandic ontologist once said: There is 
definitely, definitely, definitely no logic, to human behaviour ;-)



On that regard, I hate the
word constraint, because it means that we are placing a straitjacket
on reality, when it is the other way round, recurring patterns in the
real world make us expect that a value will fall within the bonds of
our expectations.


I think constraints are already understood in this way. The name comes 
from databases, where a constraint violation is indeed a rather hard 
error. On the other hand, ironically, constraints (as a technical term) 
are often considered to be a softer form of modelling than (onto)logical 
axioms: a constraint can be violated while a logical axiom (as the name 
suggests) is always true -- if it is not backed by the given data, new 
data will be inferred. So as a technical term, constraint is quite 
appropriate for the mechanism we have, although it may not be the best 
term to clarify the intention.




However, I would like to go to bring the conversation to a deeper level.

...


With all this I want to make the point that there are two sources of
expectations:
- from our experience seeing repetitions and patterns in the values
(male/female/etc between 10 and 50), which belong to the property
- from the agreed definition of the concept itself, which belong to the data


Yes. I agree with this as a basic dichotomy of things we may want to 
record in Wikidata. Some things are true by definition, while others are 
just very likely by observation. The exact population of Paris we will 
never know, but we are completely sure that a piano is an instrument. 
(Maybe somebody with a better philosophical background than me could 
give a better perspective of these notions -- analytical vs. 
empirical come to mind, but I am sure there is more.)


Some important ideas like classification (instance of/subclass of) 
belong completely to the analytical realm. We don't observe classes, we 
define them. A planet is what we call a planet, and this can change 
even if the actual lumps in space are pretty much the same.


However, there is yet a deeper level here (you asked for it ;-). 
Wikidata is not about facts but about statements with references. We do 
not record Pluto was a planet until 2006 but Pluto was a planet until 
2006 *according to the IAU*. Likewise, we don't say Berlin has 3 
million inhabitants but Berlin has 3 million inhabitants *according to 
the Amt fuer Statistik Berlin-Brandenburg*. If you compare these two 
statements, you can see that they are both empirical, based on our 
observation of a particular reference. We do not have analytical 
knowledge of what the IAU or the Amt fuer Statistic might say. So in 
this sense constraints can only ever be rough guidelines. It does not 
make logical sense to say if source A says X then source B must say Y 
-- even if we know that X implies Y (maybe by definition), we don't know 
what sources A and B say. All we can do with constraints it to uncover 
possible contradictions between sources, which might then be looked into.


Now inferences are slightly different. If we know that X implies Y, then 
if A says X we can infer that (implicitly) A says Y. That is a 
logical relationship (or rule) on the level of what is claimed, rather 
than on the level of statements. Note that we still need to have a way 
to find out that X implies Y, which is a content-level claim that 
should have its own reference somewhere. We mainly use inference in this 
sense with subclass of in reasonator or when checking constraints. In 
this case, the implications are encoded as subclass-of statements (If X 
is a piano, then X is an instrument). This allows us to have references 
on the implications.


In general, an interesting question here is what the status of subclass 
of really is. Do we gather this information from external sources 
(surely there must be a book that tells us that pianos are instruments) 
or do we as a community define this for Wikidata (surely, the overall 
hierarchy we get is hardly the universal class hierarchy of the world 
but a very specific classification that is different from other 
classifications that may exist elsewhere)? Best not to think about it 
too much and to gather sources whenever we have them ;-)


Besides these two notions (constraints to uncover inconsistent 
references, and logical axioms to derive new statements from given 
ones), there is also a third type of constraint that is purely 
analytical. If we *define* that our

Re: [Wikidata-l] Subclass of/instance of

2014-05-15 Thread Markus Krötzsch


On 14/05/14 19:33, Joe Filceolaire wrote:

Except that there are lots of people who have appeared in one movie who
don't consider themselves actors and should not have the
'occupation=actor/actress'. There are good reasons for some constraints
to be gadgets that can be overridden rather than hard coded semantic limits.


Sure, we completely agree here. It was just an example. But it shows why 
we need any such feature to be controlled by the community ;-)




I do think we should be able to have hard coded reverse properties and
symmettric properties.


By hard coded do you mean stored explicitly (as opposed to: 
inferred in some way)? It will always be possible to store anything 
explicitly in this sense (but I guess you know this; maybe I 
misunderstood what you said; feel free to clarify).


In general, what I mentioned about inferencing is not supposed to alter 
the way in which the site works. It would be more like a layer on top 
that could be useful for asking queries. For example, imagine you want 
to query for the grandmother of a person: we don't have this property in 
Wikidata but we have enough information to answer the query. So you 
would have to research how to get this information by combining existing 
properties. The idea is that one could have a place to keep this 
information (= the definition of grandmother in terms of Wikidata 
properties). We would then have a community approved way of finding 
grandmothers in Wikidata, and you would be much faster with your query. 
At the same time, you could look up the definition to find out how 
Wikidata really stores this information. None of this would would change 
how the underlying data works, but it could contribute to some data 
modelling problems because it gives you an option to support a 
property without the added maintenance cost on the data management level.


Cheers,

Markus




On Wed, May 14, 2014 at 2:33 PM, Markus Krötzsch
mar...@semantic-mediawiki.org mailto:mar...@semantic-mediawiki.org
wrote:



 I guess there is already a group of people who deal w
Hi Eric,

Thanks for all the information. This was very helpful. I only get to
answer now since we have been quite busy building RDF exports for
Wikidata (and writing a paper about it). I will soon announce this
here (we still need to fix a few details).

You were asking about using these properties like rdfs:subClassOf
and rdf:type. I think that's entirely possible, since the modelling
is very reasonable and would probably yield good results. Our
reasoner ELK could easily handle the class hierarchy in terms of
size, but you don't really need such a highly optimized tool for
this as long as you only have subClassOf. In fact, the page you
linked to shows that it is perfectly possible to compute the class
hierarchy with Wikidata Query and to display all of it on one page.
ELK's main task is to compute class hierarchies for more complicated
ontologies, which we do not have yet. OTOH, query answering and data
access are different tasks that ELK is not really intended for
(although it could do some of this as well).

Regarding future perspectives: one thing that we have also done is
to extract OWL axioms from property constraint templates on Wikidata
talk pages (we will publish the result soon, when announcing the
rest). This gives you only some specific types of OWL axioms, but it
is making things a bit more interesting already. In particular,
there are some constraints that tell you that an item should have a
certain class, so this is something you could reason with. However,
the current property constraint system does not work too well for
stating axioms that are not related to a particular property (such
as: Every [instance of] person who appears as an actor in some film
should be [instance of] in the class 'actor' -- which property or
item page should this be stated on?). But the constraints show that
it makes sense to express such information somehow.

In the end, however, the real use of OWL (and similar ontology
languages) is to remove the need for making everything explicit.
That is, instead of constraints (which say: if your data looks
like X, then your data should also include Y) you have axioms
(which say: if your data looks like X, then Y follows
automatically). So this allows you to remove redundancy rather than
to detect omissions. This would make more sense with derived
notions that one does not want to store in the database, but which
make sense for queries (like grandmother).

One would need a bit more infrastructure for this; in particular,
one would need to define grandmother (with labels in many
languages) even if one does not want to use it as a property but
only in queries. Maybe one could have a separate Wikibase
installation for defining such derived notions without needing

Re: [Wikidata-l] Subclass of/instance of

2014-05-14 Thread Markus Krötzsch


Hi Eric,

Thanks for all the information. This was very helpful. I only get to 
answer now since we have been quite busy building RDF exports for 
Wikidata (and writing a paper about it). I will soon announce this here 
(we still need to fix a few details).


You were asking about using these properties like rdfs:subClassOf and 
rdf:type. I think that's entirely possible, since the modelling is very 
reasonable and would probably yield good results. Our reasoner ELK could 
easily handle the class hierarchy in terms of size, but you don't really 
need such a highly optimized tool for this as long as you only have 
subClassOf. In fact, the page you linked to shows that it is perfectly 
possible to compute the class hierarchy with Wikidata Query and to 
display all of it on one page. ELK's main task is to compute class 
hierarchies for more complicated ontologies, which we do not have yet. 
OTOH, query answering and data access are different tasks that ELK is 
not really intended for (although it could do some of this as well).


Regarding future perspectives: one thing that we have also done is to 
extract OWL axioms from property constraint templates on Wikidata talk 
pages (we will publish the result soon, when announcing the rest). This 
gives you only some specific types of OWL axioms, but it is making 
things a bit more interesting already. In particular, there are some 
constraints that tell you that an item should have a certain class, so 
this is something you could reason with. However, the current property 
constraint system does not work too well for stating axioms that are not 
related to a particular property (such as: Every [instance of] person 
who appears as an actor in some film should be [instance of] in the 
class 'actor' -- which property or item page should this be stated 
on?). But the constraints show that it makes sense to express such 
information somehow.


In the end, however, the real use of OWL (and similar ontology 
languages) is to remove the need for making everything explicit. That 
is, instead of constraints (which say: if your data looks like X, 
then your data should also include Y) you have axioms (which say: if 
your data looks like X, then Y follows automatically). So this allows 
you to remove redundancy rather than to detect omissions. This would 
make more sense with derived notions that one does not want to store 
in the database, but which make sense for queries (like grandmother).


One would need a bit more infrastructure for this; in particular, one 
would need to define grandmother (with labels in many languages) even 
if one does not want to use it as a property but only in queries. Maybe 
one could have a separate Wikibase installation for defining such 
derived notions without needing to change Wikidata? There are no 
statements on properties yet, but one could also use item pages to 
define derived properties when using another site ...


Best regards,

Markus

P.S. Thanks for all the work on the semantic modelling aspects of 
Wikidata. I have seen that you have done a lot in the discussions to 
clarify things there.



On 06/05/14 04:53, emw wrote:

Hi Markus,

You asked who is creating all these [subclass of] statements and how is
this done?

The class hierarchy in
http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q35120rp=279lang=en
shows a few relatively large subclass trees for specialist domains,
including molecular biology and mineralogy.  The several thousand
subclass of 'gene' and 'protein' subclass claims were created by members
of WikiProject Molecular biology (WD:MB), based on discussions in [1]
and [2].  The decision to use P279 instead of P31 there was based on the
fact that the is-a relation in Gene Ontology maps to rdfs:subClassOf,
which P279 is based on.  The claims were added by a bot [3], with input
from WD:MB members.  The data ultimately comes from external biological
databases.

A glance at the mineralogy class hierarchy indicates it has been
constructed by WikiProject Mineralogy [4] members through non-bot
edits.  I imagine most of the other subclass of claims are done manually
or semi-automatically outside specific Wikiproject efforts.  In other
words, I think most of the other P279 claims are added by Wikidata users
going into the UI and building usually-reasonable concept hierarchies on
domains they're interested in.  I've worked on constructing class
hierarchies for health problems (e.g. diseases and injuries) [5] and
medical procedures [6] based on classifications like ICD-10 and
assertions and templates on Wikipedia (e.g. [8]).

It's not incredibly surprising to me that Wikidata has about 36,000
subclass of (P279) claims [9].  The property has been around for over a
year and is a regular topic of discussion [10] along with instance of
(P31), which has over 6,600,000 claims.

You noted a dubious claim subclass of claim for 'House of Staufen'
(Q130875).  I agree that instance of would probably be the better
membership property

Re: [Wikidata-l] Wikidata Toolkit 0.1.0 released

2014-04-09 Thread Markus Krötzsch


Hi Gerard.

On 09/04/14 10:54, Gerard Meijssen wrote:

Hoi,

What is the relevance of these tools when you have to have specialised
environments to use them ?


Not sure what you mean. Wikidata Toolkit doesn't have any requirements 
other than plain old Java to run.


Nevertheless, we'd also like to support people who are using some of the 
common Java development tools that are around, especially the free ones. 
Currently, we only have instructions for Eclipse users, but we could 
extend this. Which tools do you normally use to develop Java?


Cheers

Markus




On 9 April 2014 10:41, Daniel Kinzler daniel.kinz...@wikimedia.de
mailto:daniel.kinz...@wikimedia.de wrote:

Am 08.04.2014 23:34, schrieb Denny Vrandečić:
  I was trying to use this, but my Java is a bit rusty. How do I
run the
  DumpProcessingExample?
 
  I did the following steps:
 
  git clone https://github.com/Wikidata/Wikidata-Toolkit
  cd Wikidata-Toolkit
  mvn install
  mvn test
 
  Now, how do I start DumpProcessingExample?

Looks like you are supposed to run it from Eclipse.

It would be very useful if maven would generate a jar with all
dependencies for
the examples, or if there was a shell script that would allow us to
run classes
without the need to specify the full class path.

Finding out how to get all the libs you need into the classpath is
one of the
major annoyances of java...

-- daniel

--
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org mailto:Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l




___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l




___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

[Wikidata-l] Wikidata submissions to Wikimania

2014-03-31 Thread Markus Krötzsch


Dear all,

There are quite a few Wikidata-related submissions to Wikimania [0]. The 
selection of the program committee seems to be based on user votes to 
some extent, so don't forget to add your name to the submission pages 
you care about :-).


I just added another two:
* How to use Wikidata: Things to make and do with 30 million 
statements [1] A general introductory talk about Wikidata data reuse in 
all of its forms.
* Wikidata Toolkit: A Java library for working with Wikidata [2] A 
tutorial for working with Wikidata Toolkit (expected to be much more 
feature rich at the time of Wikimania ;-)


Feedback is welcome.

Cheers,

Markus

[0] https://wikimania2014.wikimedia.org/wiki/Category:Submissions
[1] 
https://wikimania2014.wikimedia.org/wiki/Submissions/How_to_use_Wikidata:_Things_to_make_and_do_with_30_million_statements
[2] 
https://wikimania2014.wikimedia.org/wiki/Submissions/Wikidata_Toolkit:_A_Java_library_for_working_with_Wikidata


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

[Wikidata-l] What's up with our incremental (daily) dumps?

2014-03-13 Thread Markus Krötzsch


Hi,

Since a few weeks now, no daily dumps have been published for Wikidata. 
Only empty directories are created every day. I could not find a related 
email on any list I scanned. Can anybody clarify what the situation is now?


Cheers,

Markus

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] What's up with our incremental (daily) dumps?

2014-03-13 Thread Markus Krötzsch


On 13/03/14 17:14, Katie Filbert wrote:

On Thu, Mar 13, 2014 at 5:06 PM, Markus Krötzsch
mar...@semantic-mediawiki.org mailto:mar...@semantic-mediawiki.org
wrote:

Hi,

Since a few weeks now, no daily dumps have been published for
Wikidata. Only empty directories are created every day. I could not
find a related email on any list I scanned. Can anybody clarify what
the situation is now?


The issue is due to the dumps being moved to the Ashburn data center.

https://bugzilla.wikimedia.org/show_bug.cgi?id=62315

They should be running again soonish.


Good to know. Thanks,

Markus


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Large scale glitch in references

2014-03-04 Thread Markus Krötzsch


Hi ValterVB,

On 04/03/14 20:17, ValterVB wrote:

Hi Markus, it’s an error of my bot (ValterVBot). Thanks to noted them. I
can fix  it probably on friday or saturday, source should be Q11920 not
Q11329. Sorry for this problem.


Great, that should be fine.


ValterVB
PS I’m not sure if I reply to mail-archive or in you private mail, in
second case, can you post this mail? Thanks.


Done.

Best regards,

Markus


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] supported and planned wikidata uris( was Re:Meta header for asserting that a web page is about a Wikidata subject)

2014-02-27 Thread Markus Krötzsch


Hi,

On 26/02/14 22:40, Michael Smethurst wrote:

Hello

*Really* not meaning to jump down any http-range-14 rabbit holes but
wasn't there a plan for wikidata to have uris representing things and
pages about those things?

 From conversations on this list I sketched a picture a while back of all
the planned URIs:
http://smethur.st/wp-uploads/2012/07/46159634-wikidata.png


Where
http://wikidata.org/id/Qetc
Was the thing uri (which you could point a foaf:PrimaryTopic at)


As Denny said in reply to another message, the preferred URI for this is

http://www.wikidata.org/entity/Qetc

This is also the form of URIs used within Wikidata data for certain 
things (e.g., coordinates that refer to earth use the URI 
http://www.wikidata.org/entity/Q2; to do so, even in JSON).


 and

http://wikidata.org/wiki/Qetc

Was the document uri


Yes. However, for metadata it is usually preferred to use the entity 
URI, since the document http://wikidata.org/wiki/Qetc is just an 
automatic UI rendering of the data, and as such relatively 
uninteresting. One will eventually get (using content negotiation) all 
data in RDF from http://www.wikidata.org/entity/Qetc (JSON should 
already work, and html works of course, when opening the entity URI in 
normal browsers). The only reason for using the wiki URI directly would 
be if one uses a property that requires a document as its value, but in 
this case one should probably better use another property.


Best regards,

Markus




Mainly asking not for the wikipedia  wikidata relationships but wondering
if there's a more up to date picture of supported wikidata uri patterns
and redirects?

Recently I was trying to find a way to programmatically get wikidata uris
from wikipedia uris and tried various combinations of:
http://wikidata.org/title/enwiki:Berlin
http://en.wikidata.org/item/Berlin
http://en.wikidata.org/title/Berlin


(all mentioned on the list / wiki) but all of them return a 404

Is the a way to do this?

Michael




On 26/02/2014 19:09, Dan Brickley dan...@danbri.org wrote:


On 26 February 2014 10:45, Joonas Suominen joonas.suomi...@wikimedia.fi
wrote:

How about using RDFa and foaf:primaryTopic like in this example
https://en.wikipedia.org/wiki/RDFa#XHTML.2BRDFa_1.0_example

2014-02-26 20:18 GMT+02:00 Paul Houle ontolo...@gmail.com:


Isn't there some way to do this with schema.org?


The FOAF options were designed for relations between entities and
documents -

foaf:primaryTopic relates a Document to a thing that the doc is
primarily about (i.e. assumes entity IDs as value, pedantically).

the inverse, foaf:isPrimaryTopicOf, was designed to allow an entity
description in a random page to anchor itself against well known
pages. In particular we had Wikipedia in mind.

http://xmlns.com/foaf/spec/#term_primaryTopic
http://xmlns.com/foaf/spec/#term_isPrimaryTopicOf

(Both of these share a classic Semantic Web pickyness about
distinguishing things from pages about those things).

Much more recently at schema.org we've added a new
property/relationship called http://schema.org/sameAs

It relates an entity to a reference page (e.g. wikipedia) that can be
used as a kind of proxy identifier for the real world thing that it
describes. Not to be confused with owl:sameAs which is for saying
here are two ways of identifying the exact same real world entity.

None of these are a perfect fit for a relationship between a random
Web page and a reference page. But maybe close enough?

Both FOAF and schema.org are essentially dictionaries of
hopefully-useful terms, so you can use them in HTML head, or body,
according to taste, policy, tooling etc. And you can choose a syntax
(microdata, rdfa, json-ld etc.).

I'd recommend using the new schema.org 'sameAs', .e.g. in rdfa lite,

link href=https://en.wikipedia.org/wiki/Buckingham_Palace;
property=http://schema.org/sameAs; /

This technically says the thing we're describing in the current
element is Buckingham_Palace. If you want to be more explicit and say
this Web page is about a real world Place and that place is
Buckingham_Palace ... you can do this too with a bit more nesting; the
HTML body might be a better place for it.

Dan

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l




-
http://www.bbc.co.uk
This e-mail (and any attachments) is confidential and
may contain personal views which are not the views of the BBC unless 
specifically stated.
If you have received it in
error, please delete it from your system.
Do not use, copy or disclose the
information in any way nor act in reliance on it and notify the sender
immediately.
Please note that the BBC monitors e-mails
sent or received.
Further communication will signify your consent to
this.
-

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org

Re: [Wikidata-l] CFP - IEEE Co-sponsored CyberSec2014 - Lebanon Section

2014-01-17 Thread Markus Krötzsch

This call is a scam. The conference is not a legit academic event but 
aims at making money. It is a sad truth that there is an increasingly 
large amount of (more or less) academic conference spam these days. IEEE 
has been criticized for sponsoring events without sufficient quality 
control [1], and I tend to ignore all events that advertise in its name. 
Some typical signs of scam conferences:


* Keywords: cyber, world, global, multiconference, as well as 
weird keyword combinations and neologisms (peacefare?!)

* claimed(?) IEEE sponsoring
* registration fees per accepted paper instead of per participant (pay 
to publish), see e.g. [2]; this is probably the one 100% sure sign of a 
fake conference; I have never seen any legit event doing this
* non-committing choice of words (potential inclusion to IEEE Xplore, 
possible keynote speakers etc.)
* lack of (trustworthy) names and institutions related to the event 
(that's hard to judge if you are not in a relevant research community); 
in some cases known names may be abused or suggested to cause confusion


There are more hints on recognizing scam conferences and journals 
online, e.g. at [3].


In general, I am in favour of filtering all academic calls for papers 
from this list. Even if the event is legit and has a relevant topic, 
this is not a forum to ask for academic contributions; there are more 
than enough channels these days to advertise events. Calls for 
participations in community events are a different story.


Markus

[1] 
http://blog.lib.umn.edu/denis036/thisweekinevolution/2011/07/would_ieee_really_sponsor_a_fa.html

[2] http://sdiwc.net/conferences/2014/cybersec2014/registration/
[3] http://www.cs.bris.ac.uk/Teaching/learning/junk.conferences.html


On 16/01/14 08:44, Sven Manguard wrote:

I am beginning to get tired of these types of solicitations, as they
seem to be coming in regularly, and more often than not, have little to
do with Wikidata.

Do people on this list find them useful? If so, is this the most
appropriate list? If not, is there any interest in prohibiting posts
like this?

Sven

On Jan 16, 2014 2:38 AM, Liezelle Ann Canadilla lieze...@sdiwc.info
mailto:lieze...@sdiwc.info wrote:

All the registered papers will be submitted to IEEE for potential
inclusion to IEEE Xplore as well as other Abstracting and Indexing
(AI) databases.

TITLE: The Third International Conference on Cyber Security, Cyber
Warfare, and Digital Forensic (CyberSec2014)

EVENT VENUE: Lebanese University, Lebanon

CONFERENCE DATES: Apr. 29 – May 1, 2014

EVENT URL: http://sdiwc.net/conferences/2014/cybersec2014/

OBJECTIVE: To provide a medium for professionals, engineers,
academicians, scientists, and researchers from over the world to
present the result of their research activities in the field of
Computer Science, Engineering and Information Technology.
CyberSec2014 provides opportunities for the delegates to share the
knowledge, ideas, innovations and problem solving techniques.
Submitted papers will be reviewed by the technical program committee
of the conference.

KEYWORDS: Cyber Security, Digital Forensics, Information Assurance
and Security Management, Cyber Peacefare and Physical Security, and
many more...

SUBMISSION URL:
http://sdiwc.net/conferences/2014/cybersec2014/openconf/openconf.php

FIRST SUBMISSION DEADLINE: March 29, 2014

CONTACT EMAIL: cyb2...@sdiwc.net mailto:cyb2...@sdiwc.net


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org mailto:Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l



___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l




___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] ontology Wikidata API, managing ontology structure and evolutions

2014-01-10 Thread Markus Krötzsch

On 10/01/14 03:21, emw wrote:

What about monthly/dump-based aggregated property usage statistics?

Property usage statistics would be very valuable, Dimitris. It would
help inform community decisions about how to steer changes in property
usage with less disruption. It would have other significant benefits as
well.

Getting daily counts like
https://www.wikidata.org/wiki/Wikidata:Database_reports/Popular_properties
back up and running would be a good place to start. That report hasn't
been updated since October 2013. We could go further by showing counts
for all properies, not just the top 100.

More detailed data would be great, too. Wikidata editors recently
posted a list of the most popular objects for 'instance of' (P31) claims
at
https://www.wikidata.org/w/index.php?title=Property_talk:P31oldid=99405143#Value_statistics.
Having daily data like that for all properties would be quite useful.

Thanks for the suggestions. I will put all of these on the list for the
Wikidata Toolkit development. Providing up-to-date analytics of this
kind is a good basic use case for this project. (Btw, the project starts
officially in mid Feb and runs for six months, but we will start working
before that already; but there will be a bit more planning before we
start hacking).

Markus

If anyone does end up doing something like this, I would recommend
archiving the data at http://dumps.wikimedia.org/other/ in addition to
posting it in a regularly updated report in Wikidata.

Cheers,
Eric

https://www.wikidata.org/wiki/User:Emw

On Thu, Jan 9, 2014 at 12:59 PM, Dimitris Kontokostas
kontokos...@informatik.uni-leipzig.de
mailto:kontokos...@informatik.uni-leipzig.de wrote:

What about monthly/dump-based aggregated property usage statistics?
People would be able to check property trends or maybe subscribe to
specific properties via rss.

On Thu, Jan 9, 2014 at 3:55 PM, Daniel Kinzler
daniel.kinz...@wikimedia.de mailto:daniel.kinz...@wikimedia.de
wrote:

Am 08.01.2014 16:20, schrieb Thomas Douillard:
Hi, a problem seems (not very surprisingly) to emerge into
Wikidata : the
managing of the evolution of how we do things on Wikidata.

Properties are deleted, which made some consumer of the datas
sometimes a little
frustrated they are not informed of that and could not take
part of the discussion.

They are informed if they follow the relevant channels. There's
no way to inform
them if they don't. These channels can very likely be improved, yes.

That being said: a property that is still widely used should
very rarely be
deleted, if at all. Usually, properties would be phased out by
replacing them
with another property, and only then they get deleted.

Of course, 3rd parties that rely on specific properties would
still face the
problem that the property they use is simply no longer used
(that's the actual
problem - whether it is deleted doesn't really matter, I think).

So, the question is really: how should 3rd party users be
notified in changes of
policy and best practice regarding the usage and meaning of
properties?

That's an interesting question, one that doesn't have a
technical solution I can
see.

-- daniel

--
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
mailto:Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

--
Dimitris Kontokostas
Department of Computer Science, University of Leipzig
Research Group: http://aksw.org
Homepage:http://aksw.org/DimitrisKontokostas

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org mailto:Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] How are queries doing?

2014-01-08 Thread Markus Krötzsch


Hi,

On a related note, there is also an upcoming project, Wikidata Toolkit 
[1], that will look into implementing query functionality over Wikidata 
content, not to replace the Wikidata query features but to provide 
functionality that is not a top priority for the core development. The 
first step towards this will be to collect concrete requirements (what 
queries exactly? on what part of the data?). I will send an email about 
this in due course, but input is always welcome.


There is no query result integration into Wikipedia via this route, but 
a range of interesting Wikidata-driven Web services and query features 
could be created. The lack of tight coupling to Wikipedia deployment 
makes this project a lot more flexible, with room for experiments and 
new ideas that might also inspire future core features.


Cheers,

Markus

[1] https://meta.wikimedia.org/wiki/Grants:IEG/Wikidata_Toolkit


On 08/01/14 12:22, Gerard Meijssen wrote:

Hoi,
I agree that the integration of Wikidata in all the different
Wikipedias, Wikivoyages, Wikisources, Wiktionaries and Commons is the
most important objective. It is so important because this ensures that
the data will be actually used.

We are doing fine I think. However, not all the quirks of specific Wikis
can be supported. Wikidata is data driven and consequently it matters a
lot if a Wikipedia article is an article, a list or used for
disambiguation.
Thanks,
  GerardM


On 8 January 2014 12:04, Dan Brickley dan...@danbri.org
mailto:dan...@danbri.org wrote:

On 7 January 2014 22:08, Jan Kučera kozuc...@gmail.com
mailto:kozuc...@gmail.com wrote:

  nice to read all the reasoning why queries are yet still not
possible, but I
  think we live in 2014 not and not 1914 actually... seems like the
problem is
  too small budget or bad management... can not really think of another
  reason. How much do you think would it cost to make queries
reality for
  production at Wikidata?

Absolutely the most important thing about Wikidata is the deep
integration (both technical and social) into the Wikipedia universe.
Building a sensible query framework for a system working at Wikipedia
scale (http://www.alexa.com/siteinfo/wikipedia.org) is far from
trivial. I've glad to hear that Wikidata are taking the time to do
this carefully.

Dan

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org mailto:Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l




___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l




___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Wikidata-Freebase mappings published under CC0

2013-11-12 Thread Markus Krötzsch


On 12/11/13 16:26, Sven Manguard wrote:

Google would not have sent over a large chunk of cash to help get
Wikidata started if it didn't think it could use Wikidata. That Russian
search engine comany would not have sent over a large chunk of cash to
keep Wikidata going if it didn't think it could use Wikidata.

That doesn't mean Google is being malicious (about this, at least), it
means that they are making a  business decision. As long as Google
doesn't try to make decisions about Wikidata content or operations -
something it would have no economic reason to do anyways - I don't have
a problem with that. Just don't pretend that Google is doing this out of
the goodness of ther hearts.


Another important thing to note in this context is that all funding for 
Wikidata so far had the form of donations, which is crucially different 
from sponsoring (where you get something in return). The donors who give 
their money hope, of course, that the project will do something that 
they will find useful, but they exercise no control whatsoever in the 
development process. The donations are not bound to any condition, not 
even reporting, and there is no way to retract them. So each donor's 
initial intentions, whatever they were, have no influence on the 
execution of the project (moreover, I Kant think of any reason why the 
intentions rather than the outcome should determine the value of the 
deed ;-).


Cheers,

Markus




On Nov 11, 2013 6:27 PM, Cristian Consonni kikkocrist...@gmail.com
mailto:kikkocrist...@gmail.com wrote:

2013/11/11 Denny Vrandečić vrande...@google.com
mailto:vrande...@google.com:
  as you know, I have recently started a new job. My mission to get
more free
  data out to the world has not changed due to that, though.

=))
I am very happy to hear this.

Also, the mapping is awesome.

2013/11/11 Klein,Max kle...@oclc.org mailto:kle...@oclc.org:
  I regretted writing what I did after thinking about it over
lunch, since it
  is not Assume good faith towards Google. Maybe one of the
reasons that I
  was sensitive to it was because I'm representing VIAF in
Wikidata, which is
  kind of the same as Freebase in Wikidata, and I wouldn't want people
  assuming bad faith about VIAF.
 
  Thanks for being clear and open about your work, its a real
inspiration.
  With apologies,

Yours too, Max.

Thank you both for your very good work.

Cristian

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org mailto:Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l



___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l




___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Questions about statement qualifiers

2013-11-04 Thread Markus Krötzsch


Hi Antoine,

The main answer to your questions is that the data model of Wikidata 
defines a *data structure* not the *informal meaning* that this data 
structure has in an application context (that is: what we, humans, want 
to say when we enter it). I try to explain this a bit better below.


How the presence or absence of a qualifier contributes to the informal 
meaning of a statement is not something that is defined by Wikidata. 
Just like Wikidata does not define what the property office held 
means, it also does not define what it means if office held is used 
with additional qualifiers. This is entirely governed by the community 
who uses these structures to express something. Of course, the community 
tries to do this in a systematic, reasoned, and intuitive way. However, 
there will never be a general rule how to interpret an arbitrary quantifier.


In particular, it is not true that quantifiers are statements about 
statements. First, to avoid confusion, I need to explain Wikidata's 
terminology. A statement in Wikidata comprises the whole data 
structure: main property and value, qualifiers, and references. The 
structure without the references (the thing that the references provide 
evidence for) is called a claim. A claim thus contains a main property 
and value (or no value or unknown value) and zero or more qualifier 
properties with values. Every claim encodes something that is claimed 
about the subject of the page (the Wikidata entity), and the references 
given are supporting this claim (as a whole).


You already illustrated yourself how this is different from making 
statements about statements: it would lead to confusion when several 
statements have the same main property-value but different qualifiers. 
This is also why our RDF export does not use the same resource for 
reifying statements with the same main property-value. Instead, we only 
share the same resource it two claims are completely identical 
(including qualifiers).


It is true that many qualifiers have a certain meta flavour, but this 
is not always the case. An interesting case that you might have seen is 
P161 (cast member) that is used to denote the actors in a film. The 
typical qualifier there is P453 (role), used to name the role 
(character) that the person played in the film. If you look at this, 
this is more like a ternary relation hasActor(film,actor,role) than like 
a meta-statement. Indeed, an n-ary relationship cannot in general be 
represented by a meta-statement about a binary relation, again for the 
same reasons that you gave in your email. In this view, one should maybe 
also think of a relationship usPresident(person,start date, end date) 
than of an annotated assertion usPresident(person). Wikidata is special 
in that qualifiers are optional, yet the modelling view of n-ary 
relations might be closer to the pragmatic truth, since it avoids any 
meta-statements (it also elegantly justifies why there are no 
meta-meta-statements, i.e., qualifiers on qualifiers).


Best regards,

Markus


On 31/10/13 11:39, Antoine Zimmermann wrote:

Hello,


I have a few questions about how statement qualifiers should be used.


First, my understanding of qualifiers is that they define statements
about statements. So, if I have the statement:

Q17(Japan)  P6(head of government)  Q132345(Shinzō Abe)

with the qualifier:

  P39(office held)  Q274948(Prime Minister of Japan)

it means that the statement holds an office, right?
It seems to me that this is incorrect and that this qualifier should in
fact be a statement about Shinzō Abe. Can you confirm this?



Second, concerning temporal qualifiers: what does it mean that the
start or end is no value?  I can imagine two interpretations:

  1. the statement is true forever (a person is a dead person from the
moment of their death till the end of the universe)
  2. (for end date) the statement is still true, we cannot predict when
it's going to end.

For me, case number 2 should rather be marked as unknown value rather
than no value. But again, what does unknown value means in
comparison to having no indicated value?



Third, what if a statement is temporarily true (say, X held office from
T1 to T2) then becomes false and become true again (like X held same
office from T3 to T4 with T3  T2)?  The situation exists for
Q35171(Grover Cleveland) who has the following statement:

Q35171  P39(position held)  Q11696(President of the United States of
America)

with qualifiers, and a second occurrence of the same statement with
different qualifiers. The wikidata user interface makes it clear that
there are two occurrences of the statement with different qualifiers,
but how does the wikidata data model allows me to distinguish between
these two occurrences?

How do I know that:

  P580(start date)  March 4 1885

only applies to the first occurrence of the statement, while:

  P580(start date)  March 4 1893

only applies to the second occurrence of the statement?
I could have a

Re: [Wikidata-l] Application: sexing people by name/research gender bias

2013-10-15 Thread Markus Krötzsch


On 14/10/13 17:52, Klein,Max wrote:

Hi all,

First of all I think this is fantastic research. It goes to show, it's not just 
properties that we can correlate, but also the Labels, Aliases, Sitelinks, and 
the connections between each field.

I would like to point out, as Markus does in his discussion - the relative 
disproportionate representation of sex in Acadmia is the motivation for 
studying this. Let us be sensitive to results in that field. Lets remember our 
simplifying assumptions. We have flattened sex and gender into one measure, and 
at that this research makes a binary male/female classification, where even the 
wikidata sex property is trinary (intersex). I hope that in the future we can 
increase or change our view to how we model sex.


Indeed, it the debates on gender inequality and gender multiplicity 
look at things on very different zoom levels. The goal of my little 
experiment (I would not call it research, as it has neither a hypothesis 
nor any form of evaluation) was not to put individual people into rigid 
gender buckets but to estimate rough global distributions. My error 
margins are far too wide to make any realistic statement about minority 
genders even if I had a method to consider them. As far as social 
definitions of gender go, this is probably something to study in a wider 
context of representation of social minorities in certain professional 
fields.


Cheers,

Markus




From: wikidata-l-boun...@lists.wikimedia.org wikidata-l-boun...@lists.wikimedia.org 
on behalf of Paul A. Houle p...@ontology2.com
Sent: Sunday, October 13, 2013 5:32 PM
To: Discussion list for the Wikidata project.
Subject: Re: [Wikidata-l] Application: sexing people by name/research   gender  
bias

Just as a suggestion,  you can turn these kind of numbers into a probability
distribution using the beta distribution.  If you use (1,1) as a prior you
get something like beta(251,1) for the the probability of the probability
that somebody named Aaron is male.

-Original Message-
From: Markus Krötzsch
Sent: Sunday, October 13, 2013 6:16 PM
To: Discussion list for the Wikidata project.
Subject: [Wikidata-l] Application: sexing people by name/research gender
bias

Hi all,

I'd like to share a little Wikidata application: I just used Wikidata to
guess the sex of people based on their (first) name [1]. My goal was to
determine gender bias among the authors in several research areas. This
is how some people spend their free time on weekends ;-)

In the process, I also created a long list of first names with
associated sex information from Wikidata [2]. It is not super clean but
it served its purpose. If you are a researcher, then maybe the gender
bias of journals/conferences is interesting to you as well. Details and
some discussion of the results are online [1].

Cheers,

Markus

[1] http://korrekt.org/page/Note:Sex_Distributions_in_Research
[2]
https://docs.google.com/spreadsheet/ccc?key=0AstQ5xfO-xXGdE9UVkxNc0JMVWJzNmJqNmhPRjc0cncusp=sharing

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l




___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Application: sexing people by name/research gender bias

2013-10-14 Thread Markus Krötzsch

On 14/10/13 18:18, Tom Morris wrote:

Naming patterns change over time and geography. If you're interested in
the gender of current day authors, you should probably constrain your
name sampling to the same timeframe.

I think geography has a much bigger impact than time here.
Unfortunately, the names I try to find the sex for do not come with an
obvious hint on their geographic origin, so I cannot really use this. I
think filtering by time will not have a big impact, since most people on
Wikipedia are from the 20th century anyway. So there should be a natural
tendency to overrule older uses of names.

There's an app that works of the Freebase data here:
http://namegender.freebaseapps.com/

It also has an API that returns JSON:
http://namegender.freebaseapps.com/gender_api?name=andrea

Based on the top name stats, it looks like its sample is a little more
than twice the size of Wikidata's.

Nice. Christian Thiele also pointed me to a beautiful web service based
on Wikipedia Personendaten (German language, but many things are easy to
figure out, I guess):

http://toolserver.org/~apper/pd/vorname/top
http://toolserver.org/~apper/pd/vorname/Maria

This illustrates nicely how to take the effect of time into account.

Markus

On Sun, Oct 13, 2013 at 6:16 PM, Markus Krötzsch
mar...@semantic-mediawiki.org mailto:mar...@semantic-mediawiki.org
wrote:

Hi all,

I'd like to share a little Wikidata application: I just used
Wikidata to guess the sex of people based on their (first) name [1].
My goal was to determine gender bias among the authors in several
research areas. This is how some people spend their free time on
weekends ;-)

In the process, I also created a long list of first names with
associated sex information from Wikidata [2]. It is not super clean
but it served its purpose. If you are a researcher, then maybe the
gender bias of journals/conferences is interesting to you as well.
Details and some discussion of the results are online [1].

Cheers,

Markus

[1] http://korrekt.org/page/Note:__Sex_Distributions_in_Research
http://korrekt.org/page/Note:Sex_Distributions_in_Research
[2]

https://docs.google.com/__spreadsheet/ccc?key=0AstQ5xfO-__xXGdE9UVkxNc0JMVWJzNmJqNmhPRjc__0cncusp=sharing

https://docs.google.com/spreadsheet/ccc?key=0AstQ5xfO-xXGdE9UVkxNc0JMVWJzNmJqNmhPRjc0cncusp=sharing

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Application: sexing people by name/research gender bias

2013-10-13 Thread Markus Krötzsch


On 13/10/13 23:21, Magnus Manske wrote:

If you need to push through automated sexing for items without sex
property, point to my similar attempt in June:
https://www.wikidata.org/wiki/Wikidata:Bot_requests#Set_sex:male_for_item_list


Thanks, the list I got from the items with sex is already longer than I 
need. My main problem is sexing Asian authors. Not sure if name-based 
approaches are promising there at all.


Markus




On Sun, Oct 13, 2013 at 11:16 PM, Markus Krötzsch
mar...@semantic-mediawiki.org mailto:mar...@semantic-mediawiki.org
wrote:

Hi all,

I'd like to share a little Wikidata application: I just used
Wikidata to guess the sex of people based on their (first) name [1].
My goal was to determine gender bias among the authors in several
research areas. This is how some people spend their free time on
weekends ;-)

In the process, I also created a long list of first names with
associated sex information from Wikidata [2]. It is not super clean
but it served its purpose. If you are a researcher, then maybe the
gender bias of journals/conferences is interesting to you as well.
Details and some discussion of the results are online [1].

Cheers,

Markus

[1] http://korrekt.org/page/Note:__Sex_Distributions_in_Research
http://korrekt.org/page/Note:Sex_Distributions_in_Research
[2]

https://docs.google.com/__spreadsheet/ccc?key=0AstQ5xfO-__xXGdE9UVkxNc0JMVWJzNmJqNmhPRjc__0cncusp=sharing

https://docs.google.com/spreadsheet/ccc?key=0AstQ5xfO-xXGdE9UVkxNc0JMVWJzNmJqNmhPRjc0cncusp=sharing

_
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org mailto:Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/__mailman/listinfo/wikidata-l
https://lists.wikimedia.org/mailman/listinfo/wikidata-l




--
undefined


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l




___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Pushing Wikidata to the next level

2013-10-02 Thread Markus Krötzsch


Hi -- or better: Heya! -- Lydia:

Congratulations to your new role! This is great news for the project, 
which allows Wikidata to proceed on its important mission in perfect 
continuity. Denny has made huge contributions to the project in the past 
1.5 years -- a task that often involved balancing many forces, both on a 
technical and on a social level. Without his commitment and energy, we 
would not be in this encouraging position today. We would also have a 
lot less funding to draw from.


We are really extremely fortunate to continue with a product manager who 
is perfectly prepared for this important role: someone who has the key 
skills as well as the specific experience, and who has a profound 
understanding of what open source and open knowledge are all about.


So welcome again in your new job, and all the best for the next steps.

Cheers,

Markus


On 01/10/13 15:30, Lydia Pintscher wrote:

(crossposting from http://blog.wikimedia.de/?p=17250)

In early 2010 I met Denny and Markus for the first time in a small
room at the Karlsruhe Institute of Technology to talk about Semantic
MediaWiki, its development and its community. I was intrigued by the
idea they'd been pushing for since 2005 - bringing structured data to
Wikipedia. So when the time came to assemble the team for the
development of Wikidata and Denny approached me to do community
communications for it there was no way I could have said no. The
project sounded amazing and the timing was perfect since I was about
to finish my studies of computer science. In the one and a half years
since then we have achieved something amazing. We've built a great
technical base for Wikidata and much more importantly we've built an
amazing community around it. We've built the foundation for something
extraordinary. On a personal level I could never have dreamed where
this one meeting in a small room in Karlsruhe has taken me now.

 From now on I will be taking over product ownership of Wikidata as its
product manager.

Up until today we've built the foundation for something extraordinary.
But at the same time there are still a lot of things that need to be
worked on by all of us together. The areas that we need to focus on
now are:
* Building trust in our data. The project is still young and the
Wikipedia editors and others are still wary of using data from
Wikidata on a large scale. We need to build tools and processes to
make our data more trustworthy.
* Improving the user experience around Wikidata. Building Wikidata to
the point where it is today was a tremendous technical task that we
achieved in a rather short time. This though meant that in places the
user experience has not gotten as much attention. We need to make the
experience of using Wikidata smoother.
* Making Wikidata easier to understand. Wikidata is a very geeky and
technical project. However to be truly successful it will need to be
easy to get the ideas behind it.

These are crucial for Wikidata to have the impact we all want it to
have. And we will all need to work on those - both in the development
team and in the rest of the Wikidata community.

Let's make Wikidata a joy to use and get it used in places and ways we
can't even imagine yet.


Cheers
Lydia




___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

[Wikidata-l] Wikidata Toolkit: call for feedback/support

2013-09-29 Thread Markus Krötzsch


Dear Wikidatanions (*),

I have just drafted a little proposal for creating more tools for 
external people to work with Wikidata, especially to build services on 
top of its data [1]. Your feedback and support is needed.


Idea: Currently, this is quite hard for people, since we only have WDA 
for reading/analysing dumps [2] and Wikidata Query as a single web 
service to ask queries [3]. We should have more support for programmers 
who want to load, query, analyse, and otherwise use the data. The 
proposal is to start such a toolkit to enable more work with the data.


The plan is to kickstart this project with a small team using 
Wikimedia's Individual Engagement program. For this we will need your 
support -- feel free to add your voice to the wiki page [1]. Of course, 
comments of all sorts are also great -- this email thread will be linked 
from the page. If you would like to be involved with the project, that's 
great too; let me know and I can add you to the proposal.


The proposal will already be submitted tomorrow, but support should also 
be possible after that, I hope.


Cheers,

Markus


(*) Do we have a demonym yet? Wikipedian sounds natural, Wikidatan less 
so. Maybe this should be another thread ... ;-)


[1] https://meta.wikimedia.org/wiki/Grants:IEG/Wikidata_Toolkit
[2] http://github.com/mkroetzsch/wda
[3] http://208.80.153.172/wdq/

--
Markus Kroetzsch, Departmental Lecturer
Department of Computer Science, University of Oxford
Room 306, Parks Road, OX1 3QD Oxford, United Kingdom
+44 (0)1865 283529   http://korrekt.org/

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] claims Datatypes inconsistency suspicion

2013-08-26 Thread Markus Krötzsch


Hi Daniel,

if I understand you correctly, you are in favour of equating datavalue 
types and property types. This would solve indeed the problems at hand.


The reason why both kinds of types are distinct in SMW and also in 
Wikidata is that property types are naturally more extensible than 
datavalue types. CommonsMedia is a good example of this: all you need is 
a custom UI and you can handle new data without changing the 
underlying data model. This makes it easy for contributors to add new 
types without far-reaching ramifications in the backend (think of 
numbers, which could be decimal, natural, positive, range-restricted, 
etc. but would still be treated as a number in the backend).


Using fewer datavalue types also improves interoperability. E.g., you 
want to compare two numbers, even if one is a natural number and another 
one is a decimal.


There is no simple rule for deciding how many datavalue types there 
should be. The general guideline is to decide on datavalue types based 
on use cases. I am arguing for diversifying IRIs and strings since there 
are many contexts and applications where this is a crucial difference. 
Conversely, I don't know of any application where it makes sense to keep 
the two similar (this would have to be something where we compare 
strings and IRIs on a data level, e.g., if you were looking for all 
websites with URLs that are alphabetically greater than the postcode of 
a city in England :-p).


In general, however, it will be good to keep the set of basic datavalue 
types small, while allowing the set of property types to grow. The set 
of base datavalue types that we use is based on the experience in SMW as 
well as on existing formats like XSD (which also has many derived types 
but only a few base types).


As for the possible confusion, I think some naming discipline would 
clarify this. In SMW, there is a stronger difference between both kinds 
of types, and a fixed schema for property type ids that makes it easy to 
recognise them.


In any case, using string for IRIs does not seem to solve any problem. 
It does not simplify the type system in general and it does not help 
with the use cases that I mentioned. What I do not agree with are your 
arguments about all of this being internal. We would not have this 
discussion if it were. The data model of Wikidata is the primary 
conceptual model that specifies what Wikidata stores. You might still be 
right that some of the implementation is internal, but the arguments we 
both exchange are not really on the implementation level ;-).


Best wishes

Markus, offline soon for travelling


On 26/08/13 10:35, Daniel Kinzler wrote:

Am 25.08.2013 19:19, schrieb Markus Krötzsch:

If we have an IRI DV, considering that URLs are special IRIs, it seems
clear
that IRI would be the best way of storing them.


The best way of storing them really depends on the storage platform. It
may be a string or something else.

I think the real issue here is that we are exposing something that is
really an internal detail (the data value type) instead of the high
level information we actually should be exposing, namely property type.

I think splitting the two was a mistake, and I think exposing the DV
type while making the property type all but inaccessible makes things a
lot worse.

In my opinion, data should be self-descriptive, so the *semantic* type
of the property should be included along with the value. People expect
this, and assume that this is what the DV type is. But it's not, and
should not be used or abused for this purpose.

Ideally, it should not matter at all to any 3rd party if use use a
string or IRI DV internally. The (semantic) property type would be URL,
and that's all that matters.

I'm quite unhappy about the current situation; we are beginning to see
the backlash of the decision not to include the property type inline. If
we don't do anything about this now, I fear the confusion is going to
get worse.

-- daniel




___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] claims Datatypes inconsistency suspicion

2013-08-24 Thread Markus Krötzsch


Dear Hady,

On 22/08/13 14:44, Hady elsahar wrote:

Hello Markus ,

thanks for pointing to wda code it's very useful ,
i guess by looking on the Wikidata glossary  property data types and
data value types are the same thing :
http://www.wikidata.org/wiki/Wikidata:Glossary#Datatypes

this may be shallow a little bit , but what i saw is that (correct me if
i'm mistaken) :
- they don't use the same names when you search for the datatype of the
property item and the value type of the item that uses this property.

another problem is that they decided to represent commonsMedia in
strings , for some purpose i don't know that's why i didn't get it and
thought it's some sort of consistency


In most cases, however, you can infer the property type from the
datavalue type, but not in all. Unfortunately, you do not generally
find the property type in a dump before you find its first use.


could you point me why depending on such mappings didn't always work ,
for just wikipedia common files ?

'wikibase-item' = 'wikibase-entityid'
'string' = 'string'
'time' = 'time'
'globe-coordinate' = 'globecoordinate'
'commonsMedia' = 'string'


The key is to understand that property types and value types are *not* 
the same. They match in many cases, but not in all. In the future, there 
might be more property types that use the same value type. Property 
types are what the user sees; they define every detail of user 
interaction and UI. Value types are part of the underlying data model; 
they define what the content of the data is. For most data processing, 
you should not need to know the property type.


The situation with commonsMedia is a bit bad because it should be a URL 
rather than a string. What I do in wda is effectively a type conversion 
from string to URI in this particular case. Maybe we can fix this 
somehow in the future when URIs are supported as a value datatype.


Markus






On Thu, Aug 22, 2013 at 11:33 AM, Markus Krötzsch
mar...@semantic-mediawiki.org mailto:mar...@semantic-mediawiki.org
wrote:

Hi all,

I think one source of confusion here are the overlapping names of
property datatypes and datavalue types. Basically, the mapping is as
follows right now:

[Format: property type = datavalue type occurring in current dumps]

'wikibase-item' = 'wikibase-entityid'
'string' = 'string'
'time' = 'time'
'globe-coordinate' = 'globecoordinate'
'commonsMedia' = 'string'

The point is that string on the left is not the same as string
on the right. (Also note the lack of a consistent naming scheme for
these ids :-/ ...) In most cases, however, you can infer the
property type from the datavalue type, but not in all.
Unfortunately, you do not generally find the property type in a dump
before you find its first use.

The wda script's RDF export has code for dealing with this. It
remembers all types that it finds (from P entities in the dump), it
infers types from values where possible, and it uses the API to find
out the type of a property if all else fails (typically, if you find
a string value but don't know yet if the property is of type string
or commonsMedia). In addition, the script has a hardcoded list of
known types that can be extended (there are not so many properties
and their types never change, hence one can do this quite easily).
You can find all the code at [1].

Cheers,

Markus

[1]

https://github.com/mkroetzsch/__wda/blob/master/includes/__epTurtleFileWriter.py

https://github.com/mkroetzsch/wda/blob/master/includes/epTurtleFileWriter.py
(esp. see __getPropertyType() and __fetchPropertyType())


On 21/08/13 21:00, Byrial Jensen wrote:

Den 21-08-2013 21 tel:21-08-2013%2021:09, Hady elsahar skrev:

Hello Jeroen ,

can i get from your words that this page :
http://www.wikidata.org/wiki/__Special:ListDatatypes
http://www.wikidata.org/wiki/Special:ListDatatypes
is not up to date ?if so how can i get all the datatypes in
Wikidata ?


Pages in the virtual Special namespace are generated by MediaWiki on
demand, and are therefore always (in principle - there can be
caching in
some cases) up to date.

string could be anything ( so time could be a string) , but
there's a
defined lower level representation of common media files .
so is it
wrong to represent it as string ,


Time cannot be a string, as there are several components in a
time value
(time, timezone, precision, calendar model, before and after
precisions).

I see nothing wrong in storing commonsMedia values as string
values. You
will know from the property's datatype that the string is a
CommonsMedia
string.

Regards,
- Byrial

Re: [Wikidata-l] claims Datatypes inconsistency suspicion

2013-08-22 Thread Markus Krötzsch


Hi all,

I think one source of confusion here are the overlapping names of 
property datatypes and datavalue types. Basically, the mapping is as 
follows right now:


[Format: property type = datavalue type occurring in current dumps]

'wikibase-item' = 'wikibase-entityid'
'string' = 'string'
'time' = 'time'
'globe-coordinate' = 'globecoordinate'
'commonsMedia' = 'string'

The point is that string on the left is not the same as string on 
the right. (Also note the lack of a consistent naming scheme for these 
ids :-/ ...) In most cases, however, you can infer the property type 
from the datavalue type, but not in all. Unfortunately, you do not 
generally find the property type in a dump before you find its first use.


The wda script's RDF export has code for dealing with this. It remembers 
all types that it finds (from P entities in the dump), it infers types 
from values where possible, and it uses the API to find out the type of 
a property if all else fails (typically, if you find a string value but 
don't know yet if the property is of type string or commonsMedia). In 
addition, the script has a hardcoded list of known types that can be 
extended (there are not so many properties and their types never change, 
hence one can do this quite easily). You can find all the code at [1].


Cheers,

Markus

[1] 
https://github.com/mkroetzsch/wda/blob/master/includes/epTurtleFileWriter.py 
(esp. see __getPropertyType() and __fetchPropertyType())


On 21/08/13 21:00, Byrial Jensen wrote:

Den 21-08-2013 21:09, Hady elsahar skrev:

Hello Jeroen ,

can i get from your words that this page :
http://www.wikidata.org/wiki/Special:ListDatatypes
is not up to date ?if so how can i get all the datatypes in Wikidata ?


Pages in the virtual Special namespace are generated by MediaWiki on
demand, and are therefore always (in principle - there can be caching in
some cases) up to date.


string could be anything ( so time could be a string) , but there's a
defined lower level representation of common media files . so is it
wrong to represent it as string ,


Time cannot be a string, as there are several components in a time value
(time, timezone, precision, calendar model, before and after precisions).

I see nothing wrong in storing commonsMedia values as string values. You
will know from the property's datatype that the string is a CommonsMedia
string.

Regards,
- Byrial


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l



___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Exporting RDF from Wikidata?

2013-08-18 Thread Markus Krötzsch


On 15/08/13 21:38, Dan Brickley wrote:
...



FWIW there's also RDF/XML if you use a *.rdf suffix. This btw is of
great interest to us over in the schema.org http://schema.org project;
earlier today I was showing
http://www.wikidata.org/wiki/Special:EntityData/Q199154.rdf
http://www.wikidata.org/wiki/Special:EntityData/Q199154.rdf to
colleagues there... this is a Wikidata description of a particular
sport. In schema.org http://schema.org we have a few places that
hardcode a short list of well known sports, and we're interested in
mechanisms that allow us to hand off to Wikidata for the long tail. So
http://schema.org/SportsActivityLocation has 9 hand-designed subtypes;
we have been discussing the idea of something like
http://schema.org/SportsActivityLocation?sport=Q199154
http://www.wikidata.org/wiki/Special:EntityData/Q199154.rdf to
integrate Wikidata into the story for other sports.  Similar issues
arise with religions and places of worship
(http://schema.org/PlaceOfWorship). Any thoughts on this from a Wikidata
perspective would be great.


This is definitely something that we would like to encourage. Wikidata 
ids are fairly stable (not based on labels or languages) and fairly well 
grounded (described and named in many languages + linked to many 
Wikipedia pages, authority files, and external databases). So they 
should make suitable identifiers. No identifier will ever be reused, but 
it can happen that a Wikidata item is deleted, in which case it is no 
longer a suitable identifier. In theory, it can also happen that the 
data of an item changes so completely that the meaning of the item is 
different, but this is quite unlikely. One can access historic data 
fairly easily as long as the item is not deleted completely (not sure if 
a historic RDF export [by revision number] is planned, but it would 
not be hard to implement). And of course one would want the identifiers 
to be somewhat dynamic to capture changes of ideas over time (sports 
change all the time, e.g., if official rules are modified, but probably 
one does not want new IDs for every version of football).


I am not sure if one needs to use 
http://schema.org/SportsActivityLocation?sport=Q199154 instead of using 
http://http://www.wikidata.org/entity/www.wikidata.org/Q199154 directly. 
Would these two have different meanings somehow? I guess they could, but 
there should not be a problem with long-term sustainability of the 
Wikidata URIs (just in case this is the main reason for creating new 
URIs here).




Is there any prospect of inline RDFa within the main Wikidata per-entity
pages? It would be great to have http://schema.org/sameAs in those pages
linking to dbpedia, wikipedia,freebase etc too...


This is not currently planned. One interesting starting point could be 
to identify the Wikidata properties that express same as. For example, 
many properties link to other data collections by giving IDs (which 
often correspond to URIs, only that URL datavalues are not quite 
implemented yet). However, the granularity of other databases is often 
not the same, and it might not be true that these IDs unambiguously 
define the identity of the subject. For example, we had on this list a 
question recently whether Norman Cook should have individual entities 
for his various synonyms or not; MusicBrainz has several IDs for him 
based on synonyms, but Wikipedia has only one article about the person. 
In such cases, links to other datasets should probably not be 
interpreted as sameAs.


We currently use schema.org's about for linking Wikipedia pages to 
Wikidata ids. It seems wrong to say that an abstract URI (about a 
Wikidata entity) is the same as the URL of a Webpage that covers that 
topic. (This comment is about the links to Wikipedia you mentioned, not 
about cases with dedicated URIs that are not the web page URLs; the URIs 
for Wikipedia articles are in a strong sense the Wikidata URIs that we 
already start from ;-)


Btw, it is planned (vaguely) that property pages can hold more 
information, which could be used to declare identifier properties in 
the system at some point. But this will still take a while to implement.


Markus


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Exporting RDF from Wikidata?

2013-08-15 Thread Markus Krötzsch


On 15/08/13 19:33, Jona Christopher Sahnwaldt wrote:

http://www.wikidata.org/entity/Q215607.nt which redirects to
http://www.wikidata.org/wiki/Special:EntityData/Q215607.nt

The RDF stuff at Wikidata is in flux. The RDF you get probably won't
contain all the data that the HTML page shows, and the RDF structure
may change.


Indeed, the feature is simply not fully implemented yet. The best 
preview you can get right now is the dump generated by the python 
script. The plan is to make essentially the same available on a per-item 
basis via the URIs and URLs as above (in several syntaxes, depending on 
URL or, when using the URI, content negotiation).


Markus



On 15 August 2013 20:25, Kingsley Idehen kide...@openlinksw.com wrote:

All,

How do I obtain an RDF rendition of the Wikidata document
http://www.wikidata.org/wiki/Q215607 ?

Naturally, I've scoured the Web for examples and I keep on coming up empty
:-(

--

Regards,

Kingsley Idehen
Founder  CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca handle: @kidehen
Google+ Profile: https://plus.google.com/112399767740508618350/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen






___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l



___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l




___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Wikidata RDF export available

2013-08-12 Thread Markus Krötzsch


On 12/08/13 17:56, Nicolas Torzec wrote:

With respect to the RDF export I'd advocate for:
1) an RDF format with one fact per line.
2) the use of a mature/proven RDF generation framework.

Optimizing too early based on a limited and/or biased view of the
potential use cases may not be a good idea in the long run.
I'd rather keep it simple and standard at the data publishing level, and
let consumers access data easily and optimize processing to their need.


RDF has several official, standardised syntaxes, and one of them is 
Turtle. Using it is not a form of optimisation, just a choice of syntax. 
Every tool I have ever used for serious RDF work (triple stores, 
libraries, even OWL tools) supports any of the standard RDF syntaxes 
*just as well*. I do see that there are some advantages in some formats 
and others in others (I agree with most arguments that have been put 
forward). But would it not be better to first take a look at the actual 
content rather than debating the syntactic formatting now? As I said, 
this is not the final syntax anyway, which will be created with 
different code in a different programming language.




Also, I should not have to run a preprocessing step for filtering out the
pieces of data that do not follow the standardŠ


To the best of our knowledge, there are no such pieces in the current 
dump. We should try to keep this conversation somewhat related to the 
actual Wikidata dump that is created by the current version of the 
Python script on github (I will also upload a dump again tomorrow; 
currently, you can only get the dump by running the script yourself). I 
know I suggested that one could parse Turtle in a robust way (which I 
still think one can) but I am not suggesting for a moment that this 
should be necessary for using Wikidata dumps in the future. I am 
committed to fix any error as it is found, but so far I don't get much 
input in that direction.




Note that I also understand the need for a format that groups every facts
about an subject into one record, and serialize them one record per line.
It sometime makes life easier for bulk processing of large datasets. But
that's a different discussion.



As I said: advantages and disadvantages. This is why we will probably 
have all desired formats at some time. But someone needs to start somewhere.


Markus




--
Nicolas Torzec.












On 8/12/13 1:49 AM, Markus Krötzsch mar...@semantic-mediawiki.org
wrote:


On 11/08/13 22:29, Tom Morris wrote:

On Sat, Aug 10, 2013 at 2:30 PM, Markus Krötzsch
mar...@semantic-mediawiki.org mailto:mar...@semantic-mediawiki.org
wrote:

 Anyway, if you restrict yourself to tools that are installed by
 default on your system, then it will be difficult to do many
 interesting things with a 4.5G RDF file ;-) Seriously, the RDF dump
 is really meant specifically for tools that take RDF inputs. It is
 not very straightforward to encode all of Wikidata in triples, and
 it leads to some inconvenient constructions (especially a lot of
 reification). If you don't actually want to use an RDF tool and you
 are just interested in the data, then there would be easier ways of
 getting it.


A single fact per line seems like a pretty convenient format to me.
   What format do you recommend that's easier to process?


I'd suggest some custom format that at least keeps single data values in
one line. For example, in RDF, you have to do two joins to find all
items that have a property with a date in the year 2010. Even with a
line-by-line format, you will not be able to grep this. So I think a
less normalised representation would be nicer for direct text-based
processing. For text-based processing, I would probably prefer a format
where one statement is encoded on one line. But it really depends on
what you want to do. Maybe you could also remove some data to obtain
something that is easier to process.

Markus


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l



___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l




___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] question about 2 different json formats

2013-08-10 Thread Markus Krötzsch


On 10/08/13 10:29, Byrial Jensen wrote:
...


(BTW, the time values seems to be OK again, after many syntax errors in
the beginning. But the coordinate values have some strange (probably
erroneous?) variations: Values where the precision and/or globe is given
as null, and values where the globe is given as the string earth
instead of an entity).


Thanks for the warning. This was something that has been causing 
problems in the RDF dump too. I am now validating the globe settings 
more carefully.


Cheers,

Markus




About the inconsistency in the dump file, is there any bug entry created
for this?
(I can create one, if anyone can point me the proper place to do that).


Not for my sake. I adapted to two entity formats in the dumps
immediately when the new format started to appear.

Best regards,
- Byrial


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l



___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Wikidata RDF export available

2013-08-10 Thread Markus Krötzsch

Good morning. I just found a bug that was caused by a bug in the 
Wikidata dumps (a value that should be a URI was not). This led to a few 
dozen lines with illegal qnames of the form w: . The updated script 
fixes this.


Cheers,

Markus

On 09/08/13 18:15, Markus Krötzsch wrote:

Hi Sebastian,

On 09/08/13 15:44, Sebastian Hellmann wrote:

Hi Markus,
we just had a look at your python code and created a dump. We are still
getting a syntax error for the turtle dump.


You mean just as in at around 15:30 today ;-)? The code is under
heavy development, so changes are quite frequent. Please expect things
to be broken in some cases (this is just a little community project, not
part of the official Wikidata development).

I have just uploaded a new statements export (20130808) to
http://semanticweb.org/RDF/Wikidata/ which you might want to try.



I saw, that you did not use a mature framework for serializing the
turtle. Let me explain the problem:

Over the last 4 years, I have seen about two dozen people (undergraduate
and PhD students, as well as Post-Docs) implement simple serializers
for RDF.

They all failed.

This was normally not due to the lack of skill, but due to the lack of
missing time. They wanted to do it quick, but they didn't have the time
to implement it correctly in the long run.
There are some really nasty problems ahead like encoding or special
characters in URIs. I would direly advise you to:

1. use a Python RDF framework
2. do some syntax tests on the output, e.g. with rapper
3. use a line by line format, e.g. use turtle without prefixes and just
one triple per line (It's like NTriples, but with Unicode)


Yes, URI encoding could be difficult if we were doing it manually. Note,
however, that we are already using a standard library for URI encoding
in all non-trivial cases, so this does not seem to be a very likely
cause of the problem (though some non-zero probability remains). In
general, it is not unlikely that there are bugs in the RDF somewhere;
please consider this export as an early prototype that is meant for
experimentation purposes. If you want an official RDF dump, you will
have to wait for the Wikidata project team to get around doing it (this
will surely be based on an RDF library). Personally, I already found the
dump useful (I successfully imported some 109 million triples of some
custom script into an RDF store), but I know that it can require some
tweaking.



We are having a problem currently, because we tried to convert the dump
to NTriples (which would be handled by a framework as well) with rapper.
We assume that the error is an extra  somewhere (not confirmed) and
we are still searching for it since the dump is so big


Ok, looking forward to hear about the results of your search. A good tip
for checking such things is to use grep. I did a quick grep on my
current local statements export to count the numbers of  and  (this
takes less than a minute on my laptop, including on-the-fly
decompression). Both numbers were equal, making it unlikely that there
is any unmatched  in the current dumps. Then I used grep to check that
 and  only occur in the statements files in lines with commons URLs.
These are created using urllib, so there should never be any  or  in
them.


so we can not provide a detailed bug report. If we had one triple per
line, this would also be easier, plus there are advantages for stream
reading. bzip2 compression is very good as well, no need for prefix
optimization.


Not sure what you mean here. Turtle prefixes in general seem to be a
Good Thing, not just for reducing the file size. The code has no easy
way to get rid of prefixes, but if you want a line-by-line export you
could subclass my exporter and overwrite the methods for incremental
triple writing so that they remember the last subject (or property) and
create full triples instead. This would give you a line-by-line export
in (almost) no time (some uses of [...] blocks in object positions would
remain, but maybe you could live with that).

Best wishes,

Markus



All the best,
Sebastian

Am 03.08.2013 23:22, schrieb Markus Krötzsch:

Update: the first bugs in the export have already been discovered --
and fixed in the script on github. The files I uploaded will be
updated on Monday when I have a better upload again (the links file
should be fine, the statements file requires a rather tolerant Turtle
string literal parser, and the labels file has a malformed line that
will hardly work anywhere).

Markus

On 03/08/13 14:48, Markus Krötzsch wrote:

Hi,

I am happy to report that an initial, yet fully functional RDF export
for Wikidata is now available. The exports can be created using the
wda-export-data.py script of the wda toolkit [1]. This script downloads
recent Wikidata database dumps and processes them to create RDF/Turtle
files. Various options are available to customize the output (e.g., to
export statements but not references, or to export only texts in
English
and Wolof). The file

Re: [Wikidata-l] Wikidata language codes

2013-08-10 Thread Markus Krötzsch


On 10/08/13 11:07, John Erling Blad wrote:

The language code no is the metacode for Norwegian, and nowiki was
in the beginning used for both Norwegian Bokmål, Riksmål and Nynorsk.
The later split of and made nnwiki, but nowiki continued as before.
After a while all Nynorsk content was migrated. Now nowiki has content
in Bokmål and Riksmål, first one is official in Norway and the later
is an unofficial variant. After the last additions to Bokmål there are
very few forms that are only legal n Riksmål, so for all practical
purposes nowiki has become a pure Bokmål wiki.

I think all content in Wikidata should use either nn or nb, and
all existing content with no as language code should be folded into
nb. It would be nice if no could be used as an alias for nb, as
this is de facto situation now, but it is probably not necessary and
could create a discussion with the Nynorsk community.

The site code should be nowiki as long as the community does not ask
for a change.


Thanks for the clarification. I will keep no to mean no for now.

What I wonder is: if users choose to enter a no label on Wikidata, 
what is the language setting that they see? Does this say Norwegian 
(any variant) or what? That's what puzzles me. I know that a Wikipedia 
can allow multiple languages (or dialects) to coexist, but in the 
Wikidata language selector I thought you can only select real 
languages, not language groups.


Markus




On 8/6/13, Markus Krötzsch mar...@semantic-mediawiki.org wrote:

Hi Purodha,

thanks for the helpful hints. I have implemented most of these now in
the list on git (this is also where you can see the private codes I have
created where needed). I don't see a big problem in changing the codes
in future exports if better options become available (it's much easier
than changing codes used internally).

One open question that I still have is what it means if a language that
usually has a script tag appears without such a tag (zh vs.
zh-Hans/zh-Hant or sr vs. sr-Cyrl/sr-Latn). Does this really mean that
we do not know which script is used under this code (either could appear)?

The other question is about the duplicate language tags, such as 'crh'
and 'crh-Latn', which both appear in the data but are mapped to the same
code. Maybe one of the codes is just phased out and will disappear over
time? I guess the Wikidata team needs to answer this. We also have some
codes that mean the same according to IANA, namely kk and kk-Cyrl, but
which are currently not mapped to the same canonical IANA code.

Finally, I wondered about Norwegian. I gather that no.wikipedia.org is
in Norwegian Bokmål (nb), which is how I map the site now. However, the
language data in the dumps (not the site data) uses both no and nb.
Moreover, many items have different texts for nb and no. I wonder if
both are still Bokmål, and there is just a bug that allows people to
enter texts for nb under two language settings (for descriptions this
could easily be a different text, even if in the same language). We also
have nn, and I did not check how this relates to no (same text or
different?).

Cheers,
Markus

On 05/08/13 15:41, P. Blissenbach wrote:

Hi Markus,
Our code 'sr-ec' is at this moment effectively equivalent to 'sr-Cyrl',
likewise
is our code 'sr-el' currently effectively equivalent to 'sr-Latn'. Both
might change,
once dialect codes of Serbian are added to the IANA subtag registry at
http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
Our code 'nrm' is not being used for the Narom language as ISO 639-3
does, see:
http://www-01.sil.org/iso639-3/documentation.asp?id=nrm
We rather use it for the Norman / Nourmaud, as described in
http://en.wikipedia.org/wiki/Norman_language
The Norman language is recognized by the linguist list and many others
but as of
yet not present in ISO 639-3. It should probably be suggested to be added.
We should probaly map it to a private code meanwhile.
Our code 'ksh' is currently being used to represent a superset of what
it stands for
in ISO 639-3. Since ISO 639 lacks a group code for Ripuarian, we use the
code of the
only Ripuarian variety (of dozens) having a code, to represent the whole
lot. We
should probably suggest to add a group code to ISO 639, and at least the
dozen+
Ripuarian languages that we are using, and map 'ksh' to a private code
for Ripuarian
meanwhile.
Note also, that for the ALS/GSW and the KSH Wikipedias, page titles are
not
guaranteed to be in the languages of the Wikipedias. They are often in
German
instead. Details to be found in their respective page titleing rules.
Moreover,
for the ksh Wikipedia, unlike some other multilingual or multidialectal
Wikipedias,
texts are not, or quite often incorrectly, labelled as belonging to a
certain dialect.
See also: http://meta.wikimedia.org/wiki/Special_language_codes
Greetings -- Purodha
*Gesendet:* Sonntag, 04. August 2013 um 19:01 Uhr
*Von:* Markus Krötzsch mar...@semantic-mediawiki.org
*An:* Federico Leva (Nemo

Re: [Wikidata-l] Wikidata RDF export available

2013-08-09 Thread Markus Krötzsch


Hi Sebastian,

On 09/08/13 15:44, Sebastian Hellmann wrote:

Hi Markus,
we just had a look at your python code and created a dump. We are still
getting a syntax error for the turtle dump.


You mean just as in at around 15:30 today ;-)? The code is under 
heavy development, so changes are quite frequent. Please expect things 
to be broken in some cases (this is just a little community project, not 
part of the official Wikidata development).


I have just uploaded a new statements export (20130808) to 
http://semanticweb.org/RDF/Wikidata/ which you might want to try.




I saw, that you did not use a mature framework for serializing the
turtle. Let me explain the problem:

Over the last 4 years, I have seen about two dozen people (undergraduate
and PhD students, as well as Post-Docs) implement simple serializers
for RDF.

They all failed.

This was normally not due to the lack of skill, but due to the lack of
missing time. They wanted to do it quick, but they didn't have the time
to implement it correctly in the long run.
There are some really nasty problems ahead like encoding or special
characters in URIs. I would direly advise you to:

1. use a Python RDF framework
2. do some syntax tests on the output, e.g. with rapper
3. use a line by line format, e.g. use turtle without prefixes and just
one triple per line (It's like NTriples, but with Unicode)


Yes, URI encoding could be difficult if we were doing it manually. Note, 
however, that we are already using a standard library for URI encoding 
in all non-trivial cases, so this does not seem to be a very likely 
cause of the problem (though some non-zero probability remains). In 
general, it is not unlikely that there are bugs in the RDF somewhere; 
please consider this export as an early prototype that is meant for 
experimentation purposes. If you want an official RDF dump, you will 
have to wait for the Wikidata project team to get around doing it (this 
will surely be based on an RDF library). Personally, I already found the 
dump useful (I successfully imported some 109 million triples of some 
custom script into an RDF store), but I know that it can require some 
tweaking.




We are having a problem currently, because we tried to convert the dump
to NTriples (which would be handled by a framework as well) with rapper.
We assume that the error is an extra  somewhere (not confirmed) and
we are still searching for it since the dump is so big


Ok, looking forward to hear about the results of your search. A good tip 
for checking such things is to use grep. I did a quick grep on my 
current local statements export to count the numbers of  and  (this 
takes less than a minute on my laptop, including on-the-fly 
decompression). Both numbers were equal, making it unlikely that there 
is any unmatched  in the current dumps. Then I used grep to check that 
 and  only occur in the statements files in lines with commons URLs. 
These are created using urllib, so there should never be any  or  in them.



so we can not provide a detailed bug report. If we had one triple per
line, this would also be easier, plus there are advantages for stream
reading. bzip2 compression is very good as well, no need for prefix
optimization.


Not sure what you mean here. Turtle prefixes in general seem to be a 
Good Thing, not just for reducing the file size. The code has no easy 
way to get rid of prefixes, but if you want a line-by-line export you 
could subclass my exporter and overwrite the methods for incremental 
triple writing so that they remember the last subject (or property) and 
create full triples instead. This would give you a line-by-line export 
in (almost) no time (some uses of [...] blocks in object positions would 
remain, but maybe you could live with that).


Best wishes,

Markus



All the best,
Sebastian

Am 03.08.2013 23:22, schrieb Markus Krötzsch:

Update: the first bugs in the export have already been discovered --
and fixed in the script on github. The files I uploaded will be
updated on Monday when I have a better upload again (the links file
should be fine, the statements file requires a rather tolerant Turtle
string literal parser, and the labels file has a malformed line that
will hardly work anywhere).

Markus

On 03/08/13 14:48, Markus Krötzsch wrote:

Hi,

I am happy to report that an initial, yet fully functional RDF export
for Wikidata is now available. The exports can be created using the
wda-export-data.py script of the wda toolkit [1]. This script downloads
recent Wikidata database dumps and processes them to create RDF/Turtle
files. Various options are available to customize the output (e.g., to
export statements but not references, or to export only texts in English
and Wolof). The file creation takes a few (about three) hours on my
machine depending on what exactly is exported.

For your convenience, I have created some example exports based on
yesterday's dumps. These can be found at [2]. There are three Turtle
files: site links

Re: [Wikidata-l] PoC: Combining Wikidata and Clojure logic programming

2013-08-08 Thread Markus Krötzsch


On 07/08/13 15:40, Mingli Yuan wrote:

Also, something similar to Magnus' Wiri, here is a bot developed by us
on sina weibo (a twitter-like microblogging provider in China)

* http://weibo.com/n/%E6%9E%9C%E5%A3%B3%E5%A8%98

We use dataset from wikidata with some dirty hacks. It is only a
several-days quick work.


Sounds exciting (and we always like to learn about uses of the data), 
but could you give a short description in English what is happening 
there? The above link takes me to a Chinese registration form only ;-)


Markus



We really very excited about the availability of such big dataset. The
potential of Wikipedia and Wikidata is unlimited!

Long live the free knowledge!

Regards,
Mingli




On Wed, Aug 7, 2013 at 10:21 PM, Magnus Manske
magnusman...@googlemail.com mailto:magnusman...@googlemail.com wrote:




On Wed, Aug 7, 2013 at 3:20 PM, Mingli Yuan mingli.y...@gmail.com
mailto:mingli.y...@gmail.com wrote:

Very cool, Magnus!

Does it do real query on wikidata? or it is only a UI thing?

It does use live wikidata. Reasoning is hacked with a few
hardcoded regular expressions ;-)

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org mailto:Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l




___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l




___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Related research and a working system

2013-08-07 Thread Markus Krötzsch


Dear Adam,

thanks for the pointer. The paper gives an overview of how to design a 
wiki-based data curation platform for a specific target community. Some 
of the insights could also apply to Wikidata, while other won't transfer 
(e.g., you cannot invite the Wikidata community for a mini-workshop to 
gather requirements).


What I did not find in the paper are numbers of any kind. How do you 
know that they manage petabytes of data? I also did not figure out how 
many users they cater for (e.g., they write: 'A small number of testers 
we called the “seed community” were involved in the testing and 
experimentation phase. This community generated the initial wiki 
contents that could then be used to solicit further contributions from a 
larger community of users' -- but I cannot find how big this small 
community and this larger community were; this would be important to 
understand how similar their scenario is to ours).


Anyway, good to know about this recent work. I will send them an email 
to make them aware of this thread (and of Wikidata).


Cheers,

Markus



On 05/08/13 09:35, Adam Wight wrote:

Dear comrades, I just learned of a system based on MediaWiki which
shares many of the same objectives as Wikidata: collaborative data
storage and analysis, tracking of provenance, and facilitating
citations, to name a few. I'd like to encourage a dialogue with these
scientists, I do not think they are aware of your initiative, and they
definitely have valuable practical experience after seeing the
real-world use of their system. Currently they are managing several
petabytes of data.

Research paper by the site creators:
http://opensym.org/wsos2013/proceedings/p0301-sowe.pdf

Sorry I cannot link to their site itself-it might require an account...

-Adam Wight



___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l




___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] PoC: Combining Wikidata and Clojure logic programming

2013-08-07 Thread Markus Krötzsch


Hi Mingli,

thanks, this very interesting, but I think I need a bit more context to 
understand what you are doing and why.


Is your goal to create a library for accessing Wikidata from Clojure 
(like a Clojure API for Wikidata)? Or is your goal to use logical 
inference over Wikidata and you just use Clojure as a tool since it was 
most convenient?


To your question:


   * Do we have a long term plan to evolve wikidata towards a
 semantic-rich dataset?


There are no concrete designs for adding reasoning features to Wikidata 
so far (if this is what you mean). There are various open questions, 
especially related to inferencing over quantifiers. But there are also 
important technical questions, especially regarding performance. I 
intend to work the theory out in more detail soon (that is: How should 
logical rules over the Wikidata data model look work in principle?). 
The implementation then is the next step. I don't think that any of this 
will be part of the core features of Wikidata soon, but hopefully we can 
set up a useful external service for Wikidata search and analytics 
(e.g., to check for property constraint violations in real time instead 
of using custom code and bots).


Cheers,

Markus


On 05/08/13 17:30, Mingli Yuan wrote:

Hi, folks,

After one night quick work, I had gave a proof-of-concept to demonstrate
the feasibility that we can combine Wikidata and Clojure logic
programming together.

The source code is at here:
https://github.com/mountain/knowledge

An example of an entity:
https://github.com/mountain/knowledge/blob/master/src/entities/albert_einstein.clj

Example of types:
https://github.com/mountain/knowledge/blob/master/src/meta/types.clj

Example of predicates:
https://github.com/mountain/knowledge/blob/master/src/meta/properties.clj

Example of inference:
https://github.com/mountain/knowledge/blob/master/test/knowledge/test.clj

Also we found it is very easy to get another language versions of the
data other than English.

So, thanks very much for your guys' great work!

But I found the semantic layer of wikidata is shallow, that means you
can only knows who are Einstein's father and children, but it can not be
inferred automatically from wikidata that Einstein's father is the
grandfather of Einstein's children.

So my question is that:

  * Do we have a long term plan to evolve wikidata towards a
semantic-rich dataset?

Regards,
Mingli





___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l




___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Wikidata language codes

2013-08-06 Thread Markus Krötzsch


Hi Purodha,

thanks for the helpful hints. I have implemented most of these now in 
the list on git (this is also where you can see the private codes I have 
created where needed). I don't see a big problem in changing the codes 
in future exports if better options become available (it's much easier 
than changing codes used internally).


One open question that I still have is what it means if a language that 
usually has a script tag appears without such a tag (zh vs. 
zh-Hans/zh-Hant or sr vs. sr-Cyrl/sr-Latn). Does this really mean that 
we do not know which script is used under this code (either could appear)?


The other question is about the duplicate language tags, such as 'crh' 
and 'crh-Latn', which both appear in the data but are mapped to the same 
code. Maybe one of the codes is just phased out and will disappear over 
time? I guess the Wikidata team needs to answer this. We also have some 
codes that mean the same according to IANA, namely kk and kk-Cyrl, but 
which are currently not mapped to the same canonical IANA code.


Finally, I wondered about Norwegian. I gather that no.wikipedia.org is 
in Norwegian Bokmål (nb), which is how I map the site now. However, the 
language data in the dumps (not the site data) uses both no and nb. 
Moreover, many items have different texts for nb and no. I wonder if 
both are still Bokmål, and there is just a bug that allows people to 
enter texts for nb under two language settings (for descriptions this 
could easily be a different text, even if in the same language). We also 
have nn, and I did not check how this relates to no (same text or 
different?).


Cheers,
Markus

On 05/08/13 15:41, P. Blissenbach wrote:

Hi Markus,
Our code 'sr-ec' is at this moment effectively equivalent to 'sr-Cyrl',
likewise
is our code 'sr-el' currently effectively equivalent to 'sr-Latn'. Both
might change,
once dialect codes of Serbian are added to the IANA subtag registry at
http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
Our code 'nrm' is not being used for the Narom language as ISO 639-3
does, see:
http://www-01.sil.org/iso639-3/documentation.asp?id=nrm
We rather use it for the Norman / Nourmaud, as described in
http://en.wikipedia.org/wiki/Norman_language
The Norman language is recognized by the linguist list and many others
but as of
yet not present in ISO 639-3. It should probably be suggested to be added.
We should probaly map it to a private code meanwhile.
Our code 'ksh' is currently being used to represent a superset of what
it stands for
in ISO 639-3. Since ISO 639 lacks a group code for Ripuarian, we use the
code of the
only Ripuarian variety (of dozens) having a code, to represent the whole
lot. We
should probably suggest to add a group code to ISO 639, and at least the
dozen+
Ripuarian languages that we are using, and map 'ksh' to a private code
for Ripuarian
meanwhile.
Note also, that for the ALS/GSW and the KSH Wikipedias, page titles are not
guaranteed to be in the languages of the Wikipedias. They are often in
German
instead. Details to be found in their respective page titleing rules.
Moreover,
for the ksh Wikipedia, unlike some other multilingual or multidialectal
Wikipedias,
texts are not, or quite often incorrectly, labelled as belonging to a
certain dialect.
See also: http://meta.wikimedia.org/wiki/Special_language_codes
Greetings -- Purodha
*Gesendet:* Sonntag, 04. August 2013 um 19:01 Uhr
*Von:* Markus Krötzsch mar...@semantic-mediawiki.org
*An:* Federico Leva (Nemo) nemow...@gmail.com
*Cc:* Discussion list for the Wikidata project.
wikidata-l@lists.wikimedia.org
*Betreff:* [Wikidata-l] Wikidata language codes (Was: Wikidata RDF
export available)
Small update: I went through the language list at

https://github.com/mkroetzsch/wda/blob/master/includes/epTurtleFileWriter.py#L472

and added a number of TODOs to the most obvious problematic cases.
Typical problems are:

* Malformed language codes ('tokipona')
* Correctly formed language codes without any official meaning (e.g.,
'cbk-zam')
* Correctly formed codes with the wrong meaning (e.g., 'sr-ec': Serbian
from Ecuador?!)
* Language codes with redundant information (e.g., 'kk-cyrl' should be
the same as 'kk' according to IANA, but we have both)
* Use of macrolanguages instead of languages (e.g., zh is not
Mandarin but just Chinese; I guess we mean Mandarin; less sure about
Kurdish ...)
* Language codes with incomplete information (e.g., sr should be
sr-Cyrl or sr-Latn, both of which already exist; same for zh and
zh-Hans/zh-Hant, but also for zh-HK [is this simplified or
traditional?]).

I invite any language experts to look at the file and add
comments/improvements. Some of the issues should possibly also be
considered on the implementation side: we don't want two distinct codes
for the same thing.

Cheers,

Markus


On 04/08/13 16:35, Markus Krötzsch wrote:
  On 04/08/13 13:17, Federico Leva (Nemo) wrote:
  Markus Krötzsch, 04/08/2013 12:32:
  * Wikidata uses

Re: [Wikidata-l] Wikidata RDF export available

2013-08-04 Thread Markus Krötzsch


On 04/08/13 13:17, Federico Leva (Nemo) wrote:

Markus Krötzsch, 04/08/2013 12:32:

* Wikidata uses be-x-old as a code, but MediaWiki messages for this
language seem to use be-tarask as a language code. So there must be a
mapping somewhere. Where?


Where I linked it.


Are you sure? The file you linked has mappings from site ids to language 
codes, not from language codes to language codes. Do you mean to say: 
If you take only the entries of the form 'XXXwiki' in the list, and 
extract a language code from the XXX, then you get a mapping from 
language codes to language codes that covers all exceptions in 
Wikidata? This approach would give us:


'als' : 'gsw',
'bat-smg': 'sgs',
'be_x_old' : 'be-tarask',
'crh': 'crh-latn',
'fiu_vro': 'vro',
'no' : 'nb',
'roa-rup': 'rup',
'zh-classical' : 'lzh'
'zh-min-nan': 'nan',
'zh-yue': 'yue'

Each of the values on the left here also occur as language tags in 
Wikidata, so if we map them, we use the same tag for things that 
Wikidata has distinct tags for. For example, Q27 has a label for yue but 
also for zh-yue [1]. It seems to be wrong to export both of these with 
the same language tag if Wikidata uses them for different purposes.


Maybe this is a bug in Wikidata and we should just not export texts with 
any of the above codes at all (since they always are given by another 
tag directly)?





* MediaWiki's http://www.mediawiki.org/wiki/Manual:$wgDummyLanguageCodes
provides some mappings. For example, it maps zh-yue to yue. Yet,
Wikidata use both of these codes. What does this mean?

Answers to Nemo's points inline:

On 04/08/13 06:15, Federico Leva (Nemo) wrote:

Markus Krötzsch, 03/08/2013 15:48:


...


Apart from the above, doesn't wgLanguageCode in
https://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php
have what you need?


Interesting. However, the list there does not contain all 300 sites that
we currently find in Wikidata dumps (and some that we do not find there,
including things like dkwiki that seem to be outdated). The full list of
sites we support is also found in the file I mentioned above, just after
the language list (variable siteLanguageCodes).


Of course not all wikis are there, that configuration is needed only
when the subdomain is wrong. It's still not clear to me what codes you
are considering wrong.


Well, the obvious: if a language used in Wikidata labels or on Wikimedia 
sites has an official IANA code [2], then we should use this code. Every 
other code would be wrong. For languages that do not have any accurate 
code, we should probably use a private code, following the requirements 
of BCP 47 for private use subtags (in particular, they should have a 
single x somewhere).


This does not seem to be done correctly by my current code. For example, 
we now map 'map_bmswiki' to 'map-bms'. While both 'map' and 'bms' are 
lANA language tags, I am not sure that their combination makes sense. 
The language should be Basa Banyumasan, but bms is for Bilma Kanuri (and 
it is a language code, not a dialect code). Note that map-bms does not 
occur in the file you linked to, so I guess there is some more work to do.


Markus

[1] http://www.wikidata.org/wiki/Special:Export/Q27
[2] 
http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry




___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

[Wikidata-l] Wikidata language codes (Was: Wikidata RDF export available)

2013-08-04 Thread Markus Krötzsch


Small update: I went through the language list at

https://github.com/mkroetzsch/wda/blob/master/includes/epTurtleFileWriter.py#L472

and added a number of TODOs to the most obvious problematic cases. 
Typical problems are:


* Malformed language codes ('tokipona')
* Correctly formed language codes without any official meaning (e.g., 
'cbk-zam')
* Correctly formed codes with the wrong meaning (e.g., 'sr-ec': Serbian 
from Ecuador?!)
* Language codes with redundant information (e.g., 'kk-cyrl' should be 
the same as 'kk' according to IANA, but we have both)
* Use of macrolanguages instead of languages (e.g., zh is not 
Mandarin but just Chinese; I guess we mean Mandarin; less sure about 
Kurdish ...)
* Language codes with incomplete information (e.g., sr should be 
sr-Cyrl or sr-Latn, both of which already exist; same for zh and 
zh-Hans/zh-Hant, but also for zh-HK [is this simplified or 
traditional?]).


I invite any language experts to look at the file and add 
comments/improvements. Some of the issues should possibly also be 
considered on the implementation side: we don't want two distinct codes 
for the same thing.


Cheers,

Markus


On 04/08/13 16:35, Markus Krötzsch wrote:

On 04/08/13 13:17, Federico Leva (Nemo) wrote:

Markus Krötzsch, 04/08/2013 12:32:

* Wikidata uses be-x-old as a code, but MediaWiki messages for this
language seem to use be-tarask as a language code. So there must be a
mapping somewhere. Where?


Where I linked it.


Are you sure? The file you linked has mappings from site ids to language
codes, not from language codes to language codes. Do you mean to say:
If you take only the entries of the form 'XXXwiki' in the list, and
extract a language code from the XXX, then you get a mapping from
language codes to language codes that covers all exceptions in
Wikidata? This approach would give us:

'als' : 'gsw',
'bat-smg': 'sgs',
'be_x_old' : 'be-tarask',
'crh': 'crh-latn',
'fiu_vro': 'vro',
'no' : 'nb',
'roa-rup': 'rup',
'zh-classical' : 'lzh'
'zh-min-nan': 'nan',
'zh-yue': 'yue'

Each of the values on the left here also occur as language tags in
Wikidata, so if we map them, we use the same tag for things that
Wikidata has distinct tags for. For example, Q27 has a label for yue but
also for zh-yue [1]. It seems to be wrong to export both of these with
the same language tag if Wikidata uses them for different purposes.

Maybe this is a bug in Wikidata and we should just not export texts with
any of the above codes at all (since they always are given by another
tag directly)?




* MediaWiki's http://www.mediawiki.org/wiki/Manual:$wgDummyLanguageCodes
provides some mappings. For example, it maps zh-yue to yue. Yet,
Wikidata use both of these codes. What does this mean?

Answers to Nemo's points inline:

On 04/08/13 06:15, Federico Leva (Nemo) wrote:

Markus Krötzsch, 03/08/2013 15:48:


...


Apart from the above, doesn't wgLanguageCode in
https://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php

have what you need?


Interesting. However, the list there does not contain all 300 sites that
we currently find in Wikidata dumps (and some that we do not find there,
including things like dkwiki that seem to be outdated). The full list of
sites we support is also found in the file I mentioned above, just after
the language list (variable siteLanguageCodes).


Of course not all wikis are there, that configuration is needed only
when the subdomain is wrong. It's still not clear to me what codes you
are considering wrong.


Well, the obvious: if a language used in Wikidata labels or on Wikimedia
sites has an official IANA code [2], then we should use this code. Every
other code would be wrong. For languages that do not have any accurate
code, we should probably use a private code, following the requirements
of BCP 47 for private use subtags (in particular, they should have a
single x somewhere).

This does not seem to be done correctly by my current code. For example,
we now map 'map_bmswiki' to 'map-bms'. While both 'map' and 'bms' are
lANA language tags, I am not sure that their combination makes sense.
The language should be Basa Banyumasan, but bms is for Bilma Kanuri (and
it is a language code, not a dialect code). Note that map-bms does not
occur in the file you linked to, so I guess there is some more work to do.

Markus

[1] http://www.wikidata.org/wiki/Special:Export/Q27
[2]
http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry






___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

[Wikidata-l] Wikidata RDF export available

2013-08-03 Thread Markus Krötzsch


Hi,

I am happy to report that an initial, yet fully functional RDF export 
for Wikidata is now available. The exports can be created using the 
wda-export-data.py script of the wda toolkit [1]. This script downloads 
recent Wikidata database dumps and processes them to create RDF/Turtle 
files. Various options are available to customize the output (e.g., to 
export statements but not references, or to export only texts in English 
and Wolof). The file creation takes a few (about three) hours on my 
machine depending on what exactly is exported.


For your convenience, I have created some example exports based on 
yesterday's dumps. These can be found at [2]. There are three Turtle 
files: site links only, labels/descriptions/aliases only, statements 
only. The fourth file is a preliminary version of the Wikibase ontology 
that is used in the exports.


The export format is based on our earlier proposal [3], but it adds a 
lot of details that had not been specified there yet (namespaces, 
references, ID generation, compound datavalue encoding, etc.). Details 
might still change, of course. We might provide regular dumps at another 
location once the format is stable.


As a side effect of these activities, the wda toolkit [1] is also 
getting more convenient to use. Creating code for exporting the data 
into other formats is quite easy.


Features and known limitations of the wda RDF export:

(1) All current Wikidata datatypes are supported. Commons-media data is 
correctly exported as URLs (not as strings).


(2) One-pass processing. Dumps are processed only once, even though this 
means that we may not know the types of all properties when we first 
need them: the script queries wikidata.org to find missing information. 
This is only relevant when exporting statements.


(3) Limited language support. The script uses Wikidata's internal 
language codes for string literals in RDF. In some cases, this might not 
be correct. It would be great if somebody could create a mapping from 
Wikidata language codes to BCP47 language codes (let me know if you 
think you can do this, and I'll tell you where to put it)


(4) Limited site language support. To specify the language of linked 
wiki sites, the script extracts a language code from the URL of the 
site. Again, this might not be correct in all cases, and it would be 
great if somebody had a proper mapping from Wikipedias/Wikivoyages to 
language codes.


(5) Some data excluded. Data that cannot currently be edited is not 
exported, even if it is found in the dumps. Examples include statement 
ranks and timezones for time datavalues. I also currently exclude labels 
and descriptions for simple English, formal German, and informal Dutch, 
since these would pollute the label space for English, German, and Dutch 
without adding much benefit (other than possibly for simple English 
descriptions, I cannot see any case where these languages should ever 
have different Wikidata texts at all).


Feedback is welcome.

Cheers,

Markus

[1] https://github.com/mkroetzsch/wda
Run python wda-export.data.py --help for usage instructions
[2] http://semanticweb.org/RDF/Wikidata/
[3] http://meta.wikimedia.org/wiki/Wikidata/Development/RDF

--
Markus Kroetzsch, Departmental Lecturer
Department of Computer Science, University of Oxford
Room 306, Parks Road, OX1 3QD Oxford, United Kingdom
+44 (0)1865 283529   http://korrekt.org/

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] list for bug emails

2012-04-16 Thread Markus Krötzsch


On 14/04/12 15:38, Gerard Meijssen wrote:

Hoi,
The Wikidata project is probably the software used by OmegaWiki, the
original Wikidata.


Ah, great, this completes the confusion :-D

Cheers,

Markus


On 14 April 2012 16:12, Jeroen De Dauw jeroended...@gmail.com
mailto:jeroended...@gmail.com wrote:

Hey,

  Wikidata
  WikidataClient
  WikidataRepo

Although the project is called WikiData, the software is called
Wikibase. So we should have Wikibase and Wikibase client for the
extensions, and Wikidata for the project, although I'm not sure we
really need the later.

Cheers

--
Jeroen De Dauw
http://www.bn2vs.com
Don't panic. Don't be evil.
--

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org mailto:Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l




___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Fwd: [Wiki-research-l] Wikidata opinion piece in The Atlantic

2012-04-16 Thread Markus Krötzsch


On 12/04/12 21:10, Daniel Kinzler wrote:

This is an interesting criticism, and there's an excellent retort by Denny in
the comments. Just fyi.


Thanks, very good discussion and very good answer by Denny. I should 
have a chat with Mark at some point to check out what he thinks about it 
(it is a bit ironic that we use international news sources to 
communicate when sitting in offices that are 500m away from each other ;-)


Markus


 Original Message 
Subject: [Wiki-research-l] Wikidata opinion piece in The Atlantic
Date: Tue, 10 Apr 2012 16:50:49 -0700
From: En Pinedeyntest...@hotmail.com
Reply-To: Research into Wikimedia content and communities
wiki-researc...@lists.wikimedia.org
To:wiki-researc...@lists.wikimedia.org


Here's an opinion piece, The Problem with Wikidata, by Mark Graham, who
is a Research Fellow at the Oxford Internet Institute, which appears on
The Atlantic's website. I'm not personally supporting or opposing his views
but I found this to be an interesting read.
http://www.theatlantic.com/technology/archive/2012/04/the-problem-with-wikidata/255564/

Pine


___
Wiki-research-l mailing list
wiki-researc...@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l



___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Spatial data definition

2012-04-06 Thread Markus Krötzsch


Hi Andreas,

thanks for the input. I have drafted the current text about geo-related 
datatypes, but I am far from being an expert in this area. Our mapping 
expert in Wikidata is Katie (Aude), who has also been working with 
OpenStreetMap, but further expert input on this topic would be quite 
valuable.


As in all areas, we need to find a balance between generality and 
usability, so I am slightly in favour of committing to one SR for now 
(as I understand, the data can be converted easily between SRs but -- as 
opposed to other cases where people measure something -- most of the 
world seems to be happy with one of them).


I have now included a link to this thread into an editorial remark in 
the data model, so we do not forget about this discussion when working 
out the details.


Markus


On 04/04/12 14:16, Andreas Trawoeger wrote:

Hi everybody!

As the guy who has to honor to shortly receive some funding from
Wikimedia Germany for handling spatial open government data [0] I
would like to make some remarks on the current geo definitions in the
Wikidata model:

1. Spatial Reference System Identifier (SRID [1]) definition is missing

Every GeoCoordinatesValue field should either have a corresponding
SRID field that defines the used spatial reference system (SRS [2]) or
mandate the use of a single SRS like WGS84 [3] which is currently the
standard used by GPS, OpenStreetMap and Wikipedia.

2. Geographic shapes should be defined in either Well-known text (WKT
[4]) or GeoJSON [5]

WKT is the defacto standard to store spatial data in a rational
database and GeoJSON is the defacto standard to access geo data via
web. Both formats can be easily transformed into each other. So which
one you choose pretty much depends on your preferred choice of SQL vs.
NoSQL database.


So in summary I would propose the following data model for spatial data:

Geographic locations
 Datatype IRI: http://wikidata.org/vocabulary/datatype_geocoords
 Value: GeoCoordinatesValue
 Mandatory spatial reference system: EPSG 4326 (WGS 84/GPS)
 Type: Decimal

Geographic objects
 Datatype IRI: http://wikidata.org/vocabulary/datatype_geoobjects
 Value: GeoObjectsValue
 Type: GeoJSON [5]

Geographic objects SRID
 Datatype IRI: http://wikidata.org/vocabulary/datatype_geoobjects_srid
 Value: GeoObjectsSridValue
 Type: EPSG Spatial Reference System Identifier (SRID [1])


That model would allow a structure where every spatial object can have
a complex geometry stored in its original geodetic system and still
have an easily manageable location in GPS format.


cu andreas


[0] 
http://de.wikipedia.org/wiki/Wikipedia:Community-Projektbudget#2._kartenwerkstatt.at
[1] https://en.wikipedia.org/wiki/Spatial_reference_system_identifier
[2] https://en.wikipedia.org/wiki/Spatial_reference_system
[3] https://en.wikipedia.org/wiki/WGS84
[4] https://en.wikipedia.org/wiki/Well-known_text
[5] https://en.wikipedia.org/wiki/GeoJSON

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l




___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] SNAK - assertion?

2012-04-06 Thread Markus Krötzsch


Martynas,

what you are proposing below is not W3C recommended RDF but an extension 
of triples to quads. As far as I know, this extension is not compatible 
yet with existing standards such as SPARQL and OWL. Named graphs work 
with SPARQL, but are mostly used in another way than you suggest. Most 
RDF database tools would be *very* unhappy to get millions of named 
graphs in combination with queries that use variables as graph names. 
The syntax you use is not a W3C standard either.


This does not say that N-Quads aren't a good idea if one can get them to 
work with the rest of the Semantic Web stack, but it really defeats your 
own arguments. We are committed to supporting *existing* standards (as 
we have said many times already), but we will not base our software 
design on a non-standard RDF-variant that works with neither OWL nor SPARQL.


Markus


On 06/04/12 13:09, Martynas Jusevicius wrote:

Hey Denny,

I gave it a shot:

http://dbpedia.org/resource/France
http://dbpedia.org/ontology/PopulatedPlace/populationDensity
116^^http://dbpedia.org/datatype/inhabitantsPerSquareKilometre
http://wikidata.org/graphs/France2012  .
http://dbpedia.org/resource/France
http://dbpedia.org/ontology/populationDensity
116^^http://www.w3.org/2001/XMLSchema#double
http://wikidata.org/graphs/France2012  .

http://wikidata.org/graphs/France2012
http://purl.org/dc/terms/date
2012^^http://www.w3.org/2001/XMLSchema#year
http://wikidata.org/graphs/France2012  .
http://wikidata.org/graphs/France2012
http://purl.org/dc/terms/source  _:source
http://wikidata.org/graphs/France2012  .
_:sourcehttp://purl.org/dc/terms/published
2010^^http://www.w3.org/2001/XMLSchema#year
http://wikidata.org/graphs/France2012  .
_:sourcehttp://purl.org/dc/terms/title  Bilan demographique@fr
http://wikidata.org/graphs/France2012  .

The syntax is N-Quads. It does not use reification, but instead named
graphs for provenance. The necessary concepts were already present in
DBPedia.

As you might know, temporal provenance is not the strongest point of
RDF. However conventions and solutions are available, and I am sure
implementing them would require far less effort than creating a custom
data model from scratch, not to mention the benefits of potential
reuse.
There's quite some research done on RDF provenance, which is worth
looking into if provenance is really a key feature for Wikidata from
day one. I see it as something that should work transparently behind
the scenes, and therefore could be rolled-out later on.

You would get much better and more extensive advice than mine on
semantic-...@w3.org -- the only prerequisite is willingness to
cooperate.

RDF's strength is that it solves data integration problems by pivotal
conversion, reducing the number of model transformations from
quadratic to linear:
http://en.wikipedia.org/wiki/Data_conversion#Pivotal_conversion
A custom data model brings up questions which already have an answer
in the Semantic Web stack:
# can data from different Wikidata instances be merged or interlinked natively?
# is there a native query language? In case of SQL, how performant
will it be given many JOINs and the planned use of provenance?
# what and how many custom serialization formats and API mechanisms
will have to follow?

Stacking one custom solution on top of another can eventually result
in huge costs. I honestly think the energy of Wikidata could be
directed in a more productive way.


Martynas
graphity.org


2012/4/5 Denny Vrandečićdenny.vrande...@wikimedia.de:

Dear Martynas,

if you try to model the following statement in RDF

The population density of France, as of an 2012 estimate, is 116 per square
kilometer, according to the Bilan demographique 2010.

you might notice that RDF requires a reification of the statement. The data
model that you have seen provides us with an abstract and concise way to
talk about these reifications (i.e. via the statement model, just as in
RDF).

We still have not finished the document describing how to map our data model
to OWL/RDF, but we have thought about this the whole time while discussing
the data model.

But if you find a simpler, and more RDFish way to express the above
statement, please feel free to enlighten me. I would be indeed very
interested.

Cheers,
Denny



2012/4/5 Martynas Juseviciusmarty...@graphity.org


it doesn't look like reuse of existing concepts and standards is a
priority for this project.
One cannot build a Semantic Web application by ignoring its main
building block, which is the RDF data model.



--
Project director Wikidata
Wikimedia Deutschland e.V. | Eisenacher Straße 2 | 10777 Berlin
Tel. +49-30-219 158 26-0 | http://wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/681/51985.

Re: [Wikidata-l] Data_model: Metamodel: Wikipedialink

2012-04-05 Thread Markus Krötzsch


On 04/04/12 23:23, Gregor Hagedorn wrote:

Wikidata can (and probably will) store information about each moon of
Uranus, e.g., its mass. It does probably not make sense to store the mass of
Moons of Uranus if there is such an article. It does not help to know that
the article Moons on Uranus also talks (among other things) about some
moon that has a particular mass: you need to know what *exactly* you are
talking about to exploit this data. An article on Moons of Uranus could
still (eventually) embed Wikidata data to improve its display, but this data
must refer to individual moons, not to the article as a whole.


The problem I see is that you have no definition to which real object
the data are tied. We agree that the problem is not the interwiki
links per se. It is what results from it. How do we tie data to a
wikidata page when we don't know what it is about?


This is a hard question. The best answer I can come up with now (on the 
bus to Oxford) is as follows: the meaning of Wikidata items is subject 
to social agreement, based on shared experience, communication, and 
human-language documentation. The latter is provided in labels and 
descriptions, in Wikipedia articles that are connected to a Wikidata 
item, and also in Wikidata property pages that document properties.


I know that this may not be a satisfactory answer to your question of 
how we can *really* *know* what a Wikidata item is about. If you want to 
dig deeper into this issue, there is a lot of interesting literature, 
which can give you many more details than I can. What we are dealing 
with is the well-known philosophical problem of /grounding/. In essence, 
the state of discussion boils down to the following: there is no known 
way of connecting the symbols of a purely symbolic system (such as a 
computer program) to real-world objects in a formal way. Going deeper 
into the discussion reveals that there is also no agreed-upon way to 
clarify the meaning of real and object in the first place.


In spite of all this, humans somehow manage to understand each other, 
which brings us to the point of how amazing they all are :-) Wikidata is 
but a humble technical tool that provides an environment for 
articulating and (I hope) improving this understanding in a novel way. 
This cannot provide a formal grounding, but it might come as close to 
this ideal as we have gotten yet.


Regards,

Markus

--
Dr. Markus Kroetzsch
Department of Computer Science, University of Oxford
Room 306, Parks Road, OX1 3QD Oxford, United Kingdom
+44 (0)1865 283529   http://korrekt.org/

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Notability in Wikidata

2012-04-01 Thread Markus Krötzsch

In general, policies for notability in Wikidata will be governed by the 
community of (all) Wikidata editors. On the technical side, we aim to 
achieve two things:


* The system should be able to handle a lot of data.

* The interfaces and data access features should minimize the negative 
impact that additional (correct but not very important) data has on usage.


Of course, both goals have their limits and there will always be good 
(technical or social) reasons to not include everything. We would rather 
like to support linking and data integration with external data bases 
than suggest *every* fact of the world to be copied to Wikidata.


Markus


On 31/03/12 20:22, emijrp wrote:

Hi all;

I'm thinking about notability in Wikidata and how it may conflict with
Wikipedia current policies and community conceptions. Will Wikidata
allow to create entities for small villages, asteroids, galaxies, stars,
species, etc, that are not allowed today at Wikipedia? Including about
those that don't have article in any Wikipedia?

I will be happy if so.

Regards,
emijrp



___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l



--
Dr. Markus Kroetzsch
Department of Computer Science, University of Oxford
Room 306, Parks Road, OX1 3QD Oxford, United Kingdom
+44 (0)1865 283529   http://korrekt.org/

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

< 1 2

101 - 173 of 173 matches

Mail list logo