Re: [Wikidata-l] Wikidata RDF Issues

2013-10-16 Thread Paul A. Houle
For a while I've noticed that your messages don't show up properly in Windows Live Mail. I have many circumstances that are pushing me towards gmail, but you should correct this because you can assume that you hear maybe 1% of the time when people have a problem. -Original Message

Re: [Wikidata-l] Application: sexing people by name/research gender bias

2013-10-13 Thread Paul A. Houle
Just as a suggestion, you can turn these kind of numbers into a probability distribution using the beta distribution. If you use (1,1) as a prior you get something like beta(251,1) for the the probability of the probability that somebody named "Aaron" is male. -Original Message- Fro

Re: [Wikidata-l] The Day the Knowledge Graph Exploded

2013-08-23 Thread Paul A. Houle
I’d say that WikiData is almost implied by the fundamental flaw of DBpedia; since DBpedia is based on parsing inexact and varied markup, there is a lot of complexity in getting the accuracy high, particularly in the problem that it’s hard to interact with Wikipedia with an automated syste

Re: [Wikidata-l] Make Commons a wikidata client

2013-08-13 Thread Paul A. Houle
I’d like to see assertions of the sort “Picture B represents topic X” in commons. One can easily infer this for some pictures by noticing that “Picture B is included in the encyclopedia entry for topic X”, but often there are so many pictures of the topic that they aren’t all included in the

Re: [Wikidata-l] Best practices for large RDF dumps, was: Re: Wikidata RDF export available

2013-08-12 Thread Paul A. Houle
My feelings are strong towards one-line-per-fact. Large RDF data sets have validity problems, and the difficulty of convincing publishers that this matters indicates that this situation will continue. I’ve thought a bit about the problem of the “streaming converter from Turtle to

Re: [Wikidata-l] Best practices for large RDF dumps, was: Re: Wikidata RDF export available

2013-08-12 Thread Paul A. Houle
My feelings are strong towards one-line-per-fact. Large RDF data sets have validity problems, and the difficulty of convincing publishers that this matters indicates that this situation will continue. I’ve thought a bit about the problem of the “streaming converter from Turtle to

Re: [Wikidata-l] Wikidata RDF export available

2013-08-09 Thread Paul A. Houle
Over time people have gotten the message that you shouldn't write XML like System.out.println(""+someString+"") because it is something that usually ends in tears. Although (most) RDF toolkits are like XML toolkits in that they choke on invalid data, people who write RDF seem to

Re: [Wikidata-l] Accelerating software innovation with Wikidata and improved Wikicode

2013-07-08 Thread Paul A. Houle
Here is my 2 cents. I have paid my dues writing CRUD apps for business. They all want the same thing, something that keeps track of entities and controls how the organization interacts with those entities. In one year, for instance, I worked on systems for an academic department and a lo

Re: [Wikidata-l] A solution with finality is needed for P107 - maintype (GND)

2013-07-01 Thread Paul A. Houle
I would say that GND is a “good enough” answer. Most named entities are persons, organizations, events, creative works and places and these are all mutually exclusive. There ought to be a system interlock to prevent confusion between them. “Organism Classification” or whatever you

Re: [Wikidata-l] Geoccordinates are live

2013-06-13 Thread Paul A. Houle
You’ll do better dealing with bad coordinates if your system can recognize how bad particular cases are. The worst error I see in Wikipedia is that sometimes people get east and west confused, so there is this mirror image of Europe reflected across the U.K. You find cute little Cze

Re: [Wikidata-l] Visualisations of The Most Unique Wikipedias According to Wikidata

2013-06-13 Thread Paul A. Houle
I think Poland may do better than average because Polish people, out of national pride, have made a special effort to be well documented in English Wikipedia and represent a Polish point-of-view on topics like the city of Gdansk. One fascinating thing about Wikidata is that it provides

Re: [Wikidata-l] Wikidata dumps

2013-05-23 Thread Paul A. Houle
I took a look at that and was concerned about the plan to release a future data dump in the RDF/XML format. Most people these days think of RDF/XML as obsolete and the future is in the Turtle family of languages. RDF/XML has various problems: it can't express all legal RDF statements and it

Re: [Wikidata-l] Question about wikipedia categories.

2013-05-07 Thread Paul A. Houle
Statistical methods can deal with black swans, but you've got to get away from normal distributions and also model the risk that your model is wrong. Since training sets come from the same place sausage comes from, training sets in machine learning rarely teach the algorithm the correct

Re: [Wikidata-l] Question about wikipedia categories.

2013-05-06 Thread Paul A. Houle
From my viewpoint, biases are an issue of statistical sampling. Wikipedia is an encyclopedia by humans for humans so of course it has a anthropocentric background, in which the mass of all the concepts swirling around the Earth like an atmosphere curves the graph, keeping the Sun in o

Re: [Wikidata-l] less hard-core version of the data model

2012-06-21 Thread Paul A. Houle
On 6/20/2012 6:39 AM, Lydia Pintscher wrote: > Heya folks :) > > We've published a version of the data model that is less technical and > that should be easier to understand than the very detailed existing > one. You can find it at > http://meta.wikimedia.org/wiki/Wikidata/Notes/Data_model_primer >

[Wikidata-l] Tail end testing & quality

2012-06-13 Thread Paul A. Houle
Hey guys, I think for a long time semantic projects have focused on getting data out but haven't incorporated the 'voice of the consumer'; yet, if you think of data quality as 'suitable for customer requirements' instead of 'process conforms to specification,' this is the first st