Re: [Wikidata-l] Kian: The first neural network to serve Wikidata

Emw Sat, 07 Mar 2015 10:57:04 -0800

Amir,

What is the false positive rate of your algorithm when dealing with
fictitious humans and (non-fictitious) non-human organisms?  That is, how
often does your program classify such non-humans as humans?


Regarding the latter, note that items about individual dogs, elephants,
chimpanzees and even trees can use properties that are otherwise extremely
skewed towards humans.  For example, Prometheus (Q590010) [1], an extremely
old tree, has claims for *date of birth* (P569), *date of death* (P570),
even *killed by* (P157).  Non-human animals can also have kinship claims
(e.g. *mother*, *brother, child*), among other properties typically used on
humans.

Best,
Eric

https://www.wikidata.org/wiki/User:Emw

1.  Prometheus.  https://www.wikidata.org/wiki/Q590010

On Sat, Mar 7, 2015 at 1:44 PM, Amir Ladsgroup <ladsgr...@gmail.com> wrote:

> Hey Markus,
> Thanks for your insight :)
>
> On Sat, Mar 7, 2015 at 9:52 PM, Markus Krötzsch <
> mar...@semantic-mediawiki.org> wrote:
>
>> Hi Amir,
>>
>> In spite of all due enthusiasm, please evaluate your results (with
>> humans!) before making automated edits. In fact, I would contradict Magnus
>> here and say that such an approach would best be suited to provide
>> meaningful (pre-filtered) *input* to people who play a Wikidata game,
>> rather than bypassing the game (and humans) altogether. The expected error
>> rates are quite high for such an approach, but it can still save a lot of
>> works for humans.
>>
>> there is a "certainty factor" and It can save a lot without making such
> errors by using the certainty factor
>
>
>> As for the next steps, I would suggest that you have a look at the works
>> that others have done already. Try Google Scholar:
>>
>> https://scholar.google.com/scholar?q=machine+learning+wikipedia
>>
>> As you can see, there are countless works on using machine learning
>> techniques on Wikipedia, both for information extraction (e.g.,
>> understanding link semantics) and for things like vandalism detection. I am
>> sure that one could get a lot of inspiration from there, both on potential
>> applications and on technical hints on how to improve result quality.
>>
>> Yes, definitely I would use them, thanks.
>
>
>> You will find that people are using many different approaches in these
>> works. The good old ANN is still a relevant algorithm in practice, but
>> there are many other techniques, such as SVNs, Markov models, or random
>> forests, which have been found to work better than ANNs in many cases. Not
>> saying that a three-layer feed-forward ANN cannot do some jobs as well, but
>> I would not restrict to one ML approach if you have a whole arsenal of
>> algorithms available, most of them pre-implemented in libraries (the first
>> Google hit has a lot of relevant projects listed:
>> http://daoudclarke.github.io/machine%20learning%20in%
>> 20practice/2013/10/08/machine-learning-libraries/). I would certainly
>> recommend that you don't implement any of the standard ML algorithms from
>> scratch.
>>
>> I use backward propagation algorithm and I use octave in ML for my
> personal works, but in Wikipedia I use python (for two main reasons:
> integrating with with other wikipedia-related tools like pywikibot and bad
> performance of octave and Matlab in big sets of data) and I had to write
> that parts from scratch since I couldn't find any related library in
> python. Even algorithms like BFGS is not there (I could find in scipy but I
> wasn't sure it works correctly and because no documentation is there)
>
>> In practice, the most challenging task for successful ML is often feature
>> engineering: the question which features you use as an input to your
>> learning algorithm. This is far more important that the choice of
>> algorithm. Wikipedia in particular offers you so many relevant pieces of
>> information with each article that are not just mere keywords (links,
>> categories, in-links, ...)  and it is not easy to decide which of these to
>> feed into your learner. This will be different for each task you solve
>> (subject classification is fundamentally different from vandalism
>> detection, and even different types of vandalism would require very
>> different techniques). You should pick hard or very large tasks to make
>> sure that the tweaking you need in each case takes less time than you would
>> need as a human to solve the task manually ;-)
>>
>> Yes, feature engineering is the most important thing and it can be tricky
> but feature engineering in Wikidata is lot easier (it's easier than
> Wikipedia. Wikipedia itself it's easier than other places). Anti-Vandalism
> bots are lot easier in Wikidata than Wikipedia. Editing in Wikidata is
> limited to certain kinds (like removing a sitelink, etc.) but it's not easy
> in Wikipedia.
>
>
>> Anyway, it's an interesting field, and we could certainly use some effort
>> to exploit the countless works in this field for Wikidata. But you should
>> be aware that this is no small challenge and that there is no universal
>> solution that will work well even for all the tasks that you have mentioned
>> in your email.
>>
>> Of course, I spent lots of time studying this and I would be happy if
> anyone who knows about neural networks or AI can contribute too.
>
>
>> Best wishes,
>>
>> Markus
>>
>>
>> On 07.03.2015 18:21, Magnus Manske wrote:
>>
>>> Congratulations for this bold step towards the Singularity :-)
>>>
>>> As for tasks, basically everything us mere humans do in the Wikidata
>>> game:
>>> https://tools.wmflabs.org/wikidata-game/
>>>
>>> Some may require text parsing. Not sure how to get that working; haven't
>>> spent much time with (artificial) neural nets in a while.
>>>
>>>
>>>
>>> On Sat, Mar 7, 2015 at 12:36 PM Amir Ladsgroup <ladsgr...@gmail.com
>>> <mailto:ladsgr...@gmail.com>> wrote:
>>>
>>>     Some useful tasks that I'm looking for a way to do are:
>>>     *Anti-vandal bot (or how we can quantify an edit).
>>>     *Auto labeling for humans (That's the next task).
>>>     *Add more :)
>>>
>>>
>>>     On Sat, Mar 7, 2015 at 3:54 PM, Amir Ladsgroup <ladsgr...@gmail.com
>>>     <mailto:ladsgr...@gmail.com>> wrote:
>>>
>>>         Hey,
>>>         I spent last few weeks working on this lights off [1] and now
>>>         it's ready to work!
>>>
>>>         Kian is a three-layered neural network with flexible number of
>>>         inputs and outputs. So if we can parametrize a job, we can teach
>>>         him easily and get the job done.
>>>
>>>         For example and as the first job. We want to add P31:5 (human)
>>>         to items of Wikidata based on categories of articles in
>>>         Wikipedia. The only thing we need to is get list of items with
>>>         P31:5 and list of items of not-humans (P31 exists but not 5 in
>>>         it). then get list of category links in any wiki we want[2] and
>>>         at last we feed these files to Kian and let him learn.
>>>         Afterwards if we give Kian other articles and their categories,
>>>         he classifies them as human, not human, or failed to determine.
>>>         As test I gave him categories of ckb wiki (a small wiki) and
>>>         worked pretty well and now I'm creating the training set from
>>>         German Wikipedia and the next step will be English Wikipedia.
>>>         Number of P31:5 will drastically increase this week.
>>>
>>>         I would love comments or ideas for tasks that Kian can do.
>>>
>>>
>>>         [1]: Because I love surprises
>>>         [2]: "select pp_value, cl_to from page_props join categorylinks
>>>         on pp_page = cl_from where pp_propname = 'wikibase_item';"
>>>         Best
>>>         --
>>>         Amir
>>>
>>>
>>>
>>>
>>>     --
>>>     Amir
>>>
>>>     _________________________________________________
>>>     Wikidata-l mailing list
>>>     Wikidata-l@lists.wikimedia.org <mailto:Wikidata-l@lists.
>>> wikimedia.org>
>>>     https://lists.wikimedia.org/__mailman/listinfo/wikidata-l
>>>     <https://lists.wikimedia.org/mailman/listinfo/wikidata-l>
>>>
>>>
>>>
>>> _______________________________________________
>>> Wikidata-l mailing list
>>> Wikidata-l@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wikidata-l
>>>
>>>
>>
>> _______________________________________________
>> Wikidata-l mailing list
>> Wikidata-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata-l
>>
>
>
>
> --
> Amir
>
>
> _______________________________________________
> Wikidata-l mailing list
> Wikidata-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata-l
>
>

_______________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Kian: The first neural network to serve Wikidata

Reply via email to