Today I made a couple patches that should address most of the problems reported as well as handle RTL languages and multilingual blacklist. I'm mostly using some Unicode magic which is quite well hidden in some obscure libraries, we'll see if it works. :)

In case it's not clear, for now I'm focusing on the *MediaWiki* side of the matter; the Wikimedia side, i.e. where to use what and how, is something we'll worry about when we actually have this option (or others) available in the codebase.

A couple questions below.

P. Blissenbach, 31/03/2014 17:13:
> captchas having two lines
> of identcal text [...] and accept either input.

This would need to be filed as separate enhancement request.

Shimmin, 31/03/2014 20:02:
If you actually want the captchas to make any sense in terms of word
combination and construction, that would be a whole different issue.
There's inflection, rules on what happens when words are run together
(spelling changes for one), and so on.

I suppose you're only talking of the morphological side here, right? The current patch contains a couple lines to handle hyphenation for Finnish, because it was originally provided by Nikerabbit, but we're definitely not going to build a universal grammar of univerbation in a MediaWiki script. Unless someone comes up with a general solution I think we'll drop that part.

If this turns out to be confusing, I'd rather just show the two (or N) words as separate words, what do you think? This can be done in a separate patch; once we introduce some other security improvements, I think the challenge of identifying where one word ends and the next starts may be redundant.


Quite a few of the l look like i in this font, which seems problematic.

This is indeed a problem with sans serif fonts but the broad majority thinks they are better. We can try to pick clearer fonts but most help will come from words being familiar to humans. I may upload more tests with this font, though: https://commons.wikimedia.org/wiki/File:AndBasR.pdf

Should this be "leigh"?

Yes. If incorrect, please edit: https://en.wiktionary.org/?oldid=23059687


Looks like "neuscanshoil" with a random -y added, a hangover from
English behaviour?

Same problem as with Malayam and others; the last version will avoid combining single letters to other words.


[...]
(though Aaue is a proper name) [...]

Perick is also a proper name  [...]

Do others think proper names are a problem? If yes they might be easy enough to remove, usually they're tagged as such on Wiktionary. Otherwise, this adds some cheap variety in our dictionaries.


The form "vaayl" is a rare grammar-induced form of an unusual word

In this case it's again a proper noun, no idea how correct or how current: <https://en.wiktionary.org/?oldid=21902154>


Hard to read, could be "hiu shee" or "niu shee"

It was "hiu": no "niu" in our dictionary. If the latter is a valid word, you should add it to Wiktionary and then we can try to figure out something to exclude confusable words.

Once again, the proposed approach is to rely on a mix of Unicode magic and self-healing (wiki) dictionary. Neither is enough alone.


This one means "arctic castration" (spoiy = castration).  Not obscene,
but maybe not for everyone?

Well, it could fall under "obscene" for some definition of the word. I'm now blacklisting also "pejorative" and "offensive" words, those who care can try and see if their label edits survive on the wiki.
https://en.wiktionary.org/wiki/Wiktionary:Context_labels

Nemo

_______________________________________________
Mediawiki-i18n mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-i18n

Reply via email to