Please stop calling this an “AI” system, it is not. It is statistical
learning.

This is probably not going to make me popular…

In some jurisdictions you will need a permit to create, manage, and store
biometric identifiers, no matter if the biometric identifier is for a known
person or not. If you want to create biometric identifiers, and use them,
make darn sure you follow every applicable law and rule. I'm not amused by
the idea of having CUs using illegal tools to wet ordinary users.

Any system that tries to remove anonymity og users on Wikipedia should have
an RfC where the community can make their concerns heard. This is not the
proper forum to get acceptance from Wikipedias community.

And btw, systems for cleanup of prose exists for a whole bunch of
languages, not only English. Grammarly is one, LanguageTool another, and
there are a whole bunch other such tools.

lør. 8. aug. 2020, 19.42 skrev Amir Sarabadani <ladsgr...@gmail.com>:

> Thank you all for the responses, I try to summarize my responses here.
>
> * By closed source, I don't mean it will be only accessible to me, It's
> already accessible by another CU and one WMF staff, and I would gladly
> share the code with anyone who has signed NDA and they are of course more
> than welcome to change it. Github has a really low limit for people who can
> access a private repo but I would be fine with any means to fix this.
>
> * I have read that people say that there are already public tools to
> analyze text. I disagree, 1- The tools you mentioned are for English and
> not other languages (maybe I missed something) and even if we imagine there
> would be such tools for big languages like German and/or French, they don't
> cover lots of languages unlike my tool that's basically language agnostic
> and depends on the volume of discussions happened in the wiki.
>
> * I also disagree that it's not hard to build. I have lots of experience
> with NLP (with my favorite work being a tool that finds swear words in
> every language based on history of vandalism in that Wikipedia [1]) and
> still it took me more than a year (a couple of hours almost in every
> weekend) to build this, analyzing a pure clean text is not hard, cleaning
> up wikitext and templates and links to get only text people "spoke" is
> doubly hard, analyzing user signatures brings only suffer and sorrow.
>
> * While in general I agree if a government wants to build this, they can
> but reality is more complicated and this situation is similar to security.
> You can never be 100% secure but you can increase the cost of hacking you
> so much that it would be pointless for a major actor to do it. Governments
> have a limited budget and dictatorships are by design corrupt and filled
> with incompotent people [2] and sanctions put another restrain on such
> governments too so I would not give them such opportunity for oppersion in
> a silver plate for free, if they really want to, then they must pay for it
> (which means they can't use that money/resources on oppersing some other
> groups).
>
> * People have said this AI is easy to be gamed, while it's not that easy
> and the tools you mentioned are limited to English, it's still a big win
> for the integrity of our projects. It boils down again to increasing the
> cost. If a major actor wants to spread disinformation, so far they only
> need to fake their UA and IP which is a piece of cake and I already see
> that (as a CU) but now they have to mess with UA/IP AND change their
> methods of speaking (which is one order of magnitude harder than changing
> IP). As I said, increasing this cost might not prevent it from happening
> but at least it takes away the ability of oppressing other groups.
>
> * This tool never will be the only reason to block a sock. It's more than
> anything a helper, if CU brings a large range and they are similar but the
> result is not conclusive, this tool can help. Or when we are 90% sure it's
> a WP:DUCK, this tool can help too but blocking just because this tool said
> so would imply a "Minority report" situation and to be honest and I would
> really like to avoid that. It is supposed to empower CUs.
>
> * Banning using this tool is not possible legally, the content of Wikipedia
> is published under CC-BY-SA and this allows such analysis specially you
> can't ban an offwiki action. Also, if a university professor can do it, I
> don't see the point of banning using it by the most trusted group of users
> (CUs). You can ban blocking based on this tool but I don't think we should
> block solely based on this anyway.
>
> * It has been pointed out by people in the checkuser mailing list that
> there's no point in logging accessing this tool, since the code is
> accessible to CUs (if they want to), so they can download and run it on
> their computer without logging anyway.
>
> * There is a huge difference between CU and this AI tool in matters of
> privacy. While both are privacy sensitive but CU reveals much more, as a
> CU, I know where lots of people are living or studying because they showed
> up in my CUs and while I won't tell a soul about them but it makes me
> uncomfortable (I'm also not implying CUs are not trusted, it's just we
> should respect people's privacy and avoid "unreasonable search and
> seizure"[3]) but this tool only reveals a connection between accounts if
> one of them is linked to a public identity and the other is not which I
> wholeheartedly agree is not great but it's not on the same level as seeing
> people's IPs. So I even think in an ideal world where the AI model is more
> accurate than CU, we should stop using CU and rely solely on the AI instead
> (important: I'm not implying the current model is better, I'm saying if it
> was better). This would help us understand why for example fishing for sock
> puppets with CU is bad (and banned by the policy) but fishing for socks
> using this AI is not bad and can be a good starting point. In other words,
> this tool being used right, can reduce check user actions and protect
> people's privacy instead.
>
> * People have been saying you need to teach AI to people so for example CUs
> don't make wrong judgments based on this. I want to point out the examples
> mentioned in the discussion are supervised machine learning which is AI but
> not all of AI. This tool is not machine learning but it's AI (by heavily
> relying on NLP) and for example it produces graphs and etc. and it wouldn't
> give a number like "95% sure these two users are the same" which a
> supervised machine learning model would do. I think reducing fingerprints
> of people to just a number is inaccurate and harmful (life is not like a TV
> crime series where a forensic scientist gives you the truth using some
> magic). I write a detailed instruction on how to use it but it's not as bad
> as you'd think, I leave a huge room for human judgment.
>
> [1] Have fun (warning, explicit language):
> https://gist.github.com/Ladsgroup/cc22515f55ae3d868f47#file-enwiki
> [2] For knowing why, you can read this book on political science called
> "The Dictator's handbook":
> https://en.wikipedia.org/wiki/The_Dictator%27s_Handbook
> [3] From the fourth amendment of US constitution, you can find a similar
> clause in every constitution.
>
> Hope this responds to some concerns. Sorry for a long email.
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to