[Wikidata-bugs] [Maniphest] [Updated] T105126: Evaluate pattern constraints (safely)
thiemowmde added a project: Wikidata-Sprint-2015-11-03. TASK DETAIL https://phabricator.wikimedia.org/T105126 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: thiemowmde Cc: thiemowmde, Nikki, Popcorndude, Aklapper, daniel, Wikidata-bugs, aude, GWicke, csteipp ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Updated] T105126: Evaluate pattern constraints (safely)
thiemowmde added a blocking task: T102892: Collect all regular expressions used in Wikidata's Template:Constraint:Format. TASK DETAIL https://phabricator.wikimedia.org/T105126 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: thiemowmde Cc: thiemowmde, Nikki, Popcorndude, Aklapper, daniel, Wikidata-bugs, aude, GWicke, csteipp ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Updated] T105126: Evaluate pattern constraints (safely)
Popcorndude added a comment. I messed with the constraints a bit, and it would be pretty easy to get up to ~50% with the constraints you outlined (the numbers I gave before may have forgotten to skip newlines, lowering the count). Adding + and * covers 3/4, and most of the rest could be rewritten without to much trouble (other than https://phabricator.wikimedia.org/P1793 and possibly a few others that are really basically impossible). (\\.|[^(){}\[\]\\]|\{\d+(,\d+)?\}|\[(?!^)(\\\[|\\\]|[^\[\]])*\])* should work (though it does let character classes through). After a bit more fiddling: https://phabricator.wikimedia.org/P234, https://phabricator.wikimedia.org/P274, https://phabricator.wikimedia.org/P305, P353, https://phabricator.wikimedia.org/P395, https://phabricator.wikimedia.org/P428, https://phabricator.wikimedia.org/P529, https://phabricator.wikimedia.org/P866, https://phabricator.wikimedia.org/P998, https://phabricator.wikimedia.org/P1190, https://phabricator.wikimedia.org/P1256, https://phabricator.wikimedia.org/P1257, https://phabricator.wikimedia.org/P1472, https://phabricator.wikimedia.org/P1612, https://phabricator.wikimedia.org/P1662, https://phabricator.wikimedia.org/P1793, https://phabricator.wikimedia.org/P1986, https://phabricator.wikimedia.org/P2015 are the only properties where I don't see a way to rewrite the format with your constraints plus infinite repetition (i.e. no groups), and with a lot of these there probably is a upper limit, it's just that no one bothered to write it into the format. With that in mind, the list shortens to https://phabricator.wikimedia.org/P234, https://phabricator.wikimedia.org/P274, https://phabricator.wikimedia.org/P428, https://phabricator.wikimedia.org/P1472, https://phabricator.wikimedia.org/P1612, https://phabricator.wikimedia.org/P1793 (most of which are probably things you didn't intend to support anyway). TASK DETAIL https://phabricator.wikimedia.org/T105126 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Popcorndude Cc: Nikki, Popcorndude, Aklapper, daniel, Wikidata-bugs, aude, GWicke, csteipp ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Updated] T105126: Evaluate pattern constraints (safely)
Popcorndude added a comment. Those criteria accept 62 (8%) of the current constraints. Adding character classes (\d is everywhere) brings it up to 166 (23%) I would suggest allowing infinite repetition if the thing being repeated cannot overlap with the next thing. Although to prevent that from requiring the checking regex to execute the format on itself maybe infinite repetition should be allowed on character classes if they don't overlap with the following thing. (e.g. \d*[a-z] but not \d*0), and then also treat them as atomic (automatically convert \d+ to (?>\d+)) It would probably make things a lot easier for the people making the formats if each property was allowed to have multiple format strings (match this or this). As for grouping, maybe allow it but only non-nested atomic, since the alternative is to rewrite this (admittedly slightly extreme example) as a flat list of alternatives https?://((academia|android|anime|apple|arduino|astronomy|aviation|beer|bicycles|biology|bitcoin|blender|boardgames|bricks|buddhism|chemistry|chess|chinese|christianity|codegolf|codereview|cogsci|cooking|craftcms|crypto|cs|cstheory|datascience|dba|diy|drupal|dsp|earthscience|ebooks|electronics|ell|emacs|english|expatriates|expressionengine|fitness|freelancing|french|gamedev|gaming|gardening|genealogy|german|gis|graphicdesign|ham|hermeneutics|hinduism|history|homebrew|islam|italian|japanese|joomla|judaism|linguistics|magento|martialarts|math|matheducators|mathematica|mechanics|meta|moderators|money|movies|music|networkengineering|opendata|outdoors|parenting|patents|pets|philosophy|photo|physics|pm|poker|politics|productivity|programmers|puzzling|quant|raspberrypi|reverseengineering|robotics|rpg|russian|salesforce|scicomp|scifi|security|sharepoint|skeptics|softwarerecs|sound|space|spanish|sports|sqa|startups|stats|sustainability|tex|tor|travel|tridion|unix|ux|video|webapps|webmasters| windowsphone|wordpress|workplace|writers)\.stackexchange\.com|askubuntu\.com|mathoverflow\.net|pt\.stackoverflow\.com|serverfault\.com|stackapps\.com|stackoverflow\.com|superuser\.com)/(tags|questions/tagged)/.* (https://phabricator.wikimedia.org/P1482) I'll see if I can make some regexs for this in a bit. TASK DETAIL https://phabricator.wikimedia.org/T105126 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Popcorndude Cc: Popcorndude, Aklapper, daniel, Wikidata-bugs, aude, GWicke, csteipp ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Updated] T105126: Evaluate pattern constraints (safely)
Popcorndude added a comment. I did some analysis of what regex features are actually used: https://www.wikidata.org/wiki/User:Popcorndude/formats Only 6 properties use backrefs, lookahead, or lookbehind: https://phabricator.wikimedia.org/P1256, https://phabricator.wikimedia.org/P1257, https://phabricator.wikimedia.org/P1297, https://phabricator.wikimedia.org/P1472, https://phabricator.wikimedia.org/P1612, https://phabricator.wikimedia.org/P1793. Everything else uses only "?*+ [] () | {} {,} \d\D\s\S\w ." Is this analysis helpful? If so, is there anything I else you'd like me to check? TASK DETAIL https://phabricator.wikimedia.org/T105126 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Popcorndude Cc: Popcorndude, Aklapper, daniel, Wikidata-bugs, aude, GWicke, csteipp ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Updated] T105126: Evaluate pattern constraints (safely)
Popcorndude added a comment. I letting things like 0?\d{8} through the filter, and most of what's left is checking file extensions. I can make them not backtrack at all if commons filenames don't contain periods (I don't know what characters are allowed). They are generally of the form .*\.() Besides files and the 6 I mentioned before, the only ones I see are https://phabricator.wikimedia.org/P1949 oai:.*:.* https://phabricator.wikimedia.org/P274 ([αβγδφωλμπ]-)?([([]*[A-Z☐][ub]?[a-z]?[₁₂₃₄₅₆₇₈₉₀]*(\)?[¹²³⁴⁵⁶⁷⁸⁹⁰]*[⁺⁻]?)?[])|,₁₂₃₄₅₆₇₈₉₀]*(·\(?[-0-9.]*n?\)?)?)+ https://phabricator.wikimedia.org/P367 .+ (S|s)ymbol.*\.svg https://phabricator.wikimedia.org/P395 ([A-Z0-9\p{L}]+(,\w)?)+ https://phabricator.wikimedia.org/P866 (([a-z]+|39)\-)*([1-9][0-9]*|[a-z]+|[1-9][a-z0-9][a-z0-9]) everything else either doesn't have this problem (as far as I can tell) or are guaranteed to be looking for less than 10 characters. TASK DETAIL https://phabricator.wikimedia.org/T105126 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Popcorndude Cc: Popcorndude, Aklapper, daniel, Wikidata-bugs, aude, GWicke, csteipp ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Updated] T105126: Evaluate pattern constraints (safely)
daniel added projects: Security-General, Performance. TASK DETAIL https://phabricator.wikimedia.org/T105126 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: daniel Cc: Aklapper, daniel, Wikidata-bugs, aude, GWicke, csteipp, Malyacko, P.Copp ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs