[Wikidata-bugs] [Maniphest] [Commented On] T105126: Evaluate pattern constraints (safely)

2015-10-27 Thread thiemowmde
thiemowmde added a comment. Weeks ago I played around with the current set of patterns given in F180605: pattern.txt and tried to create a parser that evaluates these regex patterns. This is the best I could come up with:

[Wikidata-bugs] [Maniphest] [Commented On] T105126: Evaluate pattern constraints (safely)

2015-10-27 Thread Popcorndude
Popcorndude added a comment. What if backtracking, grouping, and alternatives were all disabled, and each property could have multiple format constraints, and a value would pass if it matched any of them? Most paterns would need rewriting, but only for about 18 of them would it actually be

[Wikidata-bugs] [Maniphest] [Commented On] T105126: Evaluate pattern constraints (safely)

2015-10-26 Thread daniel
daniel added a comment. In https://phabricator.wikimedia.org/T105126#1752074, @Popcorndude wrote: > Those criteria accept 62 (8%) of the current constraints. > Adding character classes (\d is everywhere) brings it up to 166 (23%) Yea, I can imagine. My goal was to suggest a minimal set of

[Wikidata-bugs] [Maniphest] [Commented On] T105126: Evaluate pattern constraints (safely)

2015-10-25 Thread Popcorndude
Popcorndude added a comment. this matches the constraints I suggested: ^(?!.*?\.(\+|\*|\{\d+,\})\()(\\.|[^()\\\[\]]|\[([^\\\[\]]|\\.)*\]|\((?!\?)(\\.|[^()\\]|\[([^\\\[\]]|\\.)*\])*\))+$ these convert infinite repetition (other than ##.*()##, ##.+()##, and ##.{n,}()##) to atomic groups: |

[Wikidata-bugs] [Maniphest] [Commented On] T105126: Evaluate pattern constraints (safely)

2015-10-25 Thread Nikki
Nikki added a subscriber: Nikki. Nikki added a comment. @Popcorndude: I don't really know what's going on here (sorry if I've completely misunderstood), are you only looking at a subset of the format constraints? I know `P1814` is using \p and `P898` is using \x and you didn't mention either

[Wikidata-bugs] [Maniphest] [Commented On] T105126: Evaluate pattern constraints (safely)

2015-10-25 Thread Popcorndude
Popcorndude added a comment. My apologies. I eliminated those in my initial analysis and forgot to mention it. The full list of things with backslashes in front of them: bdDpsSwx2()[]{}|^\/$?+*,-. TASK DETAIL https://phabricator.wikimedia.org/T105126 EMAIL PREFERENCES

[Wikidata-bugs] [Maniphest] [Commented On] T105126: Evaluate pattern constraints (safely)

2015-10-25 Thread daniel
daniel added a comment. Maybe we should spell out the pattern matching features we actually need to cover our primary use case, matching identifiers. I'd suggest: - Literal, e.g. //abc// - Character set, e.g. //a[bc]//. - Character range, e.g. //foo[0-9]// - Repetition for an exact number of

[Wikidata-bugs] [Maniphest] [Commented On] T105126: Evaluate pattern constraints (safely)

2015-10-25 Thread daniel
daniel added a comment. How about using (?>) independent sub-expressions to avoid backtracking? As to https://phabricator.wikimedia.org/P1949, /oai:.*?:.*/ or

[Wikidata-bugs] [Maniphest] [Commented On] T105126: Evaluate pattern constraints (safely)

2015-10-24 Thread daniel
daniel added a comment. @Popcorndude Thanks for the analysis! Knowing what kinds of patterns are used how often is definitely a good thing. The most important question however is a bit harder to answer: which constraints use patterns that can trigger catastrophic backtracking

[Wikidata-bugs] [Maniphest] [Commented On] T105126: Evaluate pattern constraints (safely)

2015-10-24 Thread Popcorndude
Popcorndude added a comment. Of those 6 properties, 2 have "optionally the same character twice", 2 have "does not start with", and the other 2 are actually non-capturing groups that I misidentified as lookarounds. TASK DETAIL https://phabricator.wikimedia.org/T105126 EMAIL PREFERENCES

[Wikidata-bugs] [Maniphest] [Commented On] T105126: Evaluate pattern constraints (safely)

2015-10-24 Thread Popcorndude
Popcorndude added a comment. \(?\|?([^*+(){}\[\]]|\[(\\\]|[^\]])*\]|\(([^()*+]|[+*]\|)*\)|(?!<\\)\\[+*]|(\\d|\[0-9\])([*+]|\{\d+(,\d+|)\})(?!\\d|\d|\[0-9)|\{\d+\})*([*+]|\{\d+(,\d+|)\})?\|?\)? matches 624 of the ok ones, and should only match ok ones, though some will fail. TASK DETAIL

[Wikidata-bugs] [Maniphest] [Commented On] T105126: Evaluate pattern constraints (safely)

2015-07-08 Thread daniel
daniel added a comment. Glob patterns may be an option, using fnmatch http://php.net/manual/en/function.fnmatch.php. However, glob doesn't have a way to define a fixed number of repetitions, so four digits would have to be written as [0-9][0-9][0-9][0-9]. Also, the common * and ? wildcards are

[Wikidata-bugs] [Maniphest] [Commented On] T105126: Evaluate pattern constraints (safely)

2015-07-08 Thread daniel
daniel added a comment. Using PCRE limits http://php.net/manual/en/pcre.configuration.php to restrict backtracking and recursion would be one possible approach. However, this means some expressions will fail to evaluate on some input. There is no way to indicate this to the user who created