Thank you very much Alex and Benjamin. Your answers were really helpful. I wrongly thought the bracket syntax was only allowed for very basic ranges like [0-9] or [a-z] since the "Introduction to WebKit Content Blockers” only mentioned "Matching ranges with the range syntax [a-b]”.
I’m glad to know the full bracket syntax is actually supported. Romain > On Aug 17, 2015, at 9:59 PM, Benjamin Poulain <[email protected]> wrote: > > Hi Romain, > > On 8/17/15 11:03 AM, Romain Jacquinot wrote: >> For now, the following regular expression features are supported by >> content blockers: >> >> * Matching any character with “.”. >> * Matching ranges with the range syntax [a-b]. >> * Quantifying expressions with “?”, “+” and “*”. >> * Groups with parenthesis. >> * Beginning of line (“^”) and end of line (“$”) marker >> >> However, there doesn’t seem to be a way to find any of the alternatives >> specified with “|” or find any character not between the brackets "[^]”. > > Actually the "[^]" character set syntax is supported. > > It could cause compile time issues on previous betas. That has been fixed in > beta 5. > >> This is an issue when you want to block addresses like >> *http://www.example.com <http://example.com>/*, >> *https://example.com/*foobar.jpg, *http://example.com:*8080 but not >> *http://example.com**.*hk. > > The URLs are canonicalized before being processed by Content Blockers. That > ensure some invariants on the format. For example, the end of the domain name > always ends with ":" or "/". The domain name is always lowercase. > > Typically, I write domain triggers like this: > > "trigger": { > "url-filter": "^https://([^:/]+\\.)example.com[:/]", > "url-filter-is-case-sensitive": true > } > > >> With at least one of those features, you could write something like: >> >> { >> "action" : { >> "type" : "block" >> }, >> "trigger" : { >> "url-filter": "^https?://(www\\.)?example\\.com(/|:|?)+" > > This does not work but > "^https?://(www\\.)?example\\.com[/:?]+" > is equivalent. > >> } >> } >> >> or: >> >> { >> "action" : { >> "type" : "block" >> }, >> "trigger" : { >> "url-filter" : "^https?://(www\\.)?example\\.com[^.]" > > This pattern should work fine in beta 5. > >> } >> } >> >> Please note that in this case, the if-domain field wouldn’t help for >> embedded content. >> >> Should I write the same rule many times for the different cases (“/", >> “:", “?”)? (doesn’t feel like a very elegant solution though). Since >> they share the same prefix, will these rules be optimized? On the webkit >> blog, it is written "/The rules are grouped by the prefix “https?://, >> and it only counts as one rule with quantifiers./”. Does it mean that it >> will only count as one rule against the 50,000 rule limit? > > Having 3 rules with 3 different ending is fine as long as they are not > quantified. Their prefix would be merged in the compiler frontend. > > Having 3 rules with quantifiers per URL would likely cause your rules to be > rejected by the compiler even under the 50k rule limit. > > In any case, the 50k rule limit is on the number of trigger. The number of > rule is counted before rules are merged. > >> Do you see an elegant solution to handle this case? If not, could you >> please consider adding at least one of those regular expression features >> for content blockers in Safari? > > Are the solutions above good enough for your use case? > > Benjamin On Aug 17, 2015, at 8:48 PM, Alex Christensen <[email protected]> wrote: > On Aug 17, 2015, at 11:03 AM, Romain Jacquinot <[email protected] > <mailto:[email protected]>> wrote: > > Hi, > > For now, the following regular expression features are supported by content > blockers: > Matching any character with “.”. > Matching ranges with the range syntax [a-b]. > Quantifying expressions with “?”, “+” and “*”. > Groups with parenthesis. > Beginning of line (“^”) and end of line (“$”) marker > However, there doesn’t seem to be a way to find any of the alternatives > specified with “|” or find any character not between the brackets "[^]”. | is indeed not implemented yet. If I’m not mistaken, [^a] should work, though. You could always do tricky things with ranges, like [\u0001-.0-9;->@-\u007F] but this doesn’t read very well and it might lead to hard-to-find errors for those of us that don’t have ASCII memorized. > > This is an issue when you want to block addresses like http://www > <http://www/>.example.com <http://example.com/>/, https://example.com > <https://example.com/>/foobar.jpg, http://example.com > <http://example.com/>:8080 but not http://example.com > <http://example.com/>.hk. > > With at least one of those features, you could write something like: > > { > "action" : { > "type" : "block" > }, > "trigger" : { > "url-filter" : "^https?://(www\\.)?example\\.com(/|:|?)+" > } > } > > or: > > { > "action" : { > "type" : "block" > }, > "trigger" : { > "url-filter" : "^https?://(www\\.)?example\\.com[^.]" > } > } > > Please note that in this case, the if-domain field wouldn’t help for embedded > content. > > Should I write the same rule many times for the different cases (“/", “:", > “?”)? (doesn’t feel like a very elegant solution though). Since they share > the same prefix, will these rules be optimized? On the webkit blog, it is > written "The rules are grouped by the prefix “https?://, and it only counts > as one rule with quantifiers.”. Does it mean that it will only count as one > rule against the 50,000 rule limit? Rules sharing a prefix are combined into the same DFA when compiling the combined regular expressions. Fewer DFAs means faster performance. A prefix in this case is all the terms of a regular expression up to the last quantified term, so ab?c and ab?d would be combined into the same DFA and there wouldn’t be much of a performance penalty for adding more regular expressions with ab? at the beginning and no other quantified terms, but ab?cd?e has another quantified term, so it would be put into a separate DFA in our implementation. In your case, if all your rules start with ^https? with no other quantified terms, then they will all be optimized well, but if all the rules have unique terms before the last quantified term like ^https?://a\.(com)? ^https://b <https://b/>\.(com)? ^https://c <https://c/>\.(com)? etc. then these rules will not be combined well and it will hurt performance when checking if a URL matches the rules. To make it simple, the less you use ?, *, or +, the faster it will be. You could write a rule many times, but the 50000 rule limit applies when parsing the rules, so each rule will count towards that limit. > > Do you see an elegant solution to handle this case? If not, could you please > consider adding at least one of those regular expression features for content > blockers in Safari? You could do something like ^https?://(www\.)?example\.com[/:?] > > Thanks. > > _______________________________________________ > webkit-help mailing list > [email protected] <mailto:[email protected]> > https://lists.webkit.org/mailman/listinfo/webkit-help
_______________________________________________ webkit-help mailing list [email protected] https://lists.webkit.org/mailman/listinfo/webkit-help
