Thank you very much Alex and Benjamin. Your answers were really helpful.

I wrongly thought the bracket syntax was only allowed for very basic ranges 
like [0-9] or [a-z] since the "Introduction to WebKit Content Blockers” only 
mentioned "Matching ranges with the range syntax [a-b]”.

I’m glad to know the full bracket syntax is actually supported.

Romain

> On Aug 17, 2015, at 9:59 PM, Benjamin Poulain <[email protected]> wrote:
> 
> Hi Romain,
> 
> On 8/17/15 11:03 AM, Romain Jacquinot wrote:
>> For now, the following regular expression features are supported by
>> content blockers:
>> 
>>  * Matching any character with “.”.
>>  * Matching ranges with the range syntax [a-b].
>>  * Quantifying expressions with “?”, “+” and “*”.
>>  * Groups with parenthesis.
>>  * Beginning of line (“^”) and end of line (“$”) marker
>> 
>> However, there doesn’t seem to be a way to find any of the alternatives
>> specified with “|” or find any character not between the brackets "[^]”.
> 
> Actually the "[^]" character set syntax is supported.
> 
> It could cause compile time issues on previous betas. That has been fixed in 
> beta 5.
> 
>> This is an issue when you want to block addresses like
>> *http://www.example.com <http://example.com>/*,
>> *https://example.com/*foobar.jpg, *http://example.com:*8080 but not
>> *http://example.com**.*hk.
> 
> The URLs are canonicalized before being processed by Content Blockers. That 
> ensure some invariants on the format. For example, the end of the domain name 
> always ends  with ":" or "/". The domain name is always lowercase.
> 
> Typically, I write domain triggers like this:
> 
> "trigger": {
>    "url-filter": "^https://([^:/]+\\.)example.com[:/]",
>    "url-filter-is-case-sensitive": true
> }
> 
> 
>> With at least one of those features, you could write something like:
>> 
>>     {
>> "action" : {
>> "type" : "block"
>>         },
>> "trigger" : {
>> "url-filter": "^https?://(www\\.)?example\\.com(/|:|?)+"
> 
> This does not work but
>    "^https?://(www\\.)?example\\.com[/:?]+"
> is equivalent.
> 
>>         }
>>     }
>> 
>> or:
>> 
>>     {
>> "action" : {
>> "type" : "block"
>>         },
>> "trigger" : {
>> "url-filter" : "^https?://(www\\.)?example\\.com[^.]"
> 
> This pattern should work fine in beta 5.
> 
>>         }
>>     }
>> 
>> Please note that in this case, the if-domain field wouldn’t help for
>> embedded content.
>> 
>> Should I write the same rule many times for the different cases (“/",
>> “:", “?”)? (doesn’t feel like a very elegant solution though). Since
>> they share the same prefix, will these rules be optimized? On the webkit
>> blog, it is written "/The rules are grouped by the prefix “https?://,
>> and it only counts as one rule with quantifiers./”. Does it mean that it
>> will only count as one rule against the 50,000 rule limit?
> 
> Having 3 rules with 3 different ending is fine as long as they are not 
> quantified. Their prefix would be merged in the compiler frontend.
> 
> Having 3 rules with quantifiers per URL would likely cause your rules to be 
> rejected by the compiler even under the 50k rule limit.
> 
> In any case, the 50k rule limit is on the number of trigger. The number of 
> rule is counted before rules are merged.
> 
>> Do you see an elegant solution to handle this case? If not, could you
>> please consider adding at least one of those regular expression features
>> for content blockers in Safari?
> 
> Are the solutions above good enough for your use case?
> 
> Benjamin


On Aug 17, 2015, at 8:48 PM, Alex Christensen <[email protected]> wrote:


> On Aug 17, 2015, at 11:03 AM, Romain Jacquinot <[email protected] 
> <mailto:[email protected]>> wrote:
> 
> Hi,
> 
> For now, the following regular expression features are supported by content 
> blockers:
> Matching any character with “.”.
> Matching ranges with the range syntax [a-b].
> Quantifying expressions with “?”, “+” and “*”.
> Groups with parenthesis.
> Beginning of line (“^”) and end of line (“$”) marker
> However, there doesn’t seem to be a way to find any of the alternatives 
> specified with “|” or find any character not between the brackets "[^]”.
| is indeed not implemented yet.
If I’m not mistaken, [^a] should work, though.  You could always do tricky 
things with ranges, like [\u0001-.0-9;->@-\u007F] but this doesn’t read very 
well and it might lead to hard-to-find errors for those of us that don’t have 
ASCII memorized.
> 
> This is an issue when you want to block addresses like http://www 
> <http://www/>.example.com <http://example.com/>/, https://example.com 
> <https://example.com/>/foobar.jpg, http://example.com 
> <http://example.com/>:8080 but not http://example.com 
> <http://example.com/>.hk.
> 
> With at least one of those features, you could write something like:
> 
>     {
>         "action" : {
>             "type" : "block"
>         },
>         "trigger" : {
>             "url-filter" : "^https?://(www\\.)?example\\.com(/|:|?)+"
>         }
>     }
> 
> or:
> 
>     {
>         "action" : {
>             "type" : "block"
>         },
>         "trigger" : {
>             "url-filter" : "^https?://(www\\.)?example\\.com[^.]"
>         }
>     }
> 
> Please note that in this case, the if-domain field wouldn’t help for embedded 
> content.
> 
> Should I write the same rule many times for the different cases (“/", “:", 
> “?”)? (doesn’t feel like a very elegant solution though). Since they share 
> the same prefix, will these rules be optimized? On the webkit blog, it is 
> written "The rules are grouped by the prefix “https?://, and it only counts 
> as one rule with quantifiers.”. Does it mean that it will only count as one 
> rule against the 50,000 rule limit?
Rules sharing a prefix are combined into the same DFA when compiling the 
combined regular expressions.  Fewer DFAs means faster performance.  A prefix 
in this case is all the terms of a regular expression up to the last quantified 
term, so ab?c and ab?d would be combined into the same DFA and there wouldn’t 
be much of a performance penalty for adding more regular expressions with ab? 
at the beginning and no other quantified terms, but ab?cd?e has another 
quantified term, so it would be put into a separate DFA in our implementation.  
In your case, if all your rules start with ^https? with no other quantified 
terms, then they will all be optimized well, but if all the rules have unique 
terms before the last quantified term like ^https?://a\.(com)? ^https://b 
<https://b/>\.(com)? ^https://c <https://c/>\.(com)? etc. then these rules will 
not be combined well and it will hurt performance when checking if a URL 
matches the rules.  To make it simple, the less you use ?, *, or +, the faster 
it will be.

You could write a rule many times, but the 50000 rule limit applies when 
parsing the rules, so each rule will count towards that limit.
> 
> Do you see an elegant solution to handle this case? If not, could you please 
> consider adding at least one of those regular expression features for content 
> blockers in Safari?
You could do something like ^https?://(www\.)?example\.com[/:?]
> 
> Thanks.
> 
> _______________________________________________
> webkit-help mailing list
> [email protected] <mailto:[email protected]>
> https://lists.webkit.org/mailman/listinfo/webkit-help

_______________________________________________
webkit-help mailing list
[email protected]
https://lists.webkit.org/mailman/listinfo/webkit-help

Reply via email to