[Wikidata-bugs] [Maniphest] [Commented On] T240884: Standalone service to evaluate user-provided regular expressions

2020-01-22 Thread Joe
Joe added a comment.


  In T240884#5813174 , 
@Daimona wrote:
  
  > In T240884#5810160 , 
@sbassett wrote:
  >
  >> In T240884#5810094 , 
@Ladsgroup wrote:
  >>
  >>> One complicating factor here is that AbuseFilter and SpamBlacklist both 
don't have a clear maintainer.
  >>
  >> I think @Daimona is understood to be the de facto AF maintainer these days 
(trusted dev, wmf-NDA, etc.) and is pretty active 

 in its current development.
  >
  > So, I'm going to answer for myself. I think a re2-like solution would 
indeed improve performance [1] for regexps-related extensions. AbuseFilter and 
SpamBlacklist for sure, but also TitleBlacklist, and CentralAuth as of T101615 
. Given the number of possible 
consumers, I believe that a reusable service would be the best choice.
  > Of note, there's also T187669  
about adding a static ReDoS validator, in case you want to explore it as an 
alternative.
  > [1] - About AbuseFilter performance, some numbers are on grafana 
,
 and there's also a dashboard 

 on logstash, although regexps aren't the only responsible for slowness.
  
  Performace would be better if you interpret regexes locally with re2, so by 
using i.e. a php extension as @tstarling suggested.
  
  The advantage of a standalone service is that it works better if we need to 
use re2 from different services, so not just within MediaWiki.

TASK DETAIL
  https://phabricator.wikimedia.org/T240884

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Joe
Cc: Daimona, daniel, tstarling, Bawolff, Joe, WMDE-leszek, Volans, sbassett, 
Krinkle, Agabi10, Lucas_Werkmeister_WMDE, Addshore, Aklapper, Ladsgroup, 
darthmon_wmde, DannyS712, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, 
RazeSoldier, QZanden, LawExplorer, _jensen, rosalieper, D3r1ck01, Scott_WUaS, 
Izno, SBisson, Perhelion, Wikidata-bugs, Base, aude, GWicke, jayvdb, fbstj, 
santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T240884: Standalone service to evaluate user-provided regular expressions

2020-01-16 Thread Ladsgroup
Ladsgroup added a comment.


  In T240884#5808716 , 
@daniel wrote:
  
  > So, one key question to answer in this RFC is: Are there other 
people/projects/teams interested in re2 or gRPC? What are their needs and plans?
  
  One complicating factor here is that AbuseFilter and SpamBlacklist both don't 
have a clear maintainer.

TASK DETAIL
  https://phabricator.wikimedia.org/T240884

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Ladsgroup
Cc: daniel, tstarling, Bawolff, Joe, WMDE-leszek, Volans, sbassett, Krinkle, 
Agabi10, Lucas_Werkmeister_WMDE, Addshore, Aklapper, Ladsgroup, darthmon_wmde, 
DannyS712, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, RazeSoldier, 
QZanden, LawExplorer, _jensen, rosalieper, D3r1ck01, Scott_WUaS, Izno, SBisson, 
Perhelion, Wikidata-bugs, Base, aude, GWicke, jayvdb, fbstj, santhosh, 
Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T240884: Standalone service to evaluate user-provided regular expressions

2020-01-16 Thread daniel
daniel added a comment.


  Condensed outcome of a conversation between @Ladsgroup, @Addshore, @Joe, 
@krinkle, @tstarling, and myself:
  
  - If there are other use cases for re2 in MediaWiki, go for a native php 
binding for re2.
  - If there are other use cases for gRPC on our cluster, try to use it instead 
of JSON-over-HTTP REST.
  - Otherwise, the most straight forward option seems to be a simple REST 
service written in node.js using the re2 npm.
  
  So, one key question to answer in this RFC is: Are there other 
people/projects/teams interested in re2 or gRPC? What are their needs and plans?

TASK DETAIL
  https://phabricator.wikimedia.org/T240884

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: daniel
Cc: daniel, tstarling, Bawolff, Joe, WMDE-leszek, Volans, sbassett, Krinkle, 
Agabi10, Lucas_Werkmeister_WMDE, Addshore, Aklapper, Ladsgroup, darthmon_wmde, 
DannyS712, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, RazeSoldier, 
QZanden, LawExplorer, _jensen, rosalieper, D3r1ck01, Scott_WUaS, Izno, SBisson, 
Perhelion, Wikidata-bugs, Base, aude, GWicke, jayvdb, fbstj, santhosh, 
Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T240884: Standalone service to evaluate user-provided regular expressions

2020-01-15 Thread tstarling
tstarling added a comment.


  There is https://pecl.php.net/package/re2 . It was written for PHP 5 and was 
never updated after its initial release in 2011, but we have the skills to 
update it for PHP 7 and review it for security. If we believe in RE2 then we 
shouldn't be afraid to invest in it.

TASK DETAIL
  https://phabricator.wikimedia.org/T240884

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: tstarling
Cc: tstarling, Bawolff, Joe, WMDE-leszek, Volans, sbassett, Krinkle, Agabi10, 
Lucas_Werkmeister_WMDE, Addshore, Aklapper, Ladsgroup, darthmon_wmde, 
DannyS712, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, RazeSoldier, 
QZanden, LawExplorer, _jensen, rosalieper, D3r1ck01, Scott_WUaS, Izno, SBisson, 
Perhelion, Wikidata-bugs, Base, aude, GWicke, jayvdb, fbstj, santhosh, 
Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T240884: Standalone service to evaluate user-provided regular expressions

2020-01-14 Thread Bawolff
Bawolff added a comment.


  In T240884#5796687 , @Joe 
wrote:
  
  > I think the main question to answer is "does it make sense to create a safe 
regex evaluation service?".
  > I think in a void the answer is "no". It could make sense to create a small 
C++ program wrapping the main re2 functionality and shell out to it from php.
  > On the other hand, we have to consider the wikimedia infrastructure for 
this and there are two counterpoints to be made:
  >
  > - Is this a service we can only expect MediaWiki to call? If not, that's a 
point in favour of creating a separate service
  > - Shelling out for us works well by using a combination of firejail and 
cgroups creation that won't work well in the future with cgroups v2 and 
containerization
  > - Performance might not be extremely relevant
  >
  > Now on the last point: this proposal seems to worry a lot about 
performance, but I see no performance requirement spelled out. Without more 
context, both the choice of shelling out vs and RPC service, and the proposal 
to use gRPC for said service seem to me like premature optimizations.
  > So my questions are:
  >
  > - What is the 95th percentile of latency in  validating all the constraint 
on an item when editing it?
  > - What is the average, median and max number of regexes we need to validate 
per item?
  >
  > Without answering those questions, we would just make choices by principle, 
while I think we should have a more pragmatic approach.
  
  I think there is a question of whether we are just going to have the wikidata 
thing use this, or eventually all user (including admin) provided regexes? If 
we also eventually move AbuseFilter & SpamBlacklist to this, I could see 
performance possibly being a concern since these features often involve running 
quite a large number of regexes that block saving a page until they complete 
(That said, I don't have any specific numbers beyond a gut feeling that it is 
so).

TASK DETAIL
  https://phabricator.wikimedia.org/T240884

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Bawolff
Cc: Bawolff, Joe, WMDE-leszek, Volans, sbassett, Krinkle, Agabi10, 
Lucas_Werkmeister_WMDE, Addshore, Aklapper, Ladsgroup, darthmon_wmde, 
DannyS712, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, RazeSoldier, 
QZanden, LawExplorer, _jensen, rosalieper, D3r1ck01, Scott_WUaS, Izno, SBisson, 
Perhelion, Wikidata-bugs, Base, aude, GWicke, jayvdb, fbstj, santhosh, 
Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T240884: Standalone service to evaluate user-provided regular expressions

2020-01-14 Thread Addshore
Addshore added a comment.


  In T240884#5802950 , 
@Lucas_Werkmeister_WMDE wrote:
  
  > If I understand correctly, this is just the time of the individual format 
constraint check itself. A full constraint check for an item will typically 
involve several format constraint checks, as well as other constraint checks of 
various types. (I think that’s what you meant as well?
  
  Yep
  
  > The phrasing “constraint check that includes regex” confuses me a bit, 
because there isn’t really much //else// happening there besides the regex, 
except for a bit of housekeeping like getting the string out of the Wikibase 
value, and assembling the result.)
  
  That phrasing is rather meant to distinguish from other constraint check 
types that do nothing relating to regex at all

TASK DETAIL
  https://phabricator.wikimedia.org/T240884

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Addshore
Cc: Joe, WMDE-leszek, Volans, sbassett, Krinkle, Agabi10, 
Lucas_Werkmeister_WMDE, Addshore, Aklapper, Ladsgroup, darthmon_wmde, 
DannyS712, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, RazeSoldier, 
QZanden, LawExplorer, _jensen, rosalieper, D3r1ck01, Scott_WUaS, Izno, SBisson, 
Perhelion, Wikidata-bugs, Base, aude, GWicke, Bawolff, jayvdb, fbstj, santhosh, 
Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T240884: Standalone service to evaluate user-provided regular expressions

2020-01-14 Thread Lucas_Werkmeister_WMDE
Lucas_Werkmeister_WMDE added a comment.


  > I believe ever cache miss in that graph will result in 1 call to sparql to 
check 1 regular expression against 1 string.
  
  I think that’s correct, we don’t batch these requests at the moment.
  
  > The second panel is the p95 timing for the constraint check that includes a 
regex run.
  > Again, multiple of these checks will happen in a single edit / for a single 
item.
  
  If I understand correctly, this is just the time of the individual format 
constraint check itself. A full constraint check for an item will typically 
involve several format constraint checks, as well as other constraint checks of 
various types. (I think that’s what you meant as well? The phrasing “constraint 
check that includes regex” confuses me a bit, because there isn’t really much 
//else// happening there besides the regex, except for a bit of housekeeping 
like getting the string out of the Wikibase value, and assembling the result.)

TASK DETAIL
  https://phabricator.wikimedia.org/T240884

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Lucas_Werkmeister_WMDE
Cc: Joe, WMDE-leszek, Volans, sbassett, Krinkle, Agabi10, 
Lucas_Werkmeister_WMDE, Addshore, Aklapper, Ladsgroup, darthmon_wmde, 
DannyS712, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, RazeSoldier, 
QZanden, LawExplorer, _jensen, rosalieper, D3r1ck01, Scott_WUaS, Izno, SBisson, 
Perhelion, Wikidata-bugs, Base, aude, GWicke, Bawolff, jayvdb, fbstj, santhosh, 
Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T240884: Standalone service to evaluate user-provided regular expressions

2020-01-14 Thread Addshore
Addshore added a comment.


  I have created a temporary dashboard at 
https://grafana.wikimedia.org/d/HUGEtYPWz/t240884?orgId=1 with some of these 
number pulled out.
  
  The "Individual regex runs" panel covers what I said in T240884#5802852 
 about individual regex 
being run on strings.
  As said in the panel description this will happen multiple times per edit / 
item.
  
  The second panel is the p95 timing for the constraint check that includes a 
regex run.
  Again, multiple of these checks will happen in a single edit / for a single 
item.
  
  The third panel is my very poor first attempt at figuring out how many of 
these checks roughly run per edit.
  These seem to range from 5-~80(peak) but normally staying between the 5-30 
range.
  I might be able to come up with a better number for this based on a dump 
before the meeting.

TASK DETAIL
  https://phabricator.wikimedia.org/T240884

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Addshore
Cc: Joe, WMDE-leszek, Volans, sbassett, Krinkle, Agabi10, 
Lucas_Werkmeister_WMDE, Addshore, Aklapper, Ladsgroup, darthmon_wmde, 
DannyS712, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, RazeSoldier, 
QZanden, LawExplorer, _jensen, rosalieper, D3r1ck01, Scott_WUaS, Izno, SBisson, 
Perhelion, Wikidata-bugs, Base, aude, GWicke, Bawolff, jayvdb, fbstj, santhosh, 
Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T240884: Standalone service to evaluate user-provided regular expressions

2020-01-13 Thread Joe
Joe added a comment.


  I think the main question to answer is "does it make sense to create a safe 
regex evaluation service?".
  
  I think in a void the answer is "no". It could make sense to create a small 
C++ program wrapping the main re2 functionality and shell out to it from php.
  
  On the other hand, we have to consider the wikimedia infrastructure for this 
and there are two counterpoints to be made:
  
  - Is this a service we can only expect MediaWiki to call? If not, that's a 
point in favour of creating a separate service
  - Shelling out for us works well by using a combination of firejail and 
cgroups creation that won't work well in the future with cgroups v2 and 
containerization
  - Performance might not be extremely relevant
  
  Now on the last point: this proposal seems to worry a lot about performance, 
but I see no performance requirement spelled out. Without more context, both 
the choice of shelling out vs and RPC service, and the proposal to use gRPC for 
said service seem to me like premature optimizations.
  
  So my questions are:
  
  - What is the 95th percentile of latency in  validating all the constraint on 
an item when editing it?
  - What is the average, median and max number of regexes we need to validate 
per item?
  
  Without answering those questions, we would just make choices by principle, 
while I think we should have a more pragmatic approach.

TASK DETAIL
  https://phabricator.wikimedia.org/T240884

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Joe
Cc: Joe, WMDE-leszek, Volans, sbassett, Krinkle, Agabi10, 
Lucas_Werkmeister_WMDE, Addshore, Aklapper, Ladsgroup, darthmon_wmde, 
DannyS712, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, RazeSoldier, 
QZanden, LawExplorer, _jensen, rosalieper, D3r1ck01, Scott_WUaS, Izno, SBisson, 
Perhelion, Wikidata-bugs, Base, aude, GWicke, Bawolff, jayvdb, fbstj, santhosh, 
Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T240884: Standalone service to evaluate user-provided regular expressions

2020-01-13 Thread Joe
Joe added a comment.


  In T240884#5789392 , 
@Ladsgroup wrote:
  
  >> Though this is mainly an implementation detail and not significant in 
terms requirements or pros/cons.
  >
  > I disagree for a couple of reasons: gRPC is faster. According to some 
measurements in ASP.net 

 (not in php) it's seven times faster than HTTP/JSON. That would be an 
important factor in deciding whether we should go with standalone service or 
another direction. 
  > The other reason I think it's important that is this would be the first 
time we are going to use gRPC in production, meaning introducing new 
dependencies (in php) and services, this is cross cutting and would involve 
more work from services and SRE than HTTP+JSON solution. Also, another reason 
also is that the API spec of the regex implementation is hard to undo as it'll 
be used in several places not just one part of Wikibase.
  
  I think you're both right - gRPC is an implementation detail in answering the 
first question:
  
  - Does a standalone service make sense?
  
  but if the answer is yes, then it's something we should discuss in the RFC 
process for the reasons @Ladsgroup outlined.

TASK DETAIL
  https://phabricator.wikimedia.org/T240884

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Joe
Cc: Joe, WMDE-leszek, Volans, sbassett, Krinkle, Agabi10, 
Lucas_Werkmeister_WMDE, Addshore, Aklapper, Ladsgroup, darthmon_wmde, 
DannyS712, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, RazeSoldier, 
QZanden, LawExplorer, _jensen, rosalieper, D3r1ck01, Scott_WUaS, Izno, SBisson, 
Perhelion, Wikidata-bugs, Base, aude, GWicke, Bawolff, jayvdb, fbstj, santhosh, 
Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T240884: Standalone service to evaluate user-provided regular expressions

2020-01-09 Thread Ladsgroup
Ladsgroup added a comment.


  > Though this is mainly an implementation detail and not significant in terms 
requirements or pros/cons.
  
  I disagree for a couple of reasons: gRPC is faster. According to some 
measurements in ASP.net 

 (not in php) it's seven times faster than HTTP/JSON. That would be an 
important factor in deciding whether we should go with standalone service or 
another direction.
  
  The other reason I think it's important that is this would be the first time 
we are going to use gRPC in production, meaning introducing new dependencies 
(in php) and services, this is cross cutting and would involve more work from 
services and SRE than HTTP+JSON solution. Also, another reason also is that the 
API spec of the regex implementation is hard to undo as it'll be used in 
several places not just one part of Wikibase.

TASK DETAIL
  https://phabricator.wikimedia.org/T240884

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Ladsgroup
Cc: WMDE-leszek, Volans, sbassett, Krinkle, Agabi10, Lucas_Werkmeister_WMDE, 
Addshore, Aklapper, Ladsgroup, darthmon_wmde, DannyS712, Nandana, kostajh, 
Lahi, Gq86, GoranSMilovanovic, RazeSoldier, QZanden, LawExplorer, _jensen, 
rosalieper, D3r1ck01, Scott_WUaS, Izno, SBisson, Perhelion, Wikidata-bugs, 
Base, aude, GWicke, Bawolff, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, 
Rxy, Jay8g, Ltrlg, bd808, Legoktm
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T240884: Standalone service to evaluate user-provided regular expressions

2020-01-08 Thread Krinkle
Krinkle added a comment.


  Based on today's TechCom meeting I've updated the task description to better 
separate the three proposals, and added a **Requirements** section.
  
  I've also fleshed out the re2-based solution description and clarified that 
gRPC is not itself a critical part of any of the solutions. Rather, it is 
brought up here because there is interest from involved parties to try out a 
different way of communicating over the network between client and server. And, 
if we end up with an HTTP-based standalone service, we can consider gRPC as 
alternative to HTTP/JSON. Though this is mainly an implementation detail and 
not significant in terms requirements or pros/cons.

TASK DETAIL
  https://phabricator.wikimedia.org/T240884

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Krinkle
Cc: Krinkle, Agabi10, Lucas_Werkmeister_WMDE, Addshore, Aklapper, Ladsgroup, 
darthmon_wmde, DannyS712, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, 
RazeSoldier, QZanden, LawExplorer, _jensen, rosalieper, D3r1ck01, Scott_WUaS, 
Izno, SBisson, Perhelion, Wikidata-bugs, Base, aude, GWicke, Bawolff, jayvdb, 
fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs