Grammar crrection:
On Sat, 30 Mar 2024 at 16:43, Kaxil Naik <kaxiln...@gmail.com> wrote: > Have this at the end of the email too: but if folks don't read until the > end and quoting Maxime from the use-case blog[1]: > > "I think people often ask ‘how do I contribute to open source?’, ‘I've got > to get into the code’, or ‘ I’ve got to be an engineer.’ Actually, the very > simplest thing that you can do is just say, ‘my organization gets real > value from this piece of software.’ There are a bunch of ways to let the > people know about it – and now Scarf is there. If your organization is > getting a lot of value from a piece of open source software, make sure the > devs know about it." > > What kind of edge cases are you thinking about? I don't think it makes > sense to have "opt-in" at all. As the goal is to collect data for most > Airflow installations except for those that don't want to give data, then > "opt-out" is the only way to maximize it. As long as we don't collect any > PII data, this is in-compliance as well. > > Imagine someone learning Airflow, if they have to opt-in via a config, > they wouldn't even know or care about it, hence us losing most of the data. > I understand why some orgs & individuals may want to opt-out. > > Scarf Provides tracking pixels (essentially an HTML image tag) that you > can place in your website or product to track visitors to that URL. If > there were any concerns about Privacy, ASF wouldn't have approved it at all. > > A few key details to note about the pixel: > > > - No PII is tracked… Scarf does not capture/retain IP information… > this information is discarded by the platform upon processing/aggregating > - Scarf pixels respect the Do Not Track (DNT) settings of browsers - > these users will not be tracked whatsoever. > > > All the ASF projects I had listed (whether they use Scarf gateway or Scarf > pixel in product) are using opt-out. > > 1. Short opt-in period before opt-out. Test this feature with users who >> trust and if it works great - make it public. I think it's wise to handle >> edge cases and configure collected data more accurately. > > > > It would be a pixel in the webserver, should affect nothing at all even in > an air-gapped environment. > >> 2. It should not affect anything if access to the internet is restricted >> which is default for many companies. > > > > 100% agreed on the below: > >> I think we have a very good blueprint to follow including at least 5 other >> ASF projects that also passed the review of the privacy@asf. And while I >> understand (and concur) the urge for opt-in by default coming from >> consumer >> market (where it makes perfect sense) Airflow is not a consumer >> software and is used in "corporate environment" which has a little >> different expectations and broad assumption that the company can make >> decisions on such telemetry on behalf of the employees using it. > > > Couldn't agree more; even though there shouldn't we collect hamper > security (and we should aim to do the same), most security concerned folks > don't just > upgrade, and we can rely on them regarding release notes or announcements > and we can make it very clear in our announcements too; and in our > installation guides. > > We should assume that those who deploy and upgrade Airflow - actually read >> and take into account what is written in the release notes - especially if >> they have security guys breathing their necks, similarly as we have to >> assume they follow CVE announcements about security issues fixed. If we >> are very straightforward and out-going about the change, inform very >> clearly how to opt-out, I don't see a big problem with opt-out. > > > > To be clear, the collection of data, or at least the data we should gather > here should help all the consumers without violating anything regulations. > I will quote Maxime's quote in the use-case doc [1] > > "*Another Form of Contributing* > “I think people often ask ‘how do I contribute to open source?’, ‘I've got > to get into the code’, or ‘ I’ve got to be an engineer.’ Actually, the very > simplest thing that you can do is just say, ‘my organization gets real > value from this piece of software.’ There are a bunch of ways to let the > people know about it – and now Scarf is there. If your organization is > getting a lot of value from a piece of open source software, make sure the > devs know about it.”" > > > [1] https://about.scarf.sh/post/scarf-case-study-apache-superset > > On Sat, 30 Mar 2024 at 14:02, Alexander Shorin <kxe...@apache.org> wrote: > >> Hi Jarek! >> >> I understand the reasons for opt-out from a project view. I just suddenly >> imagined the situation when an upgrade happens and here comes the data to >> some third party service - that's a view from a user side of some big >> company. >> >> There could be good alternatives to handle this: >> 1. Short opt-in period before opt-out. Test this feature with users who >> trust and if it works great - make it public. I think it's wise to handle >> edge cases and configure collected data more accurately. >> 2. Explicitly somehow warn about this feature to make this feature not get >> unnoticed. Just to reduce possible frustration. >> >> Just a personal thoughts for discussion (: >> >> -- >> ,,,^..^,,, >> >> On Sat, Mar 30, 2024 at 4:36 PM Jarek Potiuk <ja...@potiuk.com> wrote: >> >> > Hello everyone, >> > >> > it has to be: >> > >> > 1. Opt-in by default to not trigger security guys about new unplanned >> > > activity after regular upgrade. >> > > >> > >> > That's a very good point about security triggering Alexander, but I am >> not >> > so sure it means that we "have to" do opt-in. There are other ways of >> > communicating with the "deployment managers" who install and upgrade >> > airflow - i.e. release notes. blogs, social media of ours, slack >> > announcements etc. We have plenty of channels we can use to communicate >> the >> > change. >> > >> > I think we have a very good blueprint to follow including at least 5 >> other >> > ASF projects that also passed the review of the privacy@asf. And while >> I >> > understand (and concur) the urge for opt-in by default coming from >> consumer >> > market (where it makes perfect sense) Airflow is not a consumer >> > software and is used in "corporate environment" which has a little >> > different expectations and broad assumption that the company can make >> > decisions on such telemetry on behalf of the employees using it. >> > >> > We should assume that those who deploy and upgrade Airflow - actually >> read >> > and take into account what is written in the release notes - especially >> if >> > they have security guys breathing their necks, similarly as we have to >> > assume they follow CVE announcements about security issues fixed. If we >> > are very straightforward and out-going about the change, inform very >> > clearly how to opt-out, I don't see a big problem with opt-out. >> > >> > We should of course check with privacy@a.o (but I'v spend a good deal >> of >> > time reading the Superset and other use case and explanation in detail >> to >> > make a better informed decision) - and it looks like they also went >> opt-out >> > way and got cleared by privacy@a.o. And if we cannot reach consensus, >> we >> > should - as usual - make a voting decision on it (because yes, it is an >> > important decision), but - after reading and understanding why others >> also >> > did it - for me personally, opt-out is a good path. >> > >> > Also because it will rather increase the amount of data to gather, and >> in >> > our case - counter intuitively - it will be even better for privacy and >> > corporate anonymity, because the more data we get, the more difficult it >> > will be to get any non-statistical/non-aggregated insight from it. >> Imagine >> > if only a few corporate users will enable it consciously - then we will >> be >> > able to draw much more conclusions if we find out who they are, than if >> > everyone has it enabled by default. >> > >> > That's my take on it - but again, it's up to us to vote, for me opt-in >> is >> > not "has to", and I am rather for opt-out. >> > >> > J. >> > >> > > Hi all, >> > > >> > > >> > > > I want to propose gathering telemetry for Airflow installations. As >> the >> > > > Airflow community, we have been relying heavily on the yearly >> Airflow >> > > > Survey and anecdotes to answer a few key questions about Airflow >> usage. >> > > > Questions like the following: >> > > > >> > > > >> > > > - Which versions of Airflow are people installing/using now (i.e. >> > > > whether people have primarily made the jump from version X to >> > version >> > > Y) >> > > > - Which DB is used as the Metadata DB and which version e.g Pg >> 14? >> > > > - What Python version is being used? >> > > > - Which Executor is being used? >> > > > - Approximately how many people out there in the world are >> > installing >> > > > Airflow >> > > > >> > > > >> > > > There is a solution that should help answer these questions: Scarf >> [1]. >> > > The >> > > > ASF already approves Scarf [2][3] and is already used by other ASF >> > > > projects: Superset [4], Dolphin Scheduler [5], Dubbo Kubernetes, >> > DevLake, >> > > > Skywalking as it follows GDPR and other regulations. >> > > > >> > > > Similar to Superset, we probably can use it as follows: >> > > > >> > > > >> > > > 1. Install the `scarf js` npm package and bundle it in the >> > Webserver. >> > > > When the package is downloaded & Airflow webserver is opened, >> > metadata >> > > > is >> > > > recorded to the Scarf dashboard. >> > > > 2. Utilize the Scarf Gateway [6], which we can use in front of >> > docker >> > > > containers. While it’s possible people go around this gateway, we >> > can >> > > > probably configure and encourage most traffic to go through these >> > > > gateways. >> > > > >> > > > While Scarf does not store any personally identifying information >> from >> > > SDK >> > > > telemetry data, it does send various bits of IP-derived information >> as >> > > > outlined here [7]. This data should be made as transparent as >> possible >> > by >> > > > granting dashboard access to the Airflow PMC and any other relevant >> > means >> > > > of sharing/surfacing it that we encounter (Town Hall, Slack, >> Newsletter >> > > > etc). >> > > > >> > > > The following case studies are worth reading: >> > > > >> > > > 1. https://about.scarf.sh/post/scarf-case-study-apache-superset >> > (From >> > > > Maxime) >> > > > 2. >> > > > >> > > > >> > > >> > >> https://about.scarf.sh/post/haskell-org-bridging-the-gap-between-language-innovation-and-community-understanding >> > > > >> > > > Similar to them, this could help in various ways that come with >> using >> > > data >> > > > for decision-making. With clear guidelines on "how to opt-out" >> > > [8][9][10] & >> > > > "what data is being collected" on the Airflow website, this can be >> > > > beneficial to the entire community as we would be making more >> informed >> > > > decisions. >> > > > >> > > > Regards, >> > > > Kaxil >> > > > >> > > > >> > > > [1] https://about.scarf.sh/ >> > > > [2] https://privacy.apache.org/policies/privacy-policy-public.html >> > > > [3] https://privacy.apache.org/faq/committers.html >> > > > [4] https://github.com/apache/superset/issues/25639 >> > > > [5] >> > > > >> > > > >> > > >> > >> https://github.com/search?q=repo%3Aapache%2Fdolphinscheduler%20scarf.sh&type=code >> > > > [6] https://about.scarf.sh/scarf-gateway >> > > > [7] https://about.scarf.sh/privacy-policy >> > > > [8] >> > > > >> > > > >> > > >> > >> https://superset.apache.org/docs/frequently-asked-questions/#does-superset-collect-any-telemetry-data >> > > > [9] >> > > > >> > > > >> > > >> > >> https://superset.apache.org/docs/installation/installing-superset-using-docker-compose >> > > > [10] >> > > > >> > > > >> > > >> > >> https://docs.scarf.sh/package-analytics/#as-a-user-of-a-package-using-scarf-js-how-can-i-opt-out-of-analytics >> > > > >> > > >> > >> >