+1 to this. It would be really useful. As long as we can opt out, I think we’re good.
Best, Wei > On Mar 31, 2024, at 12:47 AM, Kaxil Naik <kaxiln...@gmail.com> wrote: > > Grammar Correction: > > We should assume that those who deploy and upgrade Airflow - actually read >> and take into account what is written in the release notes - especially if >> they have security guys breathing their necks, similarly as we have to >> assume they follow CVE announcements about security issues fixed. If we >> are very straightforward and out-going about the change, inform very >> clearly how to opt-out, I don't see a big problem with opt-out. > > > I couldn't agree more; even though we shouldn't collect any data that > hamper security (and we should aim to do the same), most security concerned > folks don't just upgrade, and we can rely on them regarding release notes > or announcements and we can make it very clear in our announcements too; > and in our installation guides. > > On Sat, 30 Mar 2024 at 16:47, Kaxil Naik <kaxiln...@gmail.com> wrote: > >> Grammar crrection: >> >> >> On Sat, 30 Mar 2024 at 16:43, Kaxil Naik <kaxiln...@gmail.com> wrote: >> >>> Have this at the end of the email too: but if folks don't read until the >>> end and quoting Maxime from the use-case blog[1]: >>> >>> "I think people often ask ‘how do I contribute to open source?’, ‘I've >>> got to get into the code’, or ‘ I’ve got to be an engineer.’ Actually, the >>> very simplest thing that you can do is just say, ‘my organization gets real >>> value from this piece of software.’ There are a bunch of ways to let the >>> people know about it – and now Scarf is there. If your organization is >>> getting a lot of value from a piece of open source software, make sure the >>> devs know about it." >>> >>> What kind of edge cases are you thinking about? I don't think it makes >>> sense to have "opt-in" at all. As the goal is to collect data for most >>> Airflow installations except for those that don't want to give data, then >>> "opt-out" is the only way to maximize it. As long as we don't collect any >>> PII data, this is in-compliance as well. >>> >>> Imagine someone learning Airflow, if they have to opt-in via a config, >>> they wouldn't even know or care about it, hence us losing most of the data. >>> I understand why some orgs & individuals may want to opt-out. >>> >>> Scarf Provides tracking pixels (essentially an HTML image tag) that you >>> can place in your website or product to track visitors to that URL. If >>> there were any concerns about Privacy, ASF wouldn't have approved it at all. >>> >>> A few key details to note about the pixel: >>> >>> >>> - No PII is tracked… Scarf does not capture/retain IP information… >>> this information is discarded by the platform upon processing/aggregating >>> - Scarf pixels respect the Do Not Track (DNT) settings of browsers - >>> these users will not be tracked whatsoever. >>> >>> >>> All the ASF projects I had listed (whether they use Scarf gateway or >>> Scarf pixel in product) are using opt-out. >>> >>> 1. Short opt-in period before opt-out. Test this feature with users who >>>> trust and if it works great - make it public. I think it's wise to handle >>>> edge cases and configure collected data more accurately. >>> >>> >>> >>> It would be a pixel in the webserver, should affect nothing at all even >>> in an air-gapped environment. >>> >>>> 2. It should not affect anything if access to the internet is restricted >>>> which is default for many companies. >>> >>> >>> >>> 100% agreed on the below: >>> >>>> I think we have a very good blueprint to follow including at least 5 >>>> other >>>> ASF projects that also passed the review of the privacy@asf. And while I >>>> understand (and concur) the urge for opt-in by default coming from >>>> consumer >>>> market (where it makes perfect sense) Airflow is not a consumer >>>> software and is used in "corporate environment" which has a little >>>> different expectations and broad assumption that the company can make >>>> decisions on such telemetry on behalf of the employees using it. >>> >>> >>> Couldn't agree more; even though there shouldn't we collect hamper >>> security (and we should aim to do the same), most security concerned folks >>> don't just >>> upgrade, and we can rely on them regarding release notes or announcements >>> and we can make it very clear in our announcements too; and in our >>> installation guides. >>> >>> We should assume that those who deploy and upgrade Airflow - actually read >>>> and take into account what is written in the release notes - especially >>>> if >>>> they have security guys breathing their necks, similarly as we have to >>>> assume they follow CVE announcements about security issues fixed. If we >>>> are very straightforward and out-going about the change, inform very >>>> clearly how to opt-out, I don't see a big problem with opt-out. >>> >>> >>> >>> To be clear, the collection of data, or at least the data we should >>> gather here should help all the consumers without violating anything >>> regulations. I will quote Maxime's quote in the use-case doc [1] >>> >>> "*Another Form of Contributing* >>> “I think people often ask ‘how do I contribute to open source?’, ‘I've >>> got to get into the code’, or ‘ I’ve got to be an engineer.’ Actually, the >>> very simplest thing that you can do is just say, ‘my organization gets real >>> value from this piece of software.’ There are a bunch of ways to let the >>> people know about it – and now Scarf is there. If your organization is >>> getting a lot of value from a piece of open source software, make sure the >>> devs know about it.”" >>> >>> >>> [1] https://about.scarf.sh/post/scarf-case-study-apache-superset >>> >>> On Sat, 30 Mar 2024 at 14:02, Alexander Shorin <kxe...@apache.org> wrote: >>> >>>> Hi Jarek! >>>> >>>> I understand the reasons for opt-out from a project view. I just suddenly >>>> imagined the situation when an upgrade happens and here comes the data to >>>> some third party service - that's a view from a user side of some big >>>> company. >>>> >>>> There could be good alternatives to handle this: >>>> 1. Short opt-in period before opt-out. Test this feature with users who >>>> trust and if it works great - make it public. I think it's wise to handle >>>> edge cases and configure collected data more accurately. >>>> 2. Explicitly somehow warn about this feature to make this feature not >>>> get >>>> unnoticed. Just to reduce possible frustration. >>>> >>>> Just a personal thoughts for discussion (: >>>> >>>> -- >>>> ,,,^..^,,, >>>> >>>> On Sat, Mar 30, 2024 at 4:36 PM Jarek Potiuk <ja...@potiuk.com> wrote: >>>> >>>>> Hello everyone, >>>>> >>>>> it has to be: >>>>> >>>>> 1. Opt-in by default to not trigger security guys about new unplanned >>>>>> activity after regular upgrade. >>>>>> >>>>> >>>>> That's a very good point about security triggering Alexander, but I am >>>> not >>>>> so sure it means that we "have to" do opt-in. There are other ways of >>>>> communicating with the "deployment managers" who install and upgrade >>>>> airflow - i.e. release notes. blogs, social media of ours, slack >>>>> announcements etc. We have plenty of channels we can use to >>>> communicate the >>>>> change. >>>>> >>>>> I think we have a very good blueprint to follow including at least 5 >>>> other >>>>> ASF projects that also passed the review of the privacy@asf. And >>>> while I >>>>> understand (and concur) the urge for opt-in by default coming from >>>> consumer >>>>> market (where it makes perfect sense) Airflow is not a consumer >>>>> software and is used in "corporate environment" which has a little >>>>> different expectations and broad assumption that the company can make >>>>> decisions on such telemetry on behalf of the employees using it. >>>>> >>>>> We should assume that those who deploy and upgrade Airflow - actually >>>> read >>>>> and take into account what is written in the release notes - >>>> especially if >>>>> they have security guys breathing their necks, similarly as we have to >>>>> assume they follow CVE announcements about security issues fixed. If we >>>>> are very straightforward and out-going about the change, inform very >>>>> clearly how to opt-out, I don't see a big problem with opt-out. >>>>> >>>>> We should of course check with privacy@a.o (but I'v spend a good deal >>>> of >>>>> time reading the Superset and other use case and explanation in >>>> detail to >>>>> make a better informed decision) - and it looks like they also went >>>> opt-out >>>>> way and got cleared by privacy@a.o. And if we cannot reach >>>> consensus, we >>>>> should - as usual - make a voting decision on it (because yes, it is an >>>>> important decision), but - after reading and understanding why others >>>> also >>>>> did it - for me personally, opt-out is a good path. >>>>> >>>>> Also because it will rather increase the amount of data to gather, and >>>> in >>>>> our case - counter intuitively - it will be even better for privacy and >>>>> corporate anonymity, because the more data we get, the more difficult >>>> it >>>>> will be to get any non-statistical/non-aggregated insight from it. >>>> Imagine >>>>> if only a few corporate users will enable it consciously - then we >>>> will be >>>>> able to draw much more conclusions if we find out who they are, than if >>>>> everyone has it enabled by default. >>>>> >>>>> That's my take on it - but again, it's up to us to vote, for me opt-in >>>> is >>>>> not "has to", and I am rather for opt-out. >>>>> >>>>> J. >>>>> >>>>>> Hi all, >>>>>> >>>>>> >>>>>>> I want to propose gathering telemetry for Airflow installations. >>>> As the >>>>>>> Airflow community, we have been relying heavily on the yearly >>>> Airflow >>>>>>> Survey and anecdotes to answer a few key questions about Airflow >>>> usage. >>>>>>> Questions like the following: >>>>>>> >>>>>>> >>>>>>> - Which versions of Airflow are people installing/using now >>>> (i.e. >>>>>>> whether people have primarily made the jump from version X to >>>>> version >>>>>> Y) >>>>>>> - Which DB is used as the Metadata DB and which version e.g Pg >>>> 14? >>>>>>> - What Python version is being used? >>>>>>> - Which Executor is being used? >>>>>>> - Approximately how many people out there in the world are >>>>> installing >>>>>>> Airflow >>>>>>> >>>>>>> >>>>>>> There is a solution that should help answer these questions: Scarf >>>> [1]. >>>>>> The >>>>>>> ASF already approves Scarf [2][3] and is already used by other ASF >>>>>>> projects: Superset [4], Dolphin Scheduler [5], Dubbo Kubernetes, >>>>> DevLake, >>>>>>> Skywalking as it follows GDPR and other regulations. >>>>>>> >>>>>>> Similar to Superset, we probably can use it as follows: >>>>>>> >>>>>>> >>>>>>> 1. Install the `scarf js` npm package and bundle it in the >>>>> Webserver. >>>>>>> When the package is downloaded & Airflow webserver is opened, >>>>> metadata >>>>>>> is >>>>>>> recorded to the Scarf dashboard. >>>>>>> 2. Utilize the Scarf Gateway [6], which we can use in front of >>>>> docker >>>>>>> containers. While it’s possible people go around this gateway, >>>> we >>>>> can >>>>>>> probably configure and encourage most traffic to go through >>>> these >>>>>>> gateways. >>>>>>> >>>>>>> While Scarf does not store any personally identifying information >>>> from >>>>>> SDK >>>>>>> telemetry data, it does send various bits of IP-derived >>>> information as >>>>>>> outlined here [7]. This data should be made as transparent as >>>> possible >>>>> by >>>>>>> granting dashboard access to the Airflow PMC and any other relevant >>>>> means >>>>>>> of sharing/surfacing it that we encounter (Town Hall, Slack, >>>> Newsletter >>>>>>> etc). >>>>>>> >>>>>>> The following case studies are worth reading: >>>>>>> >>>>>>> 1. https://about.scarf.sh/post/scarf-case-study-apache-superset >>>>> (From >>>>>>> Maxime) >>>>>>> 2. >>>>>>> >>>>>>> >>>>>> >>>>> >>>> https://about.scarf.sh/post/haskell-org-bridging-the-gap-between-language-innovation-and-community-understanding >>>>>>> >>>>>>> Similar to them, this could help in various ways that come with >>>> using >>>>>> data >>>>>>> for decision-making. With clear guidelines on "how to opt-out" >>>>>> [8][9][10] & >>>>>>> "what data is being collected" on the Airflow website, this can be >>>>>>> beneficial to the entire community as we would be making more >>>> informed >>>>>>> decisions. >>>>>>> >>>>>>> Regards, >>>>>>> Kaxil >>>>>>> >>>>>>> >>>>>>> [1] https://about.scarf.sh/ >>>>>>> [2] https://privacy.apache.org/policies/privacy-policy-public.html >>>>>>> [3] https://privacy.apache.org/faq/committers.html >>>>>>> [4] https://github.com/apache/superset/issues/25639 >>>>>>> [5] >>>>>>> >>>>>>> >>>>>> >>>>> >>>> https://github.com/search?q=repo%3Aapache%2Fdolphinscheduler%20scarf.sh&type=code >>>>>>> [6] https://about.scarf.sh/scarf-gateway >>>>>>> [7] https://about.scarf.sh/privacy-policy >>>>>>> [8] >>>>>>> >>>>>>> >>>>>> >>>>> >>>> https://superset.apache.org/docs/frequently-asked-questions/#does-superset-collect-any-telemetry-data >>>>>>> [9] >>>>>>> >>>>>>> >>>>>> >>>>> >>>> https://superset.apache.org/docs/installation/installing-superset-using-docker-compose >>>>>>> [10] >>>>>>> >>>>>>> >>>>>> >>>>> >>>> https://docs.scarf.sh/package-analytics/#as-a-user-of-a-package-using-scarf-js-how-can-i-opt-out-of-analytics >>>>>>> >>>>>> >>>>> >>>> >>> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org For additional commands, e-mail: dev-h...@airflow.apache.org