potiuk commented on issue #27476: URL: https://github.com/apache/airflow/issues/27476#issuecomment-1310191484
> @potiuk, the biggest gap with git-sync is that we effectively have no way to poll less frequently. This isn't a concern at small scale, but there is a point with many instances using a monorepo where it becomes problematic. Hmm. Unless you mean something else, this is entirely possible. You could even set different period for workers, and diffferent for scheduler and even disable it completely for K8S pod (they only need the init container to run the sync once when they are started): https://github.com/kubernetes/git-sync ``` --period <duration>, $GIT_SYNC_PERIOD How long to wait between sync attempts. This must be at least 10ms. This flag obsoletes --wait, but if --wait is specified, it will take precedence. If not specified, this defaults to 10 seconds ("10s"). ``` And let's not forget the fact that running scheduler(s) and even workers using persistent volumes perform far more frequent polling. Polling of GitSync (if no files are modified) is just single HTTP call with latest Commit Hash in the branch and and the files were not updated, it just returns "no change". Polling of the persistency server by scheduler and worker is far worse. Because both scheduler (continuously) and worker (every time new task is picked) scan remote directory listing of such persistent volume. For example when it is NFS - which is vast majority of persistent volumes implementation, scheduler is basically continuously polling and flooding the NFS server - non, stop, sometimes hundreds of roundrips or even thousands of roundtrips per second (see my article). Git Sync on the other hand (unless there are changes coming) is pretty much one, very small HTTP request per 10 seconds by default and you can make the period longer. And even if there are changes, Git does fan tastic job on compressing and sending only incremental changes that are comiung. This is a stark contrast with Persistent Volumes. They are far less optimised and in vast majority of implementations, if a single line changes in your DAG file, the whole DAG file will have to be send. It's very different when you use Git Sync Protocol. I'd argue that because of those optimisations and fact that Git Sync has super efficient way of checking if "anything" changed, the polling when you have persistency is few orders of magnitude worse than that of Git. Question @jedcunningham : Have you ever had a case or seen a single user complaining about "efficiency" of their git server? I have not seen any. But I saw many, many, many users complaining how EFS (which is NFS-based persistency used in Amazon) was unstable and inefficient for them and they had to pay a lot for IOPS to make it stable. This is because Airflow use case is "slow number of DAG changes, huge number of polling for DAG volumes even if no DAG changed". This makes Git far superior solution. It's not "generally" better, but it's IMHO always better for Airflow. > > There are probably better solutions to that problem, but polling less frequently when you aren't on 1 LocalExecutor is asking for heartache eventually. And "polling less frequently" is a natural knob to reach for, unfortunately, if you hit that case or think you might, try and get ahead of the game, etc. > Relying on (presumably) syncing from github also introduces risk in your production system, persistence is at least "local" to some degree. I'm generally pro gitsync without persistence, but there are definitely scenarios where it isn't enough. This can be very easily solved by polling a local Git server. Git has a fantastic super-efficient synchronisation protocol and if users are keeping the file in GitHub anyway, it's extremely simple to have a local mirror. Similarly as persistent volumes - all major cloud providers allow not only to setup "persistency" but also "Github local mirrror" with a single click - simply because this is all built-in Git protocol. In case you need to achieve GitHub independence, you just set-up a mirror, and use that mirror as your Git repo source. Case solved. Works. Setting up mirror is also an easy way to distribute "polling" workload if you actually hit the limits of your git server. And it is much better and more controllable than shared, persistent volumes. With Git-sync you might setup a mirror and use them if you do not want to use main repo and "overload" it. With EFS for example the only way to turn the knobs when you start hitting the limits is to start paying A LOT for IOPS provisioning - and many of our users complained that they could not get the stability without paying a lot for pre-provisioned IOPs when they reached certain size and it was a completely unexpected and huge expense that they have not foreseen before, but because they were already vested in persistency, they were simply "forced" to do that. With pure Git sync approach you have much more knobs to turn - and they are super easy - both on premis and in the cloud. I am still not convinced at all of any benefit of Persistency + Git Sync. * Just persistency - for some cases yes. * But when you already use Git to store your files, and you want to use GitSync IMHO adding persistency only make things worse, and never better IMHO. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
