bolkedebruin commented on PR #34729: URL: https://github.com/apache/airflow/pull/34729#issuecomment-1754626593
@jens-scheffler-bosch AFS works on a fundamentally different layer than Datasets do. Datasets are currently just metadata pointing to arbitrary data and are treated as such in Airflow. They are also only evaluated at parsing time, which is a major shortcoming in my opinion. AFS gives a consistent generic API on top of storage. While that might look the same at first it or closely related, it is not when you consider that Datasets could point to Tables which are a structured way of representing a consistent set of data which in turn are backed-up by a (set of) file(s) in a filesystem residing on storage. As mentioned on the AIP I think that indeed that bringing Datasets closer to AFS makes sense, but I also think it is out-of-scope of this AIP. More a "for further research" kind of thing than something to be included now. On mount points, I am willing to consider to let it go, but I think it is also too early now. In the arguments against it I have only seen "it is confusing for users" where users are DAG authors _only_. As mentioned AFS targets DAGs, but also XCom and DAG processing. That perspective seems not to be included when arguing against using mount points. Then on the content level, there are other examples that use a Virtual File System (as AFS is). The primary that comes to mind is Databricks FS (dbfs), which does use mount points virtually. It seems not to be confusing for their users so 'user data' is not necessarily pointing towards 'confusing'. Another reason, is to provide efficiency in **cross** file system operations, so for example s3 to gcs, or **same** file system operations, like s3 to s3, without requiring the user to deal with that. Again, this is not considered when arguing against mount points. Finally, using mount points allows users **not** to think of Airflow's comple xity. By just changing the mount of a mount point your code operates on local storage, s3 or gcs. I am not sure if you read the PR correctly, as it already supports Connections and Providers. `gs, adls, s3` are filesystems obtained from their respective Providers and Connections are used if provided or obtained from the environment. Which is how Operators in general should, not all of them do, behave. Even if solely using Paths for dealing with file operations, we need to store connection details somewhere. This needs to be done centrally in order to be able to reuse connections and for efficiency. This could be done in a cache without exposing it to the user or have it exposed though something like mount points so they can be referenced by the user. I intend to keep the PR updated. There are some disadvantages, but it keeps the discussion grounded and allows people to play and experiment with it as well. Otherwise we will just be doing a paper exercise on pros and cons no one has practical experience with. DAG code is in this PR, although it is not even required to be in a DAG(!). It works the same across all, which was the intention. Note that 'Path' and 'mount points' can coexist. They are not mutually exclusive. In fact as mentioned above a `Path` interface requires a central registration. Whether or not that is exposed is up for discussion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org