dstandish commented on issue #25186: URL: https://github.com/apache/airflow/issues/25186#issuecomment-1191854525
I am actually thinking now that we should abandon URI as the identifier for datasets. I think part of the motivation for choosing URI was [this document in openlineage](https://github.com/OpenLineage/OpenLineage/blob/main/spec/Naming.md). But the thing is, openlineage doesn't actually use URI for datasets! The document is misleading. In OpenLineage, what defines a dataset is actually `namespace + name`. To me, this is much better and I think we should change Dataset to do something more similar to that -- if not exactly that. As we've seen, URI is problematic for a number of reasons. Just to name a few... It isn't "uniquely" defined by a string --- there are _many_ such URI that correspond to the same exact resource. It's not human friendly, and really requires that you _also_ have a friendly _name_ you can refer to it with (e.g. to present in UI). It can be confused for "connection string" -- when we're mearely looking for a label for a dataset. URI makes more sense as something generated by machines -- not as the human-entered label for a dataset. But openlineage has no need for it anyway. I plan to look at some other lineage standards to see how they identify datasets. But for now I am really feeling that we shouldn't use URI for this purpose. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org