[GitHub] [airflow] r-richmond commented on issue #14396: Make context less nebulous
r-richmond commented on issue #14396: URL: https://github.com/apache/airflow/issues/14396#issuecomment-937445404 >I believe so far simply no-one else found it really useful or needed. What is the bar to clear here?.. i.e. 5 thumbs up on opening issue doesn't meet it so I'm curious at what point your opinion would change. >We usually do not plan work which is not "urgent", "important" or "part of bigger feature we work on". Makes sense, ty for info >why don't you create a Draft PR with it and share it here to see how it can look like so that you could reiterate your points showing them with a real code. If you convince others (including myself), we will merge it. If not - we will drop it. I'm interested in seeing it tackled and am willing to take a stab at it. I'm just really busy for the next couple months. I.e. please don't close the issue yet. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [airflow] r-richmond commented on issue #14396: Make context less nebulous
r-richmond commented on issue #14396: URL: https://github.com/apache/airflow/issues/14396#issuecomment-937445404 >I believe so far simply no-one else found it really useful or needed. What is the bar to clear here?.. i.e. 5 thumbs up on opening issue doesn't meet it so I'm curious at what point your opinion would change. >We usually do not plan work which is not "urgent", "important" or "part of bigger feature we work on". Makes sense, ty for info >why don't you create a Draft PR with it and share it here to see how it can look like so that you could reiterate your points showing them with a real code. If you convince others (including myself), we will merge it. If not - we will drop it. I'm interested in seeing it tackled and am willing to take a stab at it. I'm just really busy for the next couple months. I.e. please don't close the issue yet. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [airflow] r-richmond commented on issue #14396: Make context less nebulous
r-richmond commented on issue #14396: URL: https://github.com/apache/airflow/issues/14396#issuecomment-928642301 @kaxil Noticed this keeps getting pushed. Can you provide any additional context? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [airflow] r-richmond commented on issue #14396: Make context less nebulous
r-richmond commented on issue #14396: URL: https://github.com/apache/airflow/issues/14396#issuecomment-928642301 @kaxil Noticed this keeps getting pushed. Can you provide any additional context? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [airflow] r-richmond commented on issue #14396: Make context less nebulous
r-richmond commented on issue #14396: URL: https://github.com/apache/airflow/issues/14396#issuecomment-786194069 >Not really. We are not planning to add anything to the context any time soon. and even if we do it's the same for dict/field. If someone extends data class with a new field the problem is the same. >`A safer way for maintainers to add new fields to context` > Not really. it's the same kind of problems you get. I don't follow here. In the V2 dataclass design users would add stuff to the `user_defined` field which has no chance of collision with anything airflow adds later to the base dataclass. The same can't be said for dicts. What am I missing? > `A clean way to implement deprecation warnings with detailed warning messages about potential silent bugs` We do not need deprecation warnings in case we do not change from Dict True regarding the deprecation warning, but isn't their value in warning users about doing things could cause bugs? (i.e. overwriting the wrong keys in context). > `More flexibility down the road (dataclasses are more flexible than dictionaries)` This sentence is meaningless. I argue that dicts are more flexible and probably we would both be right. I like the meet in the middle vibe but I kind of cheated with the `user_defined` field being a dict which I think pushes dataclass to the winners column. > `A solution that is easier to maintain in the future` Again - meaningless - maintenance is also to go trough the hassle of changing and informing users. I think we are just stuck on 2 different points here, you rightly point to the burden of migrating users and I'm stuck on what I perceive are the long-term benefits of switching (safer modifications, better IDE integration usability improvements, warning users about things that could cause bugs). > `Or said another way we shouldn't optimize for airflow 2.x maintainability we should optimize for airflow maintainability.` I do not agree. I carefully weighted pros/cons and as maintainer i agree with @kaxil TypedDict is much better solution and we will have no plans to change to Dataclass. You have not convinced us. I'm just a user so apologies if I came across as forceful. It is clear we have a difference of opinion in which case the maintainer usually and rightly decides upon the path forward for the project. Lastly at this point I'm gonna slow down on this topic for a while to allow other users & maintainers a chance to give their opinion/thoughts. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [airflow] r-richmond commented on issue #14396: Make context less nebulous
r-richmond commented on issue #14396: URL: https://github.com/apache/airflow/issues/14396#issuecomment-786041115 >Yeah. I was expecting exactly this answer :). So summarizing - what you've done now, you created hybrid dataclass and dictionary put together. Now, my goal is to show you what further consequences you have to deal with. [0_O](https://www.youtube.com/watch?v=4F4qzPbcFiA) >There are hundreds of thousands custom operators our there that are using this "public API". This means that any change we introduce now is going to be around at least a year from now. And until then 1.10 is still there as well (and will be there for quite some time). So if someone develops custom operators for their 1.10 Airflow, they will still use dictionary - so we have probably tens of thousands custom operators created still using the 'Dictionary' context for another year or two. Yes, this is true. Thats why I mentioned that now is the "easiest" time for this change as it only gets harder in the future; (In a perfect world, this would have been raised before 2.0 :s). > Most likely many of the currently released operators are using the context to pass data (as custom dictionary values) between those methods - one can set a custom value in pre_execute() and retrieve it in execute() or post_execute() reads whatever execute sets in the context. It was easy to use, we have not forbidden it, it is part of the API (this is the basic "property" of dictionary - unlike dataclass - that you can set any value with any key there). By introducing Dataclass we are breaking this property. You will not be able to set arbitrary key in the context in pre_execute so that it is available in execute. If we implement the interrim (lasting at least a year or more) hybrid dataclass <-> dictionary proposed above, this will continue to work but with deprecation warnings. Again dataclass does not break this property and in fact after thinking, I think dataclasses provides the following significant advantages in this area which dict and typedDict do not. 1. A safer way for users to set custom fields in context 1. A safer way for airflow maintainers to add new fields to context Take the following example Context Dataclass MVP V2 ```python @dataclass class Demo: # context replacement id: str value_dc: int user_defined: Dict[str, Any] = field(default_factory=dict) def __getitem__(self, item): if item in self.__dict__.keys(): logging.warning(msg=f"dictionary interface getitem on context is deprecated; update to use the dataclass interface for standard fields like `{item}`") return self.__dict__[item] elif item in self.user_defined: logging.warning(msg=f"dictionary interface getitem on context is deprecated; update to use context.user_defined for custom fields like `{item}`") return self.user_defined[item] else: raise KeyError def __setitem__(self, key: str, value): if key in self.__dict__.keys(): msg = f"""dictionary interface setitem for standard fields is deprecated; update to use the dataclass interface for standard fields like `{key}` note: changing standard context fields is not supported and may have undefined behavior. If this is meant to be a custom field use context.user_defined instead""" logging.warning(msg=msg) self.__dict__[key] = value else: logging.warning( msg=f"dictionary interface setitem on context is deprecated; update to use context.user_defined for custom fields like `{key}`") self.user_defined[key] = value def keys(self): # added as an example to show how far we could go to have a non-breaking change for 2.1 logging.warning(msg=f"dictionary interface keys is deprecated; update this to use the dataclass interface") temp = self.__dict__ temp.update(self.user_defined) return temp d = Demo(id="long_id", value_dc=1337) print(d["id"]) d["new"] = 3 print(d["new"]) print(d.keys()) d["id"] = "warn" ``` returns ``` WARNING:root:dictionary interface getitem on context is deprecated; update to use the dataclass interface for standard fields like `id` WARNING:root:dictionary interface setitem on context is deprecated; update to use context.user_defined for custom fields like `new` WARNING:root:dictionary interface getitem on context is deprecated; update to use context.user_defined for custom fields like `new` WARNING:root:dictionary interface keys is deprecated; update this to use the dataclass interface WARNING:root:dictionary interface setitem for standard fields is deprecated; update to use the dataclass interface for standard fields
[GitHub] [airflow] r-richmond commented on issue #14396: Make context less nebulous
r-richmond commented on issue #14396: URL: https://github.com/apache/airflow/issues/14396#issuecomment-785517704 >Seems like we end up with ever compiicating interfaces for no benefit other than "newer" approach where we have simpler alternative. I'd argue that the dataclass interface is clear than the dictionary interface and thus simpler to use, although perhaps not as simple to author. >Now, let me see how would you like to implement backward-compatibility of this (this is the more standard case of custom operator. This is just copy of what we have in our documentation: >What solution would be good here if we use Dataclasses, we want to keep backwards compatibility and we want to give users typehint for context. >Any idea how to do it short of converting the Dataclass to TypeDict before calling execute (which would defeat the purpose of having Dataclass in the first place).? First off great challenge :) , 2nd I'd probably replace my earlier suggestion with the following as well ```python from dataclasses import dataclass, asdict @dataclass class Demo: # context replacement id: str value_dc: int def __getitem__(self, item): logging.warning(msg=f"dictionary interface is deprecated please update this to use the dataclass interface") return asdict(self)[item] def items(self): logging.warning(msg=f"dictionary interface is deprecated please update this to use the dataclass interface") return asdict(self).items() def keys(self): logging.warning(msg=f"dictionary interface is deprecated please update this to use the dataclass interface") return asdict(self).keys() def values(self): logging.warning(msg=f"dictionary interface is deprecated please update this to use the dataclass interface") return asdict(self).values() d = Demo(id="long_id", value_dc=1337) print(d["id"]) ``` returns ``` WARNING:root:dictionary interface is deprecated please update this to use the dataclass interface long_id ``` Seems like this would allow us to go straight to dataclasses without changing any functions or operators anywhere. We could then deprecate the dictionary api at the next major version. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [airflow] r-richmond commented on issue #14396: Make context less nebulous
r-richmond commented on issue #14396: URL: https://github.com/apache/airflow/issues/14396#issuecomment-785437346 > Let me explain why this would be a breaking change Surely you can map from Dataclass to Dict in "get_current_context()" method (otherwise you'd have to also correct the context["ti"] into context.ti. Thus it would be a breaking change. Changing `get_current_context` to have the following signature would prevent any breaking changes ```python def get_current_context(return_dataclass: bool=False) -> Union[Dict[str, Any], ~Dataclass.Context]: context: Dataclass = # if return_dataclass: return context else: logging.warning(msg="dictionary context has been deprecated in 2.1 please use return_dataclass = True") return as_dict(context) ``` existing code continues just fine ```python @task def my_task(): context = get_current_context() ti = context["ti"] ``` this would then allow the users to do the following to get type hinting and at some point we could update the option's default value before finally removing it. ```python @task def my_task(return_dataclass=True): context = get_current_context() ti = context.ti ``` >But if you do that (i.e map dataclass to dict), then you loose the type hinting. This is precisely the place where typehinting is needed (in the user code). And I think Typed Dict (if used everywhere including return value from get_current_context() gives you what you need from the user point of view and there is no need to implement the > Dataclass step - as you have everything you need by implementing TypeDict. You're right that the returned dictionary appears to lose the type information. However: 1. No type information is what we have today in user code so no change (although we'd have a nice clear definition of context findable from the function signature). 1. Users would have an easy way to switch to strict typing mode with a simple migration path one function call at a time via the parameter. 3. We would also give them the ability to use the nicer dataclass api & better IDE support Find references with dataclass (shows writes and reads and highlights field names) https://user-images.githubusercontent.com/9246654/109076238-1ba3b500-76af-11eb-8223-5ba6ea09f69f.png;> vs Find references with TypedDict (Misses read and doesn't highlight field names) https://user-images.githubusercontent.com/9246654/109076512-8c4ad180-76af-11eb-8a7c-79609c098abc.png;> 4. Users would also have a clear definition of context (This comes with either TypedDict or Dataclass) just wanted to call out this improvement The advantage to dataclasses is that long term we will be providing users with a better api and developer experience. Making this change now is the easiest it will ever be as it only gets harder later in the project's life cycle. Even in the [Typed Dict Pep 589](https://www.python.org/dev/peps/pep-0589/) Dataclasses are called out as a newer alternative for this use case. `Dataclasses are a more recent alternative to solve this use case` >In a number of cases yes. But in others flexibility of the dict and fact that you can add any value there, trumps. >While we do not encourage this behavior in any way, it is tempting to use the context this way and as a user you could do that and you could rely on context being dictionary able to hold any key you want. Dataclass breaks this assumption. You can do this with dataclasses as well; although as you mentioned it isn't encouraged. The better way would be to subclass the Dataclass but that is up to end users. Demo of adding a field dynamically ```python @dataclass class Demo: id: str value: int d = Demo("a", 1) # can add new fields d.new_field = "hello world" print(d.new_field) # prints hello world ``` One nice thing is that mypy complains about it however the python code still runs `main.py:21: error: "Demo" has no attribute "new_field"` I'm unaware of anything a dictionary can do that a dataclass can't. (I believe dataclasses internally are just dictionaries like a lot of python internals) >Or maybe I am missing something? To summarize here either dataclass or typedDict will have working code but leaning on [Raymond](https://twitter.com/raymondh/status/1175124784339808256?s=20) here in my assessment it feels like `context` is a class/instance problem not a dict problem hence my personal preference to use a dataclass. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact
[GitHub] [airflow] r-richmond commented on issue #14396: Make context less nebulous
r-richmond commented on issue #14396: URL: https://github.com/apache/airflow/issues/14396#issuecomment-785259895 Ah interesting kaxil thanks to your link I can see where a TypedDict / Dataclass implementation would go https://github.com/apache/airflow/blob/352b970010846eb4168aa7b2e332d9a95e2facb3/airflow/models/taskinstance.py#L86 Since the TypedDict implementation is so straightforward given that link I wonder if its better to consider Dict > TypeDict > DataClass as an evolution path. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [airflow] r-richmond commented on issue #14396: Make context less nebulous
r-richmond commented on issue #14396: URL: https://github.com/apache/airflow/issues/14396#issuecomment-785250252 >TypedDict is far more pythonic (Hints are the way how types are usually added in Python) Hints are how dataclasses uses types as well. They aren't enforced at runtime, unless you use something like [pydantic](https://pydantic-docs.helpmanual.io). (side note [Dataclass PEP-557](https://www.python.org/dev/peps/pep-0557/) had support from Raymond & Guido which as far as I can tell gives it a lot of points in the pythonic category. Also Raymond gave a fantastic [talk on Dataclasses](https://www.youtube.com/watch?v=T-TwcmT6Rcw) that I'd recommend if you have time). >We should basically never remove/rename anything in the context (maybe we should use some deprecation mechanism). Yep makes sense to me. Although, this isn't a point for or against dataclasses, typedDicts, or dicts. >The context is used by pretty much every single custom operator out there, and the change you propose will immediately break all of them.. If anything else - this is a single reason why we should not do it. While switching to dataclasses would be harder than TypedDicts it does not have to be a breaking change. You can convert dictionaries to dataclasses and vice versa pretty seamlessly. Although, implementing Dataclasses would certainly be more work to get right in the initial pr. Example showing how you can switch between dataclasses & dictionaries ```Python from dataclasses import dataclass, asdict from dacite import from_dict @dataclass class Demo: id: str value: int d = Demo("a", 1) examples = [d, asdict(d), from_dict(data_class=Demo, data=asdict(d))] for example in examples: print(example, type(example)) ``` prints ``` Demo(id='a', value=1) {'id': 'a', 'value': 1} Demo(id='a', value=1) ``` Personally I'm of the opinion that working with the dataclass api is superior to working with the dictionary api due to the reasons outlined above but [thats just like my opinion man](https://www.youtube.com/watch?v=1vBesOFURek). >But the idea of improving developer's productivity is sound and since we have TypedDict which is backwards compatible and gives most benefits you mentioned Either option could be backwards compatible but I agree that the TypedDict change is more straight forward and is an alternative, hence the tiny blurb about it in the opener. As you mentioned TypedDict should give most of the benefits, although to my knowledge find references does not work as well dictionary keys which is a bummer. Anyways the decision on which to use is above my Airflow status so I'll let the maintainers decide. Most importantly I think either a TypedDict or Dataclass implementation would be a big improvement over the current situation. >Would youl like to pick this up @r-richmond ? Actually, the reason I submitted this issue was because me and a coworker were frustrated with the "magic context dictionary". TLDR I'm not actually sure what all the keys, values, & types are in context so I wouldn't even know how to start this one. My end goal is to be able to do the following when writing callable that can have context passed to them ```Python def generate_is_latest_callable(tasks_if_latest: List[str] , tasks_if_not_latest: List[str]) -> Callable: def result(context: Context) -> List[str]: context. #and be able to get strong autocomplete and typing while in an ide here # because currently all I can do is context: Dict[str, Any] which isn't very helpful if context.something: return tasks_if_latest else: return tasks_if_not_latest return result t_branch = PythonBranchingOperator( task_id="branch", python_callable=generate_is_latest_callable(["yes"], ["no"]), provide_context=True, dag=dag, ) ``` Note: moving this goal to the opening comment This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [airflow] r-richmond commented on issue #14396: Make context less nebulous
r-richmond commented on issue #14396: URL: https://github.com/apache/airflow/issues/14396#issuecomment-784646194 >Dataclass is not pythonic at all As a counter point [Dacite](https://github.com/konradhalas/dacite) is a python library that shows how dataclasses can be used for something like this. Their config dataclass is a similar object to our context and is defined as follows. https://github.com/konradhalas/dacite/blob/d2206b2e4711859da0ea5862c395940f33693e80/dacite/config.py#L5-L12 ```python @dataclass class Config: type_hooks: Dict[Type, Callable[[Any], Any]] = field(default_factory=dict) cast: List[Type] = field(default_factory=list) forward_references: Optional[Dict[str, Any]] = None check_types: bool = True strict: bool = False strict_unions_match: bool = False ``` which is then consumed in their function `from_dict` https://github.com/konradhalas/dacite/blob/d2206b2e4711859da0ea5862c395940f33693e80/dacite/core.py#L34 ```python def from_dict(data_class: Type[T], data: Data, config: Optional[Config] = None) -> T: """Create a data class instance from a dictionary. :param data_class: a data class type :param data: a dictionary of a input data :param config: a configuration of the creation process :return: an instance of a data class ``` As a user it feels so much better to be able to see the config in the function definition and then follow the type definition straight to the config objects strict definition (fields, types, optionally comments). Alternatively when a dictionary is used its harder for an end user to reason about `what keys are present?`, `what are the values?`, `what are the types of the values?`, `where do I find this information?`. (Perhaps its my own ignorance but it isn't obvious to me today where I can look up this info for airflow's context dictionary and I've tried). As a developer tools like [mypy](https://github.com/python/mypy/blob/538d36481526135c44b90383663eaa177cfc32e3/docs/source/additional_features.rst), [pydantic](https://github.com/samuelcolvin/pydantic/) & various ides ([pycharm](https://blog.jetbrains.com/pycharm/2018/04/python-37-introducing-data-class/), vscode) also handle dataclasses quite well which helps prevent frustrating runtime errors. >makes it difficult to evolve Not sure about this one. * To add new fields to data classes you simply add a new field to the dataclass definition with a default value. * To remove old fields you can use an ide's `find references` to remove all the old references and then delete the field from the dataclass. * Optionally, use a static analyzer like mypy before commit/merge to ensure you've gotten rid of all the old references Perhaps I'm still a rookie but for dictionaries it feels much harder to be sure you've gotten all the old key references out of the code, before you can safely remove it. >I think it's very pythonic to have dictionaries in Python and it brings a number of benefits, with little drawback I've called out a couple advantages I think dataclasses have over dictionaries. But I'm likely missing the dictionary advantages. Is Airflow using any of these advantages today? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org