[GitHub] [airflow] r-richmond commented on issue #14396: Make context less nebulous

2021-10-07 Thread GitBox


r-richmond commented on issue #14396:
URL: https://github.com/apache/airflow/issues/14396#issuecomment-937445404


   >I believe so far simply no-one else found it really useful or needed. 
   
   What is the bar to clear here?.. i.e. 5 thumbs up on opening issue doesn't 
meet it so I'm curious at what point your opinion would change.
   
   >We usually do not plan work which is not "urgent", "important" or "part of 
bigger feature we work on".
   
   Makes sense, ty for info
   
   >why don't you create a Draft PR with it and share it here to see how it can 
look like so that you could reiterate your points showing them with a real 
code. If you convince others (including myself), we will merge it. If not - we 
will drop it.
   
   I'm interested in seeing it tackled and am willing to take a stab at it. I'm 
just really busy for the next couple months. I.e. please don't close the issue 
yet.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] r-richmond commented on issue #14396: Make context less nebulous

2021-10-06 Thread GitBox


r-richmond commented on issue #14396:
URL: https://github.com/apache/airflow/issues/14396#issuecomment-937445404


   >I believe so far simply no-one else found it really useful or needed. 
   
   What is the bar to clear here?.. i.e. 5 thumbs up on opening issue doesn't 
meet it so I'm curious at what point your opinion would change.
   
   >We usually do not plan work which is not "urgent", "important" or "part of 
bigger feature we work on".
   
   Makes sense, ty for info
   
   >why don't you create a Draft PR with it and share it here to see how it can 
look like so that you could reiterate your points showing them with a real 
code. If you convince others (including myself), we will merge it. If not - we 
will drop it.
   
   I'm interested in seeing it tackled and am willing to take a stab at it. I'm 
just really busy for the next couple months. I.e. please don't close the issue 
yet.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] r-richmond commented on issue #14396: Make context less nebulous

2021-09-28 Thread GitBox


r-richmond commented on issue #14396:
URL: https://github.com/apache/airflow/issues/14396#issuecomment-928642301


   @kaxil Noticed this keeps getting pushed. Can you provide any additional 
context?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] r-richmond commented on issue #14396: Make context less nebulous

2021-09-27 Thread GitBox


r-richmond commented on issue #14396:
URL: https://github.com/apache/airflow/issues/14396#issuecomment-928642301


   @kaxil Noticed this keeps getting pushed. Can you provide any additional 
context?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] r-richmond commented on issue #14396: Make context less nebulous

2021-02-25 Thread GitBox


r-richmond commented on issue #14396:
URL: https://github.com/apache/airflow/issues/14396#issuecomment-786194069


   >Not really. We are not planning to add anything to the context any time 
soon. and even if we do it's the same for dict/field. If someone extends data 
class with a new field the problem is the same.
   
   >`A safer way for maintainers to add new fields to context`
   > Not really. it's the same kind of problems you get.
   
   I don't follow here. In the V2 dataclass design users would add stuff to the 
`user_defined` field which has no chance of collision with anything airflow 
adds later to the base dataclass. The same can't be said for dicts. What am I 
missing?
   
   > `A clean way to implement deprecation warnings with detailed warning 
messages about potential silent bugs`
   We do not need deprecation warnings in case we do not change from Dict
   
   True regarding the deprecation warning, but isn't their value in warning 
users about doing things could cause bugs? (i.e. overwriting the wrong keys in 
context).
   
   > `More flexibility down the road (dataclasses are more flexible than 
dictionaries)`
   This sentence is meaningless. I argue that dicts are more flexible and 
probably we would both be right.
   
   I like the meet in the middle vibe but I kind of cheated with the 
`user_defined` field being a dict which I think pushes dataclass to the winners 
column.
   
   > `A solution that is easier to maintain in the future`
   Again - meaningless - maintenance is also to go trough the hassle of 
changing and informing users.
   
   I think we are just stuck on 2 different points here, you rightly point to 
the burden of migrating users and I'm stuck on what I perceive are the 
long-term benefits of switching (safer modifications, better IDE integration 
usability improvements, warning users about things that could cause bugs).
   
   
   > `Or said another way we shouldn't optimize for airflow 2.x maintainability 
we should optimize for airflow maintainability.`
   I do not agree. I carefully weighted pros/cons and as maintainer i agree 
with @kaxil TypedDict is much better solution and we will have no plans to 
change to Dataclass. You have not convinced us.
   
   I'm just a user so apologies if I came across as forceful. It is clear we 
have a difference of opinion in which case the maintainer usually and rightly 
decides upon the path forward for the project. 
   
   Lastly at this point I'm gonna slow down on this topic for a while to allow 
other users & maintainers a chance to give their opinion/thoughts.
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] r-richmond commented on issue #14396: Make context less nebulous

2021-02-25 Thread GitBox


r-richmond commented on issue #14396:
URL: https://github.com/apache/airflow/issues/14396#issuecomment-786041115


   >Yeah. I was expecting exactly this answer :). So summarizing - what you've 
done now, you created hybrid dataclass and dictionary put together. Now, my 
goal is to show you what further consequences you have to deal with.
   
   [0_O](https://www.youtube.com/watch?v=4F4qzPbcFiA)
   
   >There are hundreds of thousands custom operators our there that are using 
this "public API". This means that any change we introduce now is going to be 
around at least a year from now. And until then 1.10 is still there as well 
(and will be there for quite some time). So if someone develops custom 
operators for their 1.10 Airflow, they will still use dictionary - so we have 
probably tens of thousands custom operators created still using the 
'Dictionary' context for another year or two.
   
   Yes, this is true. Thats why I mentioned that now is the "easiest" time for 
this change as it only gets harder in the future; (In a perfect world, this 
would have been raised before 2.0 :s).
   
   > Most likely many of the currently released operators are using the context 
to pass data (as custom dictionary values) between those methods - one can set 
a custom value in pre_execute() and retrieve it in execute() or post_execute() 
reads whatever execute sets in the context. It was easy to use, we have not 
forbidden it, it is part of the API (this is the basic "property" of dictionary 
- unlike dataclass - that you can set any value with any key there). By 
introducing Dataclass we are breaking this property. You will not be able to 
set arbitrary key in the context in pre_execute so that it is available in 
execute. If we implement the interrim (lasting at least a year or more) hybrid 
dataclass <-> dictionary proposed above, this will continue to work but with 
deprecation warnings.
   
   Again dataclass does not break this property and in fact after thinking, I 
think dataclasses provides the following significant advantages in this area 
which dict and typedDict do not.
   1. A safer way for users to set custom fields in context
   1. A safer way for airflow maintainers to add new fields to context
   
   Take the following example
   
   Context Dataclass MVP V2
   
   ```python
   
   @dataclass
   class Demo:  # context replacement
   id: str
   value_dc: int
   user_defined: Dict[str, Any] = field(default_factory=dict)
   
   def __getitem__(self, item):
   if item in self.__dict__.keys():
   logging.warning(msg=f"dictionary interface getitem on context is 
deprecated; update to use the dataclass interface for standard fields like 
`{item}`")
   return self.__dict__[item]
   elif item in self.user_defined:
   logging.warning(msg=f"dictionary interface getitem on context is 
deprecated; update to use context.user_defined for custom fields like `{item}`")
   return self.user_defined[item]
   else:
   raise KeyError
   
   def __setitem__(self, key: str, value):
   if key in self.__dict__.keys():
   msg = f"""dictionary interface setitem for standard fields is 
deprecated; update to use the dataclass interface for standard fields like 
`{key}`
   note: changing standard context fields is not supported and may 
have undefined behavior. If this is meant to be a custom field use 
context.user_defined instead"""
   logging.warning(msg=msg)
   self.__dict__[key] = value
   else:
   logging.warning(
   msg=f"dictionary interface setitem on context is deprecated; 
update to use context.user_defined for custom fields like `{key}`")
   self.user_defined[key] = value
   
   def keys(self):
   # added as an example to show how far we could go to have a 
non-breaking change for 2.1
   logging.warning(msg=f"dictionary interface keys is deprecated; 
update this to use the dataclass interface")
   temp = self.__dict__
   temp.update(self.user_defined)
   return temp
   
   
   d = Demo(id="long_id", value_dc=1337)
   print(d["id"])
   d["new"] = 3
   print(d["new"])
   print(d.keys())
   d["id"] = "warn"
   ```
   returns
   ```
   WARNING:root:dictionary interface getitem on context is deprecated; update 
to use the dataclass interface for standard fields like `id`
   WARNING:root:dictionary interface setitem on context is deprecated; update 
to use context.user_defined for custom fields like `new`
   WARNING:root:dictionary interface getitem on context is deprecated; update 
to use context.user_defined for custom fields like `new`
   WARNING:root:dictionary interface keys is deprecated; update this to use the 
dataclass interface
   WARNING:root:dictionary interface setitem for standard fields is deprecated; 
update to use the dataclass interface for standard fields 

[GitHub] [airflow] r-richmond commented on issue #14396: Make context less nebulous

2021-02-24 Thread GitBox


r-richmond commented on issue #14396:
URL: https://github.com/apache/airflow/issues/14396#issuecomment-785517704


   >Seems like we end up with ever compiicating interfaces for no benefit other 
than "newer" approach where we have simpler alternative.
   
   I'd argue that the dataclass interface is clear than the dictionary 
interface and thus simpler to use, although perhaps not as simple to author.
   
   >Now, let me see how would you like to implement backward-compatibility of 
this (this is the more standard case of custom operator. This is just copy of 
what we have in our documentation:
   
   >What solution would be good here if we use Dataclasses, we want to keep 
backwards compatibility and we want to give users typehint for context.
   
   >Any idea how to do it short of converting the Dataclass to TypeDict before 
calling execute (which would defeat the purpose of having Dataclass in the 
first place).?
   
   First off great challenge :) , 2nd I'd probably replace my earlier 
suggestion with the following as well
   
   ```python
   from dataclasses import dataclass, asdict
   
   @dataclass
   class Demo:  # context replacement
   id: str
   value_dc: int
   
   def __getitem__(self, item):
   logging.warning(msg=f"dictionary interface is deprecated please 
update this to use the dataclass interface")
   return asdict(self)[item]
   
   def items(self):
   logging.warning(msg=f"dictionary interface is deprecated please 
update this to use the dataclass interface")
   return asdict(self).items()
   
   def keys(self):
   logging.warning(msg=f"dictionary interface is deprecated please 
update this to use the dataclass interface")
   return asdict(self).keys()
   
   def values(self):
   logging.warning(msg=f"dictionary interface is deprecated please 
update this to use the dataclass interface")
   return asdict(self).values()
   
   d = Demo(id="long_id", value_dc=1337)
   print(d["id"])
   ``` 
   returns
   ```
   WARNING:root:dictionary interface is deprecated please update this to use 
the dataclass interface
   long_id
   ```
   
   Seems like this would allow us to go straight to dataclasses without 
changing any functions or operators anywhere. We could then deprecate the 
dictionary api at the next major version. 
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] r-richmond commented on issue #14396: Make context less nebulous

2021-02-24 Thread GitBox


r-richmond commented on issue #14396:
URL: https://github.com/apache/airflow/issues/14396#issuecomment-785437346


   > Let me explain why this would be a breaking change Surely you can map 
from Dataclass to Dict in "get_current_context()" method (otherwise you'd have 
to also correct the context["ti"] into context.ti. Thus it would be a breaking 
change.
   
   Changing `get_current_context` to have the following 
signature would prevent any breaking changes
   
   ```python
   def get_current_context(return_dataclass: bool=False) -> Union[Dict[str, 
Any], ~Dataclass.Context]:
 context: Dataclass = #
 if return_dataclass:
   return context
 else:
   logging.warning(msg="dictionary context has been deprecated in 2.1 
please use return_dataclass = True")
   return as_dict(context)
   ```
   existing code continues just fine
   ```python
   @task
   def my_task():
   context = get_current_context()
   ti = context["ti"]
   ```
   this would then allow the users to do the following to get type hinting and 
at some point we could update the option's default value before finally 
removing it.
   ```python
   @task
   def my_task(return_dataclass=True):
   context = get_current_context()
   ti = context.ti
   ```
   
   
   >But if you do that (i.e map dataclass to dict), then you loose the type 
hinting. This is precisely the place where typehinting is needed (in the user 
code). And I think Typed Dict (if used everywhere including return value from 
get_current_context() gives you what you need from the user point of view and 
there is no need to implement the > Dataclass step - as you have everything you 
need by implementing TypeDict.
   
   You're right that the returned dictionary appears to lose the type 
information. However:
   1. No type information is what we have today in user code so no change 
(although we'd have a nice clear definition of context findable from the 
function signature).
   1. Users would have an easy way to switch to strict typing mode with a 
simple migration path one function call at a time via the parameter.
   3. We would also give them the ability to use the nicer 
dataclass api & better IDE support
   Find references with dataclass (shows writes and reads and highlights field 
names)
   https://user-images.githubusercontent.com/9246654/109076238-1ba3b500-76af-11eb-8223-5ba6ea09f69f.png;>
   vs Find references with TypedDict (Misses read and doesn't highlight field 
names)
   https://user-images.githubusercontent.com/9246654/109076512-8c4ad180-76af-11eb-8a7c-79609c098abc.png;>
   
   
   
   4. Users would also have a clear definition of context (This comes with 
either TypedDict or Dataclass) just wanted to call out this improvement
   
   The advantage to dataclasses is that long term we will be providing users 
with a better api and developer experience. Making this change now is the 
easiest it will ever be as it only gets harder later in the project's life 
cycle. 
   
   Even in the [Typed Dict Pep 589](https://www.python.org/dev/peps/pep-0589/) 
Dataclasses are called out as a newer alternative for this use case. 
`Dataclasses are a more recent alternative to solve this use case`
   
   >In a number of cases yes. But in others flexibility of the dict and fact 
that you can add any value there, trumps.
   
   >While we do not encourage this behavior in any way, it is tempting to use 
the context this way and as a user you could do that and you could rely on 
context being dictionary able to hold any key you want. Dataclass breaks this 
assumption.
   
   You can do this with dataclasses as well; although as you mentioned it isn't 
encouraged. The better way would be to subclass the Dataclass but that is up to 
end users.
   
Demo of adding a field dynamically
   
   ```python
   @dataclass
   class Demo:
   id: str
   value: int
   
   
   d = Demo("a", 1)
   
   # can add new fields
   d.new_field = "hello world"
   print(d.new_field)
   
   # prints hello world
   ```
   
   One nice thing is that mypy complains about it however the python code still 
runs
   `main.py:21: error: "Demo" has no attribute "new_field"`
   
   
   
   I'm unaware of anything a dictionary can do that a dataclass can't. (I 
believe dataclasses internally are just dictionaries like a lot of python 
internals)
   
   >Or maybe I am missing something?
   
   To summarize here either dataclass or typedDict will have working code but 
leaning on 
[Raymond](https://twitter.com/raymondh/status/1175124784339808256?s=20) here in 
my assessment it feels like `context` is a class/instance problem not a dict 
problem hence my personal preference to use a dataclass. 
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact 

[GitHub] [airflow] r-richmond commented on issue #14396: Make context less nebulous

2021-02-24 Thread GitBox


r-richmond commented on issue #14396:
URL: https://github.com/apache/airflow/issues/14396#issuecomment-785259895


   Ah interesting kaxil thanks to your link I can see where a TypedDict / 
Dataclass implementation would go
   
   
https://github.com/apache/airflow/blob/352b970010846eb4168aa7b2e332d9a95e2facb3/airflow/models/taskinstance.py#L86
   
   Since the TypedDict implementation is so straightforward given that link I 
wonder if its better to consider Dict > TypeDict > DataClass as an evolution 
path.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] r-richmond commented on issue #14396: Make context less nebulous

2021-02-24 Thread GitBox


r-richmond commented on issue #14396:
URL: https://github.com/apache/airflow/issues/14396#issuecomment-785250252


   >TypedDict is far more pythonic (Hints are the way how types are usually 
added in Python)
   
   Hints are how dataclasses uses types as well. They aren't enforced at 
runtime, unless you use something like 
[pydantic](https://pydantic-docs.helpmanual.io). (side note [Dataclass 
PEP-557](https://www.python.org/dev/peps/pep-0557/) had support from Raymond & 
Guido which as far as I can tell gives it a lot of points in the pythonic 
category. Also Raymond gave a fantastic [talk on 
Dataclasses](https://www.youtube.com/watch?v=T-TwcmT6Rcw) that I'd recommend if 
you have time). 
   
   >We should basically never remove/rename anything in the context (maybe we 
should use some deprecation mechanism).
   
   Yep makes sense to me. Although, this isn't a point for or against 
dataclasses, typedDicts, or dicts.
   
   >The context is used by pretty much every single custom operator out there, 
and the change you propose will immediately break all of them.. If anything 
else - this is a single reason why we should not do it.
   
   While switching to dataclasses would be harder than TypedDicts it does not 
have to be a breaking change. You can convert dictionaries to dataclasses and 
vice versa pretty seamlessly. Although, implementing Dataclasses would 
certainly be more work to get right in the initial pr.
   
   Example showing how you can switch between dataclasses & 
dictionaries
 
 ```Python
   from dataclasses import dataclass, asdict
   from dacite import from_dict
   
   
   @dataclass
   class Demo:
   id: str
   value: int
   
   
   d = Demo("a", 1)
   
   examples = [d, asdict(d), from_dict(data_class=Demo, data=asdict(d))]
   for example in examples:
   print(example, type(example))
   ```
   prints
   ```
   Demo(id='a', value=1) 
   {'id': 'a', 'value': 1} 
   Demo(id='a', value=1) 
   ```
   
   
   Personally I'm of the opinion that working with the dataclass api is 
superior to working with the dictionary api due to the reasons outlined above 
but [thats just like my opinion 
man](https://www.youtube.com/watch?v=1vBesOFURek). 
   
   >But the idea of improving developer's productivity is sound and since we 
have TypedDict which is backwards compatible and gives most benefits you 
mentioned
   
   Either option could be backwards compatible but I agree that the TypedDict 
change is more straight forward and is an alternative, hence the tiny blurb 
about it in the opener.
   
   As you mentioned TypedDict should give most of the benefits, although to my 
knowledge find references does not work as well dictionary keys which is a 
bummer. Anyways the decision on which to use is above my Airflow status so I'll 
let the maintainers decide. Most importantly I think either a TypedDict or 
Dataclass implementation would be a big improvement over the current situation.
   
   >Would youl like to pick this up @r-richmond ?
   
   Actually, the reason I submitted this issue was because me and a coworker 
were frustrated with the "magic context dictionary". TLDR I'm not actually sure 
what all the keys, values, & types are in context so I wouldn't even know how 
to start this one.
   
   
   My end goal is to be able to do the following when writing 
callable that can have context passed to them
 
   ```Python
   def generate_is_latest_callable(tasks_if_latest: List[str] , 
tasks_if_not_latest: List[str]) -> Callable:
 def result(context: Context) -> List[str]:
   context. #and be able to get strong autocomplete and typing while in an 
ide here
   # because currently all I can do is context: Dict[str, Any] which isn't 
very helpful
   if context.something:
 return tasks_if_latest
   else:
 return tasks_if_not_latest
 return result
   
   t_branch = PythonBranchingOperator(
 task_id="branch",
 python_callable=generate_is_latest_callable(["yes"], ["no"]),
 provide_context=True,
 dag=dag,
   )
   ```
   
   Note: moving this goal to the opening comment
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] r-richmond commented on issue #14396: Make context less nebulous

2021-02-23 Thread GitBox


r-richmond commented on issue #14396:
URL: https://github.com/apache/airflow/issues/14396#issuecomment-784646194


   >Dataclass is not pythonic at all
   
   As a counter point [Dacite](https://github.com/konradhalas/dacite) is a 
python library that shows how dataclasses can be used for something like this. 
Their config dataclass is a similar object to our context and is defined as 
follows.
   
   
https://github.com/konradhalas/dacite/blob/d2206b2e4711859da0ea5862c395940f33693e80/dacite/config.py#L5-L12
   ```python
   @dataclass
   class Config:
   type_hooks: Dict[Type, Callable[[Any], Any]] = 
field(default_factory=dict)
   cast: List[Type] = field(default_factory=list)
   forward_references: Optional[Dict[str, Any]] = None
   check_types: bool = True
   strict: bool = False
   strict_unions_match: bool = False
   ```
   
   which is then consumed in their function `from_dict` 
   
https://github.com/konradhalas/dacite/blob/d2206b2e4711859da0ea5862c395940f33693e80/dacite/core.py#L34
   ```python
   def from_dict(data_class: Type[T], data: Data, config: Optional[Config] = 
None) -> T:
   """Create a data class instance from a dictionary.
   :param data_class: a data class type
   :param data: a dictionary of a input data
   :param config: a configuration of the creation process
   :return: an instance of a data class
   ```
   
   As a user it feels so much better to be able to see the config in the 
function definition and then follow the type definition straight to the config 
objects strict definition (fields, types, optionally comments). 
   
   Alternatively when a dictionary is used its harder for an end user to reason 
about `what keys are present?`, `what are the values?`, `what are the types of 
the values?`, `where do I find this information?`. (Perhaps its my own 
ignorance but it isn't obvious to me today where I can look up this info for 
airflow's context dictionary and I've tried).
   
   As a developer tools like 
[mypy](https://github.com/python/mypy/blob/538d36481526135c44b90383663eaa177cfc32e3/docs/source/additional_features.rst),
 [pydantic](https://github.com/samuelcolvin/pydantic/) & various ides 
([pycharm](https://blog.jetbrains.com/pycharm/2018/04/python-37-introducing-data-class/),
 vscode) also handle dataclasses quite well which helps prevent frustrating 
runtime errors. 
   
   
   >makes it difficult to evolve
   
   Not sure about this one.
   * To add new fields to data classes you simply add a new field to the 
dataclass definition with a default value.
   * To remove old fields you can use an ide's `find references` to remove all 
the old references and then delete the field from the dataclass.
   * Optionally, use a static analyzer like mypy before commit/merge to 
ensure you've gotten rid of all the old references
   
   Perhaps I'm still a rookie but for dictionaries it feels much harder to be 
sure you've gotten all the old key references out of the code, before you can 
safely remove it.
   
   >I think it's very pythonic to have dictionaries in Python and it brings a 
number of benefits, with little drawback
   
   I've called out a couple advantages I think dataclasses have over 
dictionaries. But I'm likely missing the dictionary advantages. Is Airflow 
using any of these advantages today?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org