potiuk commented on code in PR #27264:
URL: https://github.com/apache/airflow/pull/27264#discussion_r1008030506


##########
Dockerfile:
##########
@@ -1215,7 +1215,7 @@ ARG ADDITIONAL_PYTHON_DEPS=""
 # are compatible with the new protobuf version. All the google python client 
libraries need
 # to be upgraded to >=2.0.0 in order to able to lift that limitation
 # 
https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates
-ARG EAGER_UPGRADE_ADDITIONAL_REQUIREMENTS="dill<0.3.3 pyarrow>=6.0.0 
protobuf<4.21.0"
+ARG EAGER_UPGRADE_ADDITIONAL_REQUIREMENTS="dill<0.3.3 
pyarrow>=6.0.0;python_version==\"3.10\" protobuf<4.21.0"

Review Comment:
   Not really. This one is really interesting and what we have is just a 
workaround to strange `pip` behaviour I noticed.
   
   The problem is that  `pip` cannot properly resolve pyarrow version ONLY on 
3.10 -> 3.7, 3.8, 3.9 are all fine and produce pyarrow > 6. But 3.10 is not. I 
do not know about 3.11 is (there is no pyarrow wheel compatible for 3.11 yet 
and trying to compile current version fails in our Docker image). We are 
waiting for release and I think PyArrow team is close to release 3.11 
compatible binary wheels (so the fact it does not compile is nothing strange - 
it simply is not compatible with 3.11 and they had some difficulties to make it 
work) - see that comment from yesterday 
https://github.com/apache/arrow/pull/14499#issuecomment-1293574736 (I am 
following up the issues in all the dependencies of ours).
   
   The reason why I added `==` is that I just checked if it is stil needed 
(because I hope it will fix itself eventually). It is still the case, so I 
decided to make it more explicit that this is python 3.10 only. See the comment 
above - it always been about Python 3.10 being the reason for adding pyarrow 
lower bound, but I  previoulsly added for all the versions - unnecessarily. 
Once pyarrow has 3.11 compatible version, we might change it to >= 3.10, but I 
hope we will not have to - because this might be very specific to Python 3.10 
dependency quirks.
   
   What happens when we do not have this entry is that on Python 3.10 (and only 
thre) we end-up with an older version of Pyarrow when installing airflow with 
all extras from the scratch. Why? No idea. But we do - and it fails tests. 
Since we do not have pyarrow at all as direct dependency (they are all 
transitive) - we cannot limit it for airflow in install_requires (we do not 
want to depend on pyarrow). And since it is used transitively by many 
depenencies, we have no single place to add it. 
   
   This is likely result of some complex dependency chain/imperfect resolution 
algorithm. And it is not happening always - only when we choose to install 
"all" providers at the same time. So this pyarrow>=6.0.0 is really a "helping 
hand" I give to `pip` to get resolution better. I do not know the root cause of 
it (and It would take quite some time I think to figure out and make an easy 
reproducible case so I have not even reported it to `pip` hoping it will be 
resolved with time - either by newer dependency versions or `pip` resolver 
improvements.
   
   This is a bit different case than dill and protobuf in the same line (as you 
see they have lower-bound not upper-bount. In case of "dill" and "protobuf" - 
itis absolutely necessary for eager upgrade, because `pip` resolves direct 
dependencies when there are eaager upgrade before looking at "extra" 
requirements and if we do not add those two limitations, eager upgrade will 
upgrade them to newer versions and then it will conflict with limits that are 
in extras. In case of pyarrow it is an opposite problem - we WANT pyarrow to be 
newer (because some of our dependencies need it) but somehow `pip` chosoes 
older version when resolving so we need to help `pip` to make better decisions. 
 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to