As part of an upgrade, I was looking to run upstream PySpark unit tests on v3.2.1-rc2 before I applied some downstream patches and tested those. However, I’m running into some issues with failing unit tests, which I’m not sure are failing upstream or due to some steps, I missed in the build.
The current failing tests (at least so far, since I believe the python script exits on test failure): ====================================================================== FAIL: test_train_prediction (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests) Test that error on test data improves as model is trained. ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 474, in test_train_prediction eventually(condition, timeout=180.0) File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 86, in eventually lastValue = condition() File "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 469, in condition self.assertGreater(errors[1] - errors[-1], 2) AssertionError: 1.8960983527735014 not greater than 2 ====================================================================== FAIL: test_parameter_accuracy (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests) Test that the final value of weights is close to the desired value. ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 229, in test_parameter_accuracy eventually(condition, timeout=60.0, catch_assertions=True) File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 91, in eventually raise lastValue File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 82, in eventually lastValue = condition() File "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 226, in condition self.assertAlmostEqual(rel, 0.1, 1) AssertionError: 0.23052813480829393 != 0.1 within 1 places (0.13052813480829392 difference) ====================================================================== FAIL: test_training_and_prediction (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests) Test that the model improves on toy data with no. of batches ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 334, in test_training_and_prediction eventually(condition, timeout=180.0) File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 93, in eventually raise AssertionError( AssertionError: Test failed due to timeout after 180 sec, with last condition returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74, 0.76, 0.78, 0.7, 0.78, 0.8, 0.74, 0.77, 0.75, 0.76, 0.76, 0.75, 0.78, 0.74, 0.64, 0.64, 0.71, 0.78, 0.76, 0.64, 0.68, 0.69, 0.72, 0.77 ---------------------------------------------------------------------- Ran 13 tests in 661.536s FAILED (failures=3, skipped=1) Had test failures in pyspark.mllib.tests.test_streaming_algorithms with /usr/local/bin/python3; see logs. Here’s how I’m currently building Spark, I was using the building-spark <https://spark.apache.org/docs/3..1/building-spark.html> docs as a reference. $ git clone g...@github.com:apache/spark.git $ git checkout -b spark-321 v3.2.1 $ ./build/mvn -DskipTests clean package -Phive $ export JAVA_HOME=$(path/to/jdk/11) $ ./python/run-tests Current Java version $ java -version openjdk version "11.0.17" 2022-10-18 OpenJDK Runtime Environment Homebrew (build 11.0.17+0) OpenJDK 64-Bit Server VM Homebrew (build 11.0.17+0, mixed mode) Alternatively, I’ve also tried simply building Spark and using a python=3.9 venv and installing the requirements from pip install -r dev/requirements.txt and using that as the interpreter to run tests (./python/run-tests -python-executable=/path/to/venv/bin/python3.9). However, I was running into some failing pandas tests which to me seemed like it was coming from a pandas version difference as requirements.txt didn’t specify a version. I suppose I have a couple of questions regarding this: 1. Am I missing a build step to build Spark and run PySpark unit tests? FWIW I'm looking to build with Debian 10 & macOS Ventura 13.0.1. 2. Where could I find out whether an upstream test is failing for a specific release? 3. Would it be possible to configure the run-tests script to run all tests regardless of test failures? Any help would be much appreciated! Best, Adam Chhina