Building Spark to run PySpark Tests?

Adam Chhina Tue, 27 Dec 2022 09:01:52 -0800

As part of an upgrade, I was looking to run upstream PySpark unit tests on
v3.2.1-rc2 before I applied some downstream patches and tested those.
However, I’m running into some issues with failing unit tests, which I’m
not sure are failing upstream or due to some steps, I missed in the build.


The current failing tests (at least so far, since I believe the python
script exits on test failure):

======================================================================
FAIL: test_train_prediction
(pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
Test that error on test data improves as model is trained.
----------------------------------------------------------------------
Traceback (most recent call last):
  File 
"/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
line 474, in test_train_prediction
    eventually(condition, timeout=180.0)
  File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
86, in eventually
    lastValue = condition()
  File 
"/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
line 469, in condition
    self.assertGreater(errors[1] - errors[-1], 2)
AssertionError: 1.8960983527735014 not greater than 2

======================================================================
FAIL: test_parameter_accuracy
(pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
Test that the final value of weights is close to the desired value.
----------------------------------------------------------------------
Traceback (most recent call last):
  File 
"/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
line 229, in test_parameter_accuracy
    eventually(condition, timeout=60.0, catch_assertions=True)
  File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
91, in eventually
    raise lastValue
  File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
82, in eventually
    lastValue = condition()
  File 
"/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
line 226, in condition
    self.assertAlmostEqual(rel, 0.1, 1)
AssertionError: 0.23052813480829393 != 0.1 within 1 places
(0.13052813480829392 difference)

======================================================================
FAIL: test_training_and_prediction
(pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
Test that the model improves on toy data with no. of batches
----------------------------------------------------------------------
Traceback (most recent call last):
  File 
"/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
line 334, in test_training_and_prediction
    eventually(condition, timeout=180.0)
  File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
93, in eventually
    raise AssertionError(
AssertionError: Test failed due to timeout after 180 sec, with last
condition returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74,
0.73, 0.69, 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74, 0.76,
0.78, 0.7, 0.78, 0.8, 0.74, 0.77, 0.75, 0.76, 0.76, 0.75, 0.78, 0.74,
0.64, 0.64, 0.71, 0.78, 0.76, 0.64, 0.68, 0.69, 0.72, 0.77

----------------------------------------------------------------------
Ran 13 tests in 661.536s

FAILED (failures=3, skipped=1)

Had test failures in pyspark.mllib.tests.test_streaming_algorithms
with /usr/local/bin/python3; see logs.

Here’s how I’m currently building Spark, I was using the building-spark
<https://spark.apache.org/docs/3..1/building-spark.html> docs as a
reference.

$ git clone [email protected]:apache/spark.git
$ git checkout -b spark-321 v3.2.1
$ ./build/mvn -DskipTests clean package -Phive
$ export JAVA_HOME=$(path/to/jdk/11)
$ ./python/run-tests

Current Java version

$ java -version
openjdk version "11.0.17" 2022-10-18
OpenJDK Runtime Environment Homebrew (build 11.0.17+0)
OpenJDK 64-Bit Server VM Homebrew (build 11.0.17+0, mixed mode)

Alternatively, I’ve also tried simply building Spark and using a python=3.9
venv and installing the requirements from pip install -r
dev/requirements.txt and using that as the interpreter to run tests
(./python/run-tests
-python-executable=/path/to/venv/bin/python3.9). However, I was running
into some failing pandas tests which to me seemed like it was coming from a
pandas version difference as requirements.txt didn’t specify a version.

I suppose I have a couple of questions regarding this:

   1. Am I missing a build step to build Spark and run PySpark unit tests?
   FWIW I'm looking to build with Debian 10 & macOS Ventura 13.0.1.
   2. Where could I find out whether an upstream test is failing for a
   specific release?
   3. Would it be possible to configure the run-tests script to run all
   tests regardless of test failures?

Any help would be much appreciated!

Best,

Adam Chhina

Building Spark to run PySpark Tests?

Reply via email to