Re: Building Spark to run PySpark Tests?

Sean Owen Wed, 18 Jan 2023 14:04:23 -0800

That isn't the released version either, but rather the head of the 3.2
branch (which is beyond 3.2.3).
You may want to check out the v3.2.3 tag instead:
https://github.com/apache/spark/tree/v3.2.3
... instead of 3.2.1.
But note of course the 3.3.x is the current release branch anyway.


Hard to say what the error is without seeing more of the error log.

That final warning is fine, just means you are using Java 11+.


On Wed, Jan 18, 2023 at 3:59 PM Adam Chhina <[email protected]> wrote:

> Oh, whoops, didn’t realize that wasn’t the release version, thanks!
>
> > git clone --branch branch-3.2 https://github.com/apache/spark.git
>
> Ah, so the old failing tests are passing now, but I am seeing failures in 
> `pyspark.tests.test_broadcast`
> such as  `test_broadcast_value_against_gc`, with a majority of them
> failing due to `ConnectionRefusedError: [Errno 61] Connection refused`.
> Maybe these tests are not mean to be ran locally, and only in the pipeline?
>
> Also, I see this warning that mentions to notify the maintainers here:
>
> ```
>
> Starting test(/usr/local/bin/python3): pyspark.tests.test_broadcast
>
> WARNING: An illegal reflective access operation has occurred
>
> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform
> (file:/$path/spark/common/unsafe/target/scala-2.12/classes/) to constructor
> java.nio.DirectByteBuffer(long,int)
> ```
>
> FWIW, not sure if this matters, but python executable used for running
> these tests is `Python 3.10.9` under `/user/local/bin/python3`.
>
> Best,
>
> Adam Chhina
>
> On Jan 18, 2023, at 3:05 PM, Bjørn Jørgensen <[email protected]>
> wrote:
>
> Replace
> > > git clone [email protected]:apache/spark.git
> > > git checkout -b spark-321 v3.2.1
>
> with
> git clone --branch branch-3.2 https://github.com/apache/spark.git
> This will give you branch 3.2 as today, what I suppose you call upstream
>
> https://github.com/apache/spark/commits/branch-3.2
> and right now all tests in github action are passed :)
>
>
> ons. 18. jan. 2023 kl. 18:07 skrev Sean Owen <[email protected]>:
>
>> Never seen those, but it's probably a difference in pandas, numpy
>> versions. You can see the current CICD test results in GitHub Actions. But,
>> you want to use release versions, not an RC. 3.2.1 is not the latest
>> version, and it's possible the tests were actually failing in the RC.
>>
>> On Wed, Jan 18, 2023, 10:57 AM Adam Chhina <[email protected]> wrote:
>>
>>> Bump,
>>>
>>> Just trying to see where I can find what tests are known failing for a
>>> particular release, to ensure I’m building upstream correctly following the
>>> build docs. I figured this would be the best place to ask as it pertains to
>>> building and testing upstream (also more than happy to provide a PR for any
>>> docs if required afterwards), however if there would be a more appropriate
>>> place, please let me know.
>>>
>>> Best,
>>>
>>> Adam Chhina
>>>
>>> > On Dec 27, 2022, at 11:37 AM, Adam Chhina <[email protected]>
>>> wrote:
>>> >
>>> > As part of an upgrade I was looking to run upstream PySpark unit tests
>>> on `v3.2.1-rc2` before I applied some downstream patches and tested those.
>>> However, I'm running into some issues with failing unit tests, which I'm
>>> not sure are failing upstream or due to some step I missed in the build.
>>> >
>>> > The current failing tests (at least so far, since I believe the python
>>> script exits on test failure):
>>> > ```
>>> > ======================================================================
>>> > FAIL: test_train_prediction
>>> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
>>> > Test that error on test data improves as model is trained.
>>> > ----------------------------------------------------------------------
>>> > Traceback (most recent call last):
>>> >   File
>>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>> line 474, in test_train_prediction
>>> >     eventually(condition, timeout=180.0)
>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>>> 86, in eventually
>>> >     lastValue = condition()
>>> >   File
>>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>> line 469, in condition
>>> >     self.assertGreater(errors[1] - errors[-1], 2)
>>> > AssertionError: 1.8960983527735014 not greater than 2
>>> >
>>> > ======================================================================
>>> > FAIL: test_parameter_accuracy
>>> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
>>> > Test that the final value of weights is close to the desired value.
>>> > ----------------------------------------------------------------------
>>> > Traceback (most recent call last):
>>> >   File
>>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>> line 229, in test_parameter_accuracy
>>> >     eventually(condition, timeout=60.0, catch_assertions=True)
>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>>> 91, in eventually
>>> >     raise lastValue
>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>>> 82, in eventually
>>> >     lastValue = condition()
>>> >   File
>>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>> line 226, in condition
>>> >     self.assertAlmostEqual(rel, 0.1, 1)
>>> > AssertionError: 0.23052813480829393 != 0.1 within 1 places
>>> (0.13052813480829392 difference)
>>> >
>>> > ======================================================================
>>> > FAIL: test_training_and_prediction
>>> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
>>> > Test that the model improves on toy data with no. of batches
>>> > ----------------------------------------------------------------------
>>> > Traceback (most recent call last):
>>> >   File
>>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>> line 334, in test_training_and_prediction
>>> >     eventually(condition, timeout=180.0)
>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>>> 93, in eventually
>>> >     raise AssertionError(
>>> > AssertionError: Test failed due to timeout after 180 sec, with last
>>> condition returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74,
>>> 0.73, 0.69, 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74, 0.76, 0.78,
>>> 0.7, 0.78, 0.8, 0.74, 0.77, 0.75, 0.76, 0.76, 0.75, 0.78, 0.74, 0.64, 0.64,
>>> 0.71, 0.78, 0.76, 0.64, 0.68, 0.69, 0.72, 0.77
>>> >
>>> > ----------------------------------------------------------------------
>>> > Ran 13 tests in 661.536s
>>> >
>>> > FAILED (failures=3, skipped=1)
>>> >
>>> > Had test failures in pyspark.mllib.tests.test_streaming_algorithms
>>> with /usr/local/bin/python3; see logs.
>>> > ```
>>> >
>>> > Here's how I'm currently building Spark, I was using the
>>> [building-spark](https://spark.apache.org/docs/3..1/building-spark.html)
>>> docs as a reference.
>>> > ```
>>> > > git clone [email protected]:apache/spark.git
>>> > > git checkout -b spark-321 v3.2.1
>>> > > ./build/mvn -DskipTests clean package -Phive
>>> > > export JAVA_HOME=$(path/to/jdk/11)
>>> > > ./python/run-tests
>>> > ```
>>> >
>>> > Current Java version
>>> > ```
>>> > java -version
>>> > openjdk version "11.0.17" 2022-10-18
>>> > OpenJDK Runtime Environment Homebrew (build 11.0.17+0)
>>> > OpenJDK 64-Bit Server VM Homebrew (build 11.0.17+0, mixed mode)
>>> > ```
>>> >
>>> > Alternatively, I've also tried simply building Spark and using a
>>> python=3.9 venv and installing the requirements from `pip install -r
>>> dev/requirements.txt` and using that as the interpreter to run tests.
>>> However, I was running into some failing pandas test which to me seemed
>>> like it was coming from a pandas version difference as `requirements.txt`
>>> didn't specify a version.
>>> >
>>> > I suppose I have a couple of questions in regards to this:
>>> > 1. Am I missing a build step to build Spark and run PySpark unit tests?
>>> > 2. Where could I find whether an upstream test is failing for a
>>> specific release?
>>> > 3. Would it be possible to configure the `run-tests` script to run all
>>> tests regardless of test failures?
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: [email protected]
>>>
>>>
>
> --
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>
>
>

Re: Building Spark to run PySpark Tests?

Reply via email to