Re: Building Spark to run PySpark Tests?

Sean Owen Wed, 18 Jan 2023 15:51:03 -0800

Release _branches_ are tested as commits arrive to the branch, yes. That's
what you see at https://github.com/apache/spark/actions
Released versions are fixed, they don't change, and were also manually
tested before release, so no they are not re-tested; there is no need.


You presumably have some local env issue, because the source of Spark 3.2.3
was passing CI/CD at time of release as well as manual tests of the PMC.


On Wed, Jan 18, 2023 at 5:24 PM Adam Chhina <amanschh...@gmail.com> wrote:

> Hi Sean,
>
> That’s fair in regards to 3.3.x being the current release branch. I’m not
> familiar with the testing schedule, but I had assumed all currently
> supported release versions would have some nightly/weekly tests ran; is
> that not the case? I only ask, as when I when I’m seeing these test
> failures, I assumed these were either known/unknown from some recurring
> testing pipeline.
>
> Also, unfortunately using v3.2.3 also had the same test failures.
>
> > git clone --branch v3.2.3 https://github.com/apache/spark.git
>
> I’ve posted the traceback below for one of the ran tests. At the end it
> mentioned to check the logs - `see logs`. However I wasn’t sure whether
> that just meant the traceback or some more detailed logs elsewhere? I
> wasn’t able to see any files that looked relevant running `find . -name
> “*logs*”` afterwards. Sorry if I’m missing something obvious.
>
> ```
> test_broadcast_no_encryption (pyspark.tests.test_broadcast.BroadcastTest)
> ... ERROR
> test_broadcast_value_against_gc
> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
> test_broadcast_value_driver_encryption
> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
> test_broadcast_value_driver_no_encryption
> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
> test_broadcast_with_encryption
> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
>
> ======================================================================
> ERROR: test_broadcast_with_encryption
> (pyspark.tests.test_broadcast.BroadcastTest)
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File "$path/spark/python/pyspark/tests/test_broadcast.py", line 67, in
> test_broadcast_with_encryption
>     self._test_multiple_broadcasts(("spark.io.encryption.enabled", "true"))
>   File "$path/spark/python/pyspark/tests/test_broadcast.py", line 58, in
> _test_multiple_broadcasts
>     conf = SparkConf()
>   File "$path/spark/python/pyspark/conf.py", line 120, in __init__
>     self._jconf = _jvm.SparkConf(loadDefaults)
>   File
> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line
> 1709, in __getattr__
>     answer = self._gateway_client.send_command(
>   File
> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line
> 1036, in send_command
>     connection = self._get_connection()
>   File
> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line
> 284, in _get_connection
>     connection = self._create_new_connection()
>   File
> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line
> 291, in _create_new_connection
>     connection.connect_to_java_server()
>   File
> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line
> 438, in connect_to_java_server
>     self.socket.connect((self.java_address, self.java_port))
> ConnectionRefusedError: [Errno 61] Connection refused
>
> ----------------------------------------------------------------------
> Ran 7 tests in 12.950s
>
> FAILED (errors=7)
> sys:1: ResourceWarning: unclosed file <_io.BufferedWriter name=4>
>
> Had test failures in pyspark.tests.test_broadcast with
> /usr/local/bin/python3; see logs.
> ```
>
> Best,
>
> Adam Chhina
>
> On Jan 18, 2023, at 5:03 PM, Sean Owen <sro...@gmail.com> wrote:
>
> That isn't the released version either, but rather the head of the 3.2
> branch (which is beyond 3.2.3).
> You may want to check out the v3.2.3 tag instead:
> https://github.com/apache/spark/tree/v3.2.3
> ... instead of 3.2.1.
> But note of course the 3.3.x is the current release branch anyway.
>
> Hard to say what the error is without seeing more of the error log.
>
> That final warning is fine, just means you are using Java 11+.
>
>
> On Wed, Jan 18, 2023 at 3:59 PM Adam Chhina <amanschh...@gmail.com> wrote:
>
>> Oh, whoops, didn’t realize that wasn’t the release version, thanks!
>>
>> > git clone --branch branch-3.2 https://github.com/apache/spark.git
>>
>> Ah, so the old failing tests are passing now, but I am seeing failures in
>> `pyspark.tests.test_broadcast` such as  `test_broadcast_value_against_gc`,
>> with a majority of them failing due to `ConnectionRefusedError: [Errno
>> 61] Connection refused`. Maybe these tests are not mean to be ran locally,
>> and only in the pipeline?
>>
>> Also, I see this warning that mentions to notify the maintainers here:
>>
>> ```
>> Starting test(/usr/local/bin/python3): pyspark.tests.test_broadcast
>> WARNING: An illegal reflective access operation has occurred
>> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform
>> (file:/$path/spark/common/unsafe/target/scala-2.12/classes/) to constructor
>> java.nio.DirectByteBuffer(long,int)
>> ```
>>
>> FWIW, not sure if this matters, but python executable used for running
>> these tests is `Python 3.10.9` under `/user/local/bin/python3`.
>>
>> Best,
>>
>> Adam Chhina
>>
>> On Jan 18, 2023, at 3:05 PM, Bjørn Jørgensen <bjornjorgen...@gmail.com>
>> wrote:
>>
>> Replace
>> > > git clone g...@github.com:apache/spark.git
>> > > git checkout -b spark-321 v3.2.1
>>
>> with
>> git clone --branch branch-3.2 https://github.com/apache/spark.git
>> This will give you branch 3.2 as today, what I suppose you call upstream
>>
>> https://github.com/apache/spark/commits/branch-3.2
>> and right now all tests in github action are passed :)
>>
>>
>> ons. 18. jan. 2023 kl. 18:07 skrev Sean Owen <sro...@gmail.com>:
>>
>>> Never seen those, but it's probably a difference in pandas, numpy
>>> versions. You can see the current CICD test results in GitHub Actions. But,
>>> you want to use release versions, not an RC. 3.2.1 is not the latest
>>> version, and it's possible the tests were actually failing in the RC.
>>>
>>> On Wed, Jan 18, 2023, 10:57 AM Adam Chhina <amanschh...@gmail.com>
>>> wrote:
>>>
>>>> Bump,
>>>>
>>>> Just trying to see where I can find what tests are known failing for a
>>>> particular release, to ensure I’m building upstream correctly following the
>>>> build docs. I figured this would be the best place to ask as it pertains to
>>>> building and testing upstream (also more than happy to provide a PR for any
>>>> docs if required afterwards), however if there would be a more appropriate
>>>> place, please let me know.
>>>>
>>>> Best,
>>>>
>>>> Adam Chhina
>>>>
>>>> > On Dec 27, 2022, at 11:37 AM, Adam Chhina <amanschh...@gmail.com>
>>>> wrote:
>>>> >
>>>> > As part of an upgrade I was looking to run upstream PySpark unit
>>>> tests on `v3.2.1-rc2` before I applied some downstream patches and tested
>>>> those. However, I'm running into some issues with failing unit tests, which
>>>> I'm not sure are failing upstream or due to some step I missed in the 
>>>> build.
>>>> >
>>>> > The current failing tests (at least so far, since I believe the
>>>> python script exits on test failure):
>>>> > ```
>>>> > ======================================================================
>>>> > FAIL: test_train_prediction
>>>> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
>>>> > Test that error on test data improves as model is trained.
>>>> > ----------------------------------------------------------------------
>>>> > Traceback (most recent call last):
>>>> >   File
>>>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>>> line 474, in test_train_prediction
>>>> >     eventually(condition, timeout=180.0)
>>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>>>> 86, in eventually
>>>> >     lastValue = condition()
>>>> >   File
>>>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>>> line 469, in condition
>>>> >     self.assertGreater(errors[1] - errors[-1], 2)
>>>> > AssertionError: 1.8960983527735014 not greater than 2
>>>> >
>>>> > ======================================================================
>>>> > FAIL: test_parameter_accuracy
>>>> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
>>>> > Test that the final value of weights is close to the desired value.
>>>> > ----------------------------------------------------------------------
>>>> > Traceback (most recent call last):
>>>> >   File
>>>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>>> line 229, in test_parameter_accuracy
>>>> >     eventually(condition, timeout=60.0, catch_assertions=True)
>>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>>>> 91, in eventually
>>>> >     raise lastValue
>>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>>>> 82, in eventually
>>>> >     lastValue = condition()
>>>> >   File
>>>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>>> line 226, in condition
>>>> >     self.assertAlmostEqual(rel, 0.1, 1)
>>>> > AssertionError: 0.23052813480829393 != 0.1 within 1 places
>>>> (0.13052813480829392 difference)
>>>> >
>>>> > ======================================================================
>>>> > FAIL: test_training_and_prediction
>>>> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
>>>> > Test that the model improves on toy data with no. of batches
>>>> > ----------------------------------------------------------------------
>>>> > Traceback (most recent call last):
>>>> >   File
>>>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>>> line 334, in test_training_and_prediction
>>>> >     eventually(condition, timeout=180.0)
>>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>>>> 93, in eventually
>>>> >     raise AssertionError(
>>>> > AssertionError: Test failed due to timeout after 180 sec, with last
>>>> condition returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74,
>>>> 0.73, 0.69, 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74, 0.76, 0.78,
>>>> 0.7, 0.78, 0.8, 0.74, 0.77, 0.75, 0.76, 0.76, 0.75, 0.78, 0.74, 0.64, 0.64,
>>>> 0.71, 0.78, 0.76, 0.64, 0.68, 0.69, 0.72, 0.77
>>>> >
>>>> > ----------------------------------------------------------------------
>>>> > Ran 13 tests in 661.536s
>>>> >
>>>> > FAILED (failures=3, skipped=1)
>>>> >
>>>> > Had test failures in pyspark.mllib.tests.test_streaming_algorithms
>>>> with /usr/local/bin/python3; see logs.
>>>> > ```
>>>> >
>>>> > Here's how I'm currently building Spark, I was using the
>>>> [building-spark](https://spark.apache.org/docs/3..1/building-spark.html)
>>>> docs as a reference.
>>>> > ```
>>>> > > git clone g...@github.com:apache/spark.git
>>>> > > git checkout -b spark-321 v3.2.1
>>>> > > ./build/mvn -DskipTests clean package -Phive
>>>> > > export JAVA_HOME=$(path/to/jdk/11)
>>>> > > ./python/run-tests
>>>> > ```
>>>> >
>>>> > Current Java version
>>>> > ```
>>>> > java -version
>>>> > openjdk version "11.0.17" 2022-10-18
>>>> > OpenJDK Runtime Environment Homebrew (build 11.0.17+0)
>>>> > OpenJDK 64-Bit Server VM Homebrew (build 11.0.17+0, mixed mode)
>>>> > ```
>>>> >
>>>> > Alternatively, I've also tried simply building Spark and using a
>>>> python=3.9 venv and installing the requirements from `pip install -r
>>>> dev/requirements.txt` and using that as the interpreter to run tests.
>>>> However, I was running into some failing pandas test which to me seemed
>>>> like it was coming from a pandas version difference as `requirements.txt`
>>>> didn't specify a version.
>>>> >
>>>> > I suppose I have a couple of questions in regards to this:
>>>> > 1. Am I missing a build step to build Spark and run PySpark unit
>>>> tests?
>>>> > 2. Where could I find whether an upstream test is failing for a
>>>> specific release?
>>>> > 3. Would it be possible to configure the `run-tests` script to run
>>>> all tests regardless of test failures?
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>>
>>
>> --
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>>
>>
>

Re: Building Spark to run PySpark Tests?

Reply via email to