Re: Building Spark to run PySpark Tests?

Adam Chhina Wed, 18 Jan 2023 15:24:18 -0800

Hi Sean,

That’s fair in regards to 3.3.x being the current release branch. I’m not 
familiar with the testing schedule, but I had assumed all currently supported 
release versions would have some nightly/weekly tests ran; is that not the 
case? I only ask, as when I when I’m seeing these test failures, I assumed 
these were either known/unknown from some recurring testing pipeline.


Also, unfortunately using v3.2.3 also had the same test failures.

> git clone --branch v3.2.3 https://github.com/apache/spark.git

I’ve posted the traceback below for one of the ran tests. At the end it 
mentioned to check the logs - `see logs`. However I wasn’t sure whether that 
just meant the traceback or some more detailed logs elsewhere? I wasn’t able to 
see any files that looked relevant running `find . -name “*logs*”` afterwards. 
Sorry if I’m missing something obvious.

```
test_broadcast_no_encryption (pyspark.tests.test_broadcast.BroadcastTest) ... 
ERROR
test_broadcast_value_against_gc (pyspark.tests.test_broadcast.BroadcastTest) 
... ERROR
test_broadcast_value_driver_encryption 
(pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
test_broadcast_value_driver_no_encryption 
(pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
test_broadcast_with_encryption (pyspark.tests.test_broadcast.BroadcastTest) ... 
ERROR

======================================================================
ERROR: test_broadcast_with_encryption 
(pyspark.tests.test_broadcast.BroadcastTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "$path/spark/python/pyspark/tests/test_broadcast.py", line 67, in 
test_broadcast_with_encryption
    self._test_multiple_broadcasts(("spark.io.encryption.enabled", "true"))
  File "$path/spark/python/pyspark/tests/test_broadcast.py", line 58, in 
_test_multiple_broadcasts
    conf = SparkConf()
  File "$path/spark/python/pyspark/conf.py", line 120, in __init__
    self._jconf = _jvm.SparkConf(loadDefaults)
  File "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", 
line 1709, in __getattr__
    answer = self._gateway_client.send_command(
  File "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", 
line 1036, in send_command
    connection = self._get_connection()
  File "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", 
line 284, in _get_connection
    connection = self._create_new_connection()
  File "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", 
line 291, in _create_new_connection
    connection.connect_to_java_server()
  File "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", 
line 438, in connect_to_java_server
    self.socket.connect((self.java_address, self.java_port))
ConnectionRefusedError: [Errno 61] Connection refused

----------------------------------------------------------------------
Ran 7 tests in 12.950s

FAILED (errors=7)
sys:1: ResourceWarning: unclosed file <_io.BufferedWriter name=4>

Had test failures in pyspark.tests.test_broadcast with /usr/local/bin/python3; 
see logs.
```

Best,

Adam Chhina

> On Jan 18, 2023, at 5:03 PM, Sean Owen <sro...@gmail.com> wrote:
> 
> That isn't the released version either, but rather the head of the 3.2 branch 
> (which is beyond 3.2.3).
> You may want to check out the v3.2.3 tag instead: 
> https://github.com/apache/spark/tree/v3.2.3
> ... instead of 3.2.1. 
> But note of course the 3.3.x is the current release branch anyway.
> 
> Hard to say what the error is without seeing more of the error log.
> 
> That final warning is fine, just means you are using Java 11+.
> 
> 
> On Wed, Jan 18, 2023 at 3:59 PM Adam Chhina <amanschh...@gmail.com 
> <mailto:amanschh...@gmail.com>> wrote:
>> Oh, whoops, didn’t realize that wasn’t the release version, thanks!
>> 
>> > git clone --branch branch-3.2 https://github.com/apache/spark.git 
>> 
>> Ah, so the old failing tests are passing now, but I am seeing failures in 
>> `pyspark.tests.test_broadcast` such as  `test_broadcast_value_against_gc`, 
>> with a majority of them failing due to `ConnectionRefusedError: [Errno 61] 
>> Connection refused`. Maybe these tests are not mean to be ran locally, and 
>> only in the pipeline?
>> 
>> Also, I see this warning that mentions to notify the maintainers here:
>> 
>> ```
>> Starting test(/usr/local/bin/python3): pyspark.tests.test_broadcast
>> WARNING: An illegal reflective access operation has occurred
>> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
>> (file:/$path/spark/common/unsafe/target/scala-2.12/classes/) to constructor 
>> java.nio.DirectByteBuffer(long,int)
>> ```
>> 
>> FWIW, not sure if this matters, but python executable used for running these 
>> tests is `Python 3.10.9` under `/user/local/bin/python3`.
>> 
>> Best,
>> 
>> Adam Chhina
>> 
>>> On Jan 18, 2023, at 3:05 PM, Bjørn Jørgensen <bjornjorgen...@gmail.com 
>>> <mailto:bjornjorgen...@gmail.com>> wrote:
>>> 
>>> Replace 
>>> > > git clone g...@github.com:apache/spark.git
>>> > > git checkout -b spark-321 v3.2.1
>>> 
>>> with 
>>> git clone --branch branch-3.2 https://github.com/apache/spark.git    
>>> This will give you branch 3.2 as today, what I suppose you call upstream    
>>>   
>>> https://github.com/apache/spark/commits/branch-3.2
>>> and right now all tests in github action are passed :) 
>>> 
>>> 
>>> ons. 18. jan. 2023 kl. 18:07 skrev Sean Owen <sro...@gmail.com 
>>> <mailto:sro...@gmail.com>>:
>>>> Never seen those, but it's probably a difference in pandas, numpy 
>>>> versions. You can see the current CICD test results in GitHub Actions. 
>>>> But, you want to use release versions, not an RC. 3.2.1 is not the latest 
>>>> version, and it's possible the tests were actually failing in the RC.
>>>> 
>>>> On Wed, Jan 18, 2023, 10:57 AM Adam Chhina <amanschh...@gmail.com 
>>>> <mailto:amanschh...@gmail.com>> wrote:
>>>>> Bump,
>>>>> 
>>>>> Just trying to see where I can find what tests are known failing for a 
>>>>> particular release, to ensure I’m building upstream correctly following 
>>>>> the build docs. I figured this would be the best place to ask as it 
>>>>> pertains to building and testing upstream (also more than happy to 
>>>>> provide a PR for any docs if required afterwards), however if there would 
>>>>> be a more appropriate place, please let me know.
>>>>> 
>>>>> Best,
>>>>> 
>>>>> Adam Chhina
>>>>> 
>>>>> > On Dec 27, 2022, at 11:37 AM, Adam Chhina <amanschh...@gmail.com 
>>>>> > <mailto:amanschh...@gmail.com>> wrote:
>>>>> > 
>>>>> > As part of an upgrade I was looking to run upstream PySpark unit tests 
>>>>> > on `v3.2.1-rc2` before I applied some downstream patches and tested 
>>>>> > those. However, I'm running into some issues with failing unit tests, 
>>>>> > which I'm not sure are failing upstream or due to some step I missed in 
>>>>> > the build.
>>>>> > 
>>>>> > The current failing tests (at least so far, since I believe the python 
>>>>> > script exits on test failure):
>>>>> > ```
>>>>> > ======================================================================
>>>>> > FAIL: test_train_prediction 
>>>>> > (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
>>>>> > Test that error on test data improves as model is trained.
>>>>> > ----------------------------------------------------------------------
>>>>> > Traceback (most recent call last):
>>>>> >   File 
>>>>> > "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>>>> >  line 474, in test_train_prediction
>>>>> >     eventually(condition, timeout=180.0)
>>>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 
>>>>> > 86, in eventually
>>>>> >     lastValue = condition()
>>>>> >   File 
>>>>> > "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>>>> >  line 469, in condition
>>>>> >     self.assertGreater(errors[1] - errors[-1], 2)
>>>>> > AssertionError: 1.8960983527735014 not greater than 2
>>>>> > 
>>>>> > ======================================================================
>>>>> > FAIL: test_parameter_accuracy 
>>>>> > (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
>>>>> > Test that the final value of weights is close to the desired value.
>>>>> > ----------------------------------------------------------------------
>>>>> > Traceback (most recent call last):
>>>>> >   File 
>>>>> > "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>>>> >  line 229, in test_parameter_accuracy
>>>>> >     eventually(condition, timeout=60.0, catch_assertions=True)
>>>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 
>>>>> > 91, in eventually
>>>>> >     raise lastValue
>>>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 
>>>>> > 82, in eventually
>>>>> >     lastValue = condition()
>>>>> >   File 
>>>>> > "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>>>> >  line 226, in condition
>>>>> >     self.assertAlmostEqual(rel, 0.1, 1)
>>>>> > AssertionError: 0.23052813480829393 != 0.1 within 1 places 
>>>>> > (0.13052813480829392 difference)
>>>>> > 
>>>>> > ======================================================================
>>>>> > FAIL: test_training_and_prediction 
>>>>> > (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
>>>>> > Test that the model improves on toy data with no. of batches
>>>>> > ----------------------------------------------------------------------
>>>>> > Traceback (most recent call last):
>>>>> >   File 
>>>>> > "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>>>> >  line 334, in test_training_and_prediction
>>>>> >     eventually(condition, timeout=180.0)
>>>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 
>>>>> > 93, in eventually
>>>>> >     raise AssertionError(
>>>>> > AssertionError: Test failed due to timeout after 180 sec, with last 
>>>>> > condition returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 
>>>>> > 0.73, 0.69, 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74, 0.76, 0.78, 
>>>>> > 0.7, 0.78, 0.8, 0.74, 0.77, 0.75, 0.76, 0.76, 0.75, 0.78, 0.74, 0.64, 
>>>>> > 0.64, 0.71, 0.78, 0.76, 0.64, 0.68, 0.69, 0.72, 0.77
>>>>> > 
>>>>> > ----------------------------------------------------------------------
>>>>> > Ran 13 tests in 661.536s
>>>>> > 
>>>>> > FAILED (failures=3, skipped=1)
>>>>> > 
>>>>> > Had test failures in pyspark.mllib.tests.test_streaming_algorithms with 
>>>>> > /usr/local/bin/python3; see logs.
>>>>> > ```
>>>>> > 
>>>>> > Here's how I'm currently building Spark, I was using the 
>>>>> > [building-spark](https://spark.apache.org/docs/3..1/building-spark.html)
>>>>> >  docs as a reference.
>>>>> > ```
>>>>> > > git clone g...@github.com:apache/spark.git
>>>>> > > git checkout -b spark-321 v3.2.1
>>>>> > > ./build/mvn -DskipTests clean package -Phive
>>>>> > > export JAVA_HOME=$(path/to/jdk/11)
>>>>> > > ./python/run-tests
>>>>> > ```
>>>>> > 
>>>>> > Current Java version
>>>>> > ```
>>>>> > java -version
>>>>> > openjdk version "11.0.17" 2022-10-18
>>>>> > OpenJDK Runtime Environment Homebrew (build 11.0.17+0)
>>>>> > OpenJDK 64-Bit Server VM Homebrew (build 11.0.17+0, mixed mode)
>>>>> > ```
>>>>> > 
>>>>> > Alternatively, I've also tried simply building Spark and using a 
>>>>> > python=3.9 venv and installing the requirements from `pip install -r 
>>>>> > dev/requirements.txt` and using that as the interpreter to run tests. 
>>>>> > However, I was running into some failing pandas test which to me seemed 
>>>>> > like it was coming from a pandas version difference as 
>>>>> > `requirements.txt` didn't specify a version.
>>>>> > 
>>>>> > I suppose I have a couple of questions in regards to this:
>>>>> > 1. Am I missing a build step to build Spark and run PySpark unit tests?
>>>>> > 2. Where could I find whether an upstream test is failing for a 
>>>>> > specific release?
>>>>> > 3. Would it be possible to configure the `run-tests` script to run all 
>>>>> > tests regardless of test failures?
>>>>> 
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
>>>>> <mailto:dev-unsubscr...@spark.apache.org>
>>>>> 
>>> 
>>> 
>>> -- 
>>> Bjørn Jørgensen 
>>> Vestre Aspehaug 4, 6010 Ålesund 
>>> Norge
>>> 
>>> +47 480 94 297
>>

Re: Building Spark to run PySpark Tests?

Reply via email to