Re: Building Spark to run PySpark Tests?

Sean Owen Thu, 19 Jan 2023 08:23:57 -0800

It's not clear what error you're facing from this info (ConnectionError
could mean lots of things), so would be hard to generalize answers. How
much mem do you have on your Mac?
-Xmx2g sounds low, but also probably doesn't matter much.
Spark builds work on my Mac, FWIW.


On Thu, Jan 19, 2023 at 10:15 AM Adam Chhina <[email protected]> wrote:

> Hmm, would there be a list of common env issues that would interfere with
> builds? Looking up the error message, it seemed like often the issue was
> OOM by the JVM process. I’m not sure if that’s what’s happening here, since
> during the build and setting up the tests the config should have allocated
> enough memory?
>
> I’ve been just trying to follow the build docs, and so far I’m running as
> such:
>
> > git clone --branch v3.2.3 https://github.com/apache/spark.git
> > cd spark
> > export MAVEN_OPTS="-Xss64m -Xmx2g -XX:ReservedCodeCacheSize=1g” // was
> unset, but set to be safe
> > export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES // I saw in the
> developer tools that some pyspark tests were having issues on macOS
> > export JAVA_HOME=`/usr/libexec/java_home -v 11`
> > ./build/mvn -DskipTests clean package -Phive
> > ./python/run-tests --python-executables --testnames
> ‘pyspark.tests.test_broadcast'
>
> > java -version
>
> openjdk version "11.0.17" 2022-10-18
>
> OpenJDK Runtime Environment Homebrew (build 11.0.17+0)
>
> OpenJDK 64-Bit Server VM Homebrew (build 11.0.17+0, mixed mode)
>
>
> > OS
>
> Ventura 13.1 (22C65)
>
>
> Best,
>
>
> Adam Chhina
>
> On Jan 18, 2023, at 6:50 PM, Sean Owen <[email protected]> wrote:
>
> Release _branches_ are tested as commits arrive to the branch, yes. That's
> what you see at https://github.com/apache/spark/actions
> Released versions are fixed, they don't change, and were also manually
> tested before release, so no they are not re-tested; there is no need.
>
> You presumably have some local env issue, because the source of Spark
> 3.2.3 was passing CI/CD at time of release as well as manual tests of the
> PMC.
>
>
> On Wed, Jan 18, 2023 at 5:24 PM Adam Chhina <[email protected]> wrote:
>
>> Hi Sean,
>>
>> That’s fair in regards to 3.3.x being the current release branch. I’m not
>> familiar with the testing schedule, but I had assumed all currently
>> supported release versions would have some nightly/weekly tests ran; is
>> that not the case? I only ask, as when I when I’m seeing these test
>> failures, I assumed these were either known/unknown from some recurring
>> testing pipeline.
>>
>> Also, unfortunately using v3.2.3 also had the same test failures.
>>
>> > git clone --branch v3.2.3 https://github.com/apache/spark.git
>>
>> I’ve posted the traceback below for one of the ran tests. At the end it
>> mentioned to check the logs - `see logs`. However I wasn’t sure whether
>> that just meant the traceback or some more detailed logs elsewhere? I
>> wasn’t able to see any files that looked relevant running `find . -name
>> “*logs*”` afterwards. Sorry if I’m missing something obvious.
>>
>> ```
>> test_broadcast_no_encryption (pyspark.tests.test_broadcast.BroadcastTest)
>> ... ERROR
>> test_broadcast_value_against_gc
>> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
>> test_broadcast_value_driver_encryption
>> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
>> test_broadcast_value_driver_no_encryption
>> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
>> test_broadcast_with_encryption
>> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
>>
>> ======================================================================
>> ERROR: test_broadcast_with_encryption
>> (pyspark.tests.test_broadcast.BroadcastTest)
>> ----------------------------------------------------------------------
>> Traceback (most recent call last):
>>   File "$path/spark/python/pyspark/tests/test_broadcast.py", line 67, in
>> test_broadcast_with_encryption
>>     self._test_multiple_broadcasts(("spark.io.encryption.enabled",
>> "true"))
>>   File "$path/spark/python/pyspark/tests/test_broadcast.py", line 58, in
>> _test_multiple_broadcasts
>>     conf = SparkConf()
>>   File "$path/spark/python/pyspark/conf.py", line 120, in __init__
>>     self._jconf = _jvm.SparkConf(loadDefaults)
>>   File
>> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line
>> 1709, in __getattr__
>>     answer = self._gateway_client.send_command(
>>   File
>> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line
>> 1036, in send_command
>>     connection = self._get_connection()
>>   File
>> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line
>> 284, in _get_connection
>>     connection = self._create_new_connection()
>>   File
>> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line
>> 291, in _create_new_connection
>>     connection.connect_to_java_server()
>>   File
>> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line
>> 438, in connect_to_java_server
>>     self.socket.connect((self.java_address, self.java_port))
>> ConnectionRefusedError: [Errno 61] Connection refused
>>
>> ----------------------------------------------------------------------
>> Ran 7 tests in 12.950s
>>
>> FAILED (errors=7)
>> sys:1: ResourceWarning: unclosed file <_io.BufferedWriter name=4>
>>
>> Had test failures in pyspark.tests.test_broadcast with
>> /usr/local/bin/python3; see logs.
>> ```
>>
>> Best,
>>
>> Adam Chhina
>>
>> On Jan 18, 2023, at 5:03 PM, Sean Owen <[email protected]> wrote:
>>
>> That isn't the released version either, but rather the head of the 3.2
>> branch (which is beyond 3.2.3).
>> You may want to check out the v3.2.3 tag instead:
>> https://github.com/apache/spark/tree/v3.2.3
>> ... instead of 3.2.1.
>> But note of course the 3.3.x is the current release branch anyway.
>>
>> Hard to say what the error is without seeing more of the error log.
>>
>> That final warning is fine, just means you are using Java 11+.
>>
>>
>> On Wed, Jan 18, 2023 at 3:59 PM Adam Chhina <[email protected]>
>> wrote:
>>
>>> Oh, whoops, didn’t realize that wasn’t the release version, thanks!
>>>
>>> > git clone --branch branch-3.2 https://github.com/apache/spark.git
>>>
>>> Ah, so the old failing tests are passing now, but I am seeing failures
>>> in `pyspark.tests.test_broadcast` such as  
>>> `test_broadcast_value_against_gc`,
>>> with a majority of them failing due to `ConnectionRefusedError: [Errno
>>> 61] Connection refused`. Maybe these tests are not mean to be ran locally,
>>> and only in the pipeline?
>>>
>>> Also, I see this warning that mentions to notify the maintainers here:
>>>
>>> ```
>>> Starting test(/usr/local/bin/python3): pyspark.tests.test_broadcast
>>> WARNING: An illegal reflective access operation has occurred
>>> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform
>>> (file:/$path/spark/common/unsafe/target/scala-2.12/classes/) to constructor
>>> java.nio.DirectByteBuffer(long,int)
>>> ```
>>>
>>> FWIW, not sure if this matters, but python executable used for running
>>> these tests is `Python 3.10.9` under `/user/local/bin/python3`.
>>>
>>> Best,
>>>
>>> Adam Chhina
>>>
>>> On Jan 18, 2023, at 3:05 PM, Bjørn Jørgensen <[email protected]>
>>> wrote:
>>>
>>> Replace
>>> > > git clone [email protected]:apache/spark.git
>>> > > git checkout -b spark-321 v3.2.1
>>>
>>> with
>>> git clone --branch branch-3.2 https://github.com/apache/spark.git
>>> This will give you branch 3.2 as today, what I suppose you call
>>> upstream
>>> https://github.com/apache/spark/commits/branch-3.2
>>> and right now all tests in github action are passed :)
>>>
>>>
>>> ons. 18. jan. 2023 kl. 18:07 skrev Sean Owen <[email protected]>:
>>>
>>>> Never seen those, but it's probably a difference in pandas, numpy
>>>> versions. You can see the current CICD test results in GitHub Actions. But,
>>>> you want to use release versions, not an RC. 3.2.1 is not the latest
>>>> version, and it's possible the tests were actually failing in the RC.
>>>>
>>>> On Wed, Jan 18, 2023, 10:57 AM Adam Chhina <[email protected]>
>>>> wrote:
>>>>
>>>>> Bump,
>>>>>
>>>>> Just trying to see where I can find what tests are known failing for a
>>>>> particular release, to ensure I’m building upstream correctly following 
>>>>> the
>>>>> build docs. I figured this would be the best place to ask as it pertains 
>>>>> to
>>>>> building and testing upstream (also more than happy to provide a PR for 
>>>>> any
>>>>> docs if required afterwards), however if there would be a more appropriate
>>>>> place, please let me know.
>>>>>
>>>>> Best,
>>>>>
>>>>> Adam Chhina
>>>>>
>>>>> > On Dec 27, 2022, at 11:37 AM, Adam Chhina <[email protected]>
>>>>> wrote:
>>>>> >
>>>>> > As part of an upgrade I was looking to run upstream PySpark unit
>>>>> tests on `v3.2.1-rc2` before I applied some downstream patches and tested
>>>>> those. However, I'm running into some issues with failing unit tests, 
>>>>> which
>>>>> I'm not sure are failing upstream or due to some step I missed in the 
>>>>> build.
>>>>> >
>>>>> > The current failing tests (at least so far, since I believe the
>>>>> python script exits on test failure):
>>>>> > ```
>>>>> >
>>>>> ======================================================================
>>>>> > FAIL: test_train_prediction
>>>>> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
>>>>> > Test that error on test data improves as model is trained.
>>>>> >
>>>>> ----------------------------------------------------------------------
>>>>> > Traceback (most recent call last):
>>>>> >   File
>>>>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>>>> line 474, in test_train_prediction
>>>>> >     eventually(condition, timeout=180.0)
>>>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>>>>> 86, in eventually
>>>>> >     lastValue = condition()
>>>>> >   File
>>>>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>>>> line 469, in condition
>>>>> >     self.assertGreater(errors[1] - errors[-1], 2)
>>>>> > AssertionError: 1.8960983527735014 not greater than 2
>>>>> >
>>>>> >
>>>>> ======================================================================
>>>>> > FAIL: test_parameter_accuracy
>>>>> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
>>>>> > Test that the final value of weights is close to the desired value.
>>>>> >
>>>>> ----------------------------------------------------------------------
>>>>> > Traceback (most recent call last):
>>>>> >   File
>>>>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>>>> line 229, in test_parameter_accuracy
>>>>> >     eventually(condition, timeout=60.0, catch_assertions=True)
>>>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>>>>> 91, in eventually
>>>>> >     raise lastValue
>>>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>>>>> 82, in eventually
>>>>> >     lastValue = condition()
>>>>> >   File
>>>>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>>>> line 226, in condition
>>>>> >     self.assertAlmostEqual(rel, 0.1, 1)
>>>>> > AssertionError: 0.23052813480829393 != 0.1 within 1 places
>>>>> (0.13052813480829392 difference)
>>>>> >
>>>>> >
>>>>> ======================================================================
>>>>> > FAIL: test_training_and_prediction
>>>>> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
>>>>> > Test that the model improves on toy data with no. of batches
>>>>> >
>>>>> ----------------------------------------------------------------------
>>>>> > Traceback (most recent call last):
>>>>> >   File
>>>>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>>>> line 334, in test_training_and_prediction
>>>>> >     eventually(condition, timeout=180.0)
>>>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>>>>> 93, in eventually
>>>>> >     raise AssertionError(
>>>>> > AssertionError: Test failed due to timeout after 180 sec, with last
>>>>> condition returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74,
>>>>> 0.73, 0.69, 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74, 0.76, 0.78,
>>>>> 0.7, 0.78, 0.8, 0.74, 0.77, 0.75, 0.76, 0.76, 0.75, 0.78, 0.74, 0.64, 
>>>>> 0.64,
>>>>> 0.71, 0.78, 0.76, 0.64, 0.68, 0.69, 0.72, 0.77
>>>>> >
>>>>> >
>>>>> ----------------------------------------------------------------------
>>>>> > Ran 13 tests in 661.536s
>>>>> >
>>>>> > FAILED (failures=3, skipped=1)
>>>>> >
>>>>> > Had test failures in pyspark.mllib.tests.test_streaming_algorithms
>>>>> with /usr/local/bin/python3; see logs.
>>>>> > ```
>>>>> >
>>>>> > Here's how I'm currently building Spark, I was using the
>>>>> [building-spark](
>>>>> https://spark.apache.org/docs/3..1/building-spark.html) docs as a
>>>>> reference.
>>>>> > ```
>>>>> > > git clone [email protected]:apache/spark.git
>>>>> > > git checkout -b spark-321 v3.2.1
>>>>> > > ./build/mvn -DskipTests clean package -Phive
>>>>> > > export JAVA_HOME=$(path/to/jdk/11)
>>>>> > > ./python/run-tests
>>>>> > ```
>>>>> >
>>>>> > Current Java version
>>>>> > ```
>>>>> > java -version
>>>>> > openjdk version "11.0.17" 2022-10-18
>>>>> > OpenJDK Runtime Environment Homebrew (build 11.0.17+0)
>>>>> > OpenJDK 64-Bit Server VM Homebrew (build 11.0.17+0, mixed mode)
>>>>> > ```
>>>>> >
>>>>> > Alternatively, I've also tried simply building Spark and using a
>>>>> python=3.9 venv and installing the requirements from `pip install -r
>>>>> dev/requirements.txt` and using that as the interpreter to run tests.
>>>>> However, I was running into some failing pandas test which to me seemed
>>>>> like it was coming from a pandas version difference as `requirements.txt`
>>>>> didn't specify a version.
>>>>> >
>>>>> > I suppose I have a couple of questions in regards to this:
>>>>> > 1. Am I missing a build step to build Spark and run PySpark unit
>>>>> tests?
>>>>> > 2. Where could I find whether an upstream test is failing for a
>>>>> specific release?
>>>>> > 3. Would it be possible to configure the `run-tests` script to run
>>>>> all tests regardless of test failures?
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe e-mail: [email protected]
>>>>>
>>>>>
>>>
>>> --
>>> Bjørn Jørgensen
>>> Vestre Aspehaug 4, 6010 Ålesund
>>> Norge
>>>
>>> +47 480 94 297
>>>
>>>
>>>
>>
>

Re: Building Spark to run PySpark Tests?

Reply via email to