Re: Building Spark to run PySpark Tests?

2023-01-19 Thread Sean Owen
It's not clear what error you're facing from this info (ConnectionError
could mean lots of things), so would be hard to generalize answers. How
much mem do you have on your Mac?
-Xmx2g sounds low, but also probably doesn't matter much.
Spark builds work on my Mac, FWIW.

On Thu, Jan 19, 2023 at 10:15 AM Adam Chhina  wrote:

> Hmm, would there be a list of common env issues that would interfere with
> builds? Looking up the error message, it seemed like often the issue was
> OOM by the JVM process. I’m not sure if that’s what’s happening here, since
> during the build and setting up the tests the config should have allocated
> enough memory?
>
> I’ve been just trying to follow the build docs, and so far I’m running as
> such:
>
> > git clone --branch v3.2.3 https://github.com/apache/spark.git
> > cd spark
> > export MAVEN_OPTS="-Xss64m -Xmx2g -XX:ReservedCodeCacheSize=1g” // was
> unset, but set to be safe
> > export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES // I saw in the
> developer tools that some pyspark tests were having issues on macOS
> > export JAVA_HOME=`/usr/libexec/java_home -v 11`
> > ./build/mvn -DskipTests clean package -Phive
> > ./python/run-tests --python-executables --testnames
> ‘pyspark.tests.test_broadcast'
>
> > java -version
>
> openjdk version "11.0.17" 2022-10-18
>
> OpenJDK Runtime Environment Homebrew (build 11.0.17+0)
>
> OpenJDK 64-Bit Server VM Homebrew (build 11.0.17+0, mixed mode)
>
>
> > OS
>
> Ventura 13.1 (22C65)
>
>
> Best,
>
>
> Adam Chhina
>
> On Jan 18, 2023, at 6:50 PM, Sean Owen  wrote:
>
> Release _branches_ are tested as commits arrive to the branch, yes. That's
> what you see at https://github.com/apache/spark/actions
> Released versions are fixed, they don't change, and were also manually
> tested before release, so no they are not re-tested; there is no need.
>
> You presumably have some local env issue, because the source of Spark
> 3.2.3 was passing CI/CD at time of release as well as manual tests of the
> PMC.
>
>
> On Wed, Jan 18, 2023 at 5:24 PM Adam Chhina  wrote:
>
>> Hi Sean,
>>
>> That’s fair in regards to 3.3.x being the current release branch. I’m not
>> familiar with the testing schedule, but I had assumed all currently
>> supported release versions would have some nightly/weekly tests ran; is
>> that not the case? I only ask, as when I when I’m seeing these test
>> failures, I assumed these were either known/unknown from some recurring
>> testing pipeline.
>>
>> Also, unfortunately using v3.2.3 also had the same test failures.
>>
>> > git clone --branch v3.2.3 https://github.com/apache/spark.git
>>
>> I’ve posted the traceback below for one of the ran tests. At the end it
>> mentioned to check the logs - `see logs`. However I wasn’t sure whether
>> that just meant the traceback or some more detailed logs elsewhere? I
>> wasn’t able to see any files that looked relevant running `find . -name
>> “*logs*”` afterwards. Sorry if I’m missing something obvious.
>>
>> ```
>> test_broadcast_no_encryption (pyspark.tests.test_broadcast.BroadcastTest)
>> ... ERROR
>> test_broadcast_value_against_gc
>> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
>> test_broadcast_value_driver_encryption
>> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
>> test_broadcast_value_driver_no_encryption
>> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
>> test_broadcast_with_encryption
>> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
>>
>> ==
>> ERROR: test_broadcast_with_encryption
>> (pyspark.tests.test_broadcast.BroadcastTest)
>> --
>> Traceback (most recent call last):
>>   File "$path/spark/python/pyspark/tests/test_broadcast.py", line 67, in
>> test_broadcast_with_encryption
>> self._test_multiple_broadcasts(("spark.io.encryption.enabled",
>> "true"))
>>   File "$path/spark/python/pyspark/tests/test_broadcast.py", line 58, in
>> _test_multiple_broadcasts
>> conf = SparkConf()
>>   File "$path/spark/python/pyspark/conf.py", line 120, in __init__
>> self._jconf = _jvm.SparkConf(loadDefaults)
>>   File
>> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line
>> 1709, in __getattr__
>> answer = self._gateway_client.send_command(
>>   File
>> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line
>> 1036, in send_command
>> connection = self._get_connection()
>>   File
>> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line
>> 284, in _get_connection
>> connection = self._create_new_connection()
>>   File
>> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line
>> 291, in _create_new_connection
>> connection.connect_to_java_server()
>>   File
>> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line
>> 438, in connect_to_java_server
>> self.socket.connect((self.java_address, 

Re: Building Spark to run PySpark Tests?

2023-01-19 Thread Adam Chhina
Hmm, would there be a list of common env issues that would interfere with 
builds? Looking up the error message, it seemed like often the issue was OOM by 
the JVM process. I’m not sure if that’s what’s happening here, since during the 
build and setting up the tests the config should have allocated enough memory?

I’ve been just trying to follow the build docs, and so far I’m running as such:

> git clone --branch v3.2.3 https://github.com/apache/spark.git 
> cd spark
> export MAVEN_OPTS="-Xss64m -Xmx2g -XX:ReservedCodeCacheSize=1g” // was unset, 
> but set to be safe
> export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES // I saw in the developer 
> tools that some pyspark tests were having issues on macOS
> export JAVA_HOME=`/usr/libexec/java_home -v 11`
> ./build/mvn -DskipTests clean package -Phive
> ./python/run-tests --python-executables --testnames 
> ‘pyspark.tests.test_broadcast'

> java -version
openjdk version "11.0.17" 2022-10-18
OpenJDK Runtime Environment Homebrew (build 11.0.17+0)
OpenJDK 64-Bit Server VM Homebrew (build 11.0.17+0, mixed mode)

> OS
Ventura 13.1 (22C65)

Best,

Adam Chhina

> On Jan 18, 2023, at 6:50 PM, Sean Owen  wrote:
> 
> Release _branches_ are tested as commits arrive to the branch, yes. That's 
> what you see at https://github.com/apache/spark/actions
> Released versions are fixed, they don't change, and were also manually tested 
> before release, so no they are not re-tested; there is no need.
> 
> You presumably have some local env issue, because the source of Spark 3.2.3 
> was passing CI/CD at time of release as well as manual tests of the PMC.
> 
> 
> On Wed, Jan 18, 2023 at 5:24 PM Adam Chhina  > wrote:
>> Hi Sean,
>> 
>> That’s fair in regards to 3.3.x being the current release branch. I’m not 
>> familiar with the testing schedule, but I had assumed all currently 
>> supported release versions would have some nightly/weekly tests ran; is that 
>> not the case? I only ask, as when I when I’m seeing these test failures, I 
>> assumed these were either known/unknown from some recurring testing pipeline.
>> 
>> Also, unfortunately using v3.2.3 also had the same test failures.
>> 
>> > git clone --branch v3.2.3 https://github.com/apache/spark.git
>> 
>> I’ve posted the traceback below for one of the ran tests. At the end it 
>> mentioned to check the logs - `see logs`. However I wasn’t sure whether that 
>> just meant the traceback or some more detailed logs elsewhere? I wasn’t able 
>> to see any files that looked relevant running `find . -name “*logs*”` 
>> afterwards. Sorry if I’m missing something obvious.
>> 
>> ```
>> test_broadcast_no_encryption (pyspark.tests.test_broadcast.BroadcastTest) 
>> ... ERROR
>> test_broadcast_value_against_gc (pyspark.tests.test_broadcast.BroadcastTest) 
>> ... ERROR
>> test_broadcast_value_driver_encryption 
>> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
>> test_broadcast_value_driver_no_encryption 
>> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
>> test_broadcast_with_encryption (pyspark.tests.test_broadcast.BroadcastTest) 
>> ... ERROR
>> 
>> ==
>> ERROR: test_broadcast_with_encryption 
>> (pyspark.tests.test_broadcast.BroadcastTest)
>> --
>> Traceback (most recent call last):
>>   File "$path/spark/python/pyspark/tests/test_broadcast.py", line 67, in 
>> test_broadcast_with_encryption
>> self._test_multiple_broadcasts(("spark.io.encryption.enabled", "true"))
>>   File "$path/spark/python/pyspark/tests/test_broadcast.py", line 58, in 
>> _test_multiple_broadcasts
>> conf = SparkConf()
>>   File "$path/spark/python/pyspark/conf.py", line 120, in __init__
>> self._jconf = _jvm.SparkConf(loadDefaults)
>>   File "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", 
>> line 1709, in __getattr__
>> answer = self._gateway_client.send_command(
>>   File "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", 
>> line 1036, in send_command
>> connection = self._get_connection()
>>   File "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", 
>> line 284, in _get_connection
>> connection = self._create_new_connection()
>>   File "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", 
>> line 291, in _create_new_connection
>> connection.connect_to_java_server()
>>   File "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", 
>> line 438, in connect_to_java_server
>> self.socket.connect((self.java_address, self.java_port))
>> ConnectionRefusedError: [Errno 61] Connection refused
>> 
>> --
>> Ran 7 tests in 12.950s
>> 
>> FAILED (errors=7)
>> sys:1: ResourceWarning: unclosed file <_io.BufferedWriter name=4>
>> 
>> Had test failures in pyspark.tests.test_broadcast with 
>> 

Re: Building Spark to run PySpark Tests?

2023-01-18 Thread Sean Owen
Release _branches_ are tested as commits arrive to the branch, yes. That's
what you see at https://github.com/apache/spark/actions
Released versions are fixed, they don't change, and were also manually
tested before release, so no they are not re-tested; there is no need.

You presumably have some local env issue, because the source of Spark 3.2.3
was passing CI/CD at time of release as well as manual tests of the PMC.


On Wed, Jan 18, 2023 at 5:24 PM Adam Chhina  wrote:

> Hi Sean,
>
> That’s fair in regards to 3.3.x being the current release branch. I’m not
> familiar with the testing schedule, but I had assumed all currently
> supported release versions would have some nightly/weekly tests ran; is
> that not the case? I only ask, as when I when I’m seeing these test
> failures, I assumed these were either known/unknown from some recurring
> testing pipeline.
>
> Also, unfortunately using v3.2.3 also had the same test failures.
>
> > git clone --branch v3.2.3 https://github.com/apache/spark.git
>
> I’ve posted the traceback below for one of the ran tests. At the end it
> mentioned to check the logs - `see logs`. However I wasn’t sure whether
> that just meant the traceback or some more detailed logs elsewhere? I
> wasn’t able to see any files that looked relevant running `find . -name
> “*logs*”` afterwards. Sorry if I’m missing something obvious.
>
> ```
> test_broadcast_no_encryption (pyspark.tests.test_broadcast.BroadcastTest)
> ... ERROR
> test_broadcast_value_against_gc
> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
> test_broadcast_value_driver_encryption
> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
> test_broadcast_value_driver_no_encryption
> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
> test_broadcast_with_encryption
> (pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
>
> ==
> ERROR: test_broadcast_with_encryption
> (pyspark.tests.test_broadcast.BroadcastTest)
> --
> Traceback (most recent call last):
>   File "$path/spark/python/pyspark/tests/test_broadcast.py", line 67, in
> test_broadcast_with_encryption
> self._test_multiple_broadcasts(("spark.io.encryption.enabled", "true"))
>   File "$path/spark/python/pyspark/tests/test_broadcast.py", line 58, in
> _test_multiple_broadcasts
> conf = SparkConf()
>   File "$path/spark/python/pyspark/conf.py", line 120, in __init__
> self._jconf = _jvm.SparkConf(loadDefaults)
>   File
> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line
> 1709, in __getattr__
> answer = self._gateway_client.send_command(
>   File
> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line
> 1036, in send_command
> connection = self._get_connection()
>   File
> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line
> 284, in _get_connection
> connection = self._create_new_connection()
>   File
> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line
> 291, in _create_new_connection
> connection.connect_to_java_server()
>   File
> "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line
> 438, in connect_to_java_server
> self.socket.connect((self.java_address, self.java_port))
> ConnectionRefusedError: [Errno 61] Connection refused
>
> --
> Ran 7 tests in 12.950s
>
> FAILED (errors=7)
> sys:1: ResourceWarning: unclosed file <_io.BufferedWriter name=4>
>
> Had test failures in pyspark.tests.test_broadcast with
> /usr/local/bin/python3; see logs.
> ```
>
> Best,
>
> Adam Chhina
>
> On Jan 18, 2023, at 5:03 PM, Sean Owen  wrote:
>
> That isn't the released version either, but rather the head of the 3.2
> branch (which is beyond 3.2.3).
> You may want to check out the v3.2.3 tag instead:
> https://github.com/apache/spark/tree/v3.2.3
> ... instead of 3.2.1.
> But note of course the 3.3.x is the current release branch anyway.
>
> Hard to say what the error is without seeing more of the error log.
>
> That final warning is fine, just means you are using Java 11+.
>
>
> On Wed, Jan 18, 2023 at 3:59 PM Adam Chhina  wrote:
>
>> Oh, whoops, didn’t realize that wasn’t the release version, thanks!
>>
>> > git clone --branch branch-3.2 https://github.com/apache/spark.git
>>
>> Ah, so the old failing tests are passing now, but I am seeing failures in
>> `pyspark.tests.test_broadcast` such as  `test_broadcast_value_against_gc`,
>> with a majority of them failing due to `ConnectionRefusedError: [Errno
>> 61] Connection refused`. Maybe these tests are not mean to be ran locally,
>> and only in the pipeline?
>>
>> Also, I see this warning that mentions to notify the maintainers here:
>>
>> ```
>> Starting test(/usr/local/bin/python3): pyspark.tests.test_broadcast
>> WARNING: An illegal reflective access operation has occurred
>> 

Re: Building Spark to run PySpark Tests?

2023-01-18 Thread Adam Chhina
Hi Sean,

That’s fair in regards to 3.3.x being the current release branch. I’m not 
familiar with the testing schedule, but I had assumed all currently supported 
release versions would have some nightly/weekly tests ran; is that not the 
case? I only ask, as when I when I’m seeing these test failures, I assumed 
these were either known/unknown from some recurring testing pipeline.

Also, unfortunately using v3.2.3 also had the same test failures.

> git clone --branch v3.2.3 https://github.com/apache/spark.git

I’ve posted the traceback below for one of the ran tests. At the end it 
mentioned to check the logs - `see logs`. However I wasn’t sure whether that 
just meant the traceback or some more detailed logs elsewhere? I wasn’t able to 
see any files that looked relevant running `find . -name “*logs*”` afterwards. 
Sorry if I’m missing something obvious.

```
test_broadcast_no_encryption (pyspark.tests.test_broadcast.BroadcastTest) ... 
ERROR
test_broadcast_value_against_gc (pyspark.tests.test_broadcast.BroadcastTest) 
... ERROR
test_broadcast_value_driver_encryption 
(pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
test_broadcast_value_driver_no_encryption 
(pyspark.tests.test_broadcast.BroadcastTest) ... ERROR
test_broadcast_with_encryption (pyspark.tests.test_broadcast.BroadcastTest) ... 
ERROR

==
ERROR: test_broadcast_with_encryption 
(pyspark.tests.test_broadcast.BroadcastTest)
--
Traceback (most recent call last):
  File "$path/spark/python/pyspark/tests/test_broadcast.py", line 67, in 
test_broadcast_with_encryption
self._test_multiple_broadcasts(("spark.io.encryption.enabled", "true"))
  File "$path/spark/python/pyspark/tests/test_broadcast.py", line 58, in 
_test_multiple_broadcasts
conf = SparkConf()
  File "$path/spark/python/pyspark/conf.py", line 120, in __init__
self._jconf = _jvm.SparkConf(loadDefaults)
  File "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", 
line 1709, in __getattr__
answer = self._gateway_client.send_command(
  File "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", 
line 1036, in send_command
connection = self._get_connection()
  File "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", 
line 284, in _get_connection
connection = self._create_new_connection()
  File "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", 
line 291, in _create_new_connection
connection.connect_to_java_server()
  File "$path/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", 
line 438, in connect_to_java_server
self.socket.connect((self.java_address, self.java_port))
ConnectionRefusedError: [Errno 61] Connection refused

--
Ran 7 tests in 12.950s

FAILED (errors=7)
sys:1: ResourceWarning: unclosed file <_io.BufferedWriter name=4>

Had test failures in pyspark.tests.test_broadcast with /usr/local/bin/python3; 
see logs.
```

Best,

Adam Chhina

> On Jan 18, 2023, at 5:03 PM, Sean Owen  wrote:
> 
> That isn't the released version either, but rather the head of the 3.2 branch 
> (which is beyond 3.2.3).
> You may want to check out the v3.2.3 tag instead: 
> https://github.com/apache/spark/tree/v3.2.3
> ... instead of 3.2.1. 
> But note of course the 3.3.x is the current release branch anyway.
> 
> Hard to say what the error is without seeing more of the error log.
> 
> That final warning is fine, just means you are using Java 11+.
> 
> 
> On Wed, Jan 18, 2023 at 3:59 PM Adam Chhina  > wrote:
>> Oh, whoops, didn’t realize that wasn’t the release version, thanks!
>> 
>> > git clone --branch branch-3.2 https://github.com/apache/spark.git 
>> 
>> Ah, so the old failing tests are passing now, but I am seeing failures in 
>> `pyspark.tests.test_broadcast` such as  `test_broadcast_value_against_gc`, 
>> with a majority of them failing due to `ConnectionRefusedError: [Errno 61] 
>> Connection refused`. Maybe these tests are not mean to be ran locally, and 
>> only in the pipeline?
>> 
>> Also, I see this warning that mentions to notify the maintainers here:
>> 
>> ```
>> Starting test(/usr/local/bin/python3): pyspark.tests.test_broadcast
>> WARNING: An illegal reflective access operation has occurred
>> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
>> (file:/$path/spark/common/unsafe/target/scala-2.12/classes/) to constructor 
>> java.nio.DirectByteBuffer(long,int)
>> ```
>> 
>> FWIW, not sure if this matters, but python executable used for running these 
>> tests is `Python 3.10.9` under `/user/local/bin/python3`.
>> 
>> Best,
>> 
>> Adam Chhina
>> 
>>> On Jan 18, 2023, at 3:05 PM, Bjørn Jørgensen >> > wrote:
>>> 
>>> Replace 
>>> > > git clone g...@github.com:apache/spark.git
>>> > > git 

Re: Building Spark to run PySpark Tests?

2023-01-18 Thread Sean Owen
That isn't the released version either, but rather the head of the 3.2
branch (which is beyond 3.2.3).
You may want to check out the v3.2.3 tag instead:
https://github.com/apache/spark/tree/v3.2.3
... instead of 3.2.1.
But note of course the 3.3.x is the current release branch anyway.

Hard to say what the error is without seeing more of the error log.

That final warning is fine, just means you are using Java 11+.


On Wed, Jan 18, 2023 at 3:59 PM Adam Chhina  wrote:

> Oh, whoops, didn’t realize that wasn’t the release version, thanks!
>
> > git clone --branch branch-3.2 https://github.com/apache/spark.git
>
> Ah, so the old failing tests are passing now, but I am seeing failures in 
> `pyspark.tests.test_broadcast`
> such as  `test_broadcast_value_against_gc`, with a majority of them
> failing due to `ConnectionRefusedError: [Errno 61] Connection refused`.
> Maybe these tests are not mean to be ran locally, and only in the pipeline?
>
> Also, I see this warning that mentions to notify the maintainers here:
>
> ```
>
> Starting test(/usr/local/bin/python3): pyspark.tests.test_broadcast
>
> WARNING: An illegal reflective access operation has occurred
>
> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform
> (file:/$path/spark/common/unsafe/target/scala-2.12/classes/) to constructor
> java.nio.DirectByteBuffer(long,int)
> ```
>
> FWIW, not sure if this matters, but python executable used for running
> these tests is `Python 3.10.9` under `/user/local/bin/python3`.
>
> Best,
>
> Adam Chhina
>
> On Jan 18, 2023, at 3:05 PM, Bjørn Jørgensen 
> wrote:
>
> Replace
> > > git clone g...@github.com:apache/spark.git
> > > git checkout -b spark-321 v3.2.1
>
> with
> git clone --branch branch-3.2 https://github.com/apache/spark.git
> This will give you branch 3.2 as today, what I suppose you call upstream
>
> https://github.com/apache/spark/commits/branch-3.2
> and right now all tests in github action are passed :)
>
>
> ons. 18. jan. 2023 kl. 18:07 skrev Sean Owen :
>
>> Never seen those, but it's probably a difference in pandas, numpy
>> versions. You can see the current CICD test results in GitHub Actions. But,
>> you want to use release versions, not an RC. 3.2.1 is not the latest
>> version, and it's possible the tests were actually failing in the RC.
>>
>> On Wed, Jan 18, 2023, 10:57 AM Adam Chhina  wrote:
>>
>>> Bump,
>>>
>>> Just trying to see where I can find what tests are known failing for a
>>> particular release, to ensure I’m building upstream correctly following the
>>> build docs. I figured this would be the best place to ask as it pertains to
>>> building and testing upstream (also more than happy to provide a PR for any
>>> docs if required afterwards), however if there would be a more appropriate
>>> place, please let me know.
>>>
>>> Best,
>>>
>>> Adam Chhina
>>>
>>> > On Dec 27, 2022, at 11:37 AM, Adam Chhina 
>>> wrote:
>>> >
>>> > As part of an upgrade I was looking to run upstream PySpark unit tests
>>> on `v3.2.1-rc2` before I applied some downstream patches and tested those.
>>> However, I'm running into some issues with failing unit tests, which I'm
>>> not sure are failing upstream or due to some step I missed in the build.
>>> >
>>> > The current failing tests (at least so far, since I believe the python
>>> script exits on test failure):
>>> > ```
>>> > ==
>>> > FAIL: test_train_prediction
>>> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
>>> > Test that error on test data improves as model is trained.
>>> > --
>>> > Traceback (most recent call last):
>>> >   File
>>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>> line 474, in test_train_prediction
>>> > eventually(condition, timeout=180.0)
>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>>> 86, in eventually
>>> > lastValue = condition()
>>> >   File
>>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>> line 469, in condition
>>> > self.assertGreater(errors[1] - errors[-1], 2)
>>> > AssertionError: 1.8960983527735014 not greater than 2
>>> >
>>> > ==
>>> > FAIL: test_parameter_accuracy
>>> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
>>> > Test that the final value of weights is close to the desired value.
>>> > --
>>> > Traceback (most recent call last):
>>> >   File
>>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>> line 229, in test_parameter_accuracy
>>> > eventually(condition, timeout=60.0, catch_assertions=True)
>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>>> 91, in eventually
>>> >   

Re: Building Spark to run PySpark Tests?

2023-01-18 Thread Adam Chhina
Oh, whoops, didn’t realize that wasn’t the release version, thanks!

> git clone --branch branch-3.2 https://github.com/apache/spark.git 

Ah, so the old failing tests are passing now, but I am seeing failures in 
`pyspark.tests.test_broadcast` such as  `test_broadcast_value_against_gc`, with 
a majority of them failing due to `ConnectionRefusedError: [Errno 61] 
Connection refused`. Maybe these tests are not mean to be ran locally, and only 
in the pipeline?

Also, I see this warning that mentions to notify the maintainers here:

```
Starting test(/usr/local/bin/python3): pyspark.tests.test_broadcast
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
(file:/$path/spark/common/unsafe/target/scala-2.12/classes/) to constructor 
java.nio.DirectByteBuffer(long,int)
```

FWIW, not sure if this matters, but python executable used for running these 
tests is `Python 3.10.9` under `/user/local/bin/python3`.

Best,

Adam Chhina

> On Jan 18, 2023, at 3:05 PM, Bjørn Jørgensen  wrote:
> 
> Replace 
> > > git clone g...@github.com:apache/spark.git
> > > git checkout -b spark-321 v3.2.1
> 
> with 
> git clone --branch branch-3.2 https://github.com/apache/spark.git
> This will give you branch 3.2 as today, what I suppose you call upstream  
> https://github.com/apache/spark/commits/branch-3.2
> and right now all tests in github action are passed :) 
> 
> 
> ons. 18. jan. 2023 kl. 18:07 skrev Sean Owen  >:
>> Never seen those, but it's probably a difference in pandas, numpy versions. 
>> You can see the current CICD test results in GitHub Actions. But, you want 
>> to use release versions, not an RC. 3.2.1 is not the latest version, and 
>> it's possible the tests were actually failing in the RC.
>> 
>> On Wed, Jan 18, 2023, 10:57 AM Adam Chhina > > wrote:
>>> Bump,
>>> 
>>> Just trying to see where I can find what tests are known failing for a 
>>> particular release, to ensure I’m building upstream correctly following the 
>>> build docs. I figured this would be the best place to ask as it pertains to 
>>> building and testing upstream (also more than happy to provide a PR for any 
>>> docs if required afterwards), however if there would be a more appropriate 
>>> place, please let me know.
>>> 
>>> Best,
>>> 
>>> Adam Chhina
>>> 
>>> > On Dec 27, 2022, at 11:37 AM, Adam Chhina >> > > wrote:
>>> > 
>>> > As part of an upgrade I was looking to run upstream PySpark unit tests on 
>>> > `v3.2.1-rc2` before I applied some downstream patches and tested those. 
>>> > However, I'm running into some issues with failing unit tests, which I'm 
>>> > not sure are failing upstream or due to some step I missed in the build.
>>> > 
>>> > The current failing tests (at least so far, since I believe the python 
>>> > script exits on test failure):
>>> > ```
>>> > ==
>>> > FAIL: test_train_prediction 
>>> > (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
>>> > Test that error on test data improves as model is trained.
>>> > --
>>> > Traceback (most recent call last):
>>> >   File 
>>> > "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>> >  line 474, in test_train_prediction
>>> > eventually(condition, timeout=180.0)
>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 86, 
>>> > in eventually
>>> > lastValue = condition()
>>> >   File 
>>> > "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>> >  line 469, in condition
>>> > self.assertGreater(errors[1] - errors[-1], 2)
>>> > AssertionError: 1.8960983527735014 not greater than 2
>>> > 
>>> > ==
>>> > FAIL: test_parameter_accuracy 
>>> > (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
>>> > Test that the final value of weights is close to the desired value.
>>> > --
>>> > Traceback (most recent call last):
>>> >   File 
>>> > "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>> >  line 229, in test_parameter_accuracy
>>> > eventually(condition, timeout=60.0, catch_assertions=True)
>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 91, 
>>> > in eventually
>>> > raise lastValue
>>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 82, 
>>> > in eventually
>>> > lastValue = condition()
>>> >   File 
>>> > "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>>> >  line 226, in condition
>>> > self.assertAlmostEqual(rel, 0.1, 1)
>>> > AssertionError: 

Re: Building Spark to run PySpark Tests?

2023-01-18 Thread Bjørn Jørgensen
Replace
> > git clone g...@github.com:apache/spark.git
> > git checkout -b spark-321 v3.2.1

with
git clone --branch branch-3.2 https://github.com/apache/spark.git
This will give you branch 3.2 as today, what I suppose you call upstream

https://github.com/apache/spark/commits/branch-3.2
and right now all tests in github action are passed :)


ons. 18. jan. 2023 kl. 18:07 skrev Sean Owen :

> Never seen those, but it's probably a difference in pandas, numpy
> versions. You can see the current CICD test results in GitHub Actions. But,
> you want to use release versions, not an RC. 3.2.1 is not the latest
> version, and it's possible the tests were actually failing in the RC.
>
> On Wed, Jan 18, 2023, 10:57 AM Adam Chhina  wrote:
>
>> Bump,
>>
>> Just trying to see where I can find what tests are known failing for a
>> particular release, to ensure I’m building upstream correctly following the
>> build docs. I figured this would be the best place to ask as it pertains to
>> building and testing upstream (also more than happy to provide a PR for any
>> docs if required afterwards), however if there would be a more appropriate
>> place, please let me know.
>>
>> Best,
>>
>> Adam Chhina
>>
>> > On Dec 27, 2022, at 11:37 AM, Adam Chhina 
>> wrote:
>> >
>> > As part of an upgrade I was looking to run upstream PySpark unit tests
>> on `v3.2.1-rc2` before I applied some downstream patches and tested those.
>> However, I'm running into some issues with failing unit tests, which I'm
>> not sure are failing upstream or due to some step I missed in the build.
>> >
>> > The current failing tests (at least so far, since I believe the python
>> script exits on test failure):
>> > ```
>> > ==
>> > FAIL: test_train_prediction
>> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
>> > Test that error on test data improves as model is trained.
>> > --
>> > Traceback (most recent call last):
>> >   File
>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>> line 474, in test_train_prediction
>> > eventually(condition, timeout=180.0)
>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>> 86, in eventually
>> > lastValue = condition()
>> >   File
>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>> line 469, in condition
>> > self.assertGreater(errors[1] - errors[-1], 2)
>> > AssertionError: 1.8960983527735014 not greater than 2
>> >
>> > ==
>> > FAIL: test_parameter_accuracy
>> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
>> > Test that the final value of weights is close to the desired value.
>> > --
>> > Traceback (most recent call last):
>> >   File
>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>> line 229, in test_parameter_accuracy
>> > eventually(condition, timeout=60.0, catch_assertions=True)
>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>> 91, in eventually
>> > raise lastValue
>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>> 82, in eventually
>> > lastValue = condition()
>> >   File
>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>> line 226, in condition
>> > self.assertAlmostEqual(rel, 0.1, 1)
>> > AssertionError: 0.23052813480829393 != 0.1 within 1 places
>> (0.13052813480829392 difference)
>> >
>> > ==
>> > FAIL: test_training_and_prediction
>> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
>> > Test that the model improves on toy data with no. of batches
>> > --
>> > Traceback (most recent call last):
>> >   File
>> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>> line 334, in test_training_and_prediction
>> > eventually(condition, timeout=180.0)
>> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line
>> 93, in eventually
>> > raise AssertionError(
>> > AssertionError: Test failed due to timeout after 180 sec, with last
>> condition returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74,
>> 0.73, 0.69, 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74, 0.76, 0.78,
>> 0.7, 0.78, 0.8, 0.74, 0.77, 0.75, 0.76, 0.76, 0.75, 0.78, 0.74, 0.64, 0.64,
>> 0.71, 0.78, 0.76, 0.64, 0.68, 0.69, 0.72, 0.77
>> >
>> > --
>> > Ran 13 tests in 661.536s
>> >
>> > FAILED (failures=3, skipped=1)
>> >
>> > Had test failures in 

Re: Building Spark to run PySpark Tests?

2023-01-18 Thread Sean Owen
Never seen those, but it's probably a difference in pandas, numpy versions.
You can see the current CICD test results in GitHub Actions. But, you want
to use release versions, not an RC. 3.2.1 is not the latest version, and
it's possible the tests were actually failing in the RC.

On Wed, Jan 18, 2023, 10:57 AM Adam Chhina  wrote:

> Bump,
>
> Just trying to see where I can find what tests are known failing for a
> particular release, to ensure I’m building upstream correctly following the
> build docs. I figured this would be the best place to ask as it pertains to
> building and testing upstream (also more than happy to provide a PR for any
> docs if required afterwards), however if there would be a more appropriate
> place, please let me know.
>
> Best,
>
> Adam Chhina
>
> > On Dec 27, 2022, at 11:37 AM, Adam Chhina  wrote:
> >
> > As part of an upgrade I was looking to run upstream PySpark unit tests
> on `v3.2.1-rc2` before I applied some downstream patches and tested those.
> However, I'm running into some issues with failing unit tests, which I'm
> not sure are failing upstream or due to some step I missed in the build.
> >
> > The current failing tests (at least so far, since I believe the python
> script exits on test failure):
> > ```
> > ==
> > FAIL: test_train_prediction
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
> > Test that error on test data improves as model is trained.
> > --
> > Traceback (most recent call last):
> >   File
> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
> line 474, in test_train_prediction
> > eventually(condition, timeout=180.0)
> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 86,
> in eventually
> > lastValue = condition()
> >   File
> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
> line 469, in condition
> > self.assertGreater(errors[1] - errors[-1], 2)
> > AssertionError: 1.8960983527735014 not greater than 2
> >
> > ==
> > FAIL: test_parameter_accuracy
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
> > Test that the final value of weights is close to the desired value.
> > --
> > Traceback (most recent call last):
> >   File
> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
> line 229, in test_parameter_accuracy
> > eventually(condition, timeout=60.0, catch_assertions=True)
> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 91,
> in eventually
> > raise lastValue
> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 82,
> in eventually
> > lastValue = condition()
> >   File
> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
> line 226, in condition
> > self.assertAlmostEqual(rel, 0.1, 1)
> > AssertionError: 0.23052813480829393 != 0.1 within 1 places
> (0.13052813480829392 difference)
> >
> > ==
> > FAIL: test_training_and_prediction
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
> > Test that the model improves on toy data with no. of batches
> > --
> > Traceback (most recent call last):
> >   File
> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
> line 334, in test_training_and_prediction
> > eventually(condition, timeout=180.0)
> >   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 93,
> in eventually
> > raise AssertionError(
> > AssertionError: Test failed due to timeout after 180 sec, with last
> condition returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74,
> 0.73, 0.69, 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74, 0.76, 0.78,
> 0.7, 0.78, 0.8, 0.74, 0.77, 0.75, 0.76, 0.76, 0.75, 0.78, 0.74, 0.64, 0.64,
> 0.71, 0.78, 0.76, 0.64, 0.68, 0.69, 0.72, 0.77
> >
> > --
> > Ran 13 tests in 661.536s
> >
> > FAILED (failures=3, skipped=1)
> >
> > Had test failures in pyspark.mllib.tests.test_streaming_algorithms with
> /usr/local/bin/python3; see logs.
> > ```
> >
> > Here's how I'm currently building Spark, I was using the
> [building-spark](https://spark.apache.org/docs/3..1/building-spark.html)
> docs as a reference.
> > ```
> > > git clone g...@github.com:apache/spark.git
> > > git checkout -b spark-321 v3.2.1
> > > ./build/mvn -DskipTests clean package -Phive
> > > export JAVA_HOME=$(path/to/jdk/11)
> > > ./python/run-tests
> > ```
> >
> > Current Java version
> > ```
> > java 

Re: Building Spark to run PySpark Tests?

2023-01-18 Thread Adam Chhina
Bump,

Just trying to see where I can find what tests are known failing for a 
particular release, to ensure I’m building upstream correctly following the 
build docs. I figured this would be the best place to ask as it pertains to 
building and testing upstream (also more than happy to provide a PR for any 
docs if required afterwards), however if there would be a more appropriate 
place, please let me know.

Best,

Adam Chhina

> On Dec 27, 2022, at 11:37 AM, Adam Chhina  wrote:
> 
> As part of an upgrade I was looking to run upstream PySpark unit tests on 
> `v3.2.1-rc2` before I applied some downstream patches and tested those. 
> However, I'm running into some issues with failing unit tests, which I'm not 
> sure are failing upstream or due to some step I missed in the build.
> 
> The current failing tests (at least so far, since I believe the python script 
> exits on test failure):
> ```
> ==
> FAIL: test_train_prediction 
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
> Test that error on test data improves as model is trained.
> --
> Traceback (most recent call last):
>   File 
> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 474, in test_train_prediction
> eventually(condition, timeout=180.0)
>   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 86, in 
> eventually
> lastValue = condition()
>   File 
> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 469, in condition
> self.assertGreater(errors[1] - errors[-1], 2)
> AssertionError: 1.8960983527735014 not greater than 2
> 
> ==
> FAIL: test_parameter_accuracy 
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
> Test that the final value of weights is close to the desired value.
> --
> Traceback (most recent call last):
>   File 
> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 229, in test_parameter_accuracy
> eventually(condition, timeout=60.0, catch_assertions=True)
>   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 91, in 
> eventually
> raise lastValue
>   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 82, in 
> eventually
> lastValue = condition()
>   File 
> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 226, in condition
> self.assertAlmostEqual(rel, 0.1, 1)
> AssertionError: 0.23052813480829393 != 0.1 within 1 places 
> (0.13052813480829392 difference)
> 
> ==
> FAIL: test_training_and_prediction 
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
> Test that the model improves on toy data with no. of batches
> --
> Traceback (most recent call last):
>   File 
> "/Users/adam/OSS/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 334, in test_training_and_prediction
> eventually(condition, timeout=180.0)
>   File "/Users/adam/OSS/spark/python/pyspark/testing/utils.py", line 93, in 
> eventually
> raise AssertionError(
> AssertionError: Test failed due to timeout after 180 sec, with last condition 
> returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, 
> 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74, 0.76, 0.78, 0.7, 0.78, 0.8, 
> 0.74, 0.77, 0.75, 0.76, 0.76, 0.75, 0.78, 0.74, 0.64, 0.64, 0.71, 0.78, 0.76, 
> 0.64, 0.68, 0.69, 0.72, 0.77
> 
> --
> Ran 13 tests in 661.536s
> 
> FAILED (failures=3, skipped=1)
> 
> Had test failures in pyspark.mllib.tests.test_streaming_algorithms with 
> /usr/local/bin/python3; see logs.
> ```
> 
> Here's how I'm currently building Spark, I was using the 
> [building-spark](https://spark.apache.org/docs/3..1/building-spark.html) docs 
> as a reference.
> ```
> > git clone g...@github.com:apache/spark.git
> > git checkout -b spark-321 v3.2.1
> > ./build/mvn -DskipTests clean package -Phive
> > export JAVA_HOME=$(path/to/jdk/11)
> > ./python/run-tests
> ```
> 
> Current Java version
> ```
> java -version
> openjdk version "11.0.17" 2022-10-18
> OpenJDK Runtime Environment Homebrew (build 11.0.17+0)
> OpenJDK 64-Bit Server VM Homebrew (build 11.0.17+0, mixed mode)
> ```
> 
> Alternatively, I've also tried simply building Spark and using a python=3.9 
> venv and installing the requirements from `pip install -r 
> dev/requirements.txt` and using that as the interpreter to run tests. 
> However, I was running into some failing pandas 

Re: Building spark master failed

2016-05-23 Thread Ovidiu-Cristian MARCU
You’re right, I tought latest will only compile against Java8.
Thanks
 
> On 23 May 2016, at 11:35, Dongjoon Hyun  wrote:
> 
> Hi, 
> 
> That is not the latest. 
> 
> The bug was fixed 5 days ago.
> 
> Regards,
> Dongjoon.
> 
> 
> On Mon, May 23, 2016 at 2:16 AM, Ovidiu-Cristian MARCU 
> > 
> wrote:
> Hi
> 
> I have the following issue when trying to build the latest spark source code 
> on master:
> 
> /spark/common/network-common/src/main/java/org/apache/spark/network/util/JavaUtils.java:147:
>  error: cannot find symbol
> [error]   if (process != null && process.isAlive()) {
> [error] ^
> [error]   symbol:   method isAlive()
> [error]   location: variable process of type Process
> [error] 1 error
> [error] Compile failed at May 23, 2016 11:13:58 AM [1.319s]
> 
> related to [INFO] Spark Project Networking ... 
> FAILURE [  1.495 s]
> 
> Am I missing some fix?
> 
> Thanks
> 
> Best,
> Ovidiu
> 



Re: Building spark master failed

2016-05-23 Thread Dongjoon Hyun
Hi,

That is not the latest.

The bug was fixed 5 days ago.

Regards,
Dongjoon.


On Mon, May 23, 2016 at 2:16 AM, Ovidiu-Cristian MARCU <
ovidiu-cristian.ma...@inria.fr> wrote:

> Hi
>
> I have the following issue when trying to build the latest spark source
> code on master:
>
> /spark/common/network-common/src/main/java/org/apache/spark/network/util/JavaUtils.java:147:
> error: cannot find symbol
> [error]   if (process != null && process.isAlive()) {
> [error] ^
> [error]   symbol:   method isAlive()
> [error]   location: variable process of type Process
> [error] 1 error
> [error] Compile failed at May 23, 2016 11:13:58 AM [1.319s]
>
> related to [INFO] Spark Project Networking ...
> FAILURE [  1.495 s]
>
> Am I missing some fix?
>
> Thanks
>
> Best,
> Ovidiu
>


Re: Building Spark with a Custom Version of Hadoop: HDFS ClassNotFoundException

2016-02-11 Thread Ted Yu
Hdfs class is in hadoop-hdfs-XX.jar

Can you check the classpath to see if the above jar is there ?

Please describe the command lines you used for building hadoop / Spark.

Cheers

On Thu, Feb 11, 2016 at 5:15 PM, Charlie Wright 
wrote:

> I am having issues trying to run a test job on a built version of Spark
> with a custom Hadoop JAR.
> My custom hadoop version runs without issues and I can run jobs from a
> precompiled version of Spark (with Hadoop) no problem.
>
> However, whenever I try to run the same Spark example on the Spark version
> with my custom hadoop JAR - I get this error:
> "Exception in thread "main" java.lang.RuntimeException:
> java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.Hdfs not found"
>
> Does anybody know why this is happening?
>
> Thanks,
> Charles.
>
>


Re: Building Spark with Custom Hadoop Version

2016-02-05 Thread Steve Loughran

> On 4 Feb 2016, at 23:11, Ted Yu  wrote:
> 
> Assuming your change is based on hadoop-2 branch, you can use 'mvn install' 
> command which would put artifacts under 2.8.0-SNAPSHOT subdir in your local 
> maven repo.
> 
> Here is an example:
> ~/.m2/repository/org/apache/hadoop/hadoop-hdfs/2.8.0-SNAPSHOT
> 
> Then you can use the following command to build Spark:
> 
> -Pyarn -Phadoop-2.4 -Dhadoop.version=2.8.0-SNAPSHOT
> 

Better to choose the hadoop-2.6 profile, e.g.

mvn test -Pyarn,hadoop-2.6 -Dhadoop.version=2.7.1  -pl yarn -Dtest=m  
-DwildcardSuites=org.apache.spark.deploy.yarn.YarnClusterSuite

(the -Dtest= assignment skips all java tests)

if you are playing with -SNAPSHOT sourcess

(a) rebuild them every morning
(b) never do a test run that spans midnight

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Building Spark with Custom Hadoop Version

2016-02-05 Thread Steve Loughran

> On 4 Feb 2016, at 23:11, Ted Yu  wrote:
> 
> Assuming your change is based on hadoop-2 branch, you can use 'mvn install' 
> command which would put artifacts under 2.8.0-SNAPSHOT subdir in your local 
> maven repo.
> 


+ generally, unless you want to run all the hadoop tests, set the  -DskipTests 
on the mvn commands. The HDFS ones take a while and can use up all your file 
handles.

mvn install -DskipTests

here's the aliases I use


export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m 
-Xms256m -Djava.awt.headless=true"
alias mi="mvn install -DskipTests"
alias mci="mvn clean install -DskipTests"
alias mvt="mvn test"
alias mvct="mvn clean test"
alias mvp="mvn package -DskipTests"
alias mvcp="mvn clean package -DskipTests"
alias mvnsite="mvn site:site -Dmaven.javadoc.skip=true -DskipTests"
alias mvndep="mvn dependency:tree -Dverbose"


mvndep > target/dependencies.txt is my command of choice to start working out 
where some random dependency is coming in from

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Building Spark with Custom Hadoop Version

2016-02-04 Thread Ted Yu
Assuming your change is based on hadoop-2 branch, you can use 'mvn install'
command which would put artifacts under 2.8.0-SNAPSHOT subdir in your local
maven repo.

Here is an example:
~/.m2/repository/org/apache/hadoop/hadoop-hdfs/2.8.0-SNAPSHOT

Then you can use the following command to build Spark:

-Pyarn -Phadoop-2.4 -Dhadoop.version=2.8.0-SNAPSHOT

FYI

On Thu, Feb 4, 2016 at 3:03 PM, Charles Wright 
wrote:

> Hello,
>
> I have made some modifications to the YARN source code that I want to test
> with Spark, how do I do this? I know that I need to include my custom
> hadoop jar as a dependency but I don't know how to do this as I am not very
> familiar with maven.
>
> Any help is appreciated.
>
> Thanks,
> Charles.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Building Spark

2015-10-16 Thread Annabel Melongo
Can someone please provide insight why I get an access denied when I do the 
build according to the documentation?
Ted said I have to provide the credentials but there's nothing mention about 
that in the build documentation. 


 On Thursday, October 15, 2015 8:39 PM, Annabel Melongo 
<melongo_anna...@yahoo.com> wrote:
   

  Ted,
How do I check the permission? I just ran the command as prescribed.


Sent from my Verizon Wireless 4G LTE smartphone
 Original message From: Ted Yu 
yuzhih...@gmail.com Date: 10/15/2015  18:46  (GMT-05:00) To: 
Annabel Melongo melongo_anna...@yahoo.com Cc: dev@spark.apache.org 
Subject: Re: Building Spark 
bq. Access is denied
Please check permission of the path mentioned.
On Thu, Oct 15, 2015 at 3:45 PM, Annabel Melongo 
<melongo_anna...@yahoo.com.invalid> wrote:

I was trying to build a cloned version of Spark on my local machine using the 
command:        mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests 
clean packageHowever I got the error:       [ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-shade-plugin:2.4.1:shade (default) on project 
spark-network-common_2.10: Error creating shaded jar: 
C:\Users\Annabel\git\spark\network\common\dependency-reduced-pom.xml (Access is 
denied) -> [Help 1]
Any idea, I'm running a 64-bit Windows 8 machine
Thanks



bq. Access is denied
Please check permission of the path mentioned.
On Thu, Oct 15, 2015 at 3:45 PM, Annabel Melongo 
<melongo_anna...@yahoo.com.invalid> wrote:
I was trying to build a cloned version of Spark on my local machine using the 
command:        mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests 
clean packageHowever I got the error:       [ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-shade-plugin:2.4.1:shade (default) on project 
spark-network-common_2.10: Error creating shaded jar
: C:\Users\Annabel\git\spark\network\common\dependency-reduced-pom.xml (Access i
s denied) -> [Help 1]
Any idea, I'm running a 64-bit Windows 8 machine
Thanks




  

Re: Building Spark

2015-10-16 Thread Jean-Baptiste Onofré

Hi Annabel,

with the user that you use to lunch mvn, do you have write permission in 
C:\Users\Annabel\git\spark\network\common folder ?


Regards
JB

On 10/16/2015 05:19 PM, Annabel Melongo wrote:

Can someone please provide insight why I get an access denied when I do
the build according to the documentation?

Ted said I have to provide the credentials but there's nothing mention
about that in the build documentation.



On Thursday, October 15, 2015 8:39 PM, Annabel Melongo
<melongo_anna...@yahoo.com> wrote:


Ted,

How do I check the permission? I just ran the command as prescribed.



Sent from my Verizon Wireless 4G LTE smartphone

 Original message From: Ted Yu
yuzhih...@gmail.com <mailto:yuzhih...@gmail.com> Date:
10/15/2015  18:46  (GMT-05:00) To: Annabel Melongo
melongo_anna...@yahoo.com <mailto:melongo_anna...@yahoo.com>
Cc: dev@spark.apache.org <mailto:dev@spark.apache.org> Subject:
Re: Building Spark 
bq. Access is denied

Please check permission of the path mentioned.

On Thu, Oct 15, 2015 at 3:45 PM, Annabel Melongo
<melongo_anna...@yahoo.com.invalid
<mailto:melongo_anna...@yahoo.com.invalid>> wrote:

I was trying to build a cloned version of Spark on my local machine
using the command:
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
However I got the error:
[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-shade-plugin:2.4.1
:shade (default) on project spark-network-common_2.10: Error
creating shaded jar :
C:\Users\Annabel\git\spark\network\common\dependency-reduced-pom.xml
(Access i
s denied) -> [Help 1]

Any idea, I'm running a 64-bit Windows 8 machine

Thanks



bq. Access is denied
Please check permission of the path mentioned.
On Thu, Oct 15, 2015 at 3:45 PM, Annabel Melongo
<melongo_anna...@yahoo.com.invalid
<mailto:melongo_anna...@yahoo.com.invalid>> wrote:
I was trying to build a cloned version of Spark on my local machine
using the command:mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0
-DskipTests clean packageHowever I got the error:   [ERROR] Failed
to execute goal org.apache.maven.plugins:maven-shade-plugin:2.4.1:shade
(default) on project spark-network-common_2.10: Error creating shaded jar
: C:\Users\Annabel\git\spark\network\common\dependency-reduced-pom.xml
(Access i
s denied) -> [Help 1]
Any idea, I'm running a 64-bit Windows 8 machine
Thanks






--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Building Spark

2015-10-15 Thread Ted Yu
bq. Access is denied

Please check permission of the path mentioned.

On Thu, Oct 15, 2015 at 3:45 PM, Annabel Melongo <
melongo_anna...@yahoo.com.invalid> wrote:

> I was trying to build a cloned version of Spark on my local machine using
> the command:
> mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean
> package
> However I got the error:
>[ERROR] Failed to execute goal
> org.apache.maven.plugins:maven-shade-plugin:2.4.1
> :shade (default) on project spark-network-common_2.10: Error creating
> shaded jar :
> C:\Users\Annabel\git\spark\network\common\dependency-reduced-pom.xml
> (Access i
> s denied) -> [Help 1]
>
> Any idea, I'm running a 64-bit Windows 8 machine
>
> Thanks
>
>


Re: Building spark 1.2 from source requires more dependencies

2015-03-30 Thread yash datta
Hi all,


When selecting large data in sparksql (Select * query) , I see Buffer
overflow exception from kryo :


15/03/27 10:32:19 WARN scheduler.TaskSetManager: Lost task 6.0 in stage 3.0
(TID 30, machine159): com.esotericsoftware.kryo.KryoException: Buffer
overflow. Available: 1, required: 2
Serialization trace:
values (org.apache.spark.sql.catalyst.expressions.GenericRow)
at com.esotericsoftware.kryo.io.Output.require(Output.java:138)
at com.esotericsoftware.kryo.io.Output.writeInt(Output.java:247)
at
com.esotericsoftware.kryo.serializers.DefaultSerializers$IntSerializer.write(DefaultSerializers.java:95)
at
com.esotericsoftware.kryo.serializers.DefaultSerializers$IntSerializer.write(DefaultSerializers.java:89)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:318)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
at
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
at
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:318)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at
org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:167)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:210)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
Source)
at java.lang.Thread.run(Unknown Source)



I thought maybe increasing these would resolve the problem, but the same
exception is seen :

set spark.kryoserializer.buffer.mb=4;
set spark.kryoserializer.buffer.max.mb=1024;


I have a parquet table with 5 Int columns , 100 million rows.

Can somebody guide why this exception is seen, am I missing some
configuration ?

Thanks
Yash


On Mon, Mar 30, 2015 at 3:05 AM, Sean Owen so...@cloudera.com wrote:

 Given that's it's an internal error from scalac, I think it may be
 something to take up with the Scala folks to really fix. We can just
 look for workarounds. Try blowing away your .m2 and .ivy cache for
 example. FWIW I was running on Linux with Java 8u31, latest scala 2.11
 AFAIK.

 On Sun, Mar 29, 2015 at 10:29 PM, Pala M Muthaia
 mchett...@rocketfuelinc.com wrote:
  Sean,
 
  I did a mvn clean and then build, it produces the same error. I also did
 a
  fresh git clone of spark and invoked the same build command and it
 resulted
  in identical error (I also had a colleague do a same thing, lest there
 was
  some machine specific issue, and saw the same error). Unless i
 misunderstood
  something, it doesn't look like clean build fixes this.
 
  On Fri, Mar 27, 2015 at 10:20 PM, Sean Owen so...@cloudera.com wrote:
 
  This is not a compile error, but an error from the scalac compiler.
  That is, the code and build are fine, but scalac is not compiling it.
  Usually when this happens, a clean build fixes it.
 
  On Fri, Mar 27, 2015 at 7:09 PM, Pala M Muthaia
  mchett...@rocketfuelinc.com wrote:
   No, i am running from the root directory, parent of core.
  
   Here is the first set of errors that i see when i compile from source
   (sorry
   the error message is very long, but adding it in case it helps in
   diagnosis). After i manually add javax.servlet dependency for  version
   3.0,
   these set of errors go away and i get the next set of errors about
   missing
   classes under eclipse-jetty.
  
   I am on maven 3.2.5 and java 1.7.
  
   Error:
  
   [INFO] --- scala-maven-plugin:3.2.0:compile (scala-compile-first) @
   spark-core_2.10 ---
   [WARNING] Zinc server is not available at port 3030 - reverting to
   normal
   incremental compile
   [INFO] Using incremental compilation
   [INFO] compiler plugin:
   BasicArtifact(org.scalamacros,paradise_2.10.4,2.0.1,null)
   [INFO] Compiling 403 Scala sources and 33 Java sources to
   /Users/mchettiar/code/spark/core/target/scala-2.10/classes...
   [WARNING] Class javax.servlet.ServletException not found - continuing
   with a
   stub.
   [ERROR]
while compiling:
  
  
 /Users/mchettiar/code/spark/core/src/main/scala/org/apache/spark/HttpServer.scala
   during phase: typer
library version: version 2.10.4
   compiler version: version 2.10.4
 reconstructed args: -deprecation 

Re: Building spark 1.2 from source requires more dependencies

2015-03-29 Thread Sean Owen
Given that's it's an internal error from scalac, I think it may be
something to take up with the Scala folks to really fix. We can just
look for workarounds. Try blowing away your .m2 and .ivy cache for
example. FWIW I was running on Linux with Java 8u31, latest scala 2.11
AFAIK.

On Sun, Mar 29, 2015 at 10:29 PM, Pala M Muthaia
mchett...@rocketfuelinc.com wrote:
 Sean,

 I did a mvn clean and then build, it produces the same error. I also did a
 fresh git clone of spark and invoked the same build command and it resulted
 in identical error (I also had a colleague do a same thing, lest there was
 some machine specific issue, and saw the same error). Unless i misunderstood
 something, it doesn't look like clean build fixes this.

 On Fri, Mar 27, 2015 at 10:20 PM, Sean Owen so...@cloudera.com wrote:

 This is not a compile error, but an error from the scalac compiler.
 That is, the code and build are fine, but scalac is not compiling it.
 Usually when this happens, a clean build fixes it.

 On Fri, Mar 27, 2015 at 7:09 PM, Pala M Muthaia
 mchett...@rocketfuelinc.com wrote:
  No, i am running from the root directory, parent of core.
 
  Here is the first set of errors that i see when i compile from source
  (sorry
  the error message is very long, but adding it in case it helps in
  diagnosis). After i manually add javax.servlet dependency for  version
  3.0,
  these set of errors go away and i get the next set of errors about
  missing
  classes under eclipse-jetty.
 
  I am on maven 3.2.5 and java 1.7.
 
  Error:
 
  [INFO] --- scala-maven-plugin:3.2.0:compile (scala-compile-first) @
  spark-core_2.10 ---
  [WARNING] Zinc server is not available at port 3030 - reverting to
  normal
  incremental compile
  [INFO] Using incremental compilation
  [INFO] compiler plugin:
  BasicArtifact(org.scalamacros,paradise_2.10.4,2.0.1,null)
  [INFO] Compiling 403 Scala sources and 33 Java sources to
  /Users/mchettiar/code/spark/core/target/scala-2.10/classes...
  [WARNING] Class javax.servlet.ServletException not found - continuing
  with a
  stub.
  [ERROR]
   while compiling:
 
  /Users/mchettiar/code/spark/core/src/main/scala/org/apache/spark/HttpServer.scala
  during phase: typer
   library version: version 2.10.4
  compiler version: version 2.10.4
reconstructed args: -deprecation -feature
  -



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Building spark 1.2 from source requires more dependencies

2015-03-27 Thread Sean Owen
I built from the head of branch-1.2 and spark-core compiled correctly
with your exact command. You have something wrong with how you are
building. For example, you're not trying to run this from the core
subdirectory are you?

On Thu, Mar 26, 2015 at 10:36 PM, Pala M Muthaia
mchett...@rocketfuelinc.com wrote:
 Hi,

 We are trying to build spark 1.2 from source (tip of the branch-1.2 at the
 moment). I tried to build spark using the following command:

 mvn -U -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver
 -DskipTests clean package

 I encountered various missing class definition exceptions (e.g: class
 javax.servlet.ServletException not found).

 I eventually got the build to succeed after adding the following set of
 dependencies to the spark-core's pom.xml:

 dependency
   groupIdjavax.servlet/groupId
   artifactIdservlet-api/artifactId
   version3.0/version
 /dependency

 dependency
   groupIdorg.eclipse.jetty/groupId
   artifactIdjetty-io/artifactId
 /dependency

 dependency
   groupIdorg.eclipse.jetty/groupId
   artifactIdjetty-http/artifactId
 /dependency

 dependency
   groupIdorg.eclipse.jetty/groupId
   artifactIdjetty-servlet/artifactId
 /dependency

 Pretty much all of the missing class definition errors came up while
 building HttpServer.scala, and went away after the above dependencies were
 included.

 My guess is official build for spark 1.2 is working already. My question is
 what is wrong with my environment or setup, that requires me to add
 dependencies to pom.xml in this manner, to get this build to succeed.

 Also, i am not sure if this build would work at runtime for us, i am still
 testing this out.


 Thanks,
 pala

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Building Spark with Pants

2015-02-16 Thread Ryan Williams
I worked on Pants at Foursquare for a while and when coming up to speed on
Spark was interested in the possibility of building it with Pants,
particularly because allowing developers to share/reuse each others'
compilation artifacts seems like it would be a boon to productivity; that
was/is Pants' killer feature for Foursquare, as mentioned on the
pants-devel thread.

Given the monumental nature of the task of making Spark build with Pants,
most of my enthusiasm was deflected to SPARK-1517
https://issues.apache.org/jira/browse/SPARK-1517, which deals with
publishing nightly builds (or better, exposing all assembly JARs built by
Jenkins?) that people could use rather than having to assemble their own.

Anyway, it's an intriguing idea, Nicholas, I'm glad you are pursuing it!

On Sat Feb 14 2015 at 4:21:16 AM Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 FYI: Here is the matching discussion over on the Pants dev list.
 https://groups.google.com/forum/#!topic/pants-devel/rTaU-iIOIFE

 On Mon Feb 02 2015 at 4:50:33 PM Nicholas Chammas
 nicholas.cham...@gmail.com
 http://mailto:nicholas.cham...@gmail.com wrote:

 To reiterate, I'm asking from an experimental perspective. I'm not
  proposing we change Spark to build with Pants or anything like that.
 
  I'm interested in trying Pants out and I'm wondering if anyone else
 shares
  my interest or already has experience with Pants that they can share.
 
  On Mon Feb 02 2015 at 4:40:45 PM Nicholas Chammas 
  nicholas.cham...@gmail.com wrote:
 
  I'm asking from an experimental standpoint; this is not happening
 anytime
  soon.
 
  Of course, if the experiment turns out very well, Pants would replace
  both sbt and Maven (like it has at Twitter, for example). Pants also
 works
  with IDEs http://pantsbuild.github.io/index.html#using-pants-with.
 
  On Mon Feb 02 2015 at 4:33:11 PM Stephen Boesch java...@gmail.com
  wrote:
 
  There is a significant investment in sbt and maven - and they are not
 at
  all likely to be going away. A third build tool?  Note that there is
 also
  the perspective of building within an IDE - which actually works
 presently
  for sbt and with a little bit of tweaking with maven as well.
 
  2015-02-02 16:25 GMT-08:00 Nicholas Chammas 
 nicholas.cham...@gmail.com
  :
 
  Does anyone here have experience with Pants
 
  http://pantsbuild.github.io/index.html or interest in trying to
 build
 
 
  Spark with it?
 
  Pants has an interesting story. It was born at Twitter to help them
  build
  their Scala, Java, and Python projects as several independent
  components in
  one monolithic repo. (It was inspired by a similar build tool at
 Google
  called blaze.) The mix of languages and sub-projects at Twitter seems
  similar to the breakdown we have in Spark.
 
  Pants has an interesting take on how a build system should work, and
  Twitter and Foursquare (who use Pants as their primary build tool)
  claim it
  helps enforce better build hygiene and maintainability.
 
  Some relevant talks:
 
 - Building Scala Hygienically with Pants
 https://www.youtube.com/watch?v=ukqke8iTuH0
 - The Pants Build Tool at Twitter
 https://engineering.twitter.com/university/videos/the-pant
  s-build-tool-at-twitter
 - Getting Started with the Pants Build System: Why Pants?
 https://engineering.twitter.com/university/videos/getting-
  started-with-the-pants-build-system-why-pants
 
 
 
  At some point I may take a shot at converting Spark to use Pants as an
  experiment and just see what it’s like.
 
  Nick
  ​
 
  ​



Re: Building Spark with Pants

2015-02-14 Thread Nicholas Chammas
FYI: Here is the matching discussion over on the Pants dev list.
https://groups.google.com/forum/#!topic/pants-devel/rTaU-iIOIFE

On Mon Feb 02 2015 at 4:50:33 PM Nicholas Chammas nicholas.cham...@gmail.com
http://mailto:nicholas.cham...@gmail.com wrote:

To reiterate, I'm asking from an experimental perspective. I'm not
 proposing we change Spark to build with Pants or anything like that.

 I'm interested in trying Pants out and I'm wondering if anyone else shares
 my interest or already has experience with Pants that they can share.

 On Mon Feb 02 2015 at 4:40:45 PM Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 I'm asking from an experimental standpoint; this is not happening anytime
 soon.

 Of course, if the experiment turns out very well, Pants would replace
 both sbt and Maven (like it has at Twitter, for example). Pants also works
 with IDEs http://pantsbuild.github.io/index.html#using-pants-with.

 On Mon Feb 02 2015 at 4:33:11 PM Stephen Boesch java...@gmail.com
 wrote:

 There is a significant investment in sbt and maven - and they are not at
 all likely to be going away. A third build tool?  Note that there is also
 the perspective of building within an IDE - which actually works presently
 for sbt and with a little bit of tweaking with maven as well.

 2015-02-02 16:25 GMT-08:00 Nicholas Chammas nicholas.cham...@gmail.com
 :

 Does anyone here have experience with Pants

 http://pantsbuild.github.io/index.html or interest in trying to build


 Spark with it?

 Pants has an interesting story. It was born at Twitter to help them
 build
 their Scala, Java, and Python projects as several independent
 components in
 one monolithic repo. (It was inspired by a similar build tool at Google
 called blaze.) The mix of languages and sub-projects at Twitter seems
 similar to the breakdown we have in Spark.

 Pants has an interesting take on how a build system should work, and
 Twitter and Foursquare (who use Pants as their primary build tool)
 claim it
 helps enforce better build hygiene and maintainability.

 Some relevant talks:

- Building Scala Hygienically with Pants
https://www.youtube.com/watch?v=ukqke8iTuH0
- The Pants Build Tool at Twitter
https://engineering.twitter.com/university/videos/the-pant
 s-build-tool-at-twitter
- Getting Started with the Pants Build System: Why Pants?
https://engineering.twitter.com/university/videos/getting-
 started-with-the-pants-build-system-why-pants



 At some point I may take a shot at converting Spark to use Pants as an
 experiment and just see what it’s like.

 Nick
 ​

 ​


Re: Building Spark with Pants

2015-02-02 Thread Nicholas Chammas
I'm asking from an experimental standpoint; this is not happening anytime
soon.

Of course, if the experiment turns out very well, Pants would replace both
sbt and Maven (like it has at Twitter, for example). Pants also works with
IDEs http://pantsbuild.github.io/index.html#using-pants-with.

On Mon Feb 02 2015 at 4:33:11 PM Stephen Boesch java...@gmail.com wrote:

 There is a significant investment in sbt and maven - and they are not at
 all likely to be going away. A third build tool?  Note that there is also
 the perspective of building within an IDE - which actually works presently
 for sbt and with a little bit of tweaking with maven as well.

 2015-02-02 16:25 GMT-08:00 Nicholas Chammas nicholas.cham...@gmail.com:

 Does anyone here have experience with Pants

 http://pantsbuild.github.io/index.html or interest in trying to build


 Spark with it?

 Pants has an interesting story. It was born at Twitter to help them build
 their Scala, Java, and Python projects as several independent components
 in
 one monolithic repo. (It was inspired by a similar build tool at Google
 called blaze.) The mix of languages and sub-projects at Twitter seems
 similar to the breakdown we have in Spark.

 Pants has an interesting take on how a build system should work, and
 Twitter and Foursquare (who use Pants as their primary build tool) claim
 it
 helps enforce better build hygiene and maintainability.

 Some relevant talks:

- Building Scala Hygienically with Pants
https://www.youtube.com/watch?v=ukqke8iTuH0
- The Pants Build Tool at Twitter

 https://engineering.twitter.com/university/videos/the-pants-build-tool-at-twitter
 
- Getting Started with the Pants Build System: Why Pants?

 https://engineering.twitter.com/university/videos/getting-started-with-the-pants-build-system-why-pants
 



 At some point I may take a shot at converting Spark to use Pants as an
 experiment and just see what it’s like.

 Nick
 ​




Re: Building Spark with Pants

2015-02-02 Thread Stephen Boesch
There is a significant investment in sbt and maven - and they are not at
all likely to be going away. A third build tool?  Note that there is also
the perspective of building within an IDE - which actually works presently
for sbt and with a little bit of tweaking with maven as well.

2015-02-02 16:25 GMT-08:00 Nicholas Chammas nicholas.cham...@gmail.com:

 Does anyone here have experience with Pants
 http://pantsbuild.github.io/index.html or interest in trying to build
 Spark with it?

 Pants has an interesting story. It was born at Twitter to help them build
 their Scala, Java, and Python projects as several independent components in
 one monolithic repo. (It was inspired by a similar build tool at Google
 called blaze.) The mix of languages and sub-projects at Twitter seems
 similar to the breakdown we have in Spark.

 Pants has an interesting take on how a build system should work, and
 Twitter and Foursquare (who use Pants as their primary build tool) claim it
 helps enforce better build hygiene and maintainability.

 Some relevant talks:

- Building Scala Hygienically with Pants
https://www.youtube.com/watch?v=ukqke8iTuH0
- The Pants Build Tool at Twitter

 https://engineering.twitter.com/university/videos/the-pants-build-tool-at-twitter
 
- Getting Started with the Pants Build System: Why Pants?

 https://engineering.twitter.com/university/videos/getting-started-with-the-pants-build-system-why-pants
 

 At some point I may take a shot at converting Spark to use Pants as an
 experiment and just see what it’s like.

 Nick
 ​



Re: Building Spark with Pants

2015-02-02 Thread Nicholas Chammas
To reiterate, I'm asking from an experimental perspective. I'm not
proposing we change Spark to build with Pants or anything like that.

I'm interested in trying Pants out and I'm wondering if anyone else shares
my interest or already has experience with Pants that they can share.

On Mon Feb 02 2015 at 4:40:45 PM Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 I'm asking from an experimental standpoint; this is not happening anytime
 soon.

 Of course, if the experiment turns out very well, Pants would replace both
 sbt and Maven (like it has at Twitter, for example). Pants also works
 with IDEs http://pantsbuild.github.io/index.html#using-pants-with.

 On Mon Feb 02 2015 at 4:33:11 PM Stephen Boesch java...@gmail.com wrote:

 There is a significant investment in sbt and maven - and they are not at
 all likely to be going away. A third build tool?  Note that there is also
 the perspective of building within an IDE - which actually works presently
 for sbt and with a little bit of tweaking with maven as well.

 2015-02-02 16:25 GMT-08:00 Nicholas Chammas nicholas.cham...@gmail.com:

 Does anyone here have experience with Pants

 http://pantsbuild.github.io/index.html or interest in trying to build


 Spark with it?

 Pants has an interesting story. It was born at Twitter to help them build
 their Scala, Java, and Python projects as several independent components
 in
 one monolithic repo. (It was inspired by a similar build tool at Google
 called blaze.) The mix of languages and sub-projects at Twitter seems
 similar to the breakdown we have in Spark.

 Pants has an interesting take on how a build system should work, and
 Twitter and Foursquare (who use Pants as their primary build tool) claim
 it
 helps enforce better build hygiene and maintainability.

 Some relevant talks:

- Building Scala Hygienically with Pants
https://www.youtube.com/watch?v=ukqke8iTuH0
- The Pants Build Tool at Twitter
https://engineering.twitter.com/university/videos/the-
 pants-build-tool-at-twitter
- Getting Started with the Pants Build System: Why Pants?
https://engineering.twitter.com/university/videos/getting-
 started-with-the-pants-build-system-why-pants



 At some point I may take a shot at converting Spark to use Pants as an
 experiment and just see what it’s like.

 Nick
 ​




Re: Building Spark against Scala 2.10.1 virtualized

2014-07-18 Thread Meisam Fathi
Sorry for resurrecting this thread but project/SparkBuild.scala is
completely rewritten recently (after this commit
https://github.com/apache/spark/tree/628932b). Should library
dependencies be defined in pox.xml files after this commit?

Thanks
Meisam

On Thu, Jun 5, 2014 at 4:51 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
 You can modify project/SparkBuild.scala and build Spark with sbt instead of 
 Maven.


 On Jun 5, 2014, at 12:36 PM, Meisam Fathi meisam.fa...@gmail.com wrote:

 Hi community,

 How should I change sbt to compile spark core with a different version
 of Scala? I see maven pom files define dependencies to scala 2.10.4. I
 need to override/ignore the maven dependencies and use Scala
 virtualized, which needs these lines in a build.sbt file:

 scalaOrganization := org.scala-lang.virtualized

 scalaVersion := 2.10.1

 libraryDependencies += EPFL %% lms % 0.3-SNAPSHOT

 scalacOptions += -Yvirtualize


 Thanks,
 Meisam



Re: Building Spark against Scala 2.10.1 virtualized

2014-06-05 Thread Matei Zaharia
You can modify project/SparkBuild.scala and build Spark with sbt instead of 
Maven.


On Jun 5, 2014, at 12:36 PM, Meisam Fathi meisam.fa...@gmail.com wrote:

 Hi community,
 
 How should I change sbt to compile spark core with a different version
 of Scala? I see maven pom files define dependencies to scala 2.10.4. I
 need to override/ignore the maven dependencies and use Scala
 virtualized, which needs these lines in a build.sbt file:
 
 scalaOrganization := org.scala-lang.virtualized
 
 scalaVersion := 2.10.1
 
 libraryDependencies += EPFL %% lms % 0.3-SNAPSHOT
 
 scalacOptions += -Yvirtualize
 
 
 Thanks,
 Meisam



Re: Building Spark AMI

2014-04-11 Thread Mayur Rustagi
I am creating one fully configured  synced one. But you still need to send
over configuration. Do you plan to use chef for that ?
 On Apr 10, 2014 6:58 PM, Jim Ancona j...@anconafamily.com wrote:

 Are there scripts to build the AMI used by the spark-ec2 script?

 Alternatively, is there a place to download the AMI. I'm interested in
 using it to deploy into an internal Openstack cloud.

 Thanks,

 Jim