Re: Spark performance tests

2017-01-10 Thread Adam Roberts
Hi, I suggest HiBench and SparkSqlPerf, HiBench features many benchmarks within it that exercise several components of Spark (great for stressing core, sql, MLlib capabilities), SparkSqlPerf features 99 TPC-DS queries (stressing the DataFrame API and therefore the Catalyst optimiser), both

Re: [VOTE] Apache Spark 2.1.0 (RC5)

2016-12-18 Thread Adam Roberts
+1 (non-binding) Functional: looks good, tested with OpenJDK 8 (1.8.0_111) and IBM's latest SDK for Java (8 SR3 FP21). Tests run clean on Ubuntu 16 04, 14 04, SUSE 12, CentOS 7.2 on x86 and IBM specific platforms including big-endian. On slower machines I see these failing but nothing to be

Re: [VOTE] Apache Spark 2.1.0 (RC2)

2016-12-13 Thread Adam Roberts
I've never seen the ReplSuite test OoMing with IBM's latest SDK for Java but have always noticed this particular test failing with the following instead: java.lang.AssertionError: assertion failed: deviation too large: 0.8506807397223823, first size: 180392, second size: 333848 This

Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-11 Thread Adam Roberts
+1 (non-binding) Build: mvn -T 1C -Psparkr -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -DskipTests clean package Test: mvn -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -Dtest.exclude.tags=org.apache.spark.tags.DockerTest -fn test Test options: -Xss2048k -Dspark.buffer.pageSize=1048576

Re: [VOTE] Release Apache Spark 2.0.2 (RC2)

2016-11-02 Thread Adam Roberts
I'm seeing the same failure but manifesting itself as a stackoverflow, various operating systems and architectures (RHEL 71, CentOS 72, SUSE 12, Ubuntu 14 04 and 16 04 LTS) Build and test options: mvn -T 1C -Psparkr -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -DskipTests clean package mvn

Re: Spark 2.0.0 performance; potential large Spark core regression

2016-07-11 Thread Adam Roberts
vs 2.0 with HiBench (large profile, 25g executor memory, 4g driver), again we will be carefully checking how these benchmarks are being run and what difference the options and configurations can make Cheers, From: Ted Yu <yuzhih...@gmail.com> To: Adam Roberts/UK/IBM@IB

Re: Spark performance regression test suite

2016-07-11 Thread Adam Roberts
Agreed, this is something that we do regularly when producing our own Spark distributions in IBM and so it will be beneficial to share updates with the wider community, so far it looks like Spark 1.6.2 is the best out of the box on spark-perf and HiBench (of course this may vary for real

Re: Spark 2.0.0 performance; potential large Spark core regression

2016-07-08 Thread Adam Roberts
WholeStageCodegen: on I think, we turned it off when fixing a bug offHeap.enabled: false offHeap.size: 0 Cheers, From: Michael Allman <mich...@videoamp.com> To: Adam Roberts/UK/IBM@IBMGB Cc: dev <dev@spark.apache.org> Date: 08/07/2016 17:05 Subject:Re: Spark 2.0.0

Re: Spark 2.0.0 performance; potential large Spark core regression

2016-07-08 Thread Adam Roberts
, executor memory 16g, Kryo, 0.66 memory fraction, 100 trials We can post the 1.6.2 comparison early next week, running lots of iterations over the weekend once we get the dedicated time again Cheers, From: Michael Allman <mich...@videoamp.com> To: Adam Roberts/UK/IBM@IB

Spark 2.0.0 performance; potential large Spark core regression

2016-07-08 Thread Adam Roberts
Hi, we've been testing the performance of Spark 2.0 compared to previous releases, unfortunately there are no Spark 2.0 compatible versions of HiBench and SparkPerf apart from those I'm working on (see https://github.com/databricks/spark-perf/issues/108) With the Spark 2.0 version of SparkPerf

Re: Understanding pyspark data flow on worker nodes

2016-07-08 Thread Adam Roberts
Hi, sharing what I discovered with PySpark too, corroborates with what Amit notices and also interested in the pipe question: h ttps://mail-archives.apache.org/mod_mbox/spark-dev/201603.mbox/%3c201603291521.u2tflbfo024...@d06av05.portsmouth.uk.ibm.com%3E // Start a thread to feed the process

Re: Databricks SparkPerf with Spark 2.0

2016-06-14 Thread Adam Roberts
e is super important), so the emails here will at least point people there. Cheers, From: Adam Roberts/UK/IBM@IBMGB To: dev <dev@spark.apache.org> Date: 14/06/2016 12:18 Subject:Databricks SparkPerf with Spark 2.0 Hi, I'm working on having "SparkPerf" (

Databricks SparkPerf with Spark 2.0

2016-06-14 Thread Adam Roberts
Hi, I'm working on having "SparkPerf" ( https://github.com/databricks/spark-perf) run with Spark 2.0, noticed a few pull requests not yet accepted so concerned this project's been abandoned - it's proven very useful in the past for quality assurance as we can easily exercise lots of Spark

Caching behaviour and deserialized size

2016-05-04 Thread Adam Roberts
Hi, Given a very simple test that uses a bigger version of the pom.xml file in our Spark home directory (cat with a bash for loop into itself so it becomes 100 MB), I've noticed with larger heap sizes it looks like we have more RDDs reported as being cached, is this intended behaviour? What

Re: BytesToBytes and unaligned memory

2016-04-18 Thread Adam Roberts
for shorts/ints/longs. if these tests continue to pass then I think the Spark tests don't exercise unaligned memory access, cheers From: Ted Yu <yuzhih...@gmail.com> To: Adam Roberts/UK/IBM@IBMGB Cc: "dev@spark.apache.org" <dev@spark.apache.org> Date: 15

BytesToBytes and unaligned memory

2016-04-15 Thread Adam Roberts
Hi, I'm testing Spark 2.0.0 on various architectures and have a question, are we sure if core/src/test/java/org/apache/spark/unsafe/map/AbstractBytesToBytesMapSuite.java really is attempting to use unaligned memory access (for the BytesToBytesMapOffHeapSuite tests specifically)? Our JDKs on

Understanding PySpark Internals

2016-03-29 Thread Adam Roberts
Hi, I'm interested in figuring out how the Python API for Spark works, I've came to the following conclusion and want to share this with the community; could be of use in the PySpark docs here, specifically the "Execution and pipelining part". Any sanity checking would be much appreciated,

Tungsten in a mixed endian environment

2016-01-12 Thread Adam Roberts
Hi all, I've been experimenting with DataFrame operations in a mixed endian environment - a big endian master with little endian workers. With tungsten enabled I'm encountering data corruption issues. For example, with this simple test code: import org.apache.spark.SparkContext import

Test workflow - blacklist entire suites and run any independently

2015-09-21 Thread Adam Roberts
Hi, is there an existing way to blacklist any test suite? Ideally we'd have a text file with a series of names (let's say comma separated) and if a name matches with the fully qualified class name for a suite, this suite will be skipped. Perhaps we can achieve this via ScalaTest or Maven?

Re: Test workflow - blacklist entire suites and run any independently

2015-09-21 Thread Adam Roberts
errors. Must be an easier way... From: Josh Rosen <rosenvi...@gmail.com> To: Adam Roberts/UK/IBM@IBMGB Cc: dev <dev@spark.apache.org> Date: 21/09/2015 19:19 Subject:Re: Test workflow - blacklist entire suites and run any independently For quickly running