Re: Spark performance tests

2017-01-10 Thread Adam Roberts
Hi, I suggest HiBench and SparkSqlPerf, HiBench features many benchmarks 
within it that exercise several components of Spark (great for stressing 
core, sql, MLlib capabilities), SparkSqlPerf features 99 TPC-DS queries 
(stressing the DataFrame API and therefore the Catalyst optimiser), both 
work well with Spark 2

HiBench: https://github.com/intel-hadoop/HiBench
SparkSqlPerf: https://github.com/databricks/spark-sql-perf




From:   "Kazuaki Ishizaki" 
To: Prasun Ratn 
Cc: Apache Spark Dev 
Date:   10/01/2017 09:22
Subject:Re: Spark performance tests



Hi,
You may find several micro-benchmarks under 
https://github.com/apache/spark/tree/master/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark
.

Regards,
Kazuaki Ishizaki



From:Prasun Ratn 
To:Apache Spark Dev 
Date:2017/01/10 12:52
Subject:Spark performance tests



Hi

Are there performance tests or microbenchmarks for Spark - especially
directed towards the CPU specific parts? I looked at spark-perf but
that doesn't seem to have been updated recently.

Thanks
Prasun

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU


Re: [VOTE] Apache Spark 2.1.0 (RC5)

2016-12-18 Thread Adam Roberts
+1 (non-binding)

Functional: looks good, tested with OpenJDK 8 (1.8.0_111) and IBM's latest 
SDK for Java (8 SR3 FP21).

Tests run clean on Ubuntu 16 04, 14 04, SUSE 12, CentOS 7.2 on x86 and IBM 
specific platforms including big-endian. On slower machines I see these 
failing but nothing to be concerned over (timeouts):

org.apache.spark.DistributedSuite.caching on disk
org.apache.spark.rdd.LocalCheckpointSuite.missing checkpoint block fails 
with informative message
org.apache.spark.sql.streaming.StreamingAggregationSuite.prune results by 
current_time, complete mode
org.apache.spark.sql.streaming.StreamingAggregationSuite.prune results by 
current_date, complete mode
org.apache.spark.sql.hive.HiveSparkSubmitSuite.set 
hive.metastore.warehouse.dir

Performance vs 2.0.2: lots of improvements seen using the HiBench and 
SparkSqlPerf benchmarks, tested with a 48 core Intel machine using the 
Kryo serializer, controlled test environment. These are all open source 
benchmarks anyone can use and experiment with. Elapsed times measured, + 
scores are an improvement (so it's that much percent faster) and - scores 
are used for regressions I'm seeing.

K-means: Java API +22% (100 sec to 78 sec), Scala API +30% (34 seconds to 
24 seconds), Python API unchanged
PageRank: minor improvement from 40 seconds to 38 seconds, +5%
Sort: minor improvement, 10.8 seconds to 9.8 seconds, +10%
WordCount: unchanged
Bayes: mixed bag, sometimes much slower (95 sec to 140 sec) which is -47%, 
other times marginally faster by 15%, something to keep an eye on
Terasort: +18% (39 seconds to 32 seconds) with the Java/Scala APIs

For TPC-DS SQL queries the results are a mixed bag again, I see > 10% 
boosts for q9,  q68, q75, q96 and > 10% slowdowns for q7, q39a, q43, q52, 
q57, q89. Five iterations, average times compared, only changing which 
version of Spark we're using



From:   Holden Karau 
To: Denny Lee , Liwei Lin , 
"dev@spark.apache.org" 
Date:   18/12/2016 20:05
Subject:Re: [VOTE] Apache Spark 2.1.0 (RC5)



+1 (non-binding) - checked Python artifacts with virtual env.

On Sun, Dec 18, 2016 at 11:42 AM Denny Lee  wrote:
+1 (non-binding)


On Sat, Dec 17, 2016 at 11:45 PM Liwei Lin  wrote:
+1

Cheers,
Liwei



On Sat, Dec 17, 2016 at 10:29 AM, Yuming Wang  wrote:
I hope https://github.com/apache/spark/pull/16252 can be fixed until 
release 2.1.0. It's a fix for broadcast cannot fit in memory.

On Sat, Dec 17, 2016 at 10:23 AM, Joseph Bradley  
wrote:
+1

On Fri, Dec 16, 2016 at 3:21 PM, Herman van Hövell tot Westerflier <
hvanhov...@databricks.com> wrote:
+1

On Sat, Dec 17, 2016 at 12:14 AM, Xiao Li  wrote:
+1

Xiao Li

2016-12-16 12:19 GMT-08:00 Felix Cheung :












For R we have a license field in the DESCRIPTION, and this is standard 
practice (and requirement) for R packages.







https://cran.r-project.org/doc/manuals/R-exts.html#Licensing









From: Sean Owen 


Sent: Friday, December 16, 2016 9:57:15 AM


To: Reynold Xin; dev@spark.apache.org


Subject: Re: [VOTE] Apache Spark 2.1.0 (RC5)

 








(If you have a template for these emails, maybe update it to use https 
links. They work for

apache.org domains. After all we are asking people to verify the integrity 
of release artifacts, so it might as well be secure.)







(Also the new archives use .tar.gz instead of .tgz like the others. No big 
deal, my OCD eye just noticed it.)







I don't see an Apache license / notice for the Pyspark or SparkR 
artifacts. It would be good practice to include this in a convenience 
binary. I'm not sure if it's strictly mandatory, but something to adjust 
in any event. I think that's all there is to

do for SparkR. For Pyspark, which packages a bunch of dependencies, it 
does include the licenses (good) but I think it should include the NOTICE 
file.







This is the first time I recall getting 0 test failures off the bat!


I'm using Java 8 / Ubuntu 16 and yarn/hive/hadoop-2.7 profiles.







I think I'd +1 this therefore unless someone knows that the license issue 
above is real and a blocker.







On Fri, Dec 16, 2016 at 5:17 AM Reynold Xin  wrote:








Please vote on releasing the following candidate as Apache Spark version 
2.1.0. The vote is open until Sun, December 18, 2016 at 21:30 PT and 
passes if a majority of at least 3 +1 PMC votes are cast.







[ ] +1 Release this package as Apache Spark 2.1.0


[ ] -1 Do not release this package because ...












To learn more about Apache Spark, please see 

http://spark.apache.org/







The tag to be voted on is v2.1.0-rc5 
(cd0a08361e2526519e7c131c42116bf56fa62c76)







List of JIRA tickets resolved are:  

Re: [VOTE] Apache Spark 2.1.0 (RC2)

2016-12-13 Thread Adam Roberts
I've never seen the ReplSuite test OoMing with IBM's latest SDK for Java 
but have always noticed this particular test failing with the following 
instead:

java.lang.AssertionError: assertion failed: deviation too large: 
0.8506807397223823, first size: 180392, second size: 333848

This particular test could be improved and I don't think it should hold up 
releases, I've commented on [SPARK-14558] already a while back and the 
discussion ended with: 

"A better check would be to run with and without the closure cleaner 
change
-> Yea, this is what I did locally, but how to write a test for it?"

It will fail in this particular way reliably with Open/Oracle JDK as well 
if you were to use Kryo.

We don't see this test failing (either OoM or the above failure) with 
OpenJDK 8 in our test farm, this is with OpenJDK 1.8.0_51-b16 and I'm 
running with -Xmx4g -Xss2048k -Dspark.buffer.pageSize 1048576.

All other Spark unit tests pass (we see a grand total of 11980 tests) 
except for the Kafka stress test already mentioned, various 
platforms/operating systems including big-endian.

I've never seen the NoSuchMethod error mentioned in JavaUDFSuite and 
haven't seen the failure Alan mentions below either.

I also have performance data to share (HiBench and SparkSqlPerf with 
TPC-DS queries) comparing this release to Spark 2.0.2, I'll wait until the 
next RC before commenting (it is positive), looks like we'll have another 
as this RC2 vote should be closed by now and in RC3 we'd also have the 
[SPARK-18091] fix included to prevent a test's generated code exceeding 
the 64k constant pool size limit.




From:   akchin 
To: dev@spark.apache.org
Date:   13/12/2016 19:51
Subject:Re: [VOTE] Apache Spark 2.1.0 (RC2)



Hello, 

I am seeing this error as well except during "define case class and create
Dataset together with paste mode *** FAILED ***" 
Starts throwing OOM and GC errors after running for several minutes. 





-
Alan Chin 
IBM Spark Technology Center 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Apache-Spark-2-1-0-RC2-tp20175p20215.html

Sent from the Apache Spark Developers List mailing list archive at 
Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU


Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-11 Thread Adam Roberts
+1 (non-binding)

Build: mvn -T 1C -Psparkr -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver 
-DskipTests clean package
Test: mvn -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver 
-Dtest.exclude.tags=org.apache.spark.tags.DockerTest -fn test
Test options: -Xss2048k -Dspark.buffer.pageSize=1048576 -Xmx4g

No problems with OpenJDK 8 on x86.

No problems with the latest IBM Java 8, various architectures including 
big and little-endian, various operating systems including RHEL 72, CentOS 
72, Ubuntu 14 04, Ubuntu 16 04, SUSE 12.

No issues with the Python tests.

No performance concerns with HiBench large.




From:   Mingjie Tang 
To: Tathagata Das 
Cc: Kousuke Saruta , Reynold Xin 
, dev 
Date:   11/11/2016 03:44
Subject:Re: [VOTE] Release Apache Spark 2.0.2 (RC3)



+1 (non-binding)

On Thu, Nov 10, 2016 at 6:06 PM, Tathagata Das <
tathagata.das1...@gmail.com> wrote:
+1 binding

On Thu, Nov 10, 2016 at 6:05 PM, Kousuke Saruta  wrote:
+1 (non-binding)


On 2016年11月08日 15:09, Reynold Xin wrote:
Please vote on releasing the following candidate as Apache Spark version 
2.0.2. The vote is open until Thu, Nov 10, 2016 at 22:00 PDT and passes if 
a majority of at least 3+1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.0.2
[ ] -1 Do not release this package because ...


The tag to be voted on is v2.0.2-rc3 
(584354eaac02531c9584188b143367ba694b0c34)

This release candidate resolves 84 issues: 
https://s.apache.org/spark-2.0.2-jira

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-bin/ <
http://people.apache.org/%7Epwendell/spark-releases/spark-2.0.2-rc3-bin/>

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1214/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-docs/ <
http://people.apache.org/%7Epwendell/spark-releases/spark-2.0.2-rc3-docs/>


Q: How can I help test this release?
A: If you are a Spark user, you can help us test this release by taking an 
existing Spark workload and running on this release candidate, then 
reporting any regressions from 2.0.1.

Q: What justifies a -1 vote for this release?
A: This is a maintenance release in the 2.0.x series. Bugs already present 
in 2.0.1, missing features, or bugs related to new features will not 
necessarily block this release.

Q: What fix version should I use for patches merging into branch-2.0 from 
now on?
A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC 
(i.e. RC4) is cut, I will change the fix version of those patches to 
2.0.2.


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU


Re: [VOTE] Release Apache Spark 2.0.2 (RC2)

2016-11-02 Thread Adam Roberts
I'm seeing the same failure but manifesting itself as a stackoverflow, 
various operating systems and architectures (RHEL 71, CentOS 72, SUSE 12, 
Ubuntu 14 04 and 16 04 LTS)

Build and test options:
mvn -T 1C -Psparkr -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver 
-DskipTests clean package

mvn -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver 
-Dtest.exclude.tags=org.apache.spark.tags.DockerTest -fn test

-Xss2048k -Dspark.buffer.pageSize=1048576 -Xmx4g

Stacktrace (this is with IBM's latest SDK for Java 8):

  scala> org.apache.spark.SparkException: Job aborted due to stage 
failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost 
task 0.0 in stage 0.0 (TID 0, localhost): 
com.google.common.util.concurrent.ExecutionError: 
java.lang.StackOverflowError: operating system stack overflow
at 
com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2261)
at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
at 
com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
at 
com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:849)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:188)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:36)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:833)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:830)
at 
org.apache.spark.sql.execution.ObjectOperator$.deserializeRowToObject(objects.scala:137)
... omitted the rest for brevity

Would also be useful to include this small but useful change that looks to 
have only just missed the cut: https://github.com/apache/spark/pull/14409




From:   Reynold Xin 
To: Dongjoon Hyun 
Cc: "dev@spark.apache.org" 
Date:   02/11/2016 18:37
Subject:Re: [VOTE] Release Apache Spark 2.0.2 (RC2)



Looks like there is an issue with Maven (likely just the test itself 
though). We should look into it.


On Wed, Nov 2, 2016 at 11:32 AM, Dongjoon Hyun  
wrote:
Hi, Sean.

The same failure blocks me, too.

- SPARK-18189: Fix serialization issue in KeyValueGroupedDataset *** 
FAILED ***

I used `-Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver 
-Dsparkr` on CentOS 7 / OpenJDK1.8.0_111.

Dongjoon.

On 2016-11-02 10:44 (-0700), Sean Owen  wrote:
> Sigs, license, etc are OK. There are no Blockers for 2.0.2, though here 
are
> the 4 issues still open:
>
> SPARK-14387 Enable Hive-1.x ORC compatibility with
> spark.sql.hive.convertMetastoreOrc
> SPARK-17957 Calling outer join and na.fill(0) and then inner join will 
miss
> rows
> SPARK-17981 Incorrectly Set Nullability to False in FilterExec
> SPARK-18160 spark.files & spark.jars should not be passed to driver in 
yarn
> mode
>
> Running with Java 8, -Pyarn -Phive -Phive-thriftserver -Phadoop-2.7 on
> Ubuntu 16, I am seeing consistent failures in this test below. I think 
we
> very recently changed this so it could be legitimate. But does anyone 
else
> see something like this? I have seen other failures in this test due to 
OOM
> but my MAVEN_OPTS allows 6g of heap, which ought to be plenty.
>
>
> - SPARK-18189: Fix serialization issue in KeyValueGroupedDataset *** 
FAILED
> ***
>   isContain was true Interpreter output contained 'Exception':
>   Welcome to
>   __
>/ __/__  ___ _/ /__
>   _\ \/ _ \/ _ `/ __/  '_/
>  /___/ .__/\_,_/_/ /_/\_\   version 2.0.2
> /_/
>
>   Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_102)
>   Type in expressions to have them evaluated.
>   Type :help for more information.
>
>   scala>
>   scala> keyValueGrouped:
> org.apache.spark.sql.KeyValueGroupedDataset[Int,(Int, Int)] =
> org.apache.spark.sql.KeyValueGroupedDataset@70c30f72
>
>   scala> mapGroups: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int,
> _2: int]
>
>   scala> broadcasted: org.apache.spark.broadcast.Broadcast[Int] =
> Broadcast(0)
>
>   scala>
>   scala>
>   scala> dataset: org.apache.spark.sql.Dataset[Int] = [value: int]
>
>   scala> org.apache.spark.SparkException: Job aborted due to stage 
failure:
> Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 
in
> stage 0.0 (TID 0, localhost):
> com.google.common.util.concurrent.ExecutionError:
> java.lang.ClassCircularityError:
> 
io/netty/util/internal/__matchers__/org/apache/spark/network/protocol/MessageMatcher
>   at 
com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2261)
>   at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
>   at 

Re: Spark 2.0.0 performance; potential large Spark core regression

2016-07-11 Thread Adam Roberts
Ted,

That bug was https://issues.apache.org/jira/browse/SPARK-15822 and only 
found as part of running an sql-flights application (not with the unit 
tests), I don't know if this has anything to do with the regression we're 
seeing

One update is that we see the same ballpark regression for 1.6.2 vs 2.0 
with HiBench (large profile, 25g executor memory, 4g driver), again we 
will be carefully checking how these benchmarks are being run and what 
difference the options and configurations can make

Cheers,




From:   Ted Yu <yuzhih...@gmail.com>
To:     Adam Roberts/UK/IBM@IBMGB
Cc: Michael Allman <mich...@videoamp.com>, dev <dev@spark.apache.org>
Date:   08/07/2016 17:26
Subject:Re: Spark 2.0.0 performance; potential large Spark core 
regression



bq. we turned it off when fixing a bug

Adam:
Can you refer to the bug JIRA ?

Thanks

On Fri, Jul 8, 2016 at 9:22 AM, Adam Roberts <arobe...@uk.ibm.com> wrote:
Thanks Michael, we can give your options a try and aim for a 2.0.0 tuned 
vs 2.0.0 default vs 1.6.2 default comparison, for future reference the 
defaults in Spark 2 RC2 look to be: 

sql.shuffle.partitions: 200 
Tungsten enabled: true 
Executor memory: 1 GB (we set to 18 GB) 
kryo buffer max: 64mb 
WholeStageCodegen: on I think, we turned it off when fixing a bug 
offHeap.enabled: false 
offHeap.size: 0 

Cheers, 




From:Michael Allman <mich...@videoamp.com> 
To:Adam Roberts/UK/IBM@IBMGB 
Cc:dev <dev@spark.apache.org> 
Date:08/07/2016 17:05 
Subject:Re: Spark 2.0.0 performance; potential large Spark core 
regression 



Here are some settings we use for some very large GraphX jobs. These are 
based on using EC2 c3.8xl workers: 

.set("spark.sql.shuffle.partitions", "1024")
   .set("spark.sql.tungsten.enabled", "true")
   .set("spark.executor.memory", "24g")
   .set("spark.kryoserializer.buffer.max","1g")
   .set("spark.sql.codegen.wholeStage", "true")
   .set("spark.memory.offHeap.enabled", "true")
   .set("spark.memory.offHeap.size", "25769803776") // 24 GB

Some of these are in fact default configurations. Some are not. 

Michael


On Jul 8, 2016, at 9:01 AM, Michael Allman <mich...@videoamp.com> wrote: 

Hi Adam, 

>From our experience we've found the default Spark 2.0 configuration to be 
highly suboptimal. I don't know if this affects your benchmarks, but I 
would consider running some tests with tuned and alternate configurations. 


Michael 


On Jul 8, 2016, at 8:58 AM, Adam Roberts <arobe...@uk.ibm.com> wrote: 

Hi Michael, the two Spark configuration files aren't very exciting 

spark-env.sh 
Same as the template apart from a JAVA_HOME setting 

spark-defaults.conf 
spark.io.compression.codec lzf 

config.py has the Spark home set, is running Spark standalone mode, we run 
and prep Spark tests only, driver 8g, executor memory 16g, Kryo, 0.66 
memory fraction, 100 trials 

We can post the 1.6.2 comparison early next week, running lots of 
iterations over the weekend once we get the dedicated time again 

Cheers, 





From:Michael Allman <mich...@videoamp.com> 
To:Adam Roberts/UK/IBM@IBMGB 
Cc:dev <dev@spark.apache.org> 
Date:08/07/2016 16:44 
Subject:Re: Spark 2.0.0 performance; potential large Spark core 
regression 



Hi Adam, 

Do you have your spark confs and your spark-env.sh somewhere where we can 
see them? If not, can you make them available? 

Cheers, 

Michael 

On Jul 8, 2016, at 3:17 AM, Adam Roberts <arobe...@uk.ibm.com> wrote: 

Hi, we've been testing the performance of Spark 2.0 compared to previous 
releases, unfortunately there are no Spark 2.0 compatible versions of 
HiBench and SparkPerf apart from those I'm working on (see 
https://github.com/databricks/spark-perf/issues/108) 

With the Spark 2.0 version of SparkPerf we've noticed a 30% geomean 
regression with a very small scale factor and so we've generated a couple 
of profiles comparing 1.5.2 vs 2.0.0. Same JDK version and same platform. 
We will gather a 1.6.2 comparison and increase the scale factor. 

Has anybody noticed a similar problem? My changes for SparkPerf and Spark 
2.0 are very limited and AFAIK don't interfere with Spark core 
functionality, so any feedback on the changes would be much appreciated 
and welcome, I'd much prefer it if my changes are the problem. 

A summary for your convenience follows (this matches what I've mentioned 
on the SparkPerf issue above) 

1. spark-perf/config/config.py : SCALE_FACTOR=0.05
No. Of Workers: 1
Executor per Worker : 1
Executor Memory: 18G
Driver Memory : 8G
Serializer: kryo 

2. $SPARK_HOME/conf/spark-defaults.conf: executor Java Options: 
-Xdisableexplicitgc -Xcompressedrefs 

Main changes I made for the benchmark itself 
Use Scala 2.11.8 and Spar

Re: Spark performance regression test suite

2016-07-11 Thread Adam Roberts
Agreed, this is something that we do regularly when producing our own 
Spark distributions in IBM and so it will be beneficial to share updates 
with the wider community, so far it looks like Spark 1.6.2 is the best out 
of the box on spark-perf and HiBench (of course this may vary for real 
workloads, individual applications and tuning efforts) but we have more 
2.0 tests to be performed and we're not aware of any regressions between 
previous versions except for perhaps with the Spark 2.0.0 post I made.

I'm looking for testing and feedback from any Spark gurus with my 2.0 
changes for spark-perf (have a look at the open issue Holden's mentioned: 
https://github.com/databricks/spark-perf/issues/108) and the same goes for 
HiBench (FWIW we see the same regression on HiBench too: 
https://github.com/intel-hadoop/HiBench/issues/221).

One idea for us is that the benchmarking could be run optionally as part 
of the existing contribution process, an ideal solution IMO would involve 
an additional parameter for the Jenkins job that when ticked will result 
in a performance run being done with and without the change. As we don't 
have direct access to the Jenkins build button in the community, when 
contributing a change users could mark their change with something like 
@performance or "jenkins performance test this please". 

Alternatively the influential Spark folk could notice a change with a 
potential performance impact and have it tested accordingly. While 
microbenchmarks are useful it will be important to test the whole of 
Spark. Then there's also the use of tags in the JIRA - lots for us to work 
with if we wanted this.

This probably means the addition and therefore maintenance of dedicated 
machines in the build farm although this would highlight any regressions 
FAST as opposed to later on in the development cycle.

If there is indeed a regression we may have the fun task of binary 
chopping commits between 1.6.2 and now...again TBC but a real possibility, 
so interested to see if anybody else is doing regression testing and if 
they see a similar problem.

If we don't go down the "benchmark as you contribute" route, having such a 
suite will be perfect - it would clone the latest versions of each 
benchmark, build them for the current version of Spark (can identify the 
release from the pom), run the benchmarks we care about (let's say in 
Spark standalone mode with a couple of executors) and produce a geomean 
score - highlighting any significant deviations.

I'm happy to help with designing/reviewing this

Cheers,







From:   Michael Gummelt 
To: Eric Liang 
Cc: Holden Karau , Ted Yu , 
Michael Allman , dev 
Date:   11/07/2016 17:00
Subject:Re: Spark performance regression test suite



I second any effort to update, automate, and communicate the results of 
spark-perf (https://github.com/databricks/spark-perf)

On Fri, Jul 8, 2016 at 12:28 PM, Eric Liang  wrote:
Something like speed.pypy.org or the Chrome performance dashboards would 
be very useful.

On Fri, Jul 8, 2016 at 9:50 AM Holden Karau  wrote:
There are also the spark-perf and spark-sql-perf projects in the 
Databricks github (although I see an open issue for Spark 2.0 support in 
one of them).

On Friday, July 8, 2016, Ted Yu  wrote:
Found a few issues:

[SPARK-6810] Performance benchmarks for SparkR
[SPARK-2833] performance tests for linear regression
[SPARK-15447] Performance test for ALS in Spark 2.0

Haven't found one for Spark core.

On Fri, Jul 8, 2016 at 8:58 AM, Michael Allman  
wrote:
Hello,

I've seen a few messages on the mailing list regarding Spark performance 
concerns, especially regressions from previous versions. It got me 
thinking that perhaps an automated performance regression suite would be a 
worthwhile contribution? Is anyone working on this? Do we have a Jira 
issue for it?

I cannot commit to taking charge of such a project. I just thought it 
would be a great contribution for someone who does have the time and the 
chops to build it.

Cheers,

Michael
-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau




-- 
Michael Gummelt
Software Engineer
Mesosphere

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU


Re: Spark 2.0.0 performance; potential large Spark core regression

2016-07-08 Thread Adam Roberts
Thanks Michael, we can give your options a try and aim for a 2.0.0 tuned 
vs 2.0.0 default vs 1.6.2 default comparison, for future reference the 
defaults in Spark 2 RC2 look to be:

sql.shuffle.partitions: 200
Tungsten enabled: true
Executor memory: 1 GB (we set to 18 GB)
kryo buffer max: 64mb
WholeStageCodegen: on I think, we turned it off when fixing a bug
offHeap.enabled: false
offHeap.size: 0

Cheers,




From:   Michael Allman <mich...@videoamp.com>
To:     Adam Roberts/UK/IBM@IBMGB
Cc: dev <dev@spark.apache.org>
Date:   08/07/2016 17:05
Subject:Re: Spark 2.0.0 performance; potential large Spark core 
regression



Here are some settings we use for some very large GraphX jobs. These are 
based on using EC2 c3.8xl workers:

.set("spark.sql.shuffle.partitions", "1024")
.set("spark.sql.tungsten.enabled", "true")
.set("spark.executor.memory", "24g")
.set("spark.kryoserializer.buffer.max","1g")
.set("spark.sql.codegen.wholeStage", "true")
.set("spark.memory.offHeap.enabled", "true")
.set("spark.memory.offHeap.size", "25769803776") // 24 GB

Some of these are in fact default configurations. Some are not.

Michael


On Jul 8, 2016, at 9:01 AM, Michael Allman <mich...@videoamp.com> wrote:

Hi Adam,

>From our experience we've found the default Spark 2.0 configuration to be 
highly suboptimal. I don't know if this affects your benchmarks, but I 
would consider running some tests with tuned and alternate configurations.

Michael


On Jul 8, 2016, at 8:58 AM, Adam Roberts <arobe...@uk.ibm.com> wrote:

Hi Michael, the two Spark configuration files aren't very exciting 

spark-env.sh 
Same as the template apart from a JAVA_HOME setting 

spark-defaults.conf 
spark.io.compression.codec lzf 

config.py has the Spark home set, is running Spark standalone mode, we run 
and prep Spark tests only, driver 8g, executor memory 16g, Kryo, 0.66 
memory fraction, 100 trials 

We can post the 1.6.2 comparison early next week, running lots of 
iterations over the weekend once we get the dedicated time again 

Cheers, 





From:Michael Allman <mich...@videoamp.com> 
To:Adam Roberts/UK/IBM@IBMGB 
Cc:dev <dev@spark.apache.org> 
Date:08/07/2016 16:44 
Subject:Re: Spark 2.0.0 performance; potential large Spark core 
regression 



Hi Adam, 

Do you have your spark confs and your spark-env.sh somewhere where we can 
see them? If not, can you make them available? 

Cheers, 

Michael 

On Jul 8, 2016, at 3:17 AM, Adam Roberts <arobe...@uk.ibm.com> wrote: 

Hi, we've been testing the performance of Spark 2.0 compared to previous 
releases, unfortunately there are no Spark 2.0 compatible versions of 
HiBench and SparkPerf apart from those I'm working on (see 
https://github.com/databricks/spark-perf/issues/108) 

With the Spark 2.0 version of SparkPerf we've noticed a 30% geomean 
regression with a very small scale factor and so we've generated a couple 
of profiles comparing 1.5.2 vs 2.0.0. Same JDK version and same platform. 
We will gather a 1.6.2 comparison and increase the scale factor. 

Has anybody noticed a similar problem? My changes for SparkPerf and Spark 
2.0 are very limited and AFAIK don't interfere with Spark core 
functionality, so any feedback on the changes would be much appreciated 
and welcome, I'd much prefer it if my changes are the problem. 

A summary for your convenience follows (this matches what I've mentioned 
on the SparkPerf issue above) 

1. spark-perf/config/config.py : SCALE_FACTOR=0.05
No. Of Workers: 1
Executor per Worker : 1
Executor Memory: 18G
Driver Memory : 8G
Serializer: kryo 

2. $SPARK_HOME/conf/spark-defaults.conf: executor Java Options: 
-Xdisableexplicitgc -Xcompressedrefs 

Main changes I made for the benchmark itself 
Use Scala 2.11.8 and Spark 2.0.0 RC2 on our local filesystem 
MLAlgorithmTests use Vectors.fromML 
For streaming-tests in HdfsRecoveryTest we use wordStream.foreachRDD not 
wordStream.foreach 
KVDataTest uses awaitTerminationOrTimeout in a SparkStreamingContext 
instead of awaitTermination 
Trivial: we use compact not compact.render for outputting json

In Spark 2.0 the top five methods where we spend our time is as follows, 
the percentage is how much of the overall processing time was spent in 
this particular method: 
1.AppendOnlyMap.changeValue 44% 
2.SortShuffleWriter.write 19% 
3.SizeTracker.estimateSize 7.5% 
4.SizeEstimator.estimate 5.36% 
5.Range.foreach 3.6% 

and in 1.5.2 the top five methods are: 
1.AppendOnlyMap.changeValue 38% 
2.ExternalSorter.insertAll 33% 
3.Range.foreach 4% 
4.SizeEstimator.estimate 2% 
5.SizeEstimator.visitSingleObject 2% 

I see the following scores, on the left I have the test name followed by 
th

Re: Spark 2.0.0 performance; potential large Spark core regression

2016-07-08 Thread Adam Roberts
Hi Michael, the two Spark configuration files aren't very exciting

spark-env.sh
Same as the template apart from a JAVA_HOME setting

spark-defaults.conf
spark.io.compression.codec lzf

config.py has the Spark home set, is running Spark standalone mode, we run 
and prep Spark tests only, driver 8g, executor memory 16g, Kryo, 0.66 
memory fraction, 100 trials

We can post the 1.6.2 comparison early next week, running lots of 
iterations over the weekend once we get the dedicated time again

Cheers,





From:   Michael Allman <mich...@videoamp.com>
To:     Adam Roberts/UK/IBM@IBMGB
Cc: dev <dev@spark.apache.org>
Date:   08/07/2016 16:44
Subject:Re: Spark 2.0.0 performance; potential large Spark core 
regression



Hi Adam,

Do you have your spark confs and your spark-env.sh somewhere where we can 
see them? If not, can you make them available?

Cheers,

Michael

On Jul 8, 2016, at 3:17 AM, Adam Roberts <arobe...@uk.ibm.com> wrote:

Hi, we've been testing the performance of Spark 2.0 compared to previous 
releases, unfortunately there are no Spark 2.0 compatible versions of 
HiBench and SparkPerf apart from those I'm working on (see 
https://github.com/databricks/spark-perf/issues/108) 

With the Spark 2.0 version of SparkPerf we've noticed a 30% geomean 
regression with a very small scale factor and so we've generated a couple 
of profiles comparing 1.5.2 vs 2.0.0. Same JDK version and same platform. 
We will gather a 1.6.2 comparison and increase the scale factor. 

Has anybody noticed a similar problem? My changes for SparkPerf and Spark 
2.0 are very limited and AFAIK don't interfere with Spark core 
functionality, so any feedback on the changes would be much appreciated 
and welcome, I'd much prefer it if my changes are the problem. 

A summary for your convenience follows (this matches what I've mentioned 
on the SparkPerf issue above) 

1. spark-perf/config/config.py : SCALE_FACTOR=0.05
No. Of Workers: 1
Executor per Worker : 1
Executor Memory: 18G
Driver Memory : 8G
Serializer: kryo 

2. $SPARK_HOME/conf/spark-defaults.conf: executor Java Options: 
-Xdisableexplicitgc -Xcompressedrefs 

Main changes I made for the benchmark itself 
Use Scala 2.11.8 and Spark 2.0.0 RC2 on our local filesystem 
MLAlgorithmTests use Vectors.fromML 
For streaming-tests in HdfsRecoveryTest we use wordStream.foreachRDD not 
wordStream.foreach 
KVDataTest uses awaitTerminationOrTimeout in a SparkStreamingContext 
instead of awaitTermination 
Trivial: we use compact not compact.render for outputting json

In Spark 2.0 the top five methods where we spend our time is as follows, 
the percentage is how much of the overall processing time was spent in 
this particular method: 
1.AppendOnlyMap.changeValue 44% 
2.SortShuffleWriter.write 19% 
3.SizeTracker.estimateSize 7.5% 
4.SizeEstimator.estimate 5.36% 
5.Range.foreach 3.6% 

and in 1.5.2 the top five methods are: 
1.AppendOnlyMap.changeValue 38% 
2.ExternalSorter.insertAll 33% 
3.Range.foreach 4% 
4.SizeEstimator.estimate 2% 
5.SizeEstimator.visitSingleObject 2% 

I see the following scores, on the left I have the test name followed by 
the 1.5.2 time and then the 2.0.0 time
scheduling throughput: 5.2s vs 7.08s
agg by key; 0.72s vs 1.01s
agg by key int: 0.93s vs 1.19s
agg by key naive: 1.88s vs 2.02
sort by key: 0.64s vs 0.8s
sort by key int: 0.59s vs 0.64s
scala count: 0.09s vs 0.08s
scala count w fltr: 0.31s vs 0.47s 

This is only running the Spark core tests (scheduling throughput through 
scala-count-w-filtr, including all inbetween). 

Cheers, 


Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU


Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU


Spark 2.0.0 performance; potential large Spark core regression

2016-07-08 Thread Adam Roberts
Hi, we've been testing the performance of Spark 2.0 compared to previous 
releases, unfortunately there are no Spark 2.0 compatible versions of 
HiBench and SparkPerf apart from those I'm working on (see 
https://github.com/databricks/spark-perf/issues/108)

With the Spark 2.0 version of SparkPerf we've noticed a 30% geomean 
regression with a very small scale factor and so we've generated a couple 
of profiles comparing 1.5.2 vs 2.0.0. Same JDK version and same platform. 
We will gather a 1.6.2 comparison and increase the scale factor.

Has anybody noticed a similar problem? My changes for SparkPerf and Spark 
2.0 are very limited and AFAIK don't interfere with Spark core 
functionality, so any feedback on the changes would be much appreciated 
and welcome, I'd much prefer it if my changes are the problem.

A summary for your convenience follows (this matches what I've mentioned 
on the SparkPerf issue above)

1. spark-perf/config/config.py : SCALE_FACTOR=0.05
No. Of Workers: 1
Executor per Worker : 1
Executor Memory: 18G
Driver Memory : 8G
Serializer: kryo

2. $SPARK_HOME/conf/spark-defaults.conf: executor Java Options: 
-Xdisableexplicitgc -Xcompressedrefs

Main changes I made for the benchmark itself
Use Scala 2.11.8 and Spark 2.0.0 RC2 on our local filesystem
MLAlgorithmTests use Vectors.fromML
For streaming-tests in HdfsRecoveryTest we use wordStream.foreachRDD not 
wordStream.foreach
KVDataTest uses awaitTerminationOrTimeout in a SparkStreamingContext 
instead of awaitTermination
Trivial: we use compact not compact.render for outputting json

In Spark 2.0 the top five methods where we spend our time is as follows, 
the percentage is how much of the overall processing time was spent in 
this particular method:
1.  AppendOnlyMap.changeValue 44%
2.  SortShuffleWriter.write 19%
3.  SizeTracker.estimateSize 7.5%
4.  SizeEstimator.estimate 5.36%
5.  Range.foreach 3.6%

and in 1.5.2 the top five methods are:
1.  AppendOnlyMap.changeValue 38%
2.  ExternalSorter.insertAll 33%
3.  Range.foreach 4%
4.  SizeEstimator.estimate 2%
5.  SizeEstimator.visitSingleObject 2%

I see the following scores, on the left I have the test name followed by 
the 1.5.2 time and then the 2.0.0 time
scheduling throughput: 5.2s vs 7.08s
agg by key; 0.72s vs 1.01s
agg by key int: 0.93s vs 1.19s
agg by key naive: 1.88s vs 2.02
sort by key: 0.64s vs 0.8s
sort by key int: 0.59s vs 0.64s
scala count: 0.09s vs 0.08s
scala count w fltr: 0.31s vs 0.47s

This is only running the Spark core tests (scheduling throughput through 
scala-count-w-filtr, including all inbetween).

Cheers,


Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU


Re: Understanding pyspark data flow on worker nodes

2016-07-08 Thread Adam Roberts
Hi, sharing what I discovered with PySpark too, corroborates with what 
Amit notices and also interested in the pipe question:
h
ttps://mail-archives.apache.org/mod_mbox/spark-dev/201603.mbox/%3c201603291521.u2tflbfo024...@d06av05.portsmouth.uk.ibm.com%3E


// Start a thread to feed the process input from our parent's iterator 
  val writerThread = new WriterThread(env, worker, inputIterator, 
partitionIndex, context)

...

// Return an iterator that read lines from the process's stdout 
  val stream = new DataInputStream(new 
BufferedInputStream(worker.getInputStream, bufferSize))

The above code and what follows look to be the important parts.



Note that Josh Rosen replied to my comment with more information:

"One clarification: there are Python interpreters running on executors so 
that Python UDFs and RDD API code can be executed. Some slightly-outdated 
but mostly-correct reference material for this can be found at 
https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals. 

See also: search the Spark codebase for PythonRDD and look at 
python/pyspark/worker.py"




From:   Reynold Xin 
To: Amit Rana 
Cc: "dev@spark.apache.org" 
Date:   08/07/2016 07:03
Subject:Re: Understanding pyspark data flow on worker nodes



You can look into its source code: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala


On Thu, Jul 7, 2016 at 11:01 PM, Amit Rana  
wrote:
Hi all,
Did anyone get a chance to look into it??
Any sort of guidance will be much appreciated.
Thanks,
Amit Rana
On 7 Jul 2016 14:28, "Amit Rana"  wrote:
As mentioned in the documentation:
PythonRDD objects launch Python subprocesses and communicate with them 
using pipes, sending the user's code and the data to be processed.
I am trying to understand  the implementation of how this data transfer is 
happening  using pipes.
Can anyone please guide me along that line??
Thanks, 
Amit Rana
On 7 Jul 2016 13:44, "Sun Rui"  wrote:
You can read 
https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals
For pySpark data flow on worker nodes, you can read the source code of 
PythonRDD.scala. Python worker processes communicate with Spark executors 
via sockets instead of pipes.

On Jul 7, 2016, at 15:49, Amit Rana  wrote:

Hi all,
I am trying  to trace the data flow in pyspark. I am using intellij IDEA 
in windows 7.
I had submitted  a python  job as follows:
--master local[4]  
I have made the following  insights after running the above command in 
debug mode:
->Locally when a pyspark's interpreter starts, it also starts a JVM with 
which it communicates through socket.
->py4j is used to handle this communication 
->Now this JVM acts as actual spark driver, and loads a JavaSparkContext 
which communicates with the spark executors in cluster.
In cluster I have read that the data flow between spark executors and 
python interpreter happens using pipes. But I am not able to trace that 
data flow.
Please correct me if my understanding is wrong. It would be very helpful 
if, someone can help me understand tge code-flow for data transfer between 
JVM and python workers.
Thanks,
Amit Rana



Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU


Re: Databricks SparkPerf with Spark 2.0

2016-06-14 Thread Adam Roberts
Fixed the below problem, grepped for spark.version, noticed some instances 
of 1.5.2 being declared, changed to 2.0.0-preview in 
spark-tests/project/SparkTestsBuild.scala

Next one to fix is:
16/06/14 12:52:44 INFO ContextCleaner: Cleaned shuffle 9
Exception in thread "main" java.lang.NoSuchMethodError: 
org/json4s/jackson/JsonMethods$.render$default$2(Lorg/json4s/JsonAST$JValue;)Lorg/json4s/Formats;

I'm going to log this and further progress under "Issues" for the project 
itself (probably need to change org.json4s version in 
SparkTestsBuild.scala, now I know this file is super important), so the 
emails here will at least point people there.

Cheers,







From:   Adam Roberts/UK/IBM@IBMGB
To: dev <dev@spark.apache.org>
Date:   14/06/2016 12:18
Subject:Databricks SparkPerf with Spark 2.0



Hi, I'm working on having "SparkPerf" (
https://github.com/databricks/spark-perf) run with Spark 2.0, noticed a 
few pull requests not yet accepted so concerned this project's been 
abandoned - it's proven very useful in the past for quality assurance as 
we can easily exercise lots of Spark functions with a cluster (perhaps 
exposing problems that don't surface with the Spark unit tests). 

I want to use Scala 2.11.8 and Spark 2.0.0 so I'm making my way through 
various files, currently faced with a NoSuchMethod exception 

NoSuchMethodError: 
org/apache/spark/SparkContext.rddToPairRDDFunctions(Lorg/apache/spark/rdd/RDD;Lscala/reflect/ClassTag;Lscala/reflect/ClassTag;Lscala/math/Ordering;)Lorg/apache/spark/rdd/PairRDDFunctions;
 
at spark.perf.AggregateByKey.runTest(KVDataTest.scala:137) 

class AggregateByKey(sc: SparkContext) extends KVDataTest(sc) {
  override def runTest(rdd: RDD[_], reduceTasks: Int) {
rdd.asInstanceOf[RDD[(String, String)]]
  .map{case (k, v) => (k, v.toInt)}.reduceByKey(_ + _, 
reduceTasks).count()
  } 
}

Grepping shows
./spark-tests/target/streams/compile/incCompileSetup/$global/streams/inc_compile_2.10:/home/aroberts/Desktop/spark-perf/spark-tests/src/main/scala/spark/perf/KVDataTest.scala
 
-> rddToPairRDDFunctions 

The scheduling-throughput tests complete fine but the problem here is seen 
with agg-by-key (and likely other modules to fix owing to API changes 
between 1.x and 2.x which I guess is the cause of the above problem). 

Has anybody already made good progress here? Would like to work together 
and get this available for everyone, I'll be churning through it either 
way. Will be looking at HiBench also. 

Next step for me is to use sbt -Dspark.version=2.0.0 (2.0.0-preview?) and 
work from there, although I figured the prep tests stage would do this for 
me (how else is it going to build?). 

Cheers, 




Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU


Databricks SparkPerf with Spark 2.0

2016-06-14 Thread Adam Roberts
Hi, I'm working on having "SparkPerf" (
https://github.com/databricks/spark-perf) run with Spark 2.0, noticed a 
few pull requests not yet accepted so concerned this project's been 
abandoned - it's proven very useful in the past for quality assurance as 
we can easily exercise lots of Spark functions with a cluster (perhaps 
exposing problems that don't surface with the Spark unit tests).

I want to use Scala 2.11.8 and Spark 2.0.0 so I'm making my way through 
various files, currently faced with a NoSuchMethod exception

NoSuchMethodError: 
org/apache/spark/SparkContext.rddToPairRDDFunctions(Lorg/apache/spark/rdd/RDD;Lscala/reflect/ClassTag;Lscala/reflect/ClassTag;Lscala/math/Ordering;)Lorg/apache/spark/rdd/PairRDDFunctions;
 
at spark.perf.AggregateByKey.runTest(KVDataTest.scala:137) 

class AggregateByKey(sc: SparkContext) extends KVDataTest(sc) {
  override def runTest(rdd: RDD[_], reduceTasks: Int) {
rdd.asInstanceOf[RDD[(String, String)]]
  .map{case (k, v) => (k, v.toInt)}.reduceByKey(_ + _, 
reduceTasks).count()
  }
}

Grepping shows
./spark-tests/target/streams/compile/incCompileSetup/$global/streams/inc_compile_2.10:/home/aroberts/Desktop/spark-perf/spark-tests/src/main/scala/spark/perf/KVDataTest.scala
 
-> rddToPairRDDFunctions 

The scheduling-throughput tests complete fine but the problem here is seen 
with agg-by-key (and likely other modules to fix owing to API changes 
between 1.x and 2.x which I guess is the cause of the above problem).

Has anybody already made good progress here? Would like to work together 
and get this available for everyone, I'll be churning through it either 
way. Will be looking at HiBench also.

Next step for me is to use sbt -Dspark.version=2.0.0 (2.0.0-preview?) and 
work from there, although I figured the prep tests stage would do this for 
me (how else is it going to build?).

Cheers,




Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU


Caching behaviour and deserialized size

2016-05-04 Thread Adam Roberts
Hi, 

Given a very simple test that uses a bigger version of the pom.xml file in 
our Spark home directory (cat with a bash for loop into itself so it 
becomes 100 MB), I've noticed with larger heap sizes it looks like we have 
more RDDs reported as being cached, is this intended behaviour? What 
exactly are we looking at, replicas perhaps (the resiliency in RDD) or 
partitions for the same RDD?

With a 512 MB heap (max and initial size), regardless of JDK vendor:

Looking for mybiggerpom.xml in the directory you're running this 
application from
Added broadcast_0_piece0 in memory on 10.0.2.15:35762 (size: 15.8 KB, 
free: 159.0 MB)
caching in memory
Added broadcast_1_piece0 in memory on 10.0.2.15:35762 (size: 1789.0 B, 
free: 159.0 MB)
Added rdd_1_0 in memory on 10.0.2.15:35762 (size: 110.7 MB, free: 48.3 MB)
lines.count(): 2790700

Yet if I increase it to 1024 MB (again max and initial size), I see this:

Looking for mybiggerpom.xml in the directory you're running this 
application from
Added broadcast_0_piece0 in memory on 10.0.2.15:39739 (size: 15.8 KB, 
free: 543.0 MB)
caching in memory
Added broadcast_1_piece0 in memory on 10.0.2.15:39739 (size: 1789.0 B, 
free: 543.0 MB)
Added rdd_1_0 in memory on 10.0.2.15:39739 (size: 110.7 MB, free: 432.3 
MB)
Added rdd_1_1 in memory on 10.0.2.15:39739 (size: 107.3 MB, free: 325.0 
MB)
Added rdd_1_2 in memory on 10.0.2.15:39739 (size: 107.0 MB, free: 218.1 
MB)
lines.count(): 2790700

My simple test case:
//scalastyle:off

import java.io.File
import org.apache.spark._
import org.apache.spark.rdd._

object Trimmed {

  def main(args: Array[String]) {
val sc = new SparkContext(new SparkConf().setAppName("Adam RDD cached 
size experiment")
  .setMaster("local[1]"))

var fileName = "mybiggerpom.xml"
if (args != null && args.length > 0) {
 fileName = args(0)
}
println("Looking for " + fileName + " in the directory you're running 
this application from")
val lines = sc.textFile(fileName)
println("caching in memory")
lines.cache()
println("lines.count(): " + lines.count())
  }
}

I also want to figure out where the cached RDD size value comes from and I 
noticed deserializedSize is used (in BlockManagerMasterEndpoint.scala), 
where does this value come from? I understand SizeEstimator plays a big 
role but it's unclear who's responsible for figuring out deserializedSize 
in the first place despite my best efforts with Intellij and a lot of 
grepping.

I'm using recent Spark 2.0 code, any guidance here will be appreciated, 
cheers




Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU


Re: BytesToBytes and unaligned memory

2016-04-18 Thread Adam Roberts
Ted, yes with the forced true value all tests pass, we use the unaligned 
check in 15 other suites.

Our java.nio.Bits.unaligned() function checks that the detected os.arch 
value matches a list of known implementations (not including s390x).

We could add it to the known architectures in the catch block but this 
won't make a difference here as because we call unaligned() OK (no 
exception is thrown), we don't reach the architecture checking stage 
anyway.

I see in org.apache.spark.memory.MemoryManager that unaligned support is 
required for off-heap memory in Tungsten (perhaps incorrectly if no code 
ever exercises it in Spark?). Instead of having a requirement should we 
instead log a warning once that this is likely to lead to slow 
performance? What's the rationale for supporting unaligned memory access: 
it's my understanding that it's typically very slow, are there any design 
docs or perhaps a JIRA where I can learn more? 

Will run a simple test case exercising unaligned memory access for Linux 
on Z (without using Spark) and can also run the tests claiming to require 
unaligned memory access on a platform where unaligned memory access is 
definitely not supported for shorts/ints/longs. 

if these tests continue to pass then I think the Spark tests don't 
exercise unaligned memory access, cheers


 




From:   Ted Yu <yuzhih...@gmail.com>
To:     Adam Roberts/UK/IBM@IBMGB
Cc: "dev@spark.apache.org" <dev@spark.apache.org>
Date:   15/04/2016 17:35
Subject:Re: BytesToBytes and unaligned memory



I am curious if all Spark unit tests pass with the forced true value for 
unaligned.
If that is the case, it seems we can add s390x to the known architectures.

It would also give us some more background if you can describe 
how java.nio.Bits#unaligned() is implemented on s390x.

Josh / Andrew / Davies / Ryan are more familiar with related code. It 
would be good to hear what they think.

Thanks

On Fri, Apr 15, 2016 at 8:47 AM, Adam Roberts <arobe...@uk.ibm.com> wrote:
Ted, yeah with the forced true value the tests in that suite all pass and 
I know they're being executed thanks to prints I've added 

Cheers, 




From:Ted Yu <yuzhih...@gmail.com> 
To:Adam Roberts/UK/IBM@IBMGB 
Cc:"dev@spark.apache.org" <dev@spark.apache.org> 
Date:15/04/2016 16:43 
Subject:Re: BytesToBytes and unaligned memory 



Can you clarify whether BytesToBytesMapOffHeapSuite passed or failed with 
the forced true value for unaligned ? 

If the test failed, please pastebin the failure(s). 

Thanks 

On Fri, Apr 15, 2016 at 8:32 AM, Adam Roberts <arobe...@uk.ibm.com> wrote: 

Ted, yep I'm working from the latest code which includes that unaligned 
check, for experimenting I've modified that code to ignore the unaligned 
check (just go ahead and say we support it anyway, even though our JDK 
returns false: the return value of java.nio.Bits.unaligned()). 

My Platform.java for testing contains: 

private static final boolean unaligned; 

static { 
  boolean _unaligned; 
  // use reflection to access unaligned field 
  try { 
System.out.println("Checking unaligned support"); 
Class bitsClass = 
  Class.forName("java.nio.Bits", false, 
ClassLoader.getSystemClassLoader()); 
Method unalignedMethod = bitsClass.getDeclaredMethod("unaligned"); 
unalignedMethod.setAccessible(true); 
_unaligned = Boolean.TRUE.equals(unalignedMethod.invoke(null)); 
System.out.println("Used reflection and _unaligned is: " + 
_unaligned); 
System.out.println("Setting to true anyway for experimenting"); 
_unaligned = true; 
} catch (Throwable t) { 
  // We at least know x86 and x64 support unaligned access. 
  String arch = System.getProperty("os.arch", ""); 
  //noinspection DynamicRegexReplaceableByCompiledPattern 
  // We don't actually get here since we find the unaligned method OK 
and it returns false (I override with true anyway) 
  // but add s390x incase we somehow fail anyway. 
  System.out.println("Checking for s390x, os.arch is: " + arch); 
  _unaligned = arch.matches("^(i[3-6]86|x86(_64)?|x64|s390x|amd64)$"); 

} 
unaligned = _unaligned; 
System.out.println("returning: " + unaligned); 
  } 
} 

Output is, as you'd expect, "used reflection and _unaligned is false, 
setting to true anyway for experimenting", and the tests pass. 

No other problems on the platform (pending a different pull request). 

Cheers, 







From:Ted Yu <yuzhih...@gmail.com> 
To:Adam Roberts/UK/IBM@IBMGB 
Cc:"dev@spark.apache.org" <dev@spark.apache.org> 
Date:15/04/2016 15:32 
Subject:Re: BytesToBytes and unaligned memory 




I assume you tested 2.0 with SPARK-12181 . 

Related code from Platform.java if 

BytesToBytes and unaligned memory

2016-04-15 Thread Adam Roberts
Hi, I'm testing Spark 2.0.0 on various architectures and have a question, 
are we sure if 
core/src/test/java/org/apache/spark/unsafe/map/AbstractBytesToBytesMapSuite.java
 
really is attempting to use unaligned memory access (for the 
BytesToBytesMapOffHeapSuite tests specifically)?

Our JDKs on zSystems for example return false for the 
java.nio.Bits.unaligned() method and yet if I skip this check and add 
s390x to the supported architectures (for zSystems), all thirteen tests 
here pass. 

The 13 tests here all fail as we do not pass the unaligned requirement 
(but perhaps incorrectly):
core/src/test/java/org/apache/spark/unsafe/map/BytesToBytesMapOffHeapSuite.java 
and I know the unaligned checking is at 
common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java

Either our JDK's method is returning false incorrectly or this test isn't 
using unaligned memory access (so the requirement is invalid), there's no 
mention of alignment in the test itself.

Any guidance would be very much appreciated, cheers


Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU


Understanding PySpark Internals

2016-03-29 Thread Adam Roberts
Hi, I'm interested in figuring out how the Python API for Spark works, 
I've came to the following conclusion and want to share this with the 
community; could be of use in the PySpark docs here, specifically the 
"Execution and pipelining part".

Any sanity checking would be much appreciated, here's the trivial Python 
example I've traced:
from pyspark import SparkContext
sc = SparkContext("local[1]", "Adam test")
sc.setCheckpointDir("foo checkpoint dir")

Added this JVM option:
export 
IBM_JAVA_OPTIONS="-Xtrace:methods={org/apache/spark/*,py4j/*},print=mt"

Prints added in py4j-java/src/py4j/commands/CallCommand.java - 
specifically in the execute method. Built and replaced existing class in 
the py4j 0.9 jar in my Spark assembly jar. Example output is:
In execute for CallCommand, commandName: c
target object id: o0
methodName: get

I'll launch the Spark application with:
$SPARK_HOME/bin/spark-submit --master local[1] Adam.py > checkme.txt 2>&1

I've quickly put together the following WIP diagram of what I think is 
happening:
http://postimg.org/image/nihylmset/

To summarise I think:
We're heavily using reflection (as evidenced by Py4j's ReflectionEngine 
and MethodInvoker classes) to invoke Spark's API in a JVM from Python
There's an agreed protocol (in Py4j's Protocol.java) for handling 
commands: said commands are exchanged using a local socket between Python 
and our JVM (the driver based on docs, not the master)
The Spark API is accessible by means of commands exchanged using said 
socket using the agreed protocol
Commands are read/written using BufferedReader/Writer
Type conversion is also performed from Python to Java (not looked at in 
detail yet)
We keep track of the objects with, for example, o0 representing the first 
object we know about

Does this sound correct?

I've only checked the trace output in local mode, curious as to what 
happens when we're running in standalone mode (I didn't see a Python 
interpreter appearing on all workers in order to process partitions of 
data, I assume in standalone mode we use Python solely as an orchestrator 
- the driver - and not as an executor for distributed computing?).

Happy to provide the full trace output on request (omitted timestamps, 
logging info, added spacing), I expect there's a O*JDK method tracing 
equivalent so the above can easily be reproduced regardless of Java 
vendor.

Cheers,


Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU


Tungsten in a mixed endian environment

2016-01-12 Thread Adam Roberts
Hi all, I've been experimenting with DataFrame operations in a mixed 
endian environment - a big endian master with little endian workers. With 
tungsten enabled I'm encountering data corruption issues.

For example, with this simple test code:

import org.apache.spark.SparkContext
import org.apache.spark._
import org.apache.spark.sql.SQLContext

object SimpleSQL {
  def main(args: Array[String]): Unit = {
if (args.length != 1) {
  println("Not enough args, you need to specify the master url")
}
val masterURL = args(0)
println("Setting up Spark context at: " + masterURL)
val sparkConf = new SparkConf
val sc = new SparkContext(masterURL, "Unsafe endian test", sparkConf)

println("Performing SQL tests")

val sqlContext = new SQLContext(sc)
println("SQL context set up")
val df = sqlContext.read.json("/tmp/people.json")
df.show()
println("Selecting everyone's age and adding one to it")
df.select(df("name"), df("age") + 1).show()
println("Showing all people over the age of 21")
df.filter(df("age") > 21).show()
println("Counting people by age")
df.groupBy("age").count().show()
  }
} 

Instead of getting

++-+
| age|count|
++-+
|null|1|
|  19|1|
|  30|1|
++-+ 

I get the following with my mixed endian set up:

+---+-+
|age|count|
+---+-+
|   null|1|
|1369094286720630784|72057594037927936|
| 30|1|
+---+-+ 

and on another run:

+---+-+
|age|count|
+---+-+
|  0|72057594037927936|
| 19|1| 

Is Spark expected to work in such an environment? If I turn off tungsten (
sparkConf.set("spark.sql.tungsten.enabled", "false"), in 20 runs I don't 
see any problems.

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU


Test workflow - blacklist entire suites and run any independently

2015-09-21 Thread Adam Roberts
Hi, is there an existing way to blacklist any test suite?

Ideally we'd have a text file with a series of names (let's say comma 
separated) and if a name matches with the fully qualified class name for a 
suite, this suite will be skipped.

Perhaps we can achieve this via ScalaTest or Maven?

Currently if a number of suites are failing we're required to comment 
these out, commit and push this change then kick off a Jenkins job 
(perhaps building a custom branch) - not ideal when working with Jenkins, 
would be quicker to use such a mechanism as described above as opposed to 
having a few branches that are a little different from others.

Also, how can we quickly only run any one suite within, say, sql/hive? -f 
sql/hive/pom.xml with -nsu results in compile failures each time.
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU


Re: Test workflow - blacklist entire suites and run any independently

2015-09-21 Thread Adam Roberts
Thanks Josh, I should have added that we've tried with -DwildcardSuites 
and Maven and we use this helpful feature regularly (although this does 
result in building plenty of tests and running other tests in other 
modules too), so wondering if there's a more "streamlined" way - e.g. with 
junit and eclipse we'd just right click one individual unit test and 
that'd be run - without building again AFAIK

Unfortunately using sbt causes a lot of pain, such as...

[error]
[error]   last tree to typer: 
Literal(Constant(org.apache.spark.sql.test.ExamplePoint))
[error]   symbol: null
[error]symbol definition: null
[error]  tpe: 
Class(classOf[org.apache.spark.sql.test.ExamplePoint])
[error]symbol owners:
[error]   context owners: class ExamplePointUDT -> package test
[error]

and then an awfully long stacktrace with plenty of errors. Must be an 
easier way...



From:   Josh Rosen <rosenvi...@gmail.com>
To: Adam Roberts/UK/IBM@IBMGB
Cc: dev <dev@spark.apache.org>
Date:   21/09/2015 19:19
Subject:Re: Test workflow - blacklist entire suites and run any 
independently



For quickly running individual suites: 
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-RunningIndividualTests

On Mon, Sep 21, 2015 at 8:21 AM, Adam Roberts <arobe...@uk.ibm.com> wrote:
Hi, is there an existing way to blacklist any test suite? 

Ideally we'd have a text file with a series of names (let's say comma 
separated) and if a name matches with the fully qualified class name for a 
suite, this suite will be skipped. 

Perhaps we can achieve this via ScalaTest or Maven? 

Currently if a number of suites are failing we're required to comment 
these out, commit and push this change then kick off a Jenkins job 
(perhaps building a custom branch) - not ideal when working with Jenkins, 
would be quicker to use such a mechanism as described above as opposed to 
having a few branches that are a little different from others. 

Also, how can we quickly only run any one suite within, say, sql/hive? -f 
sql/hive/pom.xml with -nsu results in compile failures each time.
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU


Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU