[GitHub] spark issue #14409: [SPARK-16796] [Web UI] Visible passwords on Spark enviro...

2016-08-03 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/14409
  
@Devian-ua 
I think that while we're in the web UI we should look at masking the shared 
secret too (so potentially another JIRA/contribution to follow? At least 
something I'm interested in pursuing)

Also curious if you have any plans to backport this small change to 1.5.x 
and 1.6.x also 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11956: [SPARK-14098][SQL] Generate Java code that gets a float/...

2016-08-22 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/11956
  
@hvanhovell @marmbrus @srowen I see this PR has been open since the 25th of 
March and provides substantial performance improvements as mentioned above 
without introducing functional regressions, as leading SQL/community members 
what do you guys think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15419: [SPARK-17828] [DOCS] Remove unused generate-chang...

2016-10-10 Thread a-roberts
GitHub user a-roberts opened a pull request:

https://github.com/apache/spark/pull/15419

[SPARK-17828] [DOCS] Remove unused generate-changelist.py

## What changes were proposed in this pull request?
We can remove this file based on discussion at 
https://issues.apache.org/jira/browse/SPARK-17828 it's evident this file has 
been redundant for a while, JIRA release notes serves this purpose for us 
already. 

For ease of future reference you can find detailed release notes at, for 
example:

http://spark.apache.org/downloads.html -> 
http://spark.apache.org/releases/spark-release-2-0-1.html -> "Detailed changes" 
which links to 
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315420&version=12336857


## How was this patch tested?
Searched the codebase and saw nothing referencing this, hasn't been used in 
a while (probably manually invoked a long time ago)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/a-roberts/spark patch-7

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15419.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15419


commit 893c5d9ee5450e222cdfcf323c8c85b8c2b1d66b
Author: Adam Roberts 
Date:   2016-10-10T09:29:01Z

Remove unused generate-changelist.py

We can remove this based on discussion at 
https://issues.apache.org/jira/browse/SPARK-17828 it's evident this file has 
been redundant for a while, JIRA release notes serves this purpose for us 
already. For ease of future reference you can find detailed release notes at, 
for example:

http://spark.apache.org/downloads.html -> 
http://spark.apache.org/releases/spark-release-2-0-1.html -> "Detailed changes" 
which links to 
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315420&version=12336857




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16888: [SPARK-19552] [BUILD] Upgrade Netty version to 4....

2017-02-10 Thread a-roberts
GitHub user a-roberts opened a pull request:

https://github.com/apache/spark/pull/16888

[SPARK-19552] [BUILD] Upgrade Netty version to 4.1.8 final

## What changes were proposed in this pull request?
We should use Netty 4.1.8 due to one security concern and other bug fixes - 
we'll need to change Spark code as well as the usual misc files (like pom.xml 
and deps files) to enable this

## How was this patch tested?
Existing unit tests - this is a WIP as this introduces some new test 
failures and I have dummy implementations for the required methods I've added.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/a-roberts/spark NettyUpgrade418Final

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16888.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16888


commit 5ace15b8c8abba562c0abbf598464014d09ad4d5
Author: Adam Roberts 
Date:   2017-02-09T17:07:59Z

Use netty-all 4.1.8 final

commit dc88508644102ba241d75416a56afaa695db0f45
Author: Adam Roberts 
Date:   2017-02-10T16:08:38Z

Tidy up

commit cc4fb3775eafeeb5071115e1bb988a565d94295d
Author: Adam Roberts 
Date:   2017-02-10T16:10:44Z

Upgrade deps files as well




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16888: [SPARK-19552] [BUILD] Upgrade Netty version to 4.1.8 fin...

2017-02-13 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/16888
  
Netty 4.1.1 and above has the fix in (a change to OpenSslEngine)- Netty 
4.0.x does not, and the implementations of OpenSslEngine differs substantially 
between the releases (so a simple case of adding in the fix would be 
non-trivial and potentially lead to complications, this would be the path of 
least resistance)

You can find more info on the CVE I'm concerned with at: 
https://www.versioneye.com/java/io.netty:netty-all/4.0.43.Final

To be impacted it sounds as though we need to do several things - use Spark 
with Netty 4.1.0 and below (including the 4.0.x releases), have tcnative on our 
classpath and we specify to use the OpenSslEngine (there's a useful overview on 
why you'd want to do this [here](https://youtu.be/DKJ0w30M0vg?t=2969) which 
mentions it being a drop-in replacement for the JDK classes and offers superior 
performance). Info on using tcnative 
[here](http://netty.io/wiki/forked-tomcat-native.html#wiki-h2-4)

We don't include any of the netty/tcnative native libraries for Spark so I 
don't think we're impacted - but moving up to take on other fixes as well would 
be useful (and if it so much faster we could see shuffle-intensive workloads 
speedup by upgrading and including natives later on). I'll set this as a WIP 
and look into why the tests fail.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16888: [WIP] [SPARK-19552] [BUILD] Upgrade Netty version to 4.1...

2017-02-14 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/16888
  
I'll close this one then until there ever is a need to upgrade, cheers


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16888: [WIP] [SPARK-19552] [BUILD] Upgrade Netty version...

2017-02-14 Thread a-roberts
Github user a-roberts closed the pull request at:

https://github.com/apache/spark/pull/16888


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14958: [SPARK-17378] [BUILD] Upgrade snappy-java to 1.1....

2016-09-05 Thread a-roberts
GitHub user a-roberts opened a pull request:

https://github.com/apache/spark/pull/14958

[SPARK-17378] [BUILD] Upgrade snappy-java to 1.1.2.6

## What changes were proposed in this pull request?

Upgrades the Snappy version to 1.1.2.6 from 1.1.2.4, release notes: 
https://github.com/xerial/snappy-java/blob/master/Milestone.md mention "Fix a 
bug in SnappyInputStream when reading compressed data that happened to have the 
same first byte with the stream magic header (#142)"


## How was this patch tested?
Existing unit tests using the latest IBM Java 8 on Intel, Power and Z 
architectures (little and big-endian)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/a-roberts/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14958.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14958


commit 4b7ec9a9c7d136eee43adf5b6708944d76f7fe3e
Author: Adam Roberts 
Date:   2016-09-05T08:48:26Z

Update Snappy to 1.1.2.6

Pom change

commit 440b1df04f7f85a9b99b3c57e83a9ca9b19e8b76
Author: Adam Roberts 
Date:   2016-09-05T08:49:57Z

Update spark-deps-hadoop-2.2

commit 140f70acf1d1940ccbddb5408dbed9ee0feee459
Author: Adam Roberts 
Date:   2016-09-05T08:50:06Z

Update spark-deps-hadoop-2.3

commit cda5016f03f184c3bb9e35038eb00a24c1aaf970
Author: Adam Roberts 
Date:   2016-09-05T08:50:17Z

Update spark-deps-hadoop-2.4

commit 36253baf6abffa90c4480b98f38b78f2b1507447
Author: Adam Roberts 
Date:   2016-09-05T08:50:26Z

Update spark-deps-hadoop-2.6

commit 3795997a116c868343930ac9a91afb861c82705c
Author: Adam Roberts 
Date:   2016-09-05T08:50:35Z

Update spark-deps-hadoop-2.7




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14958: [SPARK-17378] [BUILD] Upgrade snappy-java to 1.1.2.6

2016-09-05 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/14958
  
Yep, builds in progress on our farm against the latest 1.6 code using 
1.1.2.6 so will improve the PR should there be no problems


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14958: [SPARK-17378] [BUILD] Upgrade snappy-java to 1.1.2.6

2016-09-05 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/14958
  
Good point, 
https://github.com/xerial/snappy-java/commit/60cc0c2e1d1a76ae2981d0572a5164fcfdfba5f1
 shows the changes that were made, so we would have a checkMagicHeader 
equivalent added to the existing Spark tests, I'll look into it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14961: [SPARK-17379] [BUILD] Upgrade netty-all to 4.0.41...

2016-09-05 Thread a-roberts
GitHub user a-roberts opened a pull request:

https://github.com/apache/spark/pull/14961

[SPARK-17379] [BUILD] Upgrade netty-all to 4.0.41 final for bug fixes

## What changes were proposed in this pull request?
Upgrade netty-all to latest in the 4.0.x line which is 4.0.41, mentions 
several bug fixes and performance improvements we may find useful, see 
netty.io/news/2016/08/29/4-0-41-Final-4-1-5-Final.html. Initially tried to use 
4.1.5 but noticed it's not backwards compatible.

## How was this patch tested?
Existing unit tests against branch-1.6 and branch-2.0 using IBM Java 8 on 
Intel, Power and Z architectures

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/a-roberts/spark netty

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14961.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14961


commit 38ca07b1f65d4d16d8ccf0a5cfbc570c2c7882e1
Author: Adam Roberts 
Date:   2016-09-05T13:08:05Z

Upgrade netty-all to 4.0.41 final for bug fixes




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14961: [SPARK-17379] [BUILD] Upgrade netty-all to 4.0.41 final ...

2016-09-06 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/14961
  
Thanks, so are we saying netty 4.0.29 can't be upgraded to 4.0.41 without 
breaking changes? That's not even a minor version change...

On branch 1.6 with the netty change for myself I see 8477 tests, two 
failures (flaky network events and DateTimeUtilsSuite to UTC timestamp, 
unrelated to this)

And against master I see 11,148 tests with unrelated failures again 
(furtherRequestsDelay, hive metastore warehouse dir, executor allocation 
manager basic functionality, replsuite clone and clean line object)

I'm using two maven commands to first build and then run
```
mvn -T 1C ${R_PROFILE} -Pyarn -Phadoop-${HADOOP_VERSION} -Phive 
-Phive-thriftserver -DskipTests -Dscala-$SCALA_VERSION clean package
```

```
mvn -Pyarn -Phadoop-${HADOOP_VERSION} -Phive -Phive-thriftserver 
-Dscala-$SCALA_VERSION -Dtest.exclude.tags=org.apache.spark.tags.DockerTest 
${TESTS_RUN_OPTIONS} -fn test
```

In this case the profiles used are for **Hadoop 2.6**, Scala 2.10 on 
branch-1.6 then **Hadoop 2.7** and Scala 2.11 for branch-2.0, no additional 
test run options. Therefore I think it's all about the Hadoop version we use 
because in the community job I see:
```
**-Phadoop-2.3** -Phive -Pyarn -Pmesos -Phive-thriftserver -Pkinesis-asl 
-Dtest.exclude.tags=org.apache.spark.tags.ExtendedHiveTest,org.apache.spark.tags.ExtendedYarnTest
 test
```

If this is the case then surely we actually should not upgrade the version 
of Netty until we either drop support for Hadoop 2.3 and below (and perhaps we 
see the problem in 2.4 too) or make the necessary changes in our Spark codebase 
to address issues seen in the above jobs using Hadoop 2.3.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14961: [SPARK-17379] [BUILD] Upgrade netty-all to 4.0.41 final ...

2016-09-06 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/14961
  
In the description I mentioned that for testing I used "Existing unit tests 
against branch-1.6 and branch-2.0 using IBM Java 8 on Intel, Power and Z 
architectures", so clarifying that I only used Hadoop 2.6 and Hadoop 2.7, but I 
see in the pull request builder we use Hadoop 2.3 and a dozen failures happened


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14961: [SPARK-17379] [BUILD] Upgrade netty-all to 4.0.41 final ...

2016-09-06 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/14961
  
Thanks, I did a ctrl-f for "** fail", you'd have a better idea of what the 
known flakies are in this farm though, my quick checking:

- using external shuffle service -> looks to be a timeout

These six mention a stopped Spark context (perhaps the previous test 
doesn't clean up properly and causes this):
- kryo objects are serialised consistently in different processes *** 
FAILED *** (1 minute) -> never seen
- cluster mode, FIFO scheduler *** FAILED *** (1 minute) -> never seen
- cluster mode, fair scheduler *** FAILED *** -> never seen
- verify that correct log urls get propagated from workers *** FAILED *** 
(1 minute) -> never seen
- verify that log urls reflect SPARK_PUBLIC_DNS (SPARK-6175) *** FAILED *** 
(1 minute) -> never seen
- task throws not serializable exception *** FAILED *** (1 minute) -> never 
seen

- simple groupByKey *** FAILED *** (1 minute) [info]   
java.lang.NullPointerException:
[info]   at org.apache.spark.SparkContext.(SparkContext.scala:552) > 
different causes now, never seen

- onTaskGettingResult() called when result fetched remotely *** FAILED *** 
(135 milliseconds) -> never seen

- set spark.sql.warehouse.dir *** FAILED *** (5 minutes, 0 seconds) -> 
never seen
- set hive.metastore.warehouse.dir *** FAILED *** (10 seconds, 552 
milliseconds) -> never seen


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14958: [SPARK-17378] [BUILD] Upgrade snappy-java to 1.1.2.6

2016-09-06 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/14958
  
I'm almost done with adding in the test case anyway, involved a quick 
rewrite to use Scala code, had fun rewriting the for loop for the test into a 
while, I _think_ it would manifest itself if your very first byte happened to 
be -126 and you're using Snappy.

At the very least I brushed up on my Scala skills and I think it's good 
practice to add a test with an implementation change so I'll update the PR once 
it's done. If Snappy were to change (e.g. 1.1.3 were to be released but the 
magic header logic changes) we'd at least be another project that picks up the 
problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14958: [SPARK-17378] [BUILD] Upgrade snappy-java to 1.1.2.6

2016-09-06 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/14958
  
Straight up rewrite so it's just testing that the Snappy version we include 
happens to have the magic header handling in, kept it simple


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14958: [SPARK-17378] [BUILD] Upgrade snappy-java to 1.1....

2016-09-06 Thread a-roberts
Github user a-roberts commented on a diff in the pull request:

https://github.com/apache/spark/pull/14958#discussion_r77660745
  
--- Diff: 
core/src/test/scala/org/apache/spark/io/CompressionCodecSuite.scala ---
@@ -130,4 +130,58 @@ class CompressionCodecSuite extends SparkFunSuite {
 ByteStreams.readFully(concatenatedBytes, decompressed)
 assert(decompressed.toSeq === (0 to 127))
   }
+
+  // Based on 
https://github.com/xerial/snappy-java/blob/60cc0c2e1d1a76ae2981d0572a5164fcfdfba5f1/src/test/java/org/xerial/snappy/SnappyInputStreamTest.java
+  test("SPARK 17378: snappy-java should handle magic header when reading 
stream") {
+val b = new ByteArrayOutputStream()
+// Write uncompressed length beginning with -126 (the same with 
magicheader[0])
+b.write(-126) // Can't access magic header[0] as it isn't public, so 
access this way
+b.write(0x01)
+// uncompressed data length = 130
+
+var data = new ByteArrayOutputStream()
+
+for (i <- 0 until 130) {
+  data.write('A')
+}
+
+var dataMoreThan8Len = data.toByteArray()
+
+// write literal (lower 2-bit of the first tag byte is 00, upper 
6-bits represents data size)
+b.write(60<<2) // 1-byte data length follows
+b.write(dataMoreThan8Len.length-1) // subsequent data length
+b.write(dataMoreThan8Len)
+
+var compressed = b.toByteArray()
+
+// This should succeed
+assert(dataMoreThan8Len === 
org.xerial.snappy.Snappy.uncompress(compressed))
+
+// Reproduce error in #142
+val in = new org.xerial.snappy.SnappyInputStream(new 
ByteArrayInputStream(b.toByteArray()))
+
+var uncompressed = readFully(in)
+assert(dataMoreThan8Len === uncompressed) // this fails as 
uncompressed is empty
+  }
+
+  private def readFully(input: InputStream): Array[Byte] = {
+try {
+  val out = new ByteArrayOutputStream()
+  var buf = new Array[Byte](4096)
+
+  var readBytes = 0
+  while (readBytes != -1) {
--- End diff --

This loop makes me weep, it's an attempt @ rewriting the following but in 
Scala

 for (int readBytes = 0; (readBytes = input.read(buf)) != -1; ) {


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14958: [SPARK-17378] [BUILD] Upgrade snappy-java to 1.1....

2016-09-06 Thread a-roberts
Github user a-roberts commented on a diff in the pull request:

https://github.com/apache/spark/pull/14958#discussion_r77662239
  
--- Diff: 
core/src/test/scala/org/apache/spark/io/CompressionCodecSuite.scala ---
@@ -130,4 +130,58 @@ class CompressionCodecSuite extends SparkFunSuite {
 ByteStreams.readFully(concatenatedBytes, decompressed)
 assert(decompressed.toSeq === (0 to 127))
   }
+
+  // Based on 
https://github.com/xerial/snappy-java/blob/60cc0c2e1d1a76ae2981d0572a5164fcfdfba5f1/src/test/java/org/xerial/snappy/SnappyInputStreamTest.java
+  test("SPARK 17378: snappy-java should handle magic header when reading 
stream") {
+val b = new ByteArrayOutputStream()
+// Write uncompressed length beginning with -126 (the same with 
magicheader[0])
+b.write(-126) // Can't access magic header[0] as it isn't public, so 
access this way
--- End diff --

Yeah I agree, so how about we just revert the test case commit here and 
merge the 1.1.2.6 change itself as folks want it, and then in a later PR add an 
extra test for robustness if we want to.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14961: [SPARK-17379] [BUILD] Upgrade netty-all to 4.0.41 final ...

2016-09-12 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/14961
  
Sean, yep, I've had trouble reproducing it too, kicked off a bunch of 
builds over the weekend including one using Hadoop-2.3 which was my initial 
theory (only difference between our testing environments apart from the options 
I mention below)

I'll add
```
  static {
System.setProperty("io.netty.recycler.maxCapacity", "0");
  }
```
in TransportConf then build and test locally before updating this.

FWIW I use these Java options for testing as our boxes have limited memory:

-Xss2048k **-Dspark.buffer.pageSize=1048576** -Xmx4g


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14961: [SPARK-17379] [BUILD] Upgrade netty-all to 4.0.41 final ...

2016-09-12 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/14961
  
No new test failures with my runs ranging from Hadoop 2.3 to Hadoop 2.7 
today so pushed the commit above


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15079: [SPARK-17524] [TESTS] Use specified spark.buffer....

2016-09-13 Thread a-roberts
GitHub user a-roberts opened a pull request:

https://github.com/apache/spark/pull/15079

[SPARK-17524] [TESTS] Use specified spark.buffer.pageSize

## What changes were proposed in this pull request?

This PR has the appendRowUntilExceedingPageSize test in 
RowBasedKeyValueBatchSuite use whatever spark.buffer.pageSize value a user has 
specified to prevent a test failure for anyone testing Apache Spark on a box 
with a reduced page size. The test is currently hardcoded to use the default 
page size which is 64 MB so this minor PR is a test improvement

## How was this patch tested?
Existing unit tests with 1 MB page size and with 64 MB (the default) page 
size

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/a-roberts/spark patch-5

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15079.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15079


commit c793efe6e567ba3fc51ceaf9ff225ca1e5f35f10
Author: Adam Roberts 
Date:   2016-09-13T12:31:06Z

Use specified spark.buffer.pageSize

Users can change the spark.buffer.pageSize value to run Spark unit tests on 
machines with less powerful hardware e.g. with two cores

To address this we can use whatever they've specified to avoid a failure 
where we see the number of rows for the appendRowUntilExceedingPageSize test is 
not as expected




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14961: [SPARK-17379] [BUILD] Upgrade netty-all to 4.0.41 final ...

2016-09-13 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/14961
  
[info] - using external shuffle service *** FAILED *** (1 minute)
[info]   java.util.concurrent.TimeoutException: Can't find 2 executors 
before 6 milliseconds elapsed

60 seconds really is an eternity, I can't reproduce this on my local set 
up, I expect we've got deadlock going on after the upgrade and would require 
some proper debugging (again, if only I could reproduce it on my test systems 
with access to tools like gdb/healthcenter/servicing APIs we use here). My 
systems have between two and eight cores and I know this farm has a lot more 
available...could be that having more cores increases the chances of thread 
contention.

I had a look at other pull requests being tested and see it typically 
completes in 3 seconds on a good run
using external shuffle service (3 seconds, 822 milliseconds)
at 
https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3258/consoleText

using external shuffle service (4 seconds, 543 milliseconds)

https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3233/consoleText


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15079: [SPARK-17524] [TESTS] Use specified spark.buffer....

2016-09-13 Thread a-roberts
Github user a-roberts commented on a diff in the pull request:

https://github.com/apache/spark/pull/15079#discussion_r78551292
  
--- Diff: 
sql/catalyst/src/test/java/org/apache/spark/sql/catalyst/expressions/RowBasedKeyValueBatchSuite.java
 ---
@@ -338,15 +338,18 @@ public void appendRowUntilExceedingCapacity() throws 
Exception {
 
   @Test
   public void appendRowUntilExceedingPageSize() throws Exception {
+// Use of this property prevents issues when running on a two core 
machine
+// A developer may override this size to 1 MB not 64 MB, so use that 
if so
+int pageSizeToUse = 
Integer.parseInt(System.getProperty("spark.buffer.pageSize"));
--- End diff --

Yeah that's right, couldn't figure out how to get the SparkConf from that 
test though, will have another look now


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15079: [SPARK-17524] [TESTS] Use specified spark.buffer....

2016-09-13 Thread a-roberts
Github user a-roberts commented on a diff in the pull request:

https://github.com/apache/spark/pull/15079#discussion_r78563225
  
--- Diff: 
sql/catalyst/src/test/java/org/apache/spark/sql/catalyst/expressions/RowBasedKeyValueBatchSuite.java
 ---
@@ -338,15 +338,18 @@ public void appendRowUntilExceedingCapacity() throws 
Exception {
 
   @Test
   public void appendRowUntilExceedingPageSize() throws Exception {
+// Use of this property prevents issues when running on a two core 
machine
+// A developer may override this size to 1 MB not 64 MB, so use that 
if so
+int pageSizeToUse = 
Integer.parseInt(System.getProperty("spark.buffer.pageSize"));
--- End diff --

Running the tests now with 
```
int pageSizeToUse = (int) memoryManager.pageSizeBytes(); 
```
Looking at the logic in here I see we either use the default page size or 
whatever the user specifies

An int is accepted for the allocate method in RowBasedKeyValueBatch, given 
that a max int in Java is 2,147m I doubt this will be a problem


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14961: [SPARK-17379] [BUILD] Upgrade netty-all to 4.0.41 final ...

2016-09-13 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/14961
  
Yep that makes more sense, UnpooledByteBufAllocator usage coming up


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14961: [SPARK-17379] [BUILD] Upgrade netty-all to 4.0.41 final ...

2016-09-13 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/14961
  
Had a look to see how to do this


https://github.com/netty/netty/blob/a01519e4f86690323647b5db45d9ffcb184b1a84/buffer/src/main/java/io/netty/buffer/ByteBufUtil.java

so I'll add io.netty.allocator.type to be unpooled, will make this change 
in TransportConf again (removing the recycler experiment)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15079: [SPARK-17524] [TESTS] Use specified spark.buffer.pageSiz...

2016-09-13 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/15079
  
Above failure (65319) shouldn't happen now as I'm using 
memoryManager.pageSizeBytes


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14961: [SPARK-17379] [BUILD] Upgrade netty-all to 4.0.41 final ...

2016-09-14 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/14961
  
Thanks @zsxwing, I've removed our older experiments in favour of this one

For future reference here is the context of how that option is used: 
https://github.com/netty/netty/blob/e7449b1ef361c55457ed21d44d6ed8387ec1fa45/common/src/main/java/io/netty/util/internal/PlatformDependent.java


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15094: [SPARK-17534] [TESTS] Increase timeouts for Direc...

2016-09-14 Thread a-roberts
GitHub user a-roberts opened a pull request:

https://github.com/apache/spark/pull/15094

[SPARK-17534] [TESTS] Increase timeouts for DirectKafkaStreamSuite tests

**## What changes were proposed in this pull request?**
There are two tests in this suite that are particularly flaky on the 
following hardware:

2x Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz and 16 GB of RAM, 1 TB HDD

This simple PR increases the timeout times and batch duration so they can 
reliably pass

**## How was this patch tested?**
Existing unit tests with the two core box where I was seeing the failures 
often


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/a-roberts/spark patch-6

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15094.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15094


commit 9ddecc6338f17b65c7b74d0e8b1854eb7f36b945
Author: Adam Roberts 
Date:   2016-09-14T13:15:18Z

[SPARK-17534] [TESTS] Increase timeouts for DirectKafkaStreamSuite tests

There are two tests in this suite that are particularly flaky on the 
following hardware:

2x Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz and 16 GB of RAM, 1 TB HDD

This simple PR increases the timeout times and batch duration so they can 
reliably pass




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16196: [SPARK-18231] Optimise SizeEstimator implementation

2017-01-02 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/16196
  
It's on my to-do list, working on this once I'm back from an end of year 
break, can I get a list of concerns here please? I'll run with them and I think 
one concern is that we'll underestimate and this'll lead to insufficient memory 
problems at runtime

Once we figure this out I can add the scenario(s) above as unit tests to 
ensure any changes here are entirely correct


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16196: [SPARK-18231] Optimise SizeEstimator implementation

2017-01-09 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/16196
  
We can close it for now and I'll reopen it once the changes are more 
conservative and I've done plenty of testing/profiling - or anybody else could 
do so by keeping in only the changes we deem as safe, unfortunately I have too 
much other Spark work on at the moment to give this PR the attention it deserves


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16212: [SPARK-18782] [BUILD] Bump Hadoop 2.6 version to use Had...

2017-01-16 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/16212
  
@srowen There's a mention 
[here](http://www.openwall.com/lists/oss-security/2017/01/10/3) that the YARN 
NodeManager and CredentialProvider classes present a risk (we bundle and 
provide the latter, org.apache.hadoop.security.alias.CredentialProvider). I see 
no direct uses in the Spark code; but I think somebody could use the 
CredentialProvider we bundle and be impacted.

Bumping up to Hadoop 2.6.5 now would shield us from more potentially 
relevant CVEs that keep popping up (and save us time investigating) that are 
only impacting classes in 2.6.4 Hadoop and below.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16616: [SPARK-18782] [BUILD] Bump Hadoop 2.6 version to ...

2017-01-17 Thread a-roberts
GitHub user a-roberts opened a pull request:

https://github.com/apache/spark/pull/16616

[SPARK-18782] [BUILD] Bump Hadoop 2.6 version to use Hadoop 2.6.5

**What changes were proposed in this pull request?**

Use Hadoop 2.6.5 for the Hadoop 2.6 profile, I see a bunch of fixes 
including security ones in the release notes that we should pick up

**How was this patch tested?**

Running the unit tests now with IBM's SDK for Java and let's see what 
happens with OpenJDK in the community builder - expecting no trouble as it is 
only a minor release.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/a-roberts/spark Hadoop265Bumper

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16616.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16616


commit d777246ea7c2b0c6acb31c8cb7e1caee4f3fc9fb
Author: Adam Roberts 
Date:   2016-12-08T10:32:16Z

Use Hadoop 2.6.5




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16212: [SPARK-18782] [BUILD] Bump Hadoop 2.6 version to use Had...

2017-01-17 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/16212
  
Created again [https://github.com/apache/spark/pull/16616](here) as I can't 
reopen this myself or push to the branch without making changes


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15713: [SPARK-18196] [CORE] Optimise CompactBuffer implementati...

2016-11-03 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/15713
  
The benchmark used is HiBench 5 which works with Spark 1.6, I expect there 
will be HiBench 6 officially available soon with [Spark 2 
support](https://github.com/intel-hadoop/HiBench/pull/337)

We can see 
[here](https://github.com/intel-hadoop/HiBench/blob/d076a43995fc59c3edc00b6d7ceecdb1572df406/workloads/pagerank/spark/scala/bin/run.sh)
 that the Spark PageRank 
[example](https://github.com/intel-hadoop/HiBench/blob/d076a43995fc59c3edc00b6d7ceecdb1572df406/src/sparkbench/src/main/scala/org/apache/spark/examples/SparkPageRank.scala)
 is used and that the main functions, in order of their occurrence, are 
textFile, map, split, slice, **groupByKey**, mapValues, join, flatMap, 
reduceByKey


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15736: [SPARK-18224] [CORE] Optimise PartitionedPairBuffer impl...

2016-11-09 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/15736
  
Sean, they are great suggestions, thanks -- I'll find the time (like for 
the other outstanding pull requests) to get your feedback integrated, tested 
and profiled, currently caught up in packaging our own Apache Spark releases 
for both 1.6.3 and 2.0.2. I also have a JIRA to create proposing regular 
performance runs using the latest Spark snapshot builds to track regressions (I 
have this all set up with scripts already)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15736: [SPARK-18224] [CORE] Optimise PartitionedPairBuffer impl...

2016-11-19 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/15736
  
I'm resuming the work for all of these related PRs again this week after 
the London Spark meetup on Wednesday, if you are keen to take it on I'm more 
than happy to help out and will share some information here that yourself and 
others should find useful.

**Useful tools**
As well as simple microbenchmarking we use the [Linux perf 
tools](https://perf.wiki.kernel.org/index.php/Main_Page), 
[tprof](http://perfinsp.sourceforge.net/tprof.html) with [Visual Performance 
Analyzer](http://www.scalabiliti.com/library/visual_performance_analyzer) and 
also IBM's 
[Healthcenter](https://www.ibm.com/developerworks/java/jdk/tools/healthcenter/) 
for Java for method profiling (this is bundled with the JDK and you provide 
-Xhealthcenter as a driver/executor option then open the files in said tool).

**Benchmarks**
We'd want to run our improvement ideas with and without the changes using[ 
HiBench 6](https://github.com/intel-hadoop/HiBench) (large profile) and 
[SparkSqlPerf](https://github.com/databricks/spark-sql-perf) against all 100 
TPCDS queries.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15736: [SPARK-18224] [CORE] Optimise PartitionedPairBuffer impl...

2016-11-25 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/15736
  
Back to working on the performance related JIRAs now, so based on the above 
helpful comments here's what I'll do

Remove the .get.compare from the loop as suggested above - we'll do a .get  
upfront to get our comparator to use, eliminating the .get later

Move the duplicated code into the WritablePartitionedPairCollection object 
so the two methods optimised here will call the above new method (let's say 
it's called getComparator) before returning accordingly (both methods are the 
same apart from the final few lines).

PartitionedAppendOnlyMap returns 
```
destructiveSortedIterator(comparator)
```

and PartitionedPairBuffer returns:

```
new Sorter(new KVArraySortDataFormat[(Int, K), AnyRef]).sort(data, 0, 
curSize, comparator)
iterator
```

I'll then build/test/profile this again


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15736: [SPARK-18224] [CORE] Optimise PartitionedPairBuffer impl...

2016-11-25 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/15736
  
I've conducted a lot of performance tests and gathered .hcd files so I can 
investigate this next week, but it looks like either the first commit is the 
best for performance or my current configuration with this benchmark results in 
us being unable to infer if our changes really make a difference.

Sharing some raw data, the format is as follows.

Benchmark name, date, time, data size in bytes (the same each run), the 
elapsed time and the throughput (bytes per second).

**With the above suggestions for Partitioned*Buffer**
```
ScalaSparkPagerank 2016-11-25 18:49:23 25992811549.577  
 5242917  
ScalaSparkPagerank 2016-11-25 18:56:55 25992811549.946  
 5204182  
ScalaSparkPagerank 2016-11-25 19:00:04 25992811546.510  
 5588650  
ScalaSparkPagerank 2016-11-25 19:02:23 25992811549.018  
 5302707  
ScalaSparkPagerank 2016-11-25 19:05:25 25992811549.270  
 5275585  
```

**Vanilla, no changes at all**
```
ScalaSparkPagerank 2016-11-25 19:08:45 25992811548.068  
 5407508  
ScalaSparkPagerank 2016-11-25 19:11:20 25992811547.712  
 5447856  
ScalaSparkPagerank 2016-11-25 19:13:50 25992811544.517  
 5838850  
ScalaSparkPagerank 2016-11-25 19:16:07 25992811549.942  
 5204599  
ScalaSparkPagerank 2016-11-25 19:19:08 25992811548.521  
 5357023  
```

**Original commit**
```
ScalaSparkPagerank 2016-11-25 19:47:59 25992811545.486  
 5714464  
ScalaSparkPagerank 2016-11-25 19:50:48 25992811548.507  
 5358569  
ScalaSparkPagerank 2016-11-25 19:53:09 25992811547.063  
 5522982  
ScalaSparkPagerank 2016-11-25 19:56:58 25992811546.154  
 5631757  
ScalaSparkPagerank 2016-11-25 20:00:01 25992811548.935  
 5311701
```

In Healthcenter I do see that these methods are still great candidates for 
optimisation as they are all very commonly used.

Open to more suggestions, I have exclusive access to lots of hardware, can 
easily churn out more custom builds and have lots of profiling software we can 
use. I'll be committing code for the SizeEstimator soon as that's a good 
candidate for optimisation here as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15713: [SPARK-18196] [CORE] Optimise CompactBuffer implementati...

2016-11-28 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/15713
  
I'll add a print to keep track of how big the CompactBuffer actually gets 
when used with, say, SparkSqlPerf (1 scale factor then 2 scale factor) and 
HiBench.

Once this is complete and we have an idea of how big it gets for these 
workloads, I'll be running the unit tests with an ArrayBuffer(2) replacing all 
CompactBuffer instances, I have the code ready and hoping to soon finish up the 
Partitioned* class PR to speed up the PageRank implementation and to contribute 
the first phase of improvements to the SizeEstimator.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15736: [SPARK-18224] [CORE] Optimise PartitionedPairBuffer impl...

2016-11-28 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/15736
  
Before progressing I've discussed what I'm seeing with our JIT compiler 
team, with the refactoring to reduce code duplication, the following occurs 
which solves some of the mystery -- although it's bad news as, like you, I 
wanted to remove the duplicate method.

Summarising:

> By having both of these classes share the getComparator method one level 
up in the hierarchy, the JIT profiling won't function as expected. 
> 
> Let's assume we have A (calling method) -> getComparator -> B (returns 
the comparator) and then you have C (another calling method) -> getComparator 
-> D (returns the comparator).
> 
> A and B are the actual methods calling getComparator. B and D are the 
comparators that are passed in.
> 
> As a JIT compiler if we profile getComparator on its I will see it 
calling B **and** calling D since profiling in most JITs is context insensitive 
and we think that's happening here. 
> 
> When we inline from A into getComparator and C into getComparator, I 
don't know if I should inline B or D, and given that inlining is critical for 
performance we see the slight drop in performance. Inlining is critical for 
eliminating call overheads, improving code locality, and the scope for 
optimisation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15713: [SPARK-18196] [CORE] Optimise CompactBuffer imple...

2016-11-01 Thread a-roberts
GitHub user a-roberts opened a pull request:

https://github.com/apache/spark/pull/15713

[SPARK-18196] [CORE] Optimise CompactBuffer implementation

## What changes were proposed in this pull request?
See the JIRA for details - summary is slightly increased footprint in the 
class but 4% performance increase observed on PageRank with HiBench large

## How was this patch tested?
Existing unit tests and HiBench large

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/a-roberts/spark patch-8

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15713.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15713


commit cdf8b8da1c986df6907673682862f34157757a2f
Author: Adam Roberts 
Date:   2016-11-01T14:45:52Z

[SPARK-18196] [CORE] Optimise CompactBuffer implementation

See the JIRA for details - summary is slightly increased footprint in the 
class but 4% performance increase observed on PageRank with HiBench large




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15714: [SPARK-18197] [CORE] Optimise AppendOnlyMap imple...

2016-11-01 Thread a-roberts
GitHub user a-roberts opened a pull request:

https://github.com/apache/spark/pull/15714

[SPARK-18197] [CORE] Optimise AppendOnlyMap implementation

## What changes were proposed in this pull request?
More details on the JIRA, slight performance increase here by performing 
the cheapest comparison first

## How was this patch tested?
Existing unit tests and HiBench large


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/a-roberts/spark patch-9

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15714.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15714


commit 40e3e06155bf905e9638831b4f8cf91e305f7fab
Author: Adam Roberts 
Date:   2016-11-01T14:58:40Z

[SPARK-18197] [CORE] Optimise AppendOnlyMap implementation

More details on the JIRA, slight performance increase here by performing 
the cheapest comparison first




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15713: [SPARK-18196] [CORE] Optimise CompactBuffer implementati...

2016-11-02 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/15713
  
Thanks for the prompt feedback, we found this opportunity when profiling 
Spark 1.6.2 with HiBench large and again this showed up as a hot method with 
the PageRank benchmark, we can gather data to see if it's still hot with Spark 
2 also and I'm planning to contribute lots of similar improvements

Paraphrasing from a colleague:

> This data structure is the backing data structure used by RDDs that are 
doing group by operations (we saw it from a PairRDD doing a groupByKey in 
PageRank)
> 
> The downside of the existing implementation is that every method in this 
class has an if ... else ... if ... else ... which handles element 0, element 1 
and then everything else respectively
> 
> We found that on PageRank this change provides a throughput boost of 
around 5% and costs us about 1 MB of estimated RDD size (86.5 MB to just under 
88 MB)_

Note that with our testing using OpenJDK 8 we didn't see a noticeable 
performance improvement (nor a regression) despite the very slight footprint 
increase (an increase of 2 MB instead of 1.5 MB), ideally we'll improve the 
performance for everybody so there may be scope for optimisations here that'll 
be of use to OpenJDK users too


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15713: [SPARK-18196] [CORE] Optimise CompactBuffer implementati...

2016-11-02 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/15713
  
Good point, I'll first see if we can build and pass the unit tests by 
replacing CompactBuffer with ArrayBuffer[2] and we'd also repeat the profiling 
(this would verify if groupByKey is still hot for PageRank). 

I'll keep the CompactBufferSuite for our experiment and replace the 
CompactBuffer references with ArrayBuffer to verify we can still perform all of 
the operations that the tests exercise.

My concern is that in comment for the CompactBuffer class we see the 
downsides of said ArrayBuffer (the 16 entry Object array we'd get, so we'd 
indeed want a default two element ArrayBuffer that only grows if needed -- 
which sounds like what we currently have but encapsulated within this class so 
we can override two operators and grow the backing array as required)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15735: [SPARK-18223] [CORE] Optimise PartitionedAppendOn...

2016-11-02 Thread a-roberts
GitHub user a-roberts opened a pull request:

https://github.com/apache/spark/pull/15735

[SPARK-18223] [CORE] Optimise PartitionedAppendOnlyMap implementation

## What changes were proposed in this pull request?

This class, like the PartitionedPairBuffer class, are both core Spark data 
structures that allow us to spill data to disk. 

From the comment in ExternalSorter before instantiating said data 
structures:
// Data structures to store in-memory objects before we spill. Depending on 
whether we have an
// Aggregator set, we either put objects into an AppendOnlyMap where we 
combine them, or we
// store them in an array buffer.

All of our data within RDDs has a partition ID and the ordering operations 
will order by a partition before any other criteria. Such data structures share 
a partitionKeyComparator from WriteablePartitionedPairCollection.

While this change adds more code, it is the bad iterator wrapping we remove 
that has a negative performance impact. In this case we avoid said wrapping to 
help the inliner. When avoided we've observed a 3% PageRank performance 
increase on HiBench large for both IBM's SDK for Java and OpenJDK 8 as a result 
of the inliner being better able to figure out what's going on. This 
observation is seen when combined with an optimisation PartitionedPairBuffer 
implementation I'll also contribute.

## How was this patch tested?

Existing unit tests and HiBench large, PageRank benchmark specifically.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/a-roberts/spark patch-10

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15735.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15735


commit 7d2e53af6d9c903f89c2457ca4b256693849
Author: Adam Roberts 
Date:   2016-11-02T11:30:51Z

[SPARK-18223] [CORE] Optimise PartitionedAppendOnlyMap implementation

This class, like the PartitionedPairBuffer class, are both core Spark data 
structures that allow us to spill data to disk. 

From the comment in ExternalSorter before instantiating said data 
structures:
// Data structures to store in-memory objects before we spill. Depending on 
whether we have an
// Aggregator set, we either put objects into an AppendOnlyMap where we 
combine them, or we
// store them in an array buffer.

All of our data within RDDs has a partition ID and the ordering operations 
will order by a partition before any other criteria. Such data structures share 
a partitionKeyComparator from WriteablePartitionedPairCollection.

While this change adds more code, it is the bad iterator wrapping we remove 
that has a negative performance impact. In this case we avoid said wrapping to 
help the inliner. When avoided we've observed a 3% PageRank performance 
increase on HiBench large for both IBM's SDK for Java and OpenJDK 8 as a result 
of the inliner being better able to figure out what's going on. This 
observation is seen when combined with an optimisation PartitionedPairBuffer 
implementation I'll also contribute.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15736: [SPARK-18224] [CORE] Optimise PartitionedPairBuff...

2016-11-02 Thread a-roberts
GitHub user a-roberts opened a pull request:

https://github.com/apache/spark/pull/15736

[SPARK-18224] [CORE] Optimise PartitionedPairBuffer implementation

## What changes were proposed in this pull request?
This change is very similar to my pull request for improving 
PartitionedPairAppendOnlyMap: https://github.com/apache/spark/pull/15735

Summarising (more detail above), we avoid the slow iterator wrapping in 
favour of helping the inliner. We observed that this, when combined with the 
above change, leads to a 3% performance increase on the HiBench large PageRank 
benchmark with both IBM's SDK for Java and with OpenJDK 8

## How was this patch tested?
Existing unit tests and HiBench large profile with both IBM's SDK for Java 
and OpenJDK 8, the PageRank benchmark specifically

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/a-roberts/spark patch-11

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15736.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15736


commit 0d1411c0f14c7679931124838f01e8f44618c15f
Author: Adam Roberts 
Date:   2016-11-02T11:36:28Z

[SPARK-18224] [CORE] Optimise PartitionedPairBuffer implementation

This change is very similar to my pull request or improving 
PartitionedPairAppendOnlyMap: https://github.com/apache/spark/pull/15735

Summarising (more detail above), we avoid the slow iterator wrapping in 
favour of helping the inliner. We observed that this, when combined with the 
above change, leads to a 3% performance increase on the HiBench large PageRank 
benchmark with both IBM's SDK for Java and with OpenJDK 8




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15735: [SPARK-18223] [CORE] Optimise PartitionedAppendOnlyMap i...

2016-11-02 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/15735
  
Good point, I'll contribute the changes here with #15736 and address your 
comments there for both changes


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15735: [SPARK-18223] [CORE] Optimise PartitionedAppendOn...

2016-11-02 Thread a-roberts
Github user a-roberts closed the pull request at:

https://github.com/apache/spark/pull/15735


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15736: [SPARK-18224] [CORE] Optimise PartitionedPairBuffer impl...

2016-11-02 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/15736
  
Will be adding the commit from #15735 here upon addressing the feedback


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15736: [SPARK-18224] [CORE] Optimise PartitionedPairBuffer impl...

2016-11-02 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/15736
  
Addressed the scalastyle comments and added the PartitionedAppendOnlyMap 
change here as per the above suggestions, will look at the review comments next

Two unrelated asides
1) wary I'm hogging the build machines, would be useful to not autotest 
everytime
2) dev/scalastyle should accept a parameter so we can quickly check just 
the one file, takes a long time typically


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15736: [SPARK-18224] [CORE] Optimise PartitionedPairBuffer impl...

2016-11-30 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/15736
  
@srowen how about this for profiling?

```
private[spark] object WritablePartitionedPairCollection {
  /**
   * Takes an optional parameter (keyComparator), use if provided
   * and returns a comparator for the partitions
   */
  def getComparator[K](keyComparator: Option[Comparator[K]]): 
Comparator[(Int, K)] = {
if (keyComparator.isDefined) {
  val theKeyComp = keyComparator.get
  new Comparator[(Int, K)] {
// We know we have a non-empty comparator here
override def compare(a: (Int, K), b: (Int, K)): Int = {
  if (a._1 == b._1) {
theKeyComp.compare(a._2, b._2)
  } else {
a._1 - b._1
  }
}
  }
} else return new Comparator[(Int, K)] {
  override def compare(a: (Int, K), b: (Int, K)): Int = {
a._1 - b._1
  }
}
  }
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15713: [SPARK-18196] [CORE] Optimise CompactBuffer implementati...

2016-11-30 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/15713
  
In response to @rxin's question, for HiBench CompactBuffers are **used only 
on PageRank** (none of the other 11) and these buffers mainly have between 3 
and 40 elements, no more than 60, never with only two elements. The PageRank 
workload processes 500k pages (large profile), we have 500k CompactBuffer 
constructor calls and 500k prints in the += method when curSize <= 2, 
indicating they're always expanding.

I don't know of any cases where we're adding only a couple of elements, I 
also ran SparkSqlPerf, all 100 queries, again we have no output indicating that 
we use this class (no prints from the constructor, the growToSize or the += 
methods). 

Here's a breakdown of growBySize invocations (prints the curSize variable) 
with PageRank so we have an idea of how big the CompactBuffers actually become.

I used the Spark WordCount example on the 677mb stdout file containing my 
prints to generate this data and we have a total of 18,762,361 growth events.

```
(3,50), (4,50), (5,50), (6,50), (7,50), (8,50), 
(9,50), (10,50), (11,50), (12,50), (13,50), (14,50), 
(15,50), (16,50), (17,50), (18,50), (19,50), (20,50), 
(21,48), (22,45), (23,42), (24,499978), (25,499951), (26,499879), 
(27,499729), (28,499321), (29,498517), (30,496984), (31,494114), (32,488878), 
(33,480328), (34,467214), (35,447829), (36,421619), (37,387790), (38,346826), 
(39,300660), (40,251266), (41,201702), (42,155372) (43,114024), (44,79886), 
(45,53196), (46,33580), (47,20146), (48,11569), (49,6222), (50,3143), 
(51,1491), (52,684), (53,289), (54,126), (55,39), (56,15), (57,6), (58,1), 
(59,1), (60,1)
```
On the left we have the CompactBuffer size in elements and on the right we 
have a number representing how many times this appeared in the output file 
(therefore the CompactBuffer has grown to have this many elements that many 
times).

If there are better ways to figure this out or other workloads to suggest 
do let me know, I've got the code ready that replaces CompactBuffer with 
ArrayBuffer(2) for profiling and testing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15736: [SPARK-18224] [CORE] Optimise PartitionedPairBuffer impl...

2016-11-30 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/15736
  
Good point, done, I can get profiling the below code then? Builds fine and 
no scalastyle problems

```
  def getComparator[K](keyComparator: Option[Comparator[K]]): 
Comparator[(Int, K)] = {
if (keyComparator.isDefined) {
  val theKeyComp = keyComparator.get
  new Comparator[(Int, K)] {
// We know we have a non-empty comparator here
override def compare(a: (Int, K), b: (Int, K)): Int = {
  if (a._1 == b._1) {
theKeyComp.compare(a._2, b._2)
  } else {
a._1 - b._1
  }
}
  }
} else {
  new Comparator[(Int, K)] {
override def compare(a: (Int, K), b: (Int, K)): Int = {
  a._1 - b._1
}
  }
}
  }
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15736: [SPARK-18224] [CORE] Optimise PartitionedPairBuffer impl...

2016-11-30 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/15736
  
I see, so doing the != comparison first which is likely to be true more of 
the time so we're not consistently failing this check then entering the else

```
  def getComparator[K](keyComparator: Option[Comparator[K]]): 
Comparator[(Int, K)] = {
if (keyComparator.isDefined) {
  val theKeyComp = keyComparator.get
  new Comparator[(Int, K)] {
// We know we have a non-empty comparator here
override def compare(a: (Int, K), b: (Int, K)): Int = {
  if (a._1 != b._1) {
   a._1 - b._1
  } else {
   theKeyComp.compare(a._2, b._2)
  }
}
  }
} else {
  new Comparator[(Int, K)] {
override def compare(a: (Int, K), b: (Int, K)): Int = {
  a._1 - b._1
}
  }
}
  }
```

Again that builds fine


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15736: [SPARK-18224] [CORE] Optimise PartitionedPairBuffer impl...

2016-11-30 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/15736
  
Passed on your question to our JIT developers

> The sense* of the test can have an impact of the VM interpreter 
performance, but that is not usually much of a component of actual throughput 
since important methods will be JIT'd very quickly regardless of which specific 
JVM you use. The J9 VM is capable of profiling the branch and flipping the 
sense when JITing the code
> 
> The sense refers to the way a code branches, so either down the equals 
branch or not equals branch

Numbers for us

**Refactored further as above**
```
ScalaSparkPagerank 2016-11-30 14:27:49 25992811549.841  
 5215146
ScalaSparkPagerank 2016-11-30 14:29:52 25992811551.310  
 5065837
ScalaSparkPagerank 2016-11-30 14:31:59 25992811552.086  
 4990364
ScalaSparkPagerank 2016-11-30 14:34:05 25992811550.667  
 5130126
ScalaSparkPagerank 2016-11-30 14:36:04 25992811547.096  
 5519112
ScalaSparkPagerank 2016-11-30 14:38:04 25992811548.244  
 5387781
ScalaSparkPagerank 2016-11-30 14:40:10 25992811548.734  
 5333609
ScalaSparkPagerank 2016-11-30 14:42:12 25992811549.295  
 5272910
397.273 / 8 = 49.659 sec average
```
**initial commit**
```
ScalaSparkPagerank 2016-11-30 14:48:01 25992811546.442  
 5596832
ScalaSparkPagerank 2016-11-30 14:50:06 25992811550.016  
 5196899
ScalaSparkPagerank 2016-11-30 14:52:12 25992811551.113  
 5085362
ScalaSparkPagerank 2016-11-30 14:54:12 25992811546.424  
 5599002
ScalaSparkPagerank 2016-11-30 14:56:15 25992811547.604  
 5460215
ScalaSparkPagerank 2016-11-30 14:58:14 25992811546.802  
 5553782
ScalaSparkPagerank 2016-11-30 15:00:16 25992811547.021  
 5527915
ScalaSparkPagerank 2016-11-30 15:02:16 25992811547.072  
 5521926
382.494 / 8 = 47.811s average
```

The first commit performs better on average, I'd like to next add the 
improved compare code as above and "push this down" into the subclasses to see 
how this performs


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15713: [SPARK-18196] [CORE] Optimise CompactBuffer implementati...

2016-12-01 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/15713
  
Performance results against the Spark master branch on a 48 core machine 
running PageRank with 500k pages follow

**Vanilla CompactBuffer, no changes, run time and throughput (bytes per 
second) provided**
```
ScalaSparkPagerank 2016-12-01 13:16:09 25992811547.933  
 5422738
ScalaSparkPagerank 2016-12-01 13:22:41 25992811545.551  
 5706309
ScalaSparkPagerank 2016-12-01 13:26:31 25992811546.745  
 5560554
ScalaSparkPagerank 2016-12-01 13:28:58 25992811551.699  
 5027720
ScalaSparkPagerank 2016-12-01 13:33:26 25992811548.415  
 5368751
240.343s / 5 = 48.068s avg
```

**The commit here**
```
ScalaSparkPagerank 2016-12-01 10:26:12 25992811548.706  
 5336675
ScalaSparkPagerank 2016-12-01 10:37:30 25992811548.947  
 5310399
ScalaSparkPagerank 2016-12-01 10:40:16 25992811549.768  
 5222796
ScalaSparkPagerank 2016-12-01 12:55:37 25992811548.873  
 5318439
ScalaSparkPagerank 2016-12-01 12:58:12 25992811547.535  
 5468141
243.829 / 5 = 48.7658s avg
```

Way too similar so attributing this to benchmark noise, without the 51s run 
this would be a few percentage points worse though

**Use an ArrayBuffer (initial capacity of 16, default) instead of 
CompactBuffer**
```
ScalaSparkPagerank 2016-12-01 13:42:45 25992811562.190  
 4179580
ScalaSparkPagerank 2016-12-01 13:55:20 25992811554.112  
 4803520
ScalaSparkPagerank 2016-12-01 13:59:06 25992811560.818  
 4273868
ScalaSparkPagerank 2016-12-01 14:06:26 25992811557.428  
 4526156
ScalaSparkPagerank 2016-12-01 14:35:01 25992811558.218  
 4464737
292.766 / 5 = 58.5532s avg
```

**Use an ArrayBuffer (initial capacity of 2) instead of CompactBuffer**
```
ScalaSparkPagerank 2016-12-01 15:31:16 25992811553.544  
 4854476
ScalaSparkPagerank 2016-12-01 15:36:32 25992811558.105  
 4473420
ScalaSparkPagerank 2016-12-01 15:38:45 25992811553.976  
 4815623
ScalaSparkPagerank 2016-12-01 15:44:09 25992811555.174  
 4711061
ScalaSparkPagerank 2016-12-01 15:50:01 25992811555.084  
 4718758
275.883 / 5 = 55.1766s avg
```

With my tests I see that using an ArrayBuffer is noticeably worse, so I'll 
continue to look into what's going on to see if we can improve performance here 
as this is definitely a hot codepath for this particular algorithm


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15736: [SPARK-18224] [CORE] Optimise PartitionedPairBuffer impl...

2016-12-06 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/15736
  
New data for us, inlined comparator scores here (code provided below to 
check I've not profiled something useless!):
```
ScalaSparkPagerank 2016-12-05 13:44:41 25992811548.149  
 5398411  5398411
ScalaSparkPagerank 2016-12-05 13:46:43 25992811546.897  
 5542531  5542531
ScalaSparkPagerank 2016-12-05 13:48:46 25992811549.130  
 5290619  5290619
ScalaSparkPagerank 2016-12-05 13:50:49 25992811549.793  
 5220173  5220173
ScalaSparkPagerank 2016-12-05 13:52:50 25992811548.061  
 5408296  5408296
ScalaSparkPagerank 2016-12-05 13:54:52 25992811546.468  
 5593701  5593701
ScalaSparkPagerank 2016-12-05 13:56:56 25992811551.385  
 5058443  5058443
ScalaSparkPagerank 2016-12-05 13:58:59 25992811547.857  
 5431349  5431349
ScalaSparkPagerank 2016-12-05 14:00:59 25992811546.515  
 5588049  5588049
ScalaSparkPagerank 2016-12-05 14:03:03 25992811547.791  
 5438850  5438850
Avg 48.2046s
```

Remember our "vanilla" average time is 47.752s and our first commit 
averaged 47.229s (so not much of a difference really).

I think we're splitting hairs and I've got another PR I am seeing good 
results on that I plan to focus on instead: the SizeEstimator. 

This is what I've benchmarked, PartitionedAppendOnlyMap first, so let me 
know if there any further suggestions, otherwise I propose leaving this one for 
later as actually against the Spark master codebase I'm not noticing anything 
exciting.

```
  def partitionedDestructiveSortedIterator(keyComparator: 
Option[Comparator[K]])
: Iterator[((Int, K), V)] = {
val comparator = {
  if (keyComparator.isDefined) {
val theKeyComp = keyComparator.get
new Comparator[(Int, K)] {
  // We know we have a non-empty comparator here
  override def compare(a: (Int, K), b: (Int, K)): Int = {
if (a._1 != b._1) {
  a._1 - b._1
} else {
 theKeyComp.compare(a._2, b._2)
}
  }
}
  } else {
new Comparator[(Int, K)] {
  override def compare(a: (Int, K), b: (Int, K)): Int = {
a._1 - b._1
  }
}
  }
}
destructiveSortedIterator(comparator)
  }
```

In PartitionedPairBuffer

```

  /** Iterate through the data in a given order. For this class this is not 
really destructive. */
  override def partitionedDestructiveSortedIterator(keyComparator: 
Option[Comparator[K]])
: Iterator[((Int, K), V)] = {
val comparator = {
  if (keyComparator.isDefined) {
val theKeyComp = keyComparator.get
new Comparator[(Int, K)] {
  // We know we have a non-empty comparator here
  override def compare(a: (Int, K), b: (Int, K)): Int = {
if (a._1 != b._1) {
  a._1 - b._1
} else {
 theKeyComp.compare(a._2, b._2)
}
  }
}
  } else {
new Comparator[(Int, K)] {
  override def compare(a: (Int, K), b: (Int, K)): Int = {
a._1 - b._1
  }
}
  }
}
new Sorter(new KVArraySortDataFormat[(Int, K), AnyRef]).sort(data, 0, 
curSize, comparator)
iterator
  }
```

WritablePartitionedPairCollection remains unchanged.







---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16196: [SPARK-18231] Optimise SizeEstimator implementati...

2016-12-07 Thread a-roberts
GitHub user a-roberts opened a pull request:

https://github.com/apache/spark/pull/16196

[SPARK-18231] Optimise SizeEstimator implementation

## What changes were proposed in this pull request?

Several improvements to the SizeEstimator for performance, most of the 
benefit comes from, when estimating, contending to not contending on multiple 
threads. There can be a small boost in uncontended scenarios from the removal 
of the synchronisation code but the cost of that synchronisation when not truly 
contended is low. On the PageRank workload for HiBench we see 10-15% 
performance improvements (measuring elapsed times on average) with both IBM's 
SDK for Java and OpenJDK 8. I don't see any changes other than noise for the 
other workloads on this benchmark.

## How was this patch tested?

Existing unit tests but there are problems to resolve.

I see SizeEstimatorSuite and SizeTrackerSuite failing with at least IBM 
Java now due to smaller sizes being reported than the test expects (let's see 
what happens with OpenJDK on the community runs). 

In SizeTrackerSuite I think the failures are caused by using 
ThreadLocalRandom and not Random - because with Random we see these tests 
passing again. Not sure how robust SizeTrackerSuite is though.

For performance testing I've used HiBench, large profile, with one executor 
ranging from 10g to 25g, experimenting with fixed and dynamic heaps. The Spark 
code I've based my results on is from December the 1st (master branch, so 2.1.0 
snapshot).

More details on the optimisations (this being phase one and JDK agnostic) 
at www.spark.tc/improvements-to-the-sizeestimator-class

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/a-roberts/spark patch-12

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16196.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16196


commit 50af8fc224cb5acb19a7b55d31ee92b44c96f26f
Author: Adam Roberts 
Date:   2016-12-07T10:32:37Z

[SPARK-18231] Optimise SizeEstimator implementation

Several improvements to the SizeEstimator for performance, most of the 
benefit comes from, when estimating, contending to not contending on multiple 
threads. There can be a small boost in uncontended scenarios from the removal 
of the synchronisation code but the cost of that synchronisation when not truly 
contended is low. On the PageRank workload for HiBench we see 49~ second 
durations reduced to ~41 second durations. I don't see any changes for other 
workloads. Observed with both IBM's SDK for Java and OpenJDK.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16196: [SPARK-18231] Optimise SizeEstimator implementati...

2016-12-07 Thread a-roberts
Github user a-roberts commented on a diff in the pull request:

https://github.com/apache/spark/pull/16196#discussion_r91301705
  
--- Diff: core/src/main/scala/org/apache/spark/util/SizeEstimator.scala ---
@@ -89,7 +90,13 @@ object SizeEstimator extends Logging {
 
   // A cache of ClassInfo objects for each class
   // We use weakKeys to allow GC of dynamically created classes
-  private val classInfos = new MapMaker().weakKeys().makeMap[Class[_], 
ClassInfo]()
+  private val classInfos = new ThreadLocal[WeakHashMap[Class[_], 
ClassInfo]] {
+override def initialValue(): java.util.WeakHashMap[Class[_], 
ClassInfo] = {
+  val toReturn = new WeakHashMap[Class[_], ClassInfo]()
+  toReturn.put(classOf[Object], new ClassInfo(objectSize, new 
Array[Int](0)))
+  return toReturn
--- End diff --

Built and profiled, averaging 42 sec run times with the initial commit, 
averaging 45 second run times with this. No changes = 48 sec.

My code as a diff (so using a ConcurrentHashMap and var not val so we can 
initialise it later) provided here:

```
 import java.lang.management.ManagementFactory
 import java.lang.reflect.{Field, Modifier}
 import java.util.{IdentityHashMap, WeakHashMap}
-import java.util.concurrent.ThreadLocalRandom
+import java.util.concurrent.{ThreadLocalRandom, ConcurrentMap}

 import scala.collection.mutable.ArrayBuffer
 import scala.concurrent.util.Unsafe
@@ -88,16 +88,6 @@ object SizeEstimator extends Logging {
   // TODO: Is this arch dependent ?
   private val ALIGN_SIZE = 8

-  // A cache of ClassInfo objects for each class
-  // We use weakKeys to allow GC of dynamically created classes
-  private val classInfos = new ThreadLocal[WeakHashMap[Class[_], 
ClassInfo]] {
-override def initialValue(): java.util.WeakHashMap[Class[_], 
ClassInfo] = {
-  val toReturn = new WeakHashMap[Class[_], ClassInfo]()
-  toReturn.put(classOf[Object], new ClassInfo(objectSize, new 
Array[Int](0)))
-  return toReturn
-}
-  }
-
   // Object and pointer sizes are arch dependent
   private var is64bit = false

@@ -109,6 +99,8 @@ object SizeEstimator extends Logging {
   // Minimum size of a java.lang.Object
   private var objectSize = 8

+  private var classInfos: ConcurrentMap[Class[_], ClassInfo] = null
+
   initialize()

   // Sets object size, pointer size based on architecture and 
CompressedOops settings
@@ -126,6 +118,9 @@ object SizeEstimator extends Logging {
   }
 }
 pointerSize = if (is64bit && !isCompressedOops) 8 else 4
+
+classInfos = new MapMaker().weakKeys().makeMap[Class[_], ClassInfo]()
+classInfos.put(classOf[Object], new ClassInfo(objectSize, new 
Array[Int](0)))
   }

   private def getIsCompressedOops: Boolean = {
@@ -338,7 +333,7 @@ object SizeEstimator extends Logging {
*/
   private def getClassInfo(cls: Class[_]): ClassInfo = {
 // Check whether we've already cached a ClassInfo for this class
-val info = classInfos.get().get(cls)
+val info = classInfos.get(cls)
 if (info != null) {
   return info
 }
@@ -371,7 +366,7 @@ object SizeEstimator extends Logging {

 // Create and cache a new ClassInfo
 val newInfo = new ClassInfo(shellSize, fieldOffsets.toArray)
-classInfos.get().put(cls, newInfo)
+classInfos.put(cls, newInfo)
 newInfo
   }
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16196: [SPARK-18231] Optimise SizeEstimator implementati...

2016-12-07 Thread a-roberts
Github user a-roberts commented on a diff in the pull request:

https://github.com/apache/spark/pull/16196#discussion_r91305043
  
--- Diff: core/src/main/scala/org/apache/spark/util/SizeEstimator.scala ---
@@ -243,47 +253,59 @@ object SizeEstimator extends Logging {
   arrSize += alignSize(length.toLong * primitiveSize(elementClass))
   state.size += arrSize
 } else {
+  // We know that the array we are dealing with is an array of 
references
+  // so explicitly expose this type so we can directly manipulate the 
array
+  // without help form the Scala runtime for efficency
   arrSize += alignSize(length.toLong * pointerSize)
   state.size += arrSize
 
+  val objArray = array.asInstanceOf[Array[AnyRef]]
+
   if (length <= ARRAY_SIZE_FOR_SAMPLING) {
 var arrayIndex = 0
 while (arrayIndex < length) {
-  state.enqueue(ScalaRunTime.array_apply(array, 
arrayIndex).asInstanceOf[AnyRef])
+  state.enqueue(objArray(arrayIndex))
   arrayIndex += 1
 }
   } else {
 // Estimate the size of a large array by sampling elements without 
replacement.
 // To exclude the shared objects that the array elements may link, 
sample twice
-// and use the min one to calculate array size.
-val rand = new Random(42)
+//  and use the min one to calculate array size.
+//  Use ThreadLocalRandom here since the random is only accessed 
from 1 thread
+// and we can save the overhead of the full thread-safe Random
+val rand = ThreadLocalRandom.current
 val drawn = new OpenHashSet[Int](2 * ARRAY_SAMPLE_SIZE)
-val s1 = sampleArray(array, state, rand, drawn, length)
-val s2 = sampleArray(array, state, rand, drawn, length)
+val s1 = sampleArray(objArray, state, rand, drawn, length)
+val s2 = sampleArray(objArray, state, rand, drawn, length)
 val size = math.min(s1, s2)
+
 state.size += math.max(s1, s2) +
   (size * ((length - ARRAY_SAMPLE_SIZE) / 
(ARRAY_SAMPLE_SIZE))).toLong
   }
 }
   }
 
   private def sampleArray(
-  array: AnyRef,
+  array: Array[AnyRef],
   state: SearchState,
-  rand: Random,
+  rand: ThreadLocalRandom,
   drawn: OpenHashSet[Int],
   length: Int): Long = {
 var size = 0L
-for (i <- 0 until ARRAY_SAMPLE_SIZE) {
+// avoid the use of an iterator derrived from the range syntax here 
for performance
+var count = 0
+val end = ARRAY_SAMPLE_SIZE
+while (count <= end) {
--- End diff --

Ah yes, should be just < not <=, will add into the next commit


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16196: [SPARK-18231] Optimise SizeEstimator implementati...

2016-12-07 Thread a-roberts
Github user a-roberts commented on a diff in the pull request:

https://github.com/apache/spark/pull/16196#discussion_r91309585
  
--- Diff: core/src/main/scala/org/apache/spark/util/SizeEstimator.scala ---
@@ -89,7 +90,13 @@ object SizeEstimator extends Logging {
 
   // A cache of ClassInfo objects for each class
   // We use weakKeys to allow GC of dynamically created classes
-  private val classInfos = new MapMaker().weakKeys().makeMap[Class[_], 
ClassInfo]()
+  private val classInfos = new ThreadLocal[WeakHashMap[Class[_], 
ClassInfo]] {
+override def initialValue(): java.util.WeakHashMap[Class[_], 
ClassInfo] = {
+  val toReturn = new WeakHashMap[Class[_], ClassInfo]()
+  toReturn.put(classOf[Object], new ClassInfo(objectSize, new 
Array[Int](0)))
+  return toReturn
--- End diff --

Got it, thanks will give this a try


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16212: [SPARK-18782] [BUILD] Bump Hadoop 2.6 version to ...

2016-12-08 Thread a-roberts
GitHub user a-roberts opened a pull request:

https://github.com/apache/spark/pull/16212

[SPARK-18782] [BUILD] Bump Hadoop 2.6 version to use Hadoop 2.6.5

## What changes were proposed in this pull request?
Use Hadoop 2.6.5 for the Hadoop 2.6 profile, I see a bunch of fixes 
including security ones in the [release 
note](http://hadoop.apache.org/docs/r2.6.5/hadoop-project-dist/hadoop-common/releasenotes.html)s
 that we should pick up

## How was this patch tested?
Running the unit tests now with IBM's SDK for Java and let's see what 
happens with OpenJDK in the community builder - expecting no trouble as it is 
only a minor release.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/a-roberts/spark Hadoop265Bumper

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16212.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16212


commit d777246ea7c2b0c6acb31c8cb7e1caee4f3fc9fb
Author: Adam Roberts 
Date:   2016-12-08T10:32:16Z

Use Hadoop 2.6.5




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16212: [SPARK-18782] [BUILD] Bump Hadoop 2.6 version to use Had...

2016-12-08 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/16212
  
http://www.openwall.com/lists/oss-security/2016/11/29/1 mentions Hadoop 
2.7.x users should upgrade to 2.7.3 and Hadoop 2.6.x users should upgrade to 
2.6.5, so if our Hadoop users are moving up to 2.6.5 for 1.6.x, can we be 
certain Spark will work if we use Hadoop 2.6.4 classes with Hadoop 2.6.5 ones? 
I'm thinking specifically in terms of autogenerated serial version UID 
mismatches that may occur


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15620: [SPARK-18091] [SQL] Deep if expressions cause Gen...

2016-12-10 Thread a-roberts
Github user a-roberts commented on a diff in the pull request:

https://github.com/apache/spark/pull/15620#discussion_r91834301
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala
 ---
@@ -97,6 +97,27 @@ class CodeGenerationSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 assert(actual(0) == cases)
   }
 
+  test("SPARK-18091: split large if expressions into blocks due to JVM 
code size limit") {
+val inStr = "StringForTesting"
+val row = create_row(inStr)
+val inputStrAttr = 'a.string.at(0)
+
+var strExpr: Expression = inputStrAttr
+for (_ <- 1 to 13) {
+  strExpr = If(EqualTo(Decode(Encode(strExpr, "utf-8"), "utf-8"), 
inputStrAttr),
--- End diff --

Very interested in this, we know with 5 iterations instead of a 13 we don't 
get the problem (which makes sense as we'd only be generating so much code, not 
hitting the 64k constant pool size limit). I'm testing with IBM's SDK for Java 
so curious if it manifests itself in a different way for us or if we have a 
problem to fix on our end.

I have log files exceeding 2 GB from the test printing the generated code 
on failure. 

If we add prints for the strExpr we see something like

The problem is
```
CodeGenerationSuite:
- multithreaded eval
- metrics are recorded on compile
- SPARK-8443: split wide projections into blocks due to JVM code size limit
- SPARK-13242: case-when expression with large number of branches (or cases)
- SPARK-18091: split large if expressions into blocks due to JVM code size 
limit *** FAILED ***
  java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
compile: org.codehaus.janino.JaninoRuntimeException: Constant pool for class 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
 has grown past JVM limit of 0x
/* 001 */ public java.lang.Object generate(Object[] references) {
/* 002 */   return new SpecificUnsafeProjection(references);
/* 003 */ }
/* 004 */
/* 005 */ class SpecificUnsafeProjection extends 
org.apache.spark.sql.catalyst.expressions.UnsafeProjection {
/* 006 */
/* 007 */   private Object[] references;
/* 008 */   private boolean isNull42;
/* 009 */   private boolean value42;
/* 010 */   private boolean isNull43;
/* 011 */   private UTF8String value43;
/* 012 */   private boolean isNull44;
/* 013 */   private UTF8String value44;
/* 014 */   private boolean isNull58;
```

With my prints:
```
debug, row: [StringForTesting]
input string attr: input[0, string, true]

in the loop
strExpr is: if ((decode(encode(input[0, string, true], utf-8), utf-8) = 
input[0, string, true])) input[0, string, true] else input[0, string, true]

in the loop
strExpr is: if ((decode(encode(if ((decode(encode(input[0, string, true], 
utf-8), utf-8) = input[0, string, true])) input[0, string, true] else input[0, 
string, true], utf-8), utf-8) = input[0, string, true])) if 
((decode(encode(input[0, string, true], utf-8), utf-8) = input[0, string, 
true])) input[0, string, true] else input[0, string, true] else if 
((decode(encode(input[0, string, true], utf-8), utf-8) = input[0, string, 
true])) input[0, string, true] else input[0, string, true]

in the loop
strExpr is: if ((decode(encode(if ((decode(encode(if 
((decode(encode(input[0, string, true], utf-8), utf-8) = input[0, string, 
true])) input[0, string, true] else input[0, string, true], utf-8), utf-8) = 
input[0, string, true])) if ((decode(encode(input[0, string, true], utf-8), 
utf-8) = input[0, string, true])) input[0, string, true] else input[0, string, 
true] else if ((decode(encode(input[0, string, true], utf-8), utf-8) = input[0, 
string, true])) input[0, string, true] else input[0, string, true], utf-8), 
utf-8) = input[0, string, true])) if ((decode(encode(if 
((decode(encode(input[0, string, true], utf-8), utf-8) = input[0, string, 
true])) input[0, string, true] else input[0, string, true], utf-8), utf-8) = 
input[0, string, true])) if ((decode(encode(input[0, string, true], utf-8), 
utf-8) = input[0, string, true])) input[0, string, true] else input[0, string, 
true] else if ((decode(encode(input[0, string, true], utf-8), utf-8) = input[0, 
string, true])) input[0, string, true
 ] else input[0, string, true] else if ((decode(encode(if 
((decode(encode(input[0, string, true], utf-8), utf-8) = input[0, string, 
true])) input[0, string, true] else input[0, string, true], utf-8), utf-8) = 
input[0, string, true])) if ((decode(encode(input[0, string, true], utf-8), 
utf-8) = input[0, string, true])) input[0, string, true] else input[0, string, 
true] else if ((decode(encode(input[0, string, true], utf-8), utf-8) = input[0, 
string

[GitHub] spark issue #16196: [SPARK-18231] Optimise SizeEstimator implementation

2016-12-11 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/16196
  
Agreed, I'll be back working on this and answering the queries after the 
2.1.0 release vote passes, that's my current priority as we're nearing the 
Christmas break period


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14558][CORE] In ClosureCleaner, clean t...

2016-04-19 Thread a-roberts
Github user a-roberts commented on the pull request:

https://github.com/apache/spark/pull/12327#issuecomment-211999755
  
Hi, with this new test I'm seeing large deviations using IBM JDKs and 
different platforms, for example:

`java.lang.AssertionError: assertion failed: deviation too large: 
0.25359712230215825, first size: 26688, second size: 19920`

Is the 20% deviation particularly important? Why this number?

FYI here's the comparison between different Java vendors and architectures:
- OpenJDK on Intel (passes), cacheSize1: 180392 and cacheSize2: 187896 
(4.07%)
- IBM JDK with SUSE on zSystems (fails): cacheSize1: 26688 and cacheSize2: 
19920 (29%)
- IBM JDK with Ubuntu 14 04 on Power 8 LE (fails): cacheSize1: 26688 and 
cacheSize2: 19920 (29%)
- IBM JDK with Ubuntu 14 04 on Intel (fails): cacheSize1: 354692 and 
cacheSize2: 263800 (29.3%)

I'll look into this, interesting that with IBM Java it's always a very 
similar and much larger percentage


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14558][CORE] In ClosureCleaner, clean t...

2016-04-21 Thread a-roberts
Github user a-roberts commented on the pull request:

https://github.com/apache/spark/pull/12327#issuecomment-212991789
  
@cloud-fan I've had a closer look at this and think a more robust method 
would be to use weak references to identify when an object is out of scope, 
with IBM Java we see the 29% reduction between cache size 1 and cache size 2 
but with OpenJDK we see a 4% increase, suggesting that we can't rely on the 
sizes being similar across JDK vendors, now thinking this is a test case issue 
rather than a problem in the ClosureCleaner or IBM Java code.

With IBM Java our second cache size (after repartitioning) is much smaller; 
repartitioning uses ContextCleaner whereas with OpenJDK it grows. Either we 
have a bigger memory footprint or the cached size is being calculated 
incorrectly (looks fine to me and we actually have smaller object sizes). The 
problem on Z was due to using repl/pom.xml instead of pom.xml in the Spark home 
directory (same result if we use the right pom.xml file) so can be discarded 
for this discussion.

I'm going to figure out what's in the Scala REPL line objects between 
vendors, I think the intention of this commit is to test that the REPL line 
object is being cleaned but the assertion in place at the moment doesn't look 
to be correct (the size is bigger after the cleaning and cacheSize2 is the 
result of cleaning if I'm understanding the code correctly), have I missed a 
trick?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13745][SQL]Support columnar in memory r...

2016-04-25 Thread a-roberts
Github user a-roberts commented on the pull request:

https://github.com/apache/spark/pull/12397#issuecomment-214536410
  
@hvanhovell I'm working with Pete on this, with this patch around 220 
failing tests now pass on BE leaving us with only a handful to investigate, 
this fix definitely takes away the majority of our problems on BE platforms


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11560: [SPARK-13715] [MLLIB] Remove last usages of jblas in tes...

2016-06-23 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/11560
  
Sean, any plans to backport this and 
https://issues.apache.org/jira/browse/SPARK-5814 to the 1.6 branch? Happy to 
test and contribute myself in a new JIRA if need be


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11956: [SPARK-14098][SQL] Generate Java code that gets a float/...

2016-07-07 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/11956
  
@robbinspg and I are evaluating this from a functional and performance 
perspective, full disclosure: we both work for IBM with @kiszk.

All unit tests pass including the new ones Ishizaki has added, we've tested 
this on a variety of platforms, both big and little-endian. This is with IBM 
Java 8 and tested on three different architectures.

We can run the benchmark with
```
bin/spark-submit --class org.apache.spark.sql.DataFrameCacheBenchmark 
sql/core/target/spark-sql_2.11-2.0.0-tests.jar
``` 

or can be run against branch-2.0 (Spark 2.0.1 snapshot) with 
```
bin/spark-submit --class org.apache.spark.sql.DataFrameCacheBenchmark 
sql/core/target/spark-sql_2.11-2.0.1-SNAPSHOT-tests.jar
```

Performance results on a few low powered testing systems are promising.

Linux on Intel: 5.3x increase
```
  Stopped after 15 iterations, 2127 ms

IBM J9 VM pxa6480sr3-20160428_01 (SR3) on Linux 3.13.0-65-generic
Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
Float Sum with PassThrough cache:Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative


InternalRow codegen669 /  829 47.1  
21.3   1.0X
ColumnVector codegen   127 /  142248.2  
 4.0   5.3X
```

Linux on Z: 2.7x increase
```
Stopped after 5 iterations, 2068 ms

IBM J9 VM pxz6480sr3-20160428_01 (SR3) on Linux 3.12.43-52.6-default
16/07/07 09:48:15 ERROR Utils: Process List(/usr/bin/grep, -m, 1, model 
name, /proc/cpuinfo) exited with code 1:
Unknown processor
Float Sum with PassThrough cache:Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative


InternalRow codegen997 / 1134 31.5  
31.7   1.0X
ColumnVector codegen   371 /  414 84.7  
11.8   2.7X

```

Linux on Power: 6.4x increase
```
  Stopped after 7 iterations, 2099 ms

IBM J9 VM pxl6480sr3-20160428_01 (SR3) on Linux 3.13.0-61-generic
16/07/07 14:33:40 ERROR Utils: Process List(/bin/grep, -m, 1, model name, 
/proc/cpuinfo) exited with code 1:
Unknown processor
Float Sum with PassThrough cache:Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative


InternalRow codegen   1199 / 1212 26.2  
38.1   1.0X
ColumnVector codegen   186 /  300168.8  
 5.9   6.4X
```

So the performance increase and functionality is solid across platforms, 
Ishizaki has tested this with OpenJDK 8 also.

One improvement would be add a scale factor parameter so we can use more 
data than:
```
doubleSumBenchmark(1024 * 1024 * 15)
floatSumBenchmark(1024 * 1024 * 30)
```
and with no parameter we'd use the above as a standard/baseline. 

Would also be useful to have the master url as a parameter so we can easily 
run this using many machines or with more cores to see the 
performance/functional impact when we scale (exercising various JIT levels for 
example)




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14379: [SPARK-16751] Upgrade Derby, remove from package

2016-07-27 Thread a-roberts
GitHub user a-roberts opened a pull request:

https://github.com/apache/spark/pull/14379

[SPARK-16751] Upgrade Derby, remove from package

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)


## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)


(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)


## What changes were proposed in this pull request?

Version of derby upgraded based on important security info at VersionEye. 
Test scope added so we don't include it in our final package anyway. NB: I 
think this should be backported to all previous releases as it is a security 
problem https://www.versioneye.com/java/org.apache.derby:derby/10.11.1.1

The CVE number is 2015-1832. I also suggest we add a SECURITY tag for JIRAs

## How was this patch tested?
Existing tests with the change making sure that we see no new failures. I 
checked derby 10.12.x and not derby 10.11.x is downloaded to our ~/.m2 folder.

I then used dev/make-distribution.sh and checked the dist/jars folder for 
Spark 2.0: no derby jar is present.

I don't know if this would also remove it from the assembly jar in our 1.x 
branches.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/a-roberts/spark patch-4

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14379.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14379


commit 910520efbd656ec960e1c4ec228b928fcee80be9
Author: Adam Roberts 
Date:   2016-07-27T12:53:46Z

[SPARK-16751] Upgrade Derby, remove from package

## What changes were proposed in this pull request?

Version of derby upgraded based on important security info at VersionEye. 
Test scope added so we don't include it in our final package anyway. NB: I 
think this should be backported to all previous releases as it is a security 
problem https://www.versioneye.com/java/org.apache.derby:derby/10.11.1.1

The CVE number is 2015-1832. I also suggest we add a SECURITY tag for JIRAs

## How was this patch tested?
Existing tests with the change making sure that we see no new failures. I 
checked derby 10.12.x and not derby 10.11.x is downloaded to our ~/.m2 folder.

I then used dev/make-distribution.sh and checked the dist/jars folder for 
Spark 2.0: no derby jar is present.

I don't know if this would also remove it from the assembly jar in our 1.x 
branches.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14379: [SPARK-16751] Upgrade Derby, remove from package

2016-07-27 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/14379
  
Noticed the 
/home/jenkins/workspace/SparkPullRequestBuilder/dev/test-dependencies.sh 
problem, need to regen (I slipped up here when upgrading Hadoop last time too), 
on the case...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14379: [SPARK-16751] Upgrade Derby, remove from package

2016-07-27 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/14379
  
**dev/test-dependencies.sh --replace-manifest** then git status to see the 
changed files, updated the PR





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14379: [SPARK-16751] Upgrade Derby, remove from package

2016-07-27 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/14379
  
Jenkins retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14379: [SPARK-16751] Upgrade Derby, remove from package

2016-07-27 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/14379
  
Interesting point, and you'd think that requires derby classes, out of 
curiosity what actually happens? I've noticed with bin/spark-shell we get a 
metastore_db folder and derby.log, without derby on our CP do we get 
NoClassDefFound exceptions? I can give this a try tomorrow if there's no quick 
answer here

If this is the case then we can simply remove the test I 
added and let's ship derby 10.12.1.1 not 10.11.1.1 instead


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14379: [SPARK-16751] Upgrade derby to 10.12.1.1

2016-07-28 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/14379
  
Jenkins retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14379: [SPARK-16751] Upgrade derby to 10.12.1.1

2016-07-28 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/14379
  
One failure at 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62943/testReport/

which looks to be a timeout or flaky test and not caused by this PR, let's 
test once more to see if it consistent with this as the "age" on Jenkins 
doesn't help (usually gives us a history of this test failing)

[info] DirectKafkaStreamSuite:
[info] - basic stream receiving with multiple topics and smallest starting 
offset (6 seconds, 566 milliseconds)
[info] - pattern based subscription *** FAILED *** (21 seconds, 148 
milliseconds)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14558][CORE] In ClosureCleaner, clean t...

2016-05-27 Thread a-roberts
Github user a-roberts commented on the pull request:

https://github.com/apache/spark/pull/12327#issuecomment-222107940
  
> cacheSize1 and cacheSize2 are both the size after cleaning. The 
difference is that, cacheSize1 is the size after cleaned the data with line 
object reference, cacheSize2 is the size after cleaned the data without line 
object reference.

Looking for clarity here, is it true that clean "with the reference" should 
be bigger (cacheSize1) and clean "without the reference" should be smaller 
(cacheSize2)? 

OpenJDK, cacheSize1: 180392, cacheSize2: 187896 (bigger without the line 
object reference)

IBM JDK, cacheSize1: 354692, cacheSize2: 263800 (smaller without the line 
object reference)

What exactly does "without line object reference" mean and should 
cacheSize1 be smaller or bigger than cacheSize2?

I know the SizeEstimator overestimates for IBM Java so our cached footprint 
is much larger (handling this), so because of the larger difference we get this 
test failing, OpenJDK **fails** with Kryo and IBM **passes** with Kryo for this 
test.

A better check would be to run with and without the closure cleaner change 
and to check the second result is less by the size of the line object, so based 
on our cacheSize2 being smaller (without the line object reference), I'm 
thinking that IBM Java functions as expected and OpenJDK doesn't - but this 
depends on my questions above, interested to hear what you think @cloud-fan


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13556: [SPARK-15818] Upgrade to Hadoop 2.7.2

2016-06-08 Thread a-roberts
GitHub user a-roberts opened a pull request:

https://github.com/apache/spark/pull/13556

[SPARK-15818] Upgrade to Hadoop 2.7.2

## What changes were proposed in this pull request?

Updating the Hadoop version from 2.7.0 to 2.7.2 if we use the Hadoop-2.7 
build profile


## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
Existing tests

(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)


I'd like us to use Hadoop 2.7.2 owing to the Hadoop release notes stating 
Hadoop 2.7.0 is not ready for production use

https://hadoop.apache.org/docs/r2.7.0/ states

"Apache Hadoop 2.7.0 is a minor release in the 2.x.y release line, building 
upon the previous stable release 2.6.0.
This release is not yet ready for production use. Production users should 
use 2.7.1 release and beyond."

Hadoop 2.7.1 release notes:
"Apache Hadoop 2.7.1 is a minor release in the 2.x.y release line, building 
upon the previous release 2.7.0. This is the next stable release after Apache 
Hadoop 2.6.x."

And then Hadoop 2.7.2 release notes:
"Apache Hadoop 2.7.2 is a minor release in the 2.x.y release line, building 
upon the previous stable release 2.7.1."

I've tested this is OK with Intel hardware and IBM Java 8 so let's test it 
with OpenJDK, ideally this will be pushed to branch-2.0 and master.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/a-roberts/spark patch-2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13556.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13556


commit 04b16494a0c484f90d7eea05b9807f8598b89903
Author: Adam Roberts 
Date:   2016-06-08T08:46:36Z

[SPARK-15818] Upgrade to Hadoop 2.7.2

I'd like us to use Hadoop 2.7.2 owing to the Hadoop release notes stating 
Hadoop 2.7.0 is not ready for production use

https://hadoop.apache.org/docs/r2.7.0/ states

"Apache Hadoop 2.7.0 is a minor release in the 2.x.y release line, building 
upon the previous stable release 2.6.0.
This release is not yet ready for production use. Production users should 
use 2.7.1 release and beyond."

Hadoop 2.7.1 release notes:
"Apache Hadoop 2.7.1 is a minor release in the 2.x.y release line, building 
upon the previous release 2.7.0. This is the next stable release after Apache 
Hadoop 2.6.x."

And then Hadoop 2.7.2 release notes:
"Apache Hadoop 2.7.2 is a minor release in the 2.x.y release line, building 
upon the previous stable release 2.7.1."

I've tested this is OK with Intel hardware and IBM Java 8 so let's test it 
with OpenJDK, ideally this will be pushed to branch-2.0 and master.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13556: [SPARK-15818] [BUILD] Upgrade to Hadoop 2.7.2

2016-06-08 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/13556
  
Good point, I see the latest Hadoop 2.6 version is 2.6.4 (mentions fixing 
critical bugs) and for 2.5 it's 2.5.2 (same story), so ideally we'd ensure each 
"best for profile" version gets used.

Should I submit the version changes with this pull request so instead it's 
a "Update Hadoop versions" change? 

Does your build farm exercise all Hadoop permutations (so we can test the 
best available 2.3.x, 2.4.x, 2.5.x, 2.6.x, 2.7.x?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13556: [SPARK-15818] [BUILD] Upgrade to Hadoop 2.7.2

2016-06-08 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/13556
  
Best versions for our profiles are as follows

2.7: 2.7.2
2.6: 2.6.4
2.5: 2.5.2
2.4: 2.4.1 (users immediately encouraged to move up for a security fix)
2.3: 2.3.0 (as is)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13556: [SPARK-15818] [BUILD] Upgrade to Hadoop 2.7.2

2016-06-08 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/13556
  
Done, didn't giff automatically so git diffed myself and see plenty of 
differences (as you'd expect Hadoop transitive dependencies should be, say, 
2.4.1 not 2.4.0 now, 2.7.2 not 2.7.0, etc). 

Where do we update the expected values? Grepping for "hadoop-annotations" 
for example only shows dev/deps/spark-deps*jar name*.

Looking in the script I see pr-deps is created based on the Hadoop 
profiles, so perhaps we need to add the explicit Hadoop version when we define 
the Hadoop profiles...

Sample output for 2.7:

-hadoop-annotations-2.7.0.jar
-hadoop-auth-2.7.0.jar
-hadoop-client-2.7.0.jar
-hadoop-common-2.7.0.jar
-hadoop-hdfs-2.7.0.jar
-hadoop-mapreduce-client-app-2.7.0.jar
-hadoop-mapreduce-client-common-2.7.0.jar
-hadoop-mapreduce-client-core-2.7.0.jar
-hadoop-mapreduce-client-jobclient-2.7.0.jar
-hadoop-mapreduce-client-shuffle-2.7.0.jar
-hadoop-yarn-api-2.7.0.jar
-hadoop-yarn-client-2.7.0.jar
-hadoop-yarn-common-2.7.0.jar
-hadoop-yarn-server-common-2.7.0.jar
-hadoop-yarn-server-web-proxy-2.7.0.jar
+hadoop-annotations-2.7.2.jar
+hadoop-auth-2.7.2.jar
+hadoop-client-2.7.2.jar
+hadoop-common-2.7.2.jar
+hadoop-hdfs-2.7.2.jar
+hadoop-mapreduce-client-app-2.7.2.jar
+hadoop-mapreduce-client-common-2.7.2.jar
+hadoop-mapreduce-client-core-2.7.2.jar
+hadoop-mapreduce-client-jobclient-2.7.2.jar
+hadoop-mapreduce-client-shuffle-2.7.2.jar
+hadoop-yarn-api-2.7.2.jar
+hadoop-yarn-client-2.7.2.jar
+hadoop-yarn-common-2.7.2.jar
+hadoop-yarn-server-common-2.7.2.jar
+hadoop-yarn-server-web-proxy-2.7.2.jar


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13562: [SPARK-15821] [DOCS] Include parallel build info

2016-06-08 Thread a-roberts
GitHub user a-roberts opened a pull request:

https://github.com/apache/spark/pull/13562

[SPARK-15821] [DOCS] Include parallel build info

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)
Mention that users can build Spark using multiple threads to decrease build 
times

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
Built on machines with between one core to 192 cores using mvn -T 1C and 
observed faster build times with no loss in stability

In response to the question here 
https://issues.apache.org/jira/browse/SPARK-15821 I think we should suggest 
this option as we know it works for Spark and can result in faster builds

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/a-roberts/spark patch-3

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13562.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13562


commit ce864e9b6721252900bdb5b2f90083e3c8bb65a0
Author: Adam Roberts 
Date:   2016-06-08T14:09:59Z

[SPARK-15821] [DOCS] Include parallel build info

In response to the question here 
https://issues.apache.org/jira/browse/SPARK-15821 I think we should suggest 
this option as we know it works for Spark and can result in faster builds




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13562: [SPARK-15821] [DOCS] Include parallel build info

2016-06-08 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/13562
  
Ah, yep I did mean using this to speed up the Jenkins jobs (not something 
I'd be able to change presumably), I see we can already provide our own Maven 
command and this could include -T 1C for example, perhaps it's a case of 
mentioning this in the prints? Alternatively we'd add it as a parameter, what 
do you think? Whatever's easiest for folks to use IMO


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13589: [SPARK-15822][SPARK-15825][SQL] Fix SMJ Segfault/Invalid...

2016-06-10 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/13589
  
Still getting seg faults on Power and with Intel on OpenJDK with this 
change when we use spark-submit with two worker instances.

Git diff

--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
@@ -490,6 +490,7 @@ class CodegenContext {
   addNewFunction(compareFunc, funcCode)
   s"this.$compareFunc($c1, $c2)"
 case schema: StructType =>
+  INPUT_ROW = "i"
   val comparisons = GenerateOrdering.genComparisons(this, schema)
   val compareFunc = freshName("compareStruct")
   val funcCode: String =

segv occurs four times

aroberts@jtcspark03:~/09JuneClone/Spark-DK$ cat 
work/app-20160610093117-/*/stderr | grep -C 2 "Segmentation error"
16/06/10 09:34:27 INFO CodeGenerator: Code generated in 115.554506 ms
Unhandled exception
Type=Segmentation error vmState=0x
J9Generic_Signal_Number=0004 Signal_Number=000b 
Error_Value= Signal_Code=0001
Handler1=3FFF7B487AA0 Handler2=3FFF7B312AC0

**offHeap is NOT ENABLED**

spark-env.sh has 
export SPARK_WORKER_CORES=2
export SPARK_WORKER_INSTANCES=2


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13562: [SPARK-15821] [DOCS] Include parallel build info

2016-06-10 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/13562
  
Can we get this merged for the docs at least? 

@JoshRosen, perhaps we could add some logic for the build command in the 
Jenkins job to determine how many threads to run with (how many are available; 
OK let's use half of them and save the others for future work).

I've also added a note in dev/make-distribution.sh and fixed the missing 
Hadoop-2.7 profile mention


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13562: [SPARK-15821] [DOCS] Include parallel build info

2016-06-10 Thread a-roberts
Github user a-roberts commented on a diff in the pull request:

https://github.com/apache/spark/pull/13562#discussion_r66610434
  
--- Diff: dev/make-distribution.sh ---
@@ -42,6 +42,7 @@ function exit_with_usage {
   echo "usage:"
   cl_options="[--name] [--tgz] [--mvn ]"
   echo "make-distribution.sh $cl_options "
+  echo "Decrease build times with the -T option: -T 1C will result in 
Maven using one thread per core."
--- End diff --

I should have been more clear, the idea was to document **if you specify a 
Maven command** you can pass -T 1C to that Maven command, but I like what you 
tried and think this is how it should be implemented so I'm adding this feature 
and testing with:

dev/make-distribution.sh --name hadoop-2.7.0 -T 1C -Psparkr -Pyarn 
-Phadoop-2.7 -Phive -Phive-thriftserver -Dscala-2.11

Will update my pull request shortly, almost done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13562: [SPARK-15821] [DOCS] Include parallel build info

2016-06-10 Thread a-roberts
Github user a-roberts commented on a diff in the pull request:

https://github.com/apache/spark/pull/13562#discussion_r66624671
  
--- Diff: dev/make-distribution.sh ---
@@ -42,6 +42,8 @@ function exit_with_usage {
   echo "usage:"
   cl_options="[--name] [--tgz] [--mvn ]"
   echo "make-distribution.sh $cl_options "
+  echo "Decrease build times by adding the -T option: -T 1C will result in 
Maven using one thread per core."
+  echo "For example, dev/make-distribution.sh -T 1C --name hadoop-2.7.2 
-Psparkr -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -Dscala-2.11"
--- End diff --

I wanted to use --mvn -T 1C in the first place, but this gives

dev/make-distribution.sh: line 117: command: -T: invalid option
command: usage: command [-pVv] command [arg ...]
+ '[' '!' '' ']'
+ echo -e 'Could not locate Maven command: '\''-T'\''.'
Could not locate Maven command: '-T'.
+ echo -e 'Specify the Maven command with the --mvn flag'
Specify the Maven command with the --mvn flag
+ exit -1

when I try

```
dev/make-distribution.sh --name hadoop-2.7.2 --mvn -T 1C  -Pyarn 
-Phadoop-2.7 -Phive -Phive-thriftserver -Dscala-2.11
```

Given that MAVEN_OPTS are JVM args I think this is the best approach unless 
I'm using --mvn incorrectly

I'll remove the profiles as you're right this is really just to demonstrate 
multithreaded building, I see now that the community downloads only feature the 
Hadoop two digit version so, agreed, let's go with that here too for consistency



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13562: [SPARK-15821] [DOCS] Include parallel build info

2016-06-10 Thread a-roberts
Github user a-roberts commented on a diff in the pull request:

https://github.com/apache/spark/pull/13562#discussion_r66686256
  
--- Diff: dev/make-distribution.sh ---
@@ -150,7 +156,13 @@ export MAVEN_OPTS="${MAVEN_OPTS:--Xmx2g 
-XX:MaxPermSize=512M -XX:ReservedCodeCac
 # Store the command as an array because $MVN variable might have spaces in 
it.
 # Normal quoting tricks don't work.
 # See: http://mywiki.wooledge.org/BashFAQ/050
-BUILD_COMMAND=("$MVN" clean package -DskipTests $@)
+
+# If NUM_THREADS is set we actually want to add the -T in (removed with 
shift earlier)
+if [ -n "$NUM_THREADS" ]; then
+  MVN_T_OPTION="-T"
+fi
+
+BUILD_COMMAND=("$MVN" $MVN_T_OPTION $NUM_THREADS clean package -DskipTests 
$@)
--- End diff --

That's true and it's unlikely users will use anything but 1C as the value, 
so I agree let's add -T 1C and keep it simple

Changeset will therefore be

- Adding the comment in README.md that you can use it
- Hardcodeing the above
- Adding the missing hadoop-2.7 example profile


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13562: [SPARK-15821] [DOCS] Include parallel build info

2016-06-11 Thread a-roberts
Github user a-roberts commented on the issue:

https://github.com/apache/spark/pull/13562
  
Yep, I let the script complete and we get exactly the same artifacts 
created as before but without waiting so long, on an 8 core box a 7m 48s build 
time becomes 4m 38s and on a 2 core box I see 15 minutes flat becoming 13 
minutes.

We could save even more time by taking out the clean (4 minute time instead 
of 13 but of course sometimes you want to use the clean target and especially 
when releasing); as you've said this script rarely gets used except for 
releases so I wouldn't want to spend long here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12785][SQL] Add ColumnarBatch, an in me...

2016-04-01 Thread a-roberts
Github user a-roberts commented on the pull request:

https://github.com/apache/spark/pull/10628#issuecomment-204397652
  
@nongli @davies 

Hi, by implementing the feature this way we've eliminated every big-endian 
user from using this feature with Spark - Z systems, for example, or Power 
big-endian machines. Critical if we want Spark to be used by large industries 
such as those that IBM et al support.

Any intention to have this not rely on the byte ordering of the platform? 
putLittleEndian is not going to cut it when we're wanting to use Spark for 
handling large volumes of customer data that traditionally sits on big-endian 
systems.

Also, are there any design docs for this? I found the documentation for 
Tungsten (UnsafeRows) to be very useful for understanding and debugging 
purposes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Specify stack size for consistency with Java t...

2015-06-09 Thread a-roberts
GitHub user a-roberts opened a pull request:

https://github.com/apache/spark/pull/6727

Specify stack size for consistency with Java tests - resolves test failures

This change is a simple one and specifies a stack size of 4096k instead of 
the vendor default for Java tests (the defaults vary between Java vendors). 
This remedies test failures observed with JavaALSSuite with IBM and Oracle Java 
owing to a lower default size in comparison to the size with OpenJDK. 4096k is 
a suitable default where the tests pass with each Java vendor tested. The 
alternative is to reduce the number of iterations in the test (no observed 
failures with 5 iterations instead of 15).

-Xss works with Oracle's HotSpot VM, IBM's J9 VM and OpenJDK (IcedTea).

I have ensured this does not have any negative implications for other tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/a-roberts/spark IncJavaStackSize

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/6727.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #6727


commit 5032d8d3921149f8f55c11a01094c2165b52201d
Author: a-roberts 
Date:   2015-06-09T17:49:21Z

Update pom.xml

Increase stack size - defaults vary between Java vendors. This remedies 
test failures observed with JavaALSSuite with IBM & Oracle Java owing to lower 
default size in comparison to OpenJDK.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Specify stack size for consistency with Java t...

2015-06-09 Thread a-roberts
Github user a-roberts commented on the pull request:

https://github.com/apache/spark/pull/6727#issuecomment-110451924
  
Sean, excellent point about the SBT build, we've been using Maven 
exclusively for testing - if there's also, say, an -Xmx3g (e.g. an argline 
equivalent as in the pom file for Maven) then indeed such an addition will need 
to be made too. I'll have a look, suspecting the answer is yes; it would make 
sense for such additional options to be mirrored


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Specify stack size for consistency with Java t...

2015-06-09 Thread a-roberts
Github user a-roberts commented on the pull request:

https://github.com/apache/spark/pull/6727#issuecomment-110462157
  
You're right, at project/SparkBuild.scala it's clear we'll need to increase 
the stack size here too - should I create a new pull request or can we add to 
the code changed here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7756] CORE RDDOperationScope fix for IB...

2015-06-10 Thread a-roberts
GitHub user a-roberts opened a pull request:

https://github.com/apache/spark/pull/6740

[SPARK-7756] CORE RDDOperationScope fix for IBM Java

IBM Java has an extra method when we do getStackTrace(): this is 
"getStackTraceImpl", a native method. This causes two tests to fail within 
"DStreamScopeSuite" when running with IBM Java. Instead of "map" or "filter" 
being the method names found, "getStackTrace" is returned. This commit 
addresses such an issue by using dropWhile. Given that our current method is 
withScope, we look for the next method that isn't ours: we don't care about 
methods that come before us in the stack trace: e.g. getStackTrace (regardless 
of how many levels this might go).

IBM:
java.lang.Thread.getStackTraceImpl(Native Method)
java.lang.Thread.getStackTrace(Thread.java:1117)

org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:104)

Oracle:
PRINTING STACKTRACE!!!
java.lang.Thread.getStackTrace(Thread.java:1552)

org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:106)

I've tested this with Oracle and IBM Java, no side effects for other tests 
introduced.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/a-roberts/spark RDDScopeStackCrawlFix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/6740.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #6740
    

commit a4fc0e025d41d197c55ce43bca14ef669aa9bc68
Author: a-roberts 
Date:   2015-06-09T18:01:43Z

Update RDDOperationScope.scala




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7756] CORE RDDOperationScope fix for IB...

2015-06-10 Thread a-roberts
Github user a-roberts commented on a diff in the pull request:

https://github.com/apache/spark/pull/6740#discussion_r32104008
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDDOperationScope.scala 
---
@@ -95,17 +95,17 @@ private[spark] object RDDOperationScope extends Logging 
{
   private[spark] def withScope[T](
   sc: SparkContext,
   allowNesting: Boolean = false)(body: => T): T = {
-val stackTrace = Thread.currentThread.getStackTrace().tail // ignore 
"Thread#getStackTrace"
-val ourMethodName = stackTrace(1).getMethodName // i.e. withScope
-// Climb upwards to find the first method that's called something 
different
-val callerMethodName = stackTrace
-  .find(_.getMethodName != ourMethodName)
+
+val callerMethodName = Thread.currentThread.getStackTrace()
+  .dropWhile(! _.getMethodName().equals("withScope"))
--- End diff --

The original code captures the stack trace within this method - at line 99, 
this will always contain `withScope` as the code comment suggests (unless this 
method name changes). 
The new code is dropping elements in the stack trace until we see 
`withScope` and follows the same logic henceforth. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7756] CORE RDDOperationScope fix for IB...

2015-06-10 Thread a-roberts
Github user a-roberts commented on the pull request:

https://github.com/apache/spark/pull/6740#issuecomment-110688022
  
Yep, cheers, updated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10949] Update Snappy version to 1.1.2

2015-10-28 Thread a-roberts
Github user a-roberts commented on the pull request:

https://github.com/apache/spark/pull/8995#issuecomment-151978162
  
Hi, have been enjoying Spark Summit Europe, better for somebody else to 
take it over; looks like the files I changed have moved around since the 
testing so I imagine it won't take a while anyway


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10949] Update Snappy version to 1.1.2

2015-10-06 Thread a-roberts
GitHub user a-roberts opened a pull request:

https://github.com/apache/spark/pull/8995

[SPARK-10949] Update Snappy version to 1.1.2

Snappy now supports concatenation of serialized streams, this patch 
contains a version number change and the "does not support" test is now a 
"supports" test.

Snappy 1.1.2 changelog mentions:
snappy-java-1.1.2 (22 September 2015)
This is a backward compatible release for 1.1.x.
Add AIX (32-bit) support.
There is no upgrade for the native libraries of the other platforms.

A major change since 1.1.1 is a support for reading concatenated results of 
SnappyOutputStream(s)
snappy-java-1.1.2-RC2 (18 May 2015)
Fix #107: SnappyOutputStream.close() is not idempotent
snappy-java-1.1.2-RC1 (13 May 2015)
SnappyInputStream now supports reading concatenated compressed results of 
SnappyOutputStream
There has been no compressed format change since 1.0.5.x. So You can read 
the compressed results interchangeablly between these versions.
Fixes a problem when java.io.tmpdir does not exist.

From https://github.com/xerial/snappy-java/blob/develop/Milestone.md and up 
to date at the time of this pull request

Also note https://github.com/xerial/snappy-java/issues/103
"@xerial not sure how feasible or likely it is for this to happen, but it'd 
help tremendously Spark's performance because we are experimenting with a new 
shuffle path that uses channel.transferTo to avoid user space copying. However, 
for that to work, we'd need the underlying format to support concatenation. As 
far we know, LZF has this property, and Snappy might also have it (but 
snappy-java implementation doesn't support it)."

Would be useful to have this in both the 1.5 and 1.6 branches

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/a-roberts/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/8995.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #8995






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >