from:"markgrover"

Github user markgrover commented on the issue:

https://github.com/apache/spark/pull/18234
  
Thanks all.

On Jun 7, 2017 6:41 PM, "Cody Koeninger" <notificati...@github.com> wrote:

> LGTM, thanks Mark
>
> â
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/spark/pull/18234#issuecomment-306973651>, or 
mute
> the thread
> 
<https://github.com/notifications/unsubscribe-auth/ABoVi-xu4UnJIgzldw0p3CMSFqDao7kKks5sB1FGgaJpZM4NzCo1>
> .
>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18234: [SPARK-19185][DSTREAM] Make Kafka consumer cache ...

Github user markgrover commented on a diff in the pull request:

https://github.com/apache/spark/pull/18234#discussion_r120778946
  
--- Diff: docs/streaming-kafka-0-10-integration.md ---
@@ -91,7 +91,7 @@ The new Kafka consumer API will pre-fetch messages into 
buffers.  Therefore it i
 
 In most cases, you should use `LocationStrategies.PreferConsistent` as 
shown above.  This will distribute partitions evenly across available 
executors.  If your executors are on the same hosts as your Kafka brokers, use 
`PreferBrokers`, which will prefer to schedule partitions on the Kafka leader 
for that partition.  Finally, if you have a significant skew in load among 
partitions, use `PreferFixed`. This allows you to specify an explicit mapping 
of partitions to hosts (any unspecified partitions will use a consistent 
location).
 
-The cache for consumers has a default maximum size of 64.  If you expect 
to be handling more than (64 * number of executors) Kafka partitions, you can 
change this setting via `spark.streaming.kafka.consumer.cache.maxCapacity`
+The cache for consumers has a default maximum size of 64.  If you expect 
to be handling more than (64 * number of executors) Kafka partitions, you can 
change this setting via `spark.streaming.kafka.consumer.cache.maxCapacity`. If 
you would like to disable the caching for Kafka consumers, you can set 
`spark.streaming.kafka.consumer.cache.enabled` to `false`.
--- End diff --

ok, thanks I have updated the doc to add a little more context. If you 
could review and help this get committed, I'd really appreciate that. Thanks 
again!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18234: [SPARK-19185][DSTREAM] Make Kafka consumer cache ...

Github user markgrover commented on a diff in the pull request:

https://github.com/apache/spark/pull/18234#discussion_r120703106
  
--- Diff: docs/streaming-kafka-0-10-integration.md ---
@@ -91,7 +91,7 @@ The new Kafka consumer API will pre-fetch messages into 
buffers.  Therefore it i
 
 In most cases, you should use `LocationStrategies.PreferConsistent` as 
shown above.  This will distribute partitions evenly across available 
executors.  If your executors are on the same hosts as your Kafka brokers, use 
`PreferBrokers`, which will prefer to schedule partitions on the Kafka leader 
for that partition.  Finally, if you have a significant skew in load among 
partitions, use `PreferFixed`. This allows you to specify an explicit mapping 
of partitions to hosts (any unspecified partitions will use a consistent 
location).
 
-The cache for consumers has a default maximum size of 64.  If you expect 
to be handling more than (64 * number of executors) Kafka partitions, you can 
change this setting via `spark.streaming.kafka.consumer.cache.maxCapacity`
+The cache for consumers has a default maximum size of 64.  If you expect 
to be handling more than (64 * number of executors) Kafka partitions, you can 
change this setting via `spark.streaming.kafka.consumer.cache.maxCapacity`. If 
you would like to disable the caching for Kafka consumers, you can set 
`spark.streaming.kafka.consumer.cache.enabled` to `false`.
--- End diff --

I am open to either option. I'd slightly prefer documenting that option and 
saying something about no guarantee.

Thanks for reviewing, Sean. @koeninger would appreciate your review as 
well, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18234: [SPARK-19185][DSTREAM] Make Kafka consumer cache configu...

Github user markgrover commented on the issue:

https://github.com/apache/spark/pull/18234
  
This is related to but is a stripped down version of #16629.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18234: [SPARK-19185][DSTREAM] Make Kafka consumer cache ...

GitHub user markgrover opened a pull request:

https://github.com/apache/spark/pull/18234

[SPARK-19185][DSTREAM] Make Kafka consumer cache configurable

## What changes were proposed in this pull request?

Add a new property `spark.streaming.kafka.consumer.cache.enabled` that 
allows users to enable or disable the cache for Kafka consumers. This property 
can be especially handy in cases where issues like SPARK-19185 get hit, for 
which there isn't a solution committed yet. By default, the cache is still on, 
so this change doesn't change any out-of-box behavior.

## How was this patch tested?
Running unit tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/markgrover/spark spark-19185

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18234.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18234


commit 68ca3f3c1051e7b98f2dcb92b752bee43d810c07
Author: Mark Grover <m...@apache.org>
Date:   2017-06-07T16:03:37Z

[SPARK-19185][DSTREAM] Make Kafka consumer cache configurable




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17990: [YARN] [SPARK-20756] yarn-shuffle jar references unshade...

2017-05-22 Thread markgrover

Github user markgrover commented on the issue:

https://github.com/apache/spark/pull/17990
  
Thanks

On May 22, 2017 6:13 PM, "asfgit" <notificati...@github.com> wrote:

Closed #17990 <https://github.com/apache/spark/pull/17990> via 3630911

<https://github.com/apache/spark/commit/36309110046a89d749a7c9746eaa16997de26922>
.

â
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<https://github.com/apache/spark/pull/17990#event-1092212640>, or mute the
thread

<https://github.com/notifications/unsubscribe-auth/ABoViz_0-89g96TpV0vq3vrq33loLldDks5r8cJSgaJpZM4Nb4c1>
.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17990: [YARN] [SPARK-20756] yarn-shuffle jar references unshade...

2017-05-22 Thread markgrover

Github user markgrover commented on the issue:

https://github.com/apache/spark/pull/17990
  
Thanks

On May 22, 2017 6:13 PM, "asfgit" <notificati...@github.com> wrote:

> Closed #17990 <https://github.com/apache/spark/pull/17990> via 3630911
> 
<https://github.com/apache/spark/commit/36309110046a89d749a7c9746eaa16997de26922>
> .
>
> â
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/spark/pull/17990#event-1092212640>, or mute
> the thread
> 
<https://github.com/notifications/unsubscribe-auth/ABoViz_0-89g96TpV0vq3vrq33loLldDks5r8cJSgaJpZM4Nb4c1>
> .
>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17990: [YARN] [SPARK-20756] yarn-shuffle jar references unshade...

2017-05-16 Thread markgrover

Github user markgrover commented on the issue:

https://github.com/apache/spark/pull/17990
  
@vanzin thanks for your suggestion, I have done some more testing and all 
looks good from my side. I'd appreciate if you could take another look. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17990: [YARN] [SPARK-20756][WIP] yarn-shuffle jar references un...

2017-05-16 Thread markgrover

Github user markgrover commented on the issue:

https://github.com/apache/spark/pull/17990
  
Jenkins, re-test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17990: [YARN] [SPARK-20756][WIP] yarn-shuffle jar refere...

2017-05-16 Thread markgrover

Github user markgrover commented on a diff in the pull request:

https://github.com/apache/spark/pull/17990#discussion_r116845002
  
--- Diff: common/network-yarn/pom.xml ---
@@ -113,6 +116,13 @@
 io.netty.**
   
 
+
+  com.google.common
--- End diff --

I have changed it to use inheritance. They looked equivalent but on second 
thought inheritance is better so any changes to parent pom trickle down 
automatically.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17990: [YARN] [SPARK-20756] yarn-shuffle jar references unshade...

2017-05-15 Thread markgrover

Github user markgrover commented on the issue:

https://github.com/apache/spark/pull/17990
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17990: [YARN] [SPARK-20756] yarn-shuffle jar references ...

2017-05-15 Thread markgrover

GitHub user markgrover opened a pull request:

https://github.com/apache/spark/pull/17990

[YARN] [SPARK-20756] yarn-shuffle jar references unshaded guava

and contains scala classes

## What changes were proposed in this pull request?
This change ensures that all references to guava from within the yarn 
shuffle jar pointed to the shaded guava class already provided in the jar.

Also, it explicitly excludes scala classes from being added to the jar.

## How was this patch tested?
Ran unit tests on the module and they passed. I am in the process of taking 
this jar and deploying it on a yarn cluster, and making sure the YARN node 
managers come up and that the shuffle via shuffle service works.

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/markgrover/spark spark-20756

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17990.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17990


commit 461ee5aad13bbd35e710bed8b6943409ce90d457
Author: Mark Grover <m...@apache.org>
Date:   2017-05-16T01:17:05Z

[YARN] [SPARK-20756] yarn-shuffle jar has references to unshaded guava and 
contains scala classes




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17790: [SPARK-20514][CORE] Upgrade Jetty to 9.3.11.v20160721

2017-04-28 Thread markgrover

Github user markgrover commented on the issue:

https://github.com/apache/spark/pull/17790
  
Thanks for the review!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17790: [SPARK-20514][CORE] Upgrade Jetty to 9.3.11.v2016...

2017-04-27 Thread markgrover

Github user markgrover commented on a diff in the pull request:

https://github.com/apache/spark/pull/17790#discussion_r113817810
  
--- Diff: pom.xml ---
@@ -136,7 +136,7 @@
 10.12.1.1
 1.8.2
 1.6.0
-9.2.16.v20160414
+9.3.11.v20160721
--- End diff --

Ah, thanks, there was some confusion about this. I have fixed the JIRA 
title and the PR title to match up with the actual code - which uses the exact 
same version as Apache Hadoop (9.3.11.v20160721).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17790: [SPARK-20514][CORE] Upgrade Jetty to 9.3.13.v20161014

2017-04-27 Thread markgrover

Github user markgrover commented on the issue:

https://github.com/apache/spark/pull/17790
  
I am running unit tests locally and while the run hasn't finished, it's 
looking pretty good so far. I think it'd be good to get Jenkins to run the 
tests here anyways so putting a PR sooner than later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17790: [SPARK-20514][CORE] Upgrade Jetty to 9.3.13.v2016...

2017-04-27 Thread markgrover

GitHub user markgrover opened a pull request:

https://github.com/apache/spark/pull/17790

[SPARK-20514][CORE] Upgrade Jetty to 9.3.13.v20161014

Upgrade Jetty so it can work with Hadoop 3 (alpha 2 release, in particular).
Without this change, because of incompatibily between Jetty versions,
Spark fails to compile when built against Hadoop 3

## How was this patch tested?
Unit tests being run.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/markgrover/spark spark-20514

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17790.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17790


commit ecfb8e3f276eeb276ed0a3293a68ff93a6f9e88e
Author: Mark Grover <m...@apache.org>
Date:   2017-04-27T21:32:18Z

[SPARK-20514][CORE] Upgrade Jetty to 9.3.13.v20161014

Upgrade Jetty so it can work with Hadoop 3 (alpha 2 release, in particular).
Without this change, because of incompatibily between Jetty versions,
Spark fails to compile when built against Hadoop 3




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17725: [SPARK-20435][CORE] More thorough redaction of sensitive...

Github user markgrover commented on the issue:

https://github.com/apache/spark/pull/17725
  
Thanks Marcelo!

On Apr 26, 2017 5:06 PM, "Marcelo Vanzin" <notificati...@github.com> wrote:

> Merging to master / 2.2.
>
> â
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> <https://github.com/apache/spark/pull/17725#issuecomment-297575053>, or 
mute
> the thread
> 
<https://github.com/notifications/unsubscribe-auth/ABoVi9s2uyakNm-81qWP3aBpRkFHtm3Jks5rz9vugaJpZM4NE66l>
> .
>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17725: [SPARK-20435][CORE] More thorough redaction of sensitive...

Github user markgrover commented on the issue:

https://github.com/apache/spark/pull/17725
  
Finally it's passing!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17725: [SPARK-20435][CORE] More thorough redaction of sensitive...

Github user markgrover commented on the issue:

https://github.com/apache/spark/pull/17725
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17725: [SPARK-20435][CORE] More thorough redaction of sensitive...

Github user markgrover commented on the issue:

https://github.com/apache/spark/pull/17725
  
This time it finished on regular time (2h 29m) but failed a test. I ran 
that test locally (org.apache.spark.streaming.ReceiverSuite's 'receiver life 
cycle') and it passed for me, so for mostly laziness, I am going to issue 
another run here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17725: [SPARK-20435][CORE] More thorough redaction of sensitive...

2017-04-25 Thread markgrover

Github user markgrover commented on the issue:

https://github.com/apache/spark/pull/17725
  
Kicking off another run here while I am running a run locally.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17725: [SPARK-20435][CORE] More thorough redaction of sensitive...

2017-04-25 Thread markgrover

Github user markgrover commented on the issue:

https://github.com/apache/spark/pull/17725
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17725: [SPARK-20435][CORE] More thorough redaction of sensitive...

2017-04-25 Thread markgrover

Github user markgrover commented on the issue:

https://github.com/apache/spark/pull/17725
  
Hmm, this is weird, it ran for 4h10 mins and got killed due to a timeout. 
Looking...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17725: [SPARK-20435][CORE] More thorough redaction of sensitive...

2017-04-24 Thread markgrover

Github user markgrover commented on the issue:

https://github.com/apache/spark/pull/17725
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17725: [SPARK-20435][CORE] More thorough redaction of sensitive...

2017-04-21 Thread markgrover

Github user markgrover commented on the issue:

https://github.com/apache/spark/pull/17725

Few decisions that I made here:
* I considered if just `sun.java.command` property should be checked and
redacted but that seemed to specific and likely a bandaid to the current
problem, not a long-term solution, so decided against doing it.
* Redaction for the `SparkListenerEnvironmentUpdate` event was solely being
done on `Spark Properties`, while `sun.java.command` is a part of `System
Properties`. I considered doing redaction for `System Properties` in addition
to `Spark Properties` (that would have gone somewhere around
[here](https://github.com/apache/spark/pull/17725/files#diff-e4a5a68c15eed95d038acfed84b0b66aL258))
but decided against it because that would have even more hardcoding and I
didn't see why these 2 special kinds of properties are special enough to be
redacted but the rest of them. So, decided to redact information from all kinds
of properties.
* One way to redact the property value would have been to redact the
minimum possible set from the value while keeping the rest of the value intact.
For example, if the following were the unredacted case:
`"sun.java.command":"org.apache.spark.deploy.SparkSubmit ... --conf
spark.executorEnv.HADOOP_CREDSTORE_PASSWORD=secret_password --conf
spark.other.property=2"`
One option for the redacted output could have been:
`"sun.java.command":"org.apache.spark.deploy.SparkSubmit ... --conf
spark.executorEnv.HADOOP_CREDSTORE_PASSWORD=*(redacted) --conf
spark.other.property=2"`
However, such a redaction is very hard to maintain. For example, we would
had to take the current regex (which is `(?i)secret|password` by default and
add matchers to it like so `(?i)secret|password` like
`"("+SECRET_REDACTION_DEFAULT+"[^ ]*=)[^ ]*"`. That would allow us to squeeze
out and replaced just the matched portion. But this all seemed very fragile and
even worse when the user supplies a non-default regex so I decided it was
easiest to simply replace the entire value, even though only a small part of it
contained `secret` or `password` in it.
* One thing which I didn't explicitly check was the performance
implications of this change. The reason I bring this up is because, previously
we were comparing keys with a regex, now if the key
doesn't match, we match the value with the regex. So, in the worst case, we
are twice as many regex matches as before. Also, before we were simply doing
regex matching on `Spark Properties`, now we do them on all properties - `Spark
Properties`, `System Properties`, `JVM Properties` and `Classpath Properties`.
I don't think this should have a big performance impact so I didn't invest time
in it, mentioning here in interest of full disclosure.

Thanks in advance for reviewing.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17725: [SPARK-20435][CORE] More thorough redaction of se...

2017-04-21 Thread markgrover

GitHub user markgrover opened a pull request:

https://github.com/apache/spark/pull/17725

[SPARK-20435][CORE] More thorough redaction of sensitive information

This change does a more thorough redaction of sensitive information from 
logs and UI
Add unit tests that ensure that no regressions happen that leak sensitive 
information to the logs.

Previously redaction logic was only checking if the key matched the secret 
regex pattern, it'd redact it's value. That worked for most cases. However, in 
the above case, the key (sun.java.command) doesn't tell much, so the value 
needs to be searched. This PR expands the check to check for values as well.


## How was this patch tested?

New unit tests added that ensure that no sensitive information is present 
in the event logs or the yarn logs.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/markgrover/spark spark-20435

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17725.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17725


commit 2f5148a2e37d8d36006fb297b28a9e8c21a0026b
Author: Mark Grover <m...@apache.org>
Date:   2017-04-22T00:24:30Z

[SPARK-20435][CORE] More thorough redaction of sensitive information from 
logs/UI, more unit tests




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17442: [SPARK-20107][SQL] Speed up HadoopMapReduceCommitProtoco...

2017-03-27 Thread markgrover

Github user markgrover commented on the issue:

https://github.com/apache/spark/pull/17442
  
Agreed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17393: [SPARK-20066] [CORE] Add explicit SecurityManager...

2017-03-23 Thread markgrover

Github user markgrover closed the pull request at:

https://github.com/apache/spark/pull/17393


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17393: [SPARK-20066] [CORE] Add explicit SecurityManager(SparkC...

2017-03-23 Thread markgrover

Github user markgrover commented on the issue:

https://github.com/apache/spark/pull/17393
  
Thanks for posting. I will close this, I understand the hesitation to set a 
precedent here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17393: [SPARK-20066] [CORE] Add explicit SecurityManager...

2017-03-22 Thread markgrover

GitHub user markgrover opened a pull request:

https://github.com/apache/spark/pull/17393

[SPARK-20066] [CORE] Add explicit SecurityManager(SparkConf) constructor

for backwards compatibility with Java.

## What changes were proposed in this pull request?
This adds an explicit SecurityManager(SparkConf) constructor in addition to 
the existing constructor that takes 2 arguments - SparkConf and 
ioEncryptionKey.  The second argument has a default but that's still not enough 
if this code is invoked from Java because of [this 
issue](http://stackoverflow.com/questions/13059528/instantiate-a-scala-class-from-java-and-use-the-default-parameters-of-the-const)

## How was this patch tested?
Before this PR:
mvn clean package -Dspark.version=2.1.0 fails.
mvn clean package -Dspark.version=2.0.0 passes.

After this PR:
mvn clean package -Dspark.version=2.2.0-SNAPSHOT passes.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/markgrover/spark spark-20066

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17393.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17393


commit 2a3c66f3f2ef89d1bbde61e1144487b5a99b70b1
Author: Mark Grover <m...@apache.org>
Date:   2017-03-23T03:41:27Z

[SPARK-20066] [CORE] Add explicit SecurityManager(SparkConf) constructor 
for backwards compatibility with Java




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17047: [SPARK-19720][CORE] Redact sensitive information from Sp...

2017-03-02 Thread markgrover

Github user markgrover commented on the issue:

https://github.com/apache/spark/pull/17047
  
Thanks @vanzin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17127: [SPARK-19734][PYTHON][ML] Correct OneHotEncoder d...

2017-03-01 Thread markgrover

GitHub user markgrover opened a pull request:

https://github.com/apache/spark/pull/17127

[SPARK-19734][PYTHON][ML] Correct OneHotEncoder doc string to say dropLast

## What changes were proposed in this pull request?
Updates the doc string to match up with the code
i.e. say dropLast instead of includeFirst

## How was this patch tested?
Not much, since it's a doc-like change. Will run unit tests via Jenkins job.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/markgrover/spark spark_19734

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17127.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17127


commit f3f5fd198a16e0128b37ac105d31d59cb9b1de8e
Author: Mark Grover <m...@apache.org>
Date:   2017-03-01T18:59:18Z

[SPARK-19734][PYTHON][ML] OneHotEncoder __init__ uses dropLast but doc 
strings all say includeFirst
Updates the doc string to match up with the code
i.e. list dropLast instead of includeFirst




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17047: [SPARK-19720][CORE] Redact sensitive information from Sp...

Github user markgrover commented on the issue:

https://github.com/apache/spark/pull/17047
  
Seems unrelated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17047: [SPARK-19720][CORE] Redact sensitive information from Sp...

Github user markgrover commented on the issue:

https://github.com/apache/spark/pull/17047
  
Jenkins, test this again, please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17047: [SPARK-19720][CORE] Redact sensitive information ...

Github user markgrover commented on a diff in the pull request:

https://github.com/apache/spark/pull/17047#discussion_r103587323
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2574,13 +2575,31 @@ private[spark] object Utils extends Logging {
 
   def redact(conf: SparkConf, kvs: Seq[(String, String)]): Seq[(String, 
String)] = {
 val redactionPattern = conf.get(SECRET_REDACTION_PATTERN).r
+redact(redactionPattern, kvs)
+  }
+
+  private def redact(redactionPattern: Regex, kvs: Seq[(String, String)]): 
Seq[(String, String)] = {
 kvs.map { kv =>
   redactionPattern.findFirstIn(kv._1)
-.map { ignore => (kv._1, REDACTION_REPLACEMENT_TEXT) }
+.map {ignore => (kv._1, REDACTION_REPLACEMENT_TEXT) }
--- End diff --

Ah, right I misunderstood that - I took your comment as a bash + scala way 
to say eliminate space (partly because it's hard to understand spacing in 
github comments). My bad, let me fix that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17047: [SPARK-19720][CORE] Redact sensitive information ...

Github user markgrover commented on a diff in the pull request:

https://github.com/apache/spark/pull/17047#discussion_r103585591
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -240,14 +240,16 @@ package object config {
 .longConf
 .createWithDefault(4 * 1024 * 1024)
 
+  private[spark] val SECRET_REDACTION_PROPERTY = "spark.redaction.regex"
--- End diff --

Thanks @vanzin Updated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17047: [SPARK-19720][SPARK SUBMIT] Redact sensitive info...

2017-02-27 Thread markgrover

Github user markgrover commented on a diff in the pull request:

https://github.com/apache/spark/pull/17047#discussion_r103367049
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2574,13 +2575,30 @@ private[spark] object Utils extends Logging {
 
   def redact(conf: SparkConf, kvs: Seq[(String, String)]): Seq[(String, 
String)] = {
 val redactionPattern = conf.get(SECRET_REDACTION_PATTERN).r
+redact(redactionPattern, kvs)
+  }
+
+  private def redact(redactionPattern: Regex, kvs: Seq[(String, String)]): 
Seq[(String, String)] = {
 kvs.map { kv =>
   redactionPattern.findFirstIn(kv._1)
 .map { ignore => (kv._1, REDACTION_REPLACEMENT_TEXT) }
 .getOrElse(kv)
 }
   }
 
+  /**
+   * Looks up the redaction regex from within the key value pairs and uses 
it to redact the rest
+   * of the key value pairs. No care is taken to make sure the redaction 
property itself is not
+   * redacted. So theoretically, the property itself could be configured 
to redact its own value
+   * when printing.
+   * @param kvs
+   * @return
+   */
+  def redact(kvs: Map[String, String]): Seq[(String, String)] = {
--- End diff --

Correct, that's exactly the use case - where there isn't a conf object 
available yet. I will update the Javadoc. Thanks for reviewing!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17047: [SPARK-19720][SPARK SUBMIT] Redact sensitive info...

2017-02-23 Thread markgrover

GitHub user markgrover opened a pull request:

https://github.com/apache/spark/pull/17047

[SPARK-19720][SPARK SUBMIT] Redact sensitive information from SparkSubmit 
console

## What changes were proposed in this pull request?
This change redacts senstive information (based on `spark.redaction.regex` 
property)
from the Spark Submit console logs. Such sensitive information is already 
being
redacted from event logs and yarn logs, etc.

## How was this patch tested?
Testing was done manually to make sure that the console logs were not 
printing any
sensitive information.

Here's some output from the console:

```
Spark properties used, including those specified through
 --conf and those from the properties file 
/etc/spark2/conf/spark-defaults.conf:
  (spark.yarn.appMasterEnv.HADOOP_CREDSTORE_PASSWORD,*(redacted))
  (spark.authenticate,false)
  (spark.executorEnv.HADOOP_CREDSTORE_PASSWORD,*(redacted))
```

```
System properties:
(spark.yarn.appMasterEnv.HADOOP_CREDSTORE_PASSWORD,*(redacted))
(spark.authenticate,false)
(spark.executorEnv.HADOOP_CREDSTORE_PASSWORD,*(redacted))
```
There is a risk if new print statements were added to the console down the 
road, sensitive information may still get leaked, since there is no test that 
asserts on the console log output. I considered it out of the scope of this 
JIRA to write an integration test to make sure new leaks don't happen in the 
future.

Running unit tests to make sure nothing else is broken by this change.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/markgrover/spark master_redaction

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17047.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17047


commit 000efb1e3152f837e01ce1f80ae108c596f9baa5
Author: Mark Grover <m...@apache.org>
Date:   2017-02-24T01:30:05Z

[SPARK-19720][SPARK SUBMIT] Redact sensitive information from SparkSubmit 
console output

This change redacts senstive information (based on spark.redaction.regex 
property)
from the Spark Submit console logs. Such sensitive information is already 
being
redacted from event logs and yarn logs, etc.

Testing was done manually to make sure that the console logs were not 
printing any
sensitive information.
Here's some output from the console:
Spark properties used, including those specified through
 --conf and those from the properties file 
/etc/spark2/conf/spark-defaults.conf:
  (spark.yarn.appMasterEnv.HADOOP_CREDSTORE_PASSWORD,*(redacted))
  (spark.authenticate,false)
  (spark.executorEnv.HADOOP_CREDSTORE_PASSWORD,*(redacted))




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15971: [SPARK-18535][UI][YARN] Redact sensitive information fro...

Github user markgrover commented on the issue:

https://github.com/apache/spark/pull/15971
  
Ok, looks like all is good now!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15971: [SPARK-18535][UI][YARN] Redact sensitive information fro...

Github user markgrover commented on the issue:

https://github.com/apache/spark/pull/15971
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15971: [SPARK-18535][UI][YARN] Redact sensitive information fro...

Github user markgrover commented on the issue:

https://github.com/apache/spark/pull/15971
  
Thanks for reviewing, Marcelo.

Hmm, the failures are still unrelated:-(
```
[info] Exception encountered when attempting to run a suite with class 
name: org.apache.spark.sql.jdbc.JDBCWriteSuite *** ABORTED *** (1 second, 791 
milliseconds)
[info]   org.h2.jdbc.JdbcSQLException: Schema "TEST" already exists; SQL 
statement:
[info] create schema test [90078-183]
[info]   at 
org.h2.message.DbException.getJdbcSQLException(DbException.java:345)
[info]   at org.h2.message.DbException.get(DbException.java:179)
[info]   at org.h2.message.DbException.get(DbException.java:155)
[info]   at org.h2.command.ddl.CreateSchema.update(CreateSchema.java:48)
[info]   at org.h2.command.CommandContainer.update(CommandContainer.java:78)
[info]   at org.h2.command.Command.executeUpdate(Command.java:254)
[info]   at 
org.h2.jdbc.JdbcPreparedStatement.executeUpdateInternal(JdbcPreparedStatement.java:157)
[info]   at 
org.h2.jdbc.JdbcPreparedStatement.executeUpdate(JdbcPreparedStatement.java:143)
[info]   at 
org.apache.spark.sql.jdbc.JDBCWriteSuite$$anonfun$29.apply(JDBCWriteSuite.scala:56)
[info]   at 
org.apache.spark.sql.jdbc.JDBCWriteSuite$$anonfun$29.apply(JDBCWriteSuite.scala:53)
[info]   at 
org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:195)
[info]   at 
org.apache.spark.sql.jdbc.JDBCWriteSuite.runTest(JDBCWriteSuite.scala:34)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
[info]   at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
[info]   at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
[info]   at scala.collection.immutable.List.foreach(List.scala:381)
[info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
[info]   at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
[info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
[info]   at 
org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
[info]   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
[info]   at org.scalatest.Suite$class.run(Suite.scala:1424)
[info]   at 
org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
[info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
[info]   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
[info]   at 
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31)
[info]   at 
org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
[info]   at 
org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
[info]   at 
org.apache.spark.sql.jdbc.JDBCWriteSuite.org$scalatest$BeforeAndAfter$$super$run(JDBCWriteSuite.scala:34)
[info]   at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241)
[info]   at 
org.apache.spark.sql.jdbc.JDBCWriteSuite.run(JDBCWriteSuite.scala:34)
[info]   at 
org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:357)
[info]   at 
org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:502)
[info]   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
[info]   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
[info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[info]   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[info]   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[info]   at java.lang.Thread.run(Thread.java:745)
```
I won't be available for the next few days. If you deem fit to you, I'd 
appreciate if you could commit this after the test runs stabilize. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15971: [SPARK-18535][UI][YARN] Redact sensitive informat...

Github user markgrover commented on a diff in the pull request:

https://github.com/apache/spark/pull/15971#discussion_r89416437
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala ---
@@ -231,6 +233,15 @@ private[spark] class EventLoggingListener(
 }
   }
 
+  private[spark] def redactEvent(
+event: SparkListenerEnvironmentUpdate): SparkListenerEnvironmentUpdate 
= {
--- End diff --

Thanks, I'll go through Spark style guide so I don't cause as much trouble 
next time.

Thanks for reviewing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15971: [SPARK-18535][UI][YARN] Redact sensitive information fro...

Github user markgrover commented on the issue:

https://github.com/apache/spark/pull/15971
  
Thanks for your review, @vanzin I have incorporated all suggestions, I'd 
appreciate if you could take another look.

And, the test failures in the last run seem unrelated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15971: [SPARK-18535][UI][YARN] Redact sensitive informat...

Github user markgrover commented on a diff in the pull request:

https://github.com/apache/spark/pull/15971#discussion_r89388400
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/EventLoggingListenerSuite.scala 
---
@@ -95,6 +95,30 @@ class EventLoggingListenerSuite extends SparkFunSuite 
with LocalSparkContext wit
 }
   }
 
+  test("Event logging with password redaction") {
+val secretPassword = "secret_password"
+val conf = getLoggingConf(testDirPath, 
None).set("spark.executorEnv.HADOOP_CREDSTORE_PASSWORD",
+  secretPassword)
+sc = new SparkContext("local-cluster[2,2,1024]", "test", conf)
--- End diff --

Ok, I have updated this test to be less bloated. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15971: [SPARK-18535][UI][YARN] Redact sensitive informat...

Github user markgrover commented on a diff in the pull request:

https://github.com/apache/spark/pull/15971#discussion_r89383168
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala ---
@@ -231,6 +233,19 @@ private[spark] class EventLoggingListener(
 }
   }
 
+
+  private def redactEvent(event: SparkListenerEnvironmentUpdate): 
SparkListenerEnvironmentUpdate = {
--- End diff --

>That's the only argument I can buy, and said user has different ways to 
get that information.

True, `sparkConf.getAll` is always available.

I still think it's better to put the redaction closer to the "sinks" for 
now. The good thing though is that if we see the number of "sinks" increasing, 
and everyone wanting redaction, we can take redaction more upstream. For now, 2 
sinks seem manageable and it's hard to guess if future sinks are going to want 
redaction or not. So, unless you strongly object, I'd like to keep the 
redaction closer to the "sinks" now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15971: [SPARK-18535][UI][YARN] Redact sensitive informat...

Github user markgrover commented on a diff in the pull request:

https://github.com/apache/spark/pull/15971#discussion_r89382000
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2555,6 +2555,19 @@ private[spark] object Utils extends Logging {
   sparkJars.map(_.split(",")).map(_.filter(_.nonEmpty)).toSeq.flatten
 }
   }
+
+  private[util] val REDACTION_REPLACEMENT_TEXT = "*(redacted)"
+
+  def redact(conf: SparkConf, kvs: Seq[(String, String)]): Seq[(String, 
String)] = {
+val redactionPattern = conf.get(SECRET_REDACTION_PATTERN).r
+kvs.map { kv =>
+  if (redactionPattern.findFirstIn(kv._1).isDefined) {
+(kv._1, REDACTION_REPLACEMENT_TEXT)
+  }
+  else kv
--- End diff --

That's much better, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15971: [SPARK-18535][UI][YARN] Redact sensitive informat...

Github user markgrover commented on a diff in the pull request:

https://github.com/apache/spark/pull/15971#discussion_r89382120
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -223,4 +223,13 @@ package object config {
   " bigger files.")
 .longConf
 .createWithDefault(4 * 1024 * 1024)
+
+  private[spark] val SECRET_REDACTION_PATTERN =
+ConfigBuilder("spark.redaction.regex")
+  .doc("Regex to decide which Spark configuration properties and 
environment variables in " +
+"driver and executor environments contain sensitive information. 
When this regex matches " +
+"a property , its value is redacted from the environment UI and 
various logs like YARN " +
+"and event logs")
--- End diff --

Fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15971: [SPARK-18535][UI][YARN] Redact sensitive informat...

Github user markgrover commented on a diff in the pull request:

https://github.com/apache/spark/pull/15971#discussion_r89382071
  
--- Diff: core/src/test/scala/org/apache/spark/util/UtilsSuite.scala ---
@@ -974,4 +974,28 @@ class UtilsSuite extends SparkFunSuite with 
ResetSystemProperties with Logging {
 
 assert(pValue > threshold)
   }
+
+  test("redact sensitive information") {
+val sparkConf = new SparkConf
+
+// Set some secret keys
+val secretKeys = Seq("" +
+  "spark.executorEnv.HADOOP_CREDSTORE_PASSWORD",
+  "spark.my.password",
+  "spark.my.sECreT")
+secretKeys.foreach { key =>
+  sparkConf.set(key, "secret_password")
+}
+// Set a non-secret key
+sparkConf.set("spark.regular.property", "not_a_secret")
+
+// Redact sensitive information
+val redactedConf = Utils.redact(sparkConf, sparkConf.getAll).toMap
+
+// Assert that secret information got redacted while the regular 
property remained the same
+secretKeys.foreach { key =>
+  assert(redactedConf.get(key).get == Utils.REDACTION_REPLACEMENT_TEXT)
+}
+assert(redactedConf.get("spark.regular.property").get == 
"not_a_secret")
--- End diff --

Fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15971: [SPARK-18535][UI][YARN] Redact sensitive informat...

Github user markgrover commented on a diff in the pull request:

https://github.com/apache/spark/pull/15971#discussion_r89382041
  
--- Diff: core/src/test/scala/org/apache/spark/util/UtilsSuite.scala ---
@@ -974,4 +974,28 @@ class UtilsSuite extends SparkFunSuite with 
ResetSystemProperties with Logging {
 
 assert(pValue > threshold)
   }
+
+  test("redact sensitive information") {
+val sparkConf = new SparkConf
+
+// Set some secret keys
+val secretKeys = Seq("" +
--- End diff --

Me trying to format things in my editor led to this, apologies. Fixed it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15971: [SPARK-18535][UI][YARN] Redact sensitive informat...

Github user markgrover commented on a diff in the pull request:

https://github.com/apache/spark/pull/15971#discussion_r89382129
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala ---
@@ -231,6 +233,19 @@ private[spark] class EventLoggingListener(
 }
   }
 
+
--- End diff --

Fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15971: [SPARK-18535][UI][YARN] Redact sensitive informat...

Github user markgrover commented on a diff in the pull request:

https://github.com/apache/spark/pull/15971#discussion_r89382059
  
--- Diff: core/src/test/scala/org/apache/spark/util/UtilsSuite.scala ---
@@ -974,4 +974,28 @@ class UtilsSuite extends SparkFunSuite with 
ResetSystemProperties with Logging {
 
 assert(pValue > threshold)
   }
+
+  test("redact sensitive information") {
+val sparkConf = new SparkConf
+
+// Set some secret keys
+val secretKeys = Seq("" +
+  "spark.executorEnv.HADOOP_CREDSTORE_PASSWORD",
+  "spark.my.password",
+  "spark.my.sECreT")
+secretKeys.foreach { key =>
+  sparkConf.set(key, "secret_password")
+}
+// Set a non-secret key
+sparkConf.set("spark.regular.property", "not_a_secret")
+
+// Redact sensitive information
+val redactedConf = Utils.redact(sparkConf, sparkConf.getAll).toMap
+
+// Assert that secret information got redacted while the regular 
property remained the same
+secretKeys.foreach { key =>
+  assert(redactedConf.get(key).get == Utils.REDACTION_REPLACEMENT_TEXT)
--- End diff --

Fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15971: [SPARK-18535][UI][YARN] Redact sensitive information fro...

Github user markgrover commented on the issue:

https://github.com/apache/spark/pull/15971
  
Thanks for your feedback @vanzin and @ajbozarth.
I have updated the PR with your suggestions. The only one that's pending is 
@vanzin's suggestion of improving the EventLoggingSuiteTest. Let me think about 
that one for a bit before I update it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15971: [SPARK-18535][UI][YARN] Redact sensitive informat...

Github user markgrover commented on a diff in the pull request:

https://github.com/apache/spark/pull/15971#discussion_r89240891
  
--- Diff: core/src/test/scala/org/apache/spark/util/UtilsSuite.scala ---
@@ -974,4 +974,18 @@ class UtilsSuite extends SparkFunSuite with 
ResetSystemProperties with Logging {
 
 assert(pValue > threshold)
   }
+
+  test("redact sensitive information") {
+val sparkConf = new SparkConf
+sparkConf.set("spark.executorEnv.HADOOP_CREDSTORE_PASSWORD", 
"secret_password")
--- End diff --

Fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15971: [SPARK-18535][UI][YARN] Redact sensitive informat...

Github user markgrover commented on a diff in the pull request:

https://github.com/apache/spark/pull/15971#discussion_r89240880
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/EventLoggingListenerSuite.scala 
---
@@ -95,6 +95,30 @@ class EventLoggingListenerSuite extends SparkFunSuite 
with LocalSparkContext wit
 }
   }
 
+  test("Event logging with password redaction") {
+val secretPassword = "secret_password"
+val conf = getLoggingConf(testDirPath, 
None).set("spark.executorEnv.HADOOP_CREDSTORE_PASSWORD",
+  secretPassword)
+sc = new SparkContext("local-cluster[2,2,1024]", "test", conf)
+assert(sc.eventLogger.isDefined)
+val eventLogger = sc.eventLogger.get
+
+sc.parallelize(1 to 1).count()
+sc.stop()
+
+val logData = EventLoggingListener.openEventLog(new 
Path(eventLogger.logPath), fileSystem)
+val eventLog = Source.fromInputStream(logData).mkString
+// Make sure nothing secret shows up anywhere
+assert(!eventLog.contains(secretPassword), s"Secret password 
($secretPassword) not redacted " +
+  s"from event logs:\n $eventLog")
+val expected = 
""""spark.executorEnv.HADOOP_CREDSTORE_PASSWORD":"*(redacted)""""
+// Make sure every occurrence of the property is accompanied by a 
redaction text.
+val regex = 
""""spark.executorEnv.HADOOP_CREDSTORE_PASSWORD":"([^"]*)"""".r
+val matches = regex.findAllIn(eventLog)
+assert(matches.nonEmpty)
+matches.foreach(matched => assert(matched.equals(expected)))
--- End diff --

Fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15971: [SPARK-18535][UI][YARN] Redact sensitive informat...

Github user markgrover commented on a diff in the pull request:

https://github.com/apache/spark/pull/15971#discussion_r89240853
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/EventLoggingListenerSuite.scala 
---
@@ -95,6 +95,30 @@ class EventLoggingListenerSuite extends SparkFunSuite 
with LocalSparkContext wit
 }
   }
 
+  test("Event logging with password redaction") {
+val secretPassword = "secret_password"
+val conf = getLoggingConf(testDirPath, 
None).set("spark.executorEnv.HADOOP_CREDSTORE_PASSWORD",
+  secretPassword)
+sc = new SparkContext("local-cluster[2,2,1024]", "test", conf)
--- End diff --

Yeah, I wanted a little more than just a "unit" test. This was more broader 
and checked for actual redaction taking place in event logs, so I have it here.

I think you have a valid point though, if you think this is too expensive, 
I think the method in UtilsSuite.scala does a pretty good job at "unit testing" 
 `redact()`, so I'd rather take this out completely. Thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15971: [SPARK-18535][UI][YARN] Redact sensitive informat...

Github user markgrover commented on a diff in the pull request:

https://github.com/apache/spark/pull/15971#discussion_r89240682
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2555,6 +2555,15 @@ private[spark] object Utils extends Logging {
   sparkJars.map(_.split(",")).map(_.filter(_.nonEmpty)).toSeq.flatten
 }
   }
+
+  private[util] val REDACTION_REPLACEMENT_TEXT = "*(redacted)"
+  def redact(conf: SparkConf)(kv: (String, String)): (String, String) = {
+val redactionPattern = conf.get(SECRET_REDACTION_PATTERN).r
--- End diff --

What part do you think is expensive? Going through all the configuration 
properties and matching them with the regex?
If so, I agree. However, that has to be done somewhere. All the callers of 
this function have a `SparkConf` that they want stuff redacted from. So, if 
this function accepts a list of tuples, they have to run the regex check to 
find that list first before sending it to `redact()`. So, overall, unless I am 
missing something, I don't think we can avoid the expense.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15971: [SPARK-18535][UI][YARN] Redact sensitive informat...

Github user markgrover commented on a diff in the pull request:

https://github.com/apache/spark/pull/15971#discussion_r89240387
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala ---
@@ -231,6 +233,19 @@ private[spark] class EventLoggingListener(
 }
   }
 
+
+  private def redactEvent(event: SparkListenerEnvironmentUpdate): 
SparkListenerEnvironmentUpdate = {
--- End diff --

Good question. I thought about this quite a bit when making this change. I 
should have posted that decision in the PR description, apologies for not 
providing that context. The way I see it - there are three places where 
redaction can be done:
1. Right at the source of SparkListenerEnvironmentUpdate ([here, in 
SparkContext.scala](https://github.com/apache/spark/blob/39a1d30636857715247c82d551b200e1c331ad69/core/src/main/scala/org/apache/spark/SparkContext.scala#L2214)).
2. [In 
JsonProtocol.scala](https://github.com/apache/spark/blob/39a1d30636857715247c82d551b200e1c331ad69/core/src/main/scala/org/apache/spark/util/JsonProtocol.scala#L167)
 when converting the event to JSON.
3. In 
[EventLogginListener.scala](https://github.com/apache/spark/blob/39a1d30636857715247c82d551b200e1c331ad69/core/src/main/scala/org/apache/spark/util/JsonProtocol.scala#L167)
 (where it is right now), when being persisted to disk.

A user could write a custom listener that listened to the environment 
updates and did something useful with them. And, I didn't want to impose 
redaction on them. They could be using it to create a clone of their 
environment, for example and may need to the same sensitive properties. So, I 
ruled out 1.

And, JsonProtocol seemed like a generic utility to convert events to JSON. 
While I could be selective about only redacting events of 
`SparkListenerEnvironmentUpdate` type, I didn't want to assume that everyone 
translating the environment update to JSON should only be seeing redacted 
configuration. So, that made me rule out 2.

I decided that it was best redact "closest to disk", which made me put the 
redaction code where I did - in EventLoggingListener. Hope that makes sense, 
happy to hear your thoughts if you think otherwise.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15971: [SPARK-18535][UI][YARN] Redact sensitive informat...

Github user markgrover commented on a diff in the pull request:

https://github.com/apache/spark/pull/15971#discussion_r89238304
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -223,4 +223,13 @@ package object config {
   " bigger files.")
 .longConf
 .createWithDefault(4 * 1024 * 1024)
+
+  private[spark] val SECRET_REDACTION_PATTERN =
+ConfigBuilder("spark.secret.redactionPattern")
+  .doc("Scala regex(case-sensitive) to decide which Spark 
configuration properties and " +
+"environment variables in driver and executor environments contain 
sensitive information." +
+" When this regex matches the property or environment variable 
name, its value is " +
+"redacted from the environment UI and various logs like YARN and 
event logs")
+  .stringConf
+  .createWithDefault("secret|password|SECRET|PASSWORD")
--- End diff --

Fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15971: [SPARK-18535][UI][YARN] Redact sensitive informat...

Github user markgrover commented on a diff in the pull request:

https://github.com/apache/spark/pull/15971#discussion_r89238293
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -223,4 +223,13 @@ package object config {
   " bigger files.")
 .longConf
 .createWithDefault(4 * 1024 * 1024)
+
+  private[spark] val SECRET_REDACTION_PATTERN =
+ConfigBuilder("spark.secret.redactionPattern")
+  .doc("Scala regex(case-sensitive) to decide which Spark 
configuration properties and " +
+"environment variables in driver and executor environments contain 
sensitive information." +
+" When this regex matches the property or environment variable 
name, its value is " +
--- End diff --

Fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15971: [SPARK-18535][UI][YARN] Redact sensitive informat...

Github user markgrover commented on a diff in the pull request:

https://github.com/apache/spark/pull/15971#discussion_r89238264
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -223,4 +223,13 @@ package object config {
   " bigger files.")
 .longConf
 .createWithDefault(4 * 1024 * 1024)
+
+  private[spark] val SECRET_REDACTION_PATTERN =
+ConfigBuilder("spark.secret.redactionPattern")
--- End diff --

Good point! I think it's good for the actual regex to not be redacted. So, 
I will rename the property to be clearer anyways (and not have secret in the 
name) to `spark.redaction.regex`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15971: [SPARK-18535][UI][YARN] Redact sensitive information fro...

2016-11-21 Thread markgrover

Github user markgrover commented on the issue:

https://github.com/apache/spark/pull/15971
  
> Does this protect against doing sc.getConf.getAll.foreach(println) in 
spark shell ?
No, it doesn't. The goal of the JIRA is the first step - to stop printing 
sensitive information all over the place.

If a user has access to spark-shell (say, in a cluster secured through 
kerberos), it may be reasonable for them to see the sensitive information.

This, however, prevents anyone from going to the unauthenticated Spark UI, 
and having all the creds to S3.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15971: [SPARK-18535][UI][YARN] Redact sensitive informat...

2016-11-21 Thread markgrover

GitHub user markgrover opened a pull request:

https://github.com/apache/spark/pull/15971

[SPARK-18535][UI][YARN] Redact sensitive information from Spark logs and UI

## What changes were proposed in this pull request?

This patch adds a new property called `spark.secret.redactionPattern` that
allows users to specify a scala regex to decide which Spark configuration
properties and environment variables in driver and executor environments
contain sensitive information. When this regex matches the property or
environment variable name, its value is redacted from the environment UI and
various logs like YARN and event logs.

This change uses this property to redact information from event logs and 
YARN
logs. It also, updates the UI code to adhere to this property instead of
hardcoding the logic to decipher which properties are sensitive.

Here's an image of the UI post-redaction:

![image](https://cloud.githubusercontent.com/assets/1709451/20506215/4cc30654-b007-11e6-8aee-4cde253fba2f.png)

Here's the text in the YARN logs, post-redaction:
``HADOOP_CREDSTORE_PASSWORD -> *(redacted)``

Here's the text in the event logs, post-redaction:

``...,"spark.executorEnv.HADOOP_CREDSTORE_PASSWORD":"*(redacted)","spark.yarn.appMasterEnv.HADOOP_CREDSTORE_PASSWORD":"*(redacted)",...``

## How was this patch tested?
1. Unit tests are added to ensure that redaction works.
2. A YARN job reading data off of S3 with confidential information
(hadoop credential provider password) being provided in the environment
variables of driver and executor. And, afterwards, logs were grepped to make
sure that no mention of secret password was present. It was also ensure that
the job was able to read the data off of S3 correctly, thereby ensuring that
the sensitive information was being trickled down to the right places to 
read
the data.
3. The event logs were checked to make sure no mention of secret password 
was
present.
4. UI environment tab was checked to make sure there was no secret 
information
being displayed.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/markgrover/spark master_redaction

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15971.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15971


commit 5dd3630d6a937ba8634054940543509d9186e68e
Author: Mark Grover <m...@apache.org>
Date:   2016-11-17T01:47:17Z

[SPARK-18535][UI][YARN] Redact sensitive information from Spark logs and UI

This commit adds a new property called `spark.secret.redactionPattern` that
allows users to specify a scala regex to decide which Spark configuration
properties and environment variables in driver and executor environments
contain sensitive information. When this regex matches the property or
environment variable name, its value is redacted from the environment UI and
various logs like YARN and event logs.

This change uses this property to redact information from event logs and 
YARN
logs. It also, updates the UI code to adhere to this property instead of
hardcoding the logic to decipher which properties are sensitive.

For testing:
1. Unit tests are added to ensure that redaction works.
2. A YARN job reading data off of S3 with confidential information
(hadoop credential provider password) being provided in the environment
variables of driver and executor. And, afterwards, logs were grepped to make
sure that no mention of secret password was present. It was also ensure that
the job was able to read the data off of S3 correctly, thereby ensuring that
the sensitive information was being trickled down to the right places to 
read
the data.
3. The event logs were checked to make sure no mention of secret password 
was
present.
4. UI environment tab was checked to make sure there was no secret 
information
being displayed.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15623: SPARK-18093][SQL] Fix default value test in SQLConfSuite...

2016-10-25 Thread markgrover

Github user markgrover commented on the issue:

https://github.com/apache/spark/pull/15623
  
Thanks @srowen, indeed, it could be stripping the trailing slash too. I am 
ok either way too. Let me know if you have strong preference to the other one, 
otherwise, this should be ok.

Thanks again!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15623: SPARK-18093][SQL] Fix default value test in SQLCo...

2016-10-25 Thread markgrover

GitHub user markgrover opened a pull request:

https://github.com/apache/spark/pull/15623

SPARK-18093][SQL] Fix default value test in SQLConfSuite to work regaâ¦

â¦rdless of warehouse dir's existence

## What changes were proposed in this pull request?
Appending a trailing slash, if there already isn't one for the 
sake comparison of the two paths. It doesn't take away from
the essence of the check, but removes any potential mismatch
due to lack of trailing slash.

## How was this patch tested?
Ran unit tests and they passed.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/markgrover/spark spark-18093

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15623.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15623


commit dea1f337debbd2d0bb2488117f80ec5ded4f76a5
Author: Mark Grover <m...@apache.org>
Date:   2016-10-25T13:31:47Z

SPARK-18093][SQL] Fix default value test in SQLConfSuite to work regardless 
of warehouse dir's existence

Appending a trailing slash, if there already isn't one for the
sake comparison of the two paths. It doesn't take away from
the essence of the check, but removes any potential mismatch
due to lack of trailing slash.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15572: [DOCS] Update docs to not suggest to package Spar...

2016-10-20 Thread markgrover

GitHub user markgrover opened a pull request:

https://github.com/apache/spark/pull/15572

[DOCS] Update docs to not suggest to package Spark before running tests.

## What changes were proposed in this pull request?

Update docs to not suggest to package Spark before running tests.

## How was this patch tested?

Not creating a JIRA since this pretty small. We haven't had the need to run 
mvn package before mvn test since 1.6 at least, or so I am told. So, updating 
the docs to not be misguiding.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/markgrover/spark doc_update

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15572.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15572


commit 858e50d6199dcacd313c578904b4d3d45c193088
Author: Mark Grover <m...@apache.org>
Date:   2016-10-20T21:57:34Z

[DOCS] Update docs to not suggest to package Spark before running tests.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14270: [SPARK-5847][CORE] Allow for configuring MetricsSystem's...

2016-08-25 Thread markgrover

Github user markgrover commented on the issue:

https://github.com/apache/spark/pull/14270
  
Sure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14270: [SPARK-5847][CORE] Allow for configuring MetricsSystem's...

2016-07-27 Thread markgrover

Github user markgrover commented on the issue:

https://github.com/apache/spark/pull/14270
  
Thanks a lot, @vanzin!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14270: [SPARK-5847][CORE] Allow for configuring MetricsSystem's...

Github user markgrover commented on the issue:

https://github.com/apache/spark/pull/14270
  
Fixed the nits, resolved the merge conflict. Shortened some test names.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14270: [SPARK-5847][CORE] Allow for configuring MetricsS...

Github user markgrover commented on a diff in the pull request:

https://github.com/apache/spark/pull/14270#discussion_r72357797
  
--- Diff: core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala 
---
@@ -125,19 +126,26 @@ private[spark] class MetricsSystem private (
* application, executor/driver and metric source.
*/
   private[spark] def buildRegistryName(source: Source): String = {
-val appId = conf.getOption("spark.app.id")
+val metricsNamespace = conf.get(METRICS_NAMESPACE).map(Some(_))
--- End diff --

Good point, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14270: [SPARK-5847][CORE] Allow for configuring MetricsS...

Github user markgrover commented on a diff in the pull request:

https://github.com/apache/spark/pull/14270#discussion_r72357828
  
--- Diff: 
core/src/test/scala/org/apache/spark/metrics/MetricsSystemSuite.scala ---
@@ -183,4 +184,89 @@ class MetricsSystemSuite extends SparkFunSuite with 
BeforeAndAfter with PrivateM
 assert(metricName != s"$appId.$executorId.${source.sourceName}")
 assert(metricName === source.sourceName)
   }
+
+  test("MetricsSystem with Executor instance, with custom namespace") {
+val source = new Source {
+  override val sourceName = "dummySource"
+  override val metricRegistry = new MetricRegistry()
+}
+
+val appId = "testId"
+val appName = "testName"
+val executorId = "1"
+conf.set("spark.app.id", appId)
+conf.set("spark.app.name", appName)
+conf.set("spark.executor.id", executorId)
+conf.set(METRICS_NAMESPACE, "${spark.app.name}")
+
+val instanceName = "executor"
+val driverMetricsSystem = 
MetricsSystem.createMetricsSystem(instanceName, conf, securityMgr)
+
+val metricName = driverMetricsSystem.buildRegistryName(source)
+assert(metricName === s"$appName.$executorId.${source.sourceName}")
+  }
+
+  test("MetricsSystem with Executor instance and custom namespace which is 
not set") {
+val source = new Source {
+  override val sourceName = "dummySource"
+  override val metricRegistry = new MetricRegistry()
+}
+
+val executorId = "1"
+val namespaceToResolve = "${spark.doesnotexist}"
+conf.set("spark.executor.id", executorId)
+conf.set(METRICS_NAMESPACE, namespaceToResolve)
+
+val instanceName = "executor"
+val driverMetricsSystem = 
MetricsSystem.createMetricsSystem(instanceName, conf, securityMgr)
+
+val metricName = driverMetricsSystem.buildRegistryName(source)
+// If the user set the spark.metrics.namespace property to an 
expansion of another property
+// (say ${spark.doesnotexist}, the unresolved name (i.e. litterally 
${spark.doesnot})
--- End diff --

Appreciate your thoroughness!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14270: [SPARK-5847][CORE] Allow for configuring MetricsS...

Github user markgrover commented on a diff in the pull request:

https://github.com/apache/spark/pull/14270#discussion_r72307684
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -103,4 +103,9 @@ package object config {
 .stringConf
 .checkValues(Set("hive", "in-memory"))
 .createWithDefault("in-memory")
+
+  // This property sets the root namespace for metrics reporting
+  private[spark] val METRICS_NAMESPACE = 
ConfigBuilder("spark.metrics.namespace")
+.stringConf
+.createOptional
--- End diff --

Thanks, but I think that changes the semantics in cases where 
`spark.app.id` is not defined.
Currently, the code [in this 
PR](https://github.com/apache/spark/pull/14270/files#diff-7ea2624e832b166ca27cd4baca8691d9R129)
 creates an Option with value `None` if `spark.app.id` is not defined. With the 
proposed change, it will create an Option with value literal `${spark.app.id}` 
if `spark.app.id` not defined. And, that changes the existing behavior. Even 
though in practice, I believe, `spark.app.id` is always defined, there are some 
tests in MetrisSystemSuite.scala that test the scenario when `spark.app.id` is 
not defined. And, I am hesitant to change that behavior.

I suppose I could use `createWithDefault()` in package.scala for 
`METRICS_NAMESPACE` but then I'd have to check if there are any unresolved 
`spark.app.id` references in the value, but that'd be too big of a pain for not 
much benefit. So, I'd prefer to keep this the way it is.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14270: [SPARK-5847][CORE] Allow for configuring MetricsSystem's...

2016-07-22 Thread markgrover

Github user markgrover commented on the issue:

https://github.com/apache/spark/pull/14270
  
Ok, I have pushed changes to use the expansion capabilities brought in by 
SPARK-16272. Overall, I think it was a very good call to use that, so thanks 
for the suggestions! Would appreciate a review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14270: [SPARK-5847][CORE] Allow for configuring MetricsSystem's...

2016-07-21 Thread markgrover

Github user markgrover commented on the issue:

https://github.com/apache/spark/pull/14270
  
ok, thanks, I will rely on changes in #14022 (SPARK-16272), now that it's 
been committed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14270: [SPARK-5847][CORE] Allow for configuring MetricsSystem's...

Github user markgrover commented on the issue:

https://github.com/apache/spark/pull/14270
  
Tests passed! I'd really appreciate a review!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14270: [SPARK-5847][CORE] Allow for configuring MetricsSystem's...

Github user markgrover commented on the issue:

https://github.com/apache/spark/pull/14270

A few design choices I made along the way are:
1. Use a `SparkConf` property to control the namespace instead of a
`MetricsConfig` property. The reason is that metrics config properties mean
something particular and follow a particular pattern (see
[here](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala#L54),
for example). And, regardless of what we name this property, it doesn't follow
that paradigm - the first part of the property name before the first dot
doesn't represent an instance (like master, worker, executor, etc.) like it
does for legit metrics properties. Also, the properties defined in
`MetricsConfig` are obtained using the `getInstance()` call, where the instance
is passed and again, such a retrieval doesn't really apply to this
configuration property since it doesn't belong to an instance. I understand
it's one more property in `SparkConf` but I really feel its belongs there,
right by `spark.metrics.conf`.
2. Based on the previous point, this property shouldn't fall under the
`spark.metrics.conf.` prefix. This is because if it does, it's mistakenly
understood as a legit metrics property (see code
[here](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/metrics/MetricsConfig.scala#L54),
for example).
3. The proposed `spark.metrics.namespace` property allows one to specify an
arbitrary property to be used as namespace. I did consider whitespacing it so
users can use `spark.app.name` or `spark.app.id` only but then we'd have to
deal with magic strings(app.name|app.id), and I didn't really feel inclined to
do that. And, @ryan-williams who took a stab at the same issue #4632 made the
same call and I agree with him.

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14223: [SPARK-10614][CORE] Change SystemClock to derive time fr...

Github user markgrover commented on the issue:

https://github.com/apache/spark/pull/14223
  
And, thanks for your feedback, @rxin, I will take a look.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14270: [SPARK-5847][CORE] Allow for configuring MetricsS...

GitHub user markgrover opened a pull request:

https://github.com/apache/spark/pull/14270

[SPARK-5847][CORE] Allow for configuring MetricsSystem's use of app ID to 
namespace all metrics

## What changes were proposed in this pull request?
Adding a new property to SparkConf called spark.metrics.namespace that 
allows users to
set a custom namespace for executor and driver metrics in the metrics 
systems.

By default, the root namespace used for driver or executor metrics is
the value of `spark.app.id`. However, often times, users want to be able to 
track the metrics
across apps for driver and executor metrics, which is hard to do with 
application ID
(i.e. `spark.app.id`) since it changes with every invocation of the app. 
For such use cases,
users can set the `spark.metrics.namespace` property to another spark 
configuration key like
`spark.app.name` which is then used to populate the root namespace of the 
metrics system
(with the app name in our example). `spark.metrics.namespace` property can 
be set to any
arbitrary spark property key, whose value would be used to set the root 
namespace of the
metrics system. Non driver and executor metrics are never prefixed with 
`spark.app.id`, nor
does the `spark.metrics.namespace` property have any such affect on such 
metrics.


## How was this patch tested?
Added new unit tests, modified existing unit tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/markgrover/spark spark-5847

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14270.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14270


commit 7d023c820cae3b970e11b2278f6c2b3f3751cc3b
Author: Mark Grover <m...@apache.org>
Date:   2016-07-19T05:38:31Z

[SPARK-5847][CORE] Allow for configuring MetricsSystem's use of app ID to 
namespace all metrics

Adding a new property to SparkConf called spark.metrics.namespace that 
allows users to
set a custom namespace for executor and driver metrics in the metrics 
systems.

By default, the root namespace used for driver or executor metrics is
the value of `spark.app.id`. However, often times, users want to be able to 
track the metrics
across apps for driver and executor metrics, which is hard to do with 
application ID
(i.e. `spark.app.id`) since it changes with every invocation of the app. 
For such use cases,
users can set the `spark.metrics.namespace` property to another spark 
configuration key like
`spark.app.name` which is then used to populate the root namespace of the 
metrics system
(with the app name in our example). `spark.metrics.namespace` property can 
be set to any
arbitrary spark property key, whose value would be used to set the root 
namespace of the
metrics system. Non driver and executor metrics are never prefixed with 
`spark.app.id`, nor
does the `spark.metrics.namespace` property have any such affect on such 
metrics.

commit 605e690eefe905fa52979dde16f04bd208c8b219
Author: Mark Grover <m...@apache.org>
Date:   2016-07-19T19:44:55Z

Minor fixes based on self-reviewing




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14223: [SPARK-10614][CORE] Change SystemClock to derive time fr...

2016-07-18 Thread markgrover

Github user markgrover commented on the issue:

https://github.com/apache/spark/pull/14223
  
Thanks @vanzin Yeah, I noticed that too. Will take a look.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14223: [SPARK-10614][CORE] Change SystemClock to derive time fr...

Github user markgrover commented on the issue:

https://github.com/apache/spark/pull/14223
  
The unit test failures from the previous run looked unrelated, fyi.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14223: [SPARK-10614][CORE] SystemClock uses non-monotoni...

Github user markgrover commented on a diff in the pull request:

https://github.com/apache/spark/pull/14223#discussion_r71030596
  
--- Diff: core/src/main/scala/org/apache/spark/util/Clock.scala ---
@@ -33,18 +40,18 @@ private[spark] class SystemClock extends Clock {
   val minPollTime = 25L
 
   /**
-   * @return the same time (milliseconds since the epoch)
-   * as is reported by `System.currentTimeMillis()`
+   * @return the same time in milliseconds as is derived from 
`System.nanoTime()`
*/
-  def getTimeMillis(): Long = System.currentTimeMillis()
+  def getTimeMillis(): Long = System.nanoTime() / (1000 * 1000)
--- End diff --

Will do


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14223: [SPARK-10614][CORE] SystemClock uses non-monotoni...

Github user markgrover commented on a diff in the pull request:

https://github.com/apache/spark/pull/14223#discussion_r71030586
  
--- Diff: core/src/main/scala/org/apache/spark/util/Clock.scala ---
@@ -21,7 +21,14 @@ package org.apache.spark.util
  * An interface to represent clocks, so that they can be mocked out in 
unit tests.
  */
 private[spark] trait Clock {
+  /** @return the time in milliseconds */
   def getTimeMillis(): Long
+
+  /**
+   * Wait until a clock's view of current time reaches the specified 
target time.
+   * @param targetTime block until the current time is at least this value
--- End diff --

Will do


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14223: [SPARK-10614][CORE] SystemClock uses non-monotoni...

GitHub user markgrover opened a pull request:

https://github.com/apache/spark/pull/14223

[SPARK-10614][CORE] SystemClock uses non-monotonic time

## What changes were proposed in this pull request?

Change SystemClock to derive time from System.nanoTime() instead of 
System.currentTimeMillis(). This is because using System.currentTimeMillis() 
makes waitTillTime() routine brittle against systems like VMs where time can go 
forward and backwards.

Shout out to @steveloughran for doing a good chunk of research and a 
previous PR (#8766) on this in 2015.


## How was this patch tested?
Build is green, I am running unit tests, they are looking green, but are 
also taking a while. So, filing this PR a little proactively.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/markgrover/spark SPARK-10614

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14223.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14223


commit a475cd82455e892b23efe2e8fad0dade2ec8b043
Author: Mark Grover <m...@apache.org>
Date:   2016-07-14T23:42:39Z

[SPARK-10614][CORE] SystemClock uses non-monotonic time in its wait logic




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #10953: [SPARK-12177] [STREAMING] Update KafkaDStreams to...

2016-06-13 Thread markgrover

Github user markgrover closed the pull request at:

https://github.com/apache/spark/pull/10953


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #10953: [SPARK-12177] [STREAMING] Update KafkaDStreams to new Ka...

2016-06-13 Thread markgrover

Github user markgrover commented on the issue:

https://github.com/apache/spark/pull/10953
  
Yeah, I agree with @koeninger. This PR is pretty out of date, it makes 
sense to turn focus on Cody's PR #11863 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13382][DOCS][PYSPARK] Update pyspark te...

2016-05-10 Thread markgrover

Github user markgrover commented on the pull request:

https://github.com/apache/spark/pull/11278#issuecomment-218229224
  
Would love to help, but I am not a committer:-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13670][launcher] Propagate error from l...

2016-05-09 Thread markgrover

Github user markgrover commented on the pull request:

https://github.com/apache/spark/pull/12910#issuecomment-218020176
  
LGTM, I had tested this on mac a while back (on 03/22) and it worked for me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9384] [core] Easier setting of executor...

2016-05-06 Thread markgrover

Github user markgrover commented on the pull request:

https://github.com/apache/spark/pull/7739#issuecomment-217517817
  
Thanks for bringing this up, Sean. Yeah, I'd like to see this go in too. I 
realize there are some merge conflicts here. I will make the semantics clear in 
documentation (they are already there, will make sure they are clear enough).

So, if someone is strongly against, let me know. Otherwise, I will be 
updated the code early next week.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14731][shuffle]Revert SPARK-12130 to ma...

2016-04-22 Thread markgrover

Github user markgrover commented on the pull request:

https://github.com/apache/spark/pull/12568#issuecomment-213518615
  
LGTM. +1 (non-binding)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14731][shuffle]Revert SPARK-12130 to ma...

2016-04-22 Thread markgrover

Github user markgrover commented on a diff in the pull request:

https://github.com/apache/spark/pull/12568#discussion_r60772476
  
--- Diff: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java
 ---
@@ -183,12 +190,7 @@ public ManagedBuffer getBlockData(String appId, String 
execId, String blockId) {
 String.format("Executor is not registered (appId=%s, execId=%s)", 
appId, execId));
 }
 
-if ("sort".equals(executor.shuffleManager) || 
"tungsten-sort".equals(executor.shuffleManager)) {
-  return getSortBasedShuffleBlockData(executor, shuffleId, mapId, 
reduceId);
-} else {
-  throw new UnsupportedOperationException(
-"Unsupported shuffle manager: " + executor.shuffleManager);
-}
+return getSortBasedShuffleBlockData(executor, shuffleId, mapId, 
reduceId);
--- End diff --

Makes sense.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14731][shuffle]Revert SPARK-12130 to ma...

Github user markgrover commented on a diff in the pull request:

https://github.com/apache/spark/pull/12568#discussion_r60624962
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -182,7 +182,7 @@ private[spark] class BlockManager(
 val shuffleConfig = new ExecutorShuffleInfo(
   diskBlockManager.localDirs.map(_.toString),
   diskBlockManager.subDirsPerLocalDir,
-  shuffleManager.shortName)
--- End diff --

Yeah, I'm pretty sure it's not.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14731][shuffle]Revert SPARK-12130 to ma...

Github user markgrover commented on a diff in the pull request:

https://github.com/apache/spark/pull/12568#discussion_r60613929
  
--- Diff: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java
 ---
@@ -183,12 +190,7 @@ public ManagedBuffer getBlockData(String appId, String 
execId, String blockId) {
 String.format("Executor is not registered (appId=%s, execId=%s)", 
appId, execId));
 }
 
-if ("sort".equals(executor.shuffleManager) || 
"tungsten-sort".equals(executor.shuffleManager)) {
-  return getSortBasedShuffleBlockData(executor, shuffleId, mapId, 
reduceId);
-} else {
-  throw new UnsupportedOperationException(
-"Unsupported shuffle manager: " + executor.shuffleManager);
-}
+return getSortBasedShuffleBlockData(executor, shuffleId, mapId, 
reduceId);
--- End diff --

Personally I'd prefer an assert or similar on the shuffleManager here. What 
do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14731][shuffle]Revert SPARK-12130 to ma...

Github user markgrover commented on the pull request:

https://github.com/apache/spark/pull/12568#issuecomment-213000250
  
Thanks for picking this up, @lianhuiwang I didn't get a chance to work on 
this yesterday, so I am very glad you are picked this up, thanks! Left some 
comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14731][shuffle]Revert SPARK-12130 to ma...

Github user markgrover commented on a diff in the pull request:

https://github.com/apache/spark/pull/12568#discussion_r60613362
  
--- Diff: 
common/network-shuffle/src/test/java/org/apache/spark/network/shuffle/ExternalShuffleIntegrationSuite.java
 ---
@@ -186,7 +186,12 @@ public void testFetchThreeSort() throws Exception {
 
   @Test
   public void testFetchInvalidShuffle() throws Exception {
-registerExecutor("exec-1", dataContext0.createExecutorInfo("unknown 
sort manager"));
+try {
+  registerExecutor("exec-1", dataContext0.createExecutorInfo("unknown 
sort manager"));
+  fail("unknown sort manager ");
--- End diff --

nit: trailing whitespace


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14731][shuffle]Revert SPARK-12130 to ma...