[GitHub] spark issue #18179: [SPARK-20894][SS] Resolve the checkpoint location in dri...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/18179 Sounds good to me. The JIRA was unresolved. I have resolved it with a fix version of 2.3.0 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18802: [SPARK-18535][SPARK-19720][CORE][BACKPORT-2.1] Redact se...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/18802 Thanks @dmvieira and @vanzin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18765: [SPARK-19720][CORE][BACKPORT-2.1] Redact sensitiv...
Github user markgrover commented on a diff in the pull request: https://github.com/apache/spark/pull/18765#discussion_r130668429 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2571,6 +2572,23 @@ private[spark] object Utils extends Logging { sparkJars.map(_.split(",")).map(_.filter(_.nonEmpty)).toSeq.flatten } } + + private[util] val REDACTION_REPLACEMENT_TEXT = "*(redacted)" + private[util] val SECRET_REDACTION_PATTERN = "(?i)secret|password".r --- End diff -- I think what's really happening here is that we are backporting some changes introduced in SPARK-18535 while backporting this JIRA (SPARK-19720). SPARK-18535 is a dependency of this, so if we want to backport this, we should really be backporting SPARK-18535 as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18179: [SPARK-20894][SS] Resolve the checkpoint location in dri...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/18179 Doesn't look like this made to 2.2. Should we mark the JIRA resolved with 2.3 as the target version. I am assuming it's too late for this to go to 2.2.0. May be retarget this for 2.2.1? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18234: [SPARK-19185][DSTREAM] Make Kafka consumer cache configu...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/18234 Thanks @vanzin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18234: [SPARK-19185][DSTREAM] Make Kafka consumer cache configu...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/18234 Thanks all. On Jun 7, 2017 6:41 PM, "Cody Koeninger" wrote: > LGTM, thanks Mark > > â > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <https://github.com/apache/spark/pull/18234#issuecomment-306973651>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/ABoVi-xu4UnJIgzldw0p3CMSFqDao7kKks5sB1FGgaJpZM4NzCo1> > . > --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18234: [SPARK-19185][DSTREAM] Make Kafka consumer cache ...
Github user markgrover commented on a diff in the pull request: https://github.com/apache/spark/pull/18234#discussion_r120778946 --- Diff: docs/streaming-kafka-0-10-integration.md --- @@ -91,7 +91,7 @@ The new Kafka consumer API will pre-fetch messages into buffers. Therefore it i In most cases, you should use `LocationStrategies.PreferConsistent` as shown above. This will distribute partitions evenly across available executors. If your executors are on the same hosts as your Kafka brokers, use `PreferBrokers`, which will prefer to schedule partitions on the Kafka leader for that partition. Finally, if you have a significant skew in load among partitions, use `PreferFixed`. This allows you to specify an explicit mapping of partitions to hosts (any unspecified partitions will use a consistent location). -The cache for consumers has a default maximum size of 64. If you expect to be handling more than (64 * number of executors) Kafka partitions, you can change this setting via `spark.streaming.kafka.consumer.cache.maxCapacity` +The cache for consumers has a default maximum size of 64. If you expect to be handling more than (64 * number of executors) Kafka partitions, you can change this setting via `spark.streaming.kafka.consumer.cache.maxCapacity`. If you would like to disable the caching for Kafka consumers, you can set `spark.streaming.kafka.consumer.cache.enabled` to `false`. --- End diff -- ok, thanks I have updated the doc to add a little more context. If you could review and help this get committed, I'd really appreciate that. Thanks again! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18234: [SPARK-19185][DSTREAM] Make Kafka consumer cache ...
Github user markgrover commented on a diff in the pull request: https://github.com/apache/spark/pull/18234#discussion_r120703106 --- Diff: docs/streaming-kafka-0-10-integration.md --- @@ -91,7 +91,7 @@ The new Kafka consumer API will pre-fetch messages into buffers. Therefore it i In most cases, you should use `LocationStrategies.PreferConsistent` as shown above. This will distribute partitions evenly across available executors. If your executors are on the same hosts as your Kafka brokers, use `PreferBrokers`, which will prefer to schedule partitions on the Kafka leader for that partition. Finally, if you have a significant skew in load among partitions, use `PreferFixed`. This allows you to specify an explicit mapping of partitions to hosts (any unspecified partitions will use a consistent location). -The cache for consumers has a default maximum size of 64. If you expect to be handling more than (64 * number of executors) Kafka partitions, you can change this setting via `spark.streaming.kafka.consumer.cache.maxCapacity` +The cache for consumers has a default maximum size of 64. If you expect to be handling more than (64 * number of executors) Kafka partitions, you can change this setting via `spark.streaming.kafka.consumer.cache.maxCapacity`. If you would like to disable the caching for Kafka consumers, you can set `spark.streaming.kafka.consumer.cache.enabled` to `false`. --- End diff -- I am open to either option. I'd slightly prefer documenting that option and saying something about no guarantee. Thanks for reviewing, Sean. @koeninger would appreciate your review as well, thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18234: [SPARK-19185][DSTREAM] Make Kafka consumer cache configu...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/18234 This is related to but is a stripped down version of #16629. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18234: [SPARK-19185][DSTREAM] Make Kafka consumer cache ...
GitHub user markgrover opened a pull request: https://github.com/apache/spark/pull/18234 [SPARK-19185][DSTREAM] Make Kafka consumer cache configurable ## What changes were proposed in this pull request? Add a new property `spark.streaming.kafka.consumer.cache.enabled` that allows users to enable or disable the cache for Kafka consumers. This property can be especially handy in cases where issues like SPARK-19185 get hit, for which there isn't a solution committed yet. By default, the cache is still on, so this change doesn't change any out-of-box behavior. ## How was this patch tested? Running unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/markgrover/spark spark-19185 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18234.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18234 commit 68ca3f3c1051e7b98f2dcb92b752bee43d810c07 Author: Mark Grover Date: 2017-06-07T16:03:37Z [SPARK-19185][DSTREAM] Make Kafka consumer cache configurable --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17990: [YARN] [SPARK-20756] yarn-shuffle jar references unshade...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/17990 Thanks On May 22, 2017 6:13 PM, "asfgit" wrote: Closed #17990 <https://github.com/apache/spark/pull/17990> via 3630911 <https://github.com/apache/spark/commit/36309110046a89d749a7c9746eaa16997de26922> . â You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <https://github.com/apache/spark/pull/17990#event-1092212640>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABoViz_0-89g96TpV0vq3vrq33loLldDks5r8cJSgaJpZM4Nb4c1> . --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17990: [YARN] [SPARK-20756] yarn-shuffle jar references unshade...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/17990 Thanks On May 22, 2017 6:13 PM, "asfgit" wrote: > Closed #17990 <https://github.com/apache/spark/pull/17990> via 3630911 > <https://github.com/apache/spark/commit/36309110046a89d749a7c9746eaa16997de26922> > . > > â > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <https://github.com/apache/spark/pull/17990#event-1092212640>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/ABoViz_0-89g96TpV0vq3vrq33loLldDks5r8cJSgaJpZM4Nb4c1> > . > --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17990: [YARN] [SPARK-20756] yarn-shuffle jar references unshade...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/17990 @vanzin thanks for your suggestion, I have done some more testing and all looks good from my side. I'd appreciate if you could take another look. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17990: [YARN] [SPARK-20756][WIP] yarn-shuffle jar references un...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/17990 Jenkins, re-test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17990: [YARN] [SPARK-20756][WIP] yarn-shuffle jar refere...
Github user markgrover commented on a diff in the pull request: https://github.com/apache/spark/pull/17990#discussion_r116845002 --- Diff: common/network-yarn/pom.xml --- @@ -113,6 +116,13 @@ io.netty.** + + com.google.common --- End diff -- I have changed it to use inheritance. They looked equivalent but on second thought inheritance is better so any changes to parent pom trickle down automatically. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17990: [YARN] [SPARK-20756] yarn-shuffle jar references unshade...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/17990 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17990: [YARN] [SPARK-20756] yarn-shuffle jar references ...
GitHub user markgrover opened a pull request: https://github.com/apache/spark/pull/17990 [YARN] [SPARK-20756] yarn-shuffle jar references unshaded guava and contains scala classes ## What changes were proposed in this pull request? This change ensures that all references to guava from within the yarn shuffle jar pointed to the shaded guava class already provided in the jar. Also, it explicitly excludes scala classes from being added to the jar. ## How was this patch tested? Ran unit tests on the module and they passed. I am in the process of taking this jar and deploying it on a yarn cluster, and making sure the YARN node managers come up and that the shuffle via shuffle service works. Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/markgrover/spark spark-20756 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17990.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17990 commit 461ee5aad13bbd35e710bed8b6943409ce90d457 Author: Mark Grover Date: 2017-05-16T01:17:05Z [YARN] [SPARK-20756] yarn-shuffle jar has references to unshaded guava and contains scala classes --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17790: [SPARK-20514][CORE] Upgrade Jetty to 9.3.11.v20160721
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/17790 Thanks for the review! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17790: [SPARK-20514][CORE] Upgrade Jetty to 9.3.11.v2016...
Github user markgrover commented on a diff in the pull request: https://github.com/apache/spark/pull/17790#discussion_r113817810 --- Diff: pom.xml --- @@ -136,7 +136,7 @@ 10.12.1.1 1.8.2 1.6.0 -9.2.16.v20160414 +9.3.11.v20160721 --- End diff -- Ah, thanks, there was some confusion about this. I have fixed the JIRA title and the PR title to match up with the actual code - which uses the exact same version as Apache Hadoop (9.3.11.v20160721). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17790: [SPARK-20514][CORE] Upgrade Jetty to 9.3.13.v20161014
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/17790 I am running unit tests locally and while the run hasn't finished, it's looking pretty good so far. I think it'd be good to get Jenkins to run the tests here anyways so putting a PR sooner than later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17790: [SPARK-20514][CORE] Upgrade Jetty to 9.3.13.v2016...
GitHub user markgrover opened a pull request: https://github.com/apache/spark/pull/17790 [SPARK-20514][CORE] Upgrade Jetty to 9.3.13.v20161014 Upgrade Jetty so it can work with Hadoop 3 (alpha 2 release, in particular). Without this change, because of incompatibily between Jetty versions, Spark fails to compile when built against Hadoop 3 ## How was this patch tested? Unit tests being run. You can merge this pull request into a Git repository by running: $ git pull https://github.com/markgrover/spark spark-20514 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17790.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17790 commit ecfb8e3f276eeb276ed0a3293a68ff93a6f9e88e Author: Mark Grover Date: 2017-04-27T21:32:18Z [SPARK-20514][CORE] Upgrade Jetty to 9.3.13.v20161014 Upgrade Jetty so it can work with Hadoop 3 (alpha 2 release, in particular). Without this change, because of incompatibily between Jetty versions, Spark fails to compile when built against Hadoop 3 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17725: [SPARK-20435][CORE] More thorough redaction of sensitive...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/17725 Thanks Marcelo! On Apr 26, 2017 5:06 PM, "Marcelo Vanzin" wrote: > Merging to master / 2.2. > > â > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <https://github.com/apache/spark/pull/17725#issuecomment-297575053>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/ABoVi9s2uyakNm-81qWP3aBpRkFHtm3Jks5rz9vugaJpZM4NE66l> > . > --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17725: [SPARK-20435][CORE] More thorough redaction of sensitive...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/17725 Finally it's passing! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17725: [SPARK-20435][CORE] More thorough redaction of sensitive...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/17725 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17725: [SPARK-20435][CORE] More thorough redaction of sensitive...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/17725 This time it finished on regular time (2h 29m) but failed a test. I ran that test locally (org.apache.spark.streaming.ReceiverSuite's 'receiver life cycle') and it passed for me, so for mostly laziness, I am going to issue another run here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17725: [SPARK-20435][CORE] More thorough redaction of sensitive...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/17725 Kicking off another run here while I am running a run locally. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17725: [SPARK-20435][CORE] More thorough redaction of sensitive...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/17725 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17725: [SPARK-20435][CORE] More thorough redaction of sensitive...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/17725 Hmm, this is weird, it ran for 4h10 mins and got killed due to a timeout. Looking... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17725: [SPARK-20435][CORE] More thorough redaction of sensitive...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/17725 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17725: [SPARK-20435][CORE] More thorough redaction of sensitive...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/17725 Few decisions that I made here: * I considered if just `sun.java.command` property should be checked and redacted but that seemed to specific and likely a bandaid to the current problem, not a long-term solution, so decided against doing it. * Redaction for the `SparkListenerEnvironmentUpdate` event was solely being done on `Spark Properties`, while `sun.java.command` is a part of `System Properties`. I considered doing redaction for `System Properties` in addition to `Spark Properties` (that would have gone somewhere around [here](https://github.com/apache/spark/pull/17725/files#diff-e4a5a68c15eed95d038acfed84b0b66aL258)) but decided against it because that would have even more hardcoding and I didn't see why these 2 special kinds of properties are special enough to be redacted but the rest of them. So, decided to redact information from all kinds of properties. * One way to redact the property value would have been to redact the minimum possible set from the value while keeping the rest of the value intact. For example, if the following were the unredacted case: `"sun.java.command":"org.apache.spark.deploy.SparkSubmit ... --conf spark.executorEnv.HADOOP_CREDSTORE_PASSWORD=secret_password --conf spark.other.property=2"` One option for the redacted output could have been: `"sun.java.command":"org.apache.spark.deploy.SparkSubmit ... --conf spark.executorEnv.HADOOP_CREDSTORE_PASSWORD=*(redacted) --conf spark.other.property=2"` However, such a redaction is very hard to maintain. For example, we would had to take the current regex (which is `(?i)secret|password` by default and add matchers to it like so `(?i)secret|password` like `"("+SECRET_REDACTION_DEFAULT+"[^ ]*=)[^ ]*"`. That would allow us to squeeze out and replaced just the matched portion. But this all seemed very fragile and even worse when the user supplies a non-default regex so I decided it was easiest to simply replace the entire value, even though only a small part of it contained `secret` or `password` in it. * One thing which I didn't explicitly check was the performance implications of this change. The reason I bring this up is because, previously we were comparing keys with a regex, now if the key doesn't match, we match the value with the regex. So, in the worst case, we are twice as many regex matches as before. Also, before we were simply doing regex matching on `Spark Properties`, now we do them on all properties - `Spark Properties`, `System Properties`, `JVM Properties` and `Classpath Properties`. I don't think this should have a big performance impact so I didn't invest time in it, mentioning here in interest of full disclosure. Thanks in advance for reviewing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17725: [SPARK-20435][CORE] More thorough redaction of se...
GitHub user markgrover opened a pull request: https://github.com/apache/spark/pull/17725 [SPARK-20435][CORE] More thorough redaction of sensitive information This change does a more thorough redaction of sensitive information from logs and UI Add unit tests that ensure that no regressions happen that leak sensitive information to the logs. Previously redaction logic was only checking if the key matched the secret regex pattern, it'd redact it's value. That worked for most cases. However, in the above case, the key (sun.java.command) doesn't tell much, so the value needs to be searched. This PR expands the check to check for values as well. ## How was this patch tested? New unit tests added that ensure that no sensitive information is present in the event logs or the yarn logs. You can merge this pull request into a Git repository by running: $ git pull https://github.com/markgrover/spark spark-20435 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17725.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17725 commit 2f5148a2e37d8d36006fb297b28a9e8c21a0026b Author: Mark Grover Date: 2017-04-22T00:24:30Z [SPARK-20435][CORE] More thorough redaction of sensitive information from logs/UI, more unit tests --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17442: [SPARK-20107][SQL] Speed up HadoopMapReduceCommitProtoco...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/17442 Agreed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17393: [SPARK-20066] [CORE] Add explicit SecurityManager...
Github user markgrover closed the pull request at: https://github.com/apache/spark/pull/17393 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17393: [SPARK-20066] [CORE] Add explicit SecurityManager(SparkC...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/17393 Thanks for posting. I will close this, I understand the hesitation to set a precedent here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17393: [SPARK-20066] [CORE] Add explicit SecurityManager...
GitHub user markgrover opened a pull request: https://github.com/apache/spark/pull/17393 [SPARK-20066] [CORE] Add explicit SecurityManager(SparkConf) constructor for backwards compatibility with Java. ## What changes were proposed in this pull request? This adds an explicit SecurityManager(SparkConf) constructor in addition to the existing constructor that takes 2 arguments - SparkConf and ioEncryptionKey. The second argument has a default but that's still not enough if this code is invoked from Java because of [this issue](http://stackoverflow.com/questions/13059528/instantiate-a-scala-class-from-java-and-use-the-default-parameters-of-the-const) ## How was this patch tested? Before this PR: mvn clean package -Dspark.version=2.1.0 fails. mvn clean package -Dspark.version=2.0.0 passes. After this PR: mvn clean package -Dspark.version=2.2.0-SNAPSHOT passes. You can merge this pull request into a Git repository by running: $ git pull https://github.com/markgrover/spark spark-20066 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17393.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17393 commit 2a3c66f3f2ef89d1bbde61e1144487b5a99b70b1 Author: Mark Grover Date: 2017-03-23T03:41:27Z [SPARK-20066] [CORE] Add explicit SecurityManager(SparkConf) constructor for backwards compatibility with Java --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17047: [SPARK-19720][CORE] Redact sensitive information from Sp...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/17047 Thanks @vanzin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17127: [SPARK-19734][PYTHON][ML] Correct OneHotEncoder d...
GitHub user markgrover opened a pull request: https://github.com/apache/spark/pull/17127 [SPARK-19734][PYTHON][ML] Correct OneHotEncoder doc string to say dropLast ## What changes were proposed in this pull request? Updates the doc string to match up with the code i.e. say dropLast instead of includeFirst ## How was this patch tested? Not much, since it's a doc-like change. Will run unit tests via Jenkins job. You can merge this pull request into a Git repository by running: $ git pull https://github.com/markgrover/spark spark_19734 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17127.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17127 commit f3f5fd198a16e0128b37ac105d31d59cb9b1de8e Author: Mark Grover Date: 2017-03-01T18:59:18Z [SPARK-19734][PYTHON][ML] OneHotEncoder __init__ uses dropLast but doc strings all say includeFirst Updates the doc string to match up with the code i.e. list dropLast instead of includeFirst --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17047: [SPARK-19720][CORE] Redact sensitive information from Sp...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/17047 Seems unrelated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17047: [SPARK-19720][CORE] Redact sensitive information from Sp...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/17047 Jenkins, test this again, please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17047: [SPARK-19720][CORE] Redact sensitive information ...
Github user markgrover commented on a diff in the pull request: https://github.com/apache/spark/pull/17047#discussion_r103587323 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2574,13 +2575,31 @@ private[spark] object Utils extends Logging { def redact(conf: SparkConf, kvs: Seq[(String, String)]): Seq[(String, String)] = { val redactionPattern = conf.get(SECRET_REDACTION_PATTERN).r +redact(redactionPattern, kvs) + } + + private def redact(redactionPattern: Regex, kvs: Seq[(String, String)]): Seq[(String, String)] = { kvs.map { kv => redactionPattern.findFirstIn(kv._1) -.map { ignore => (kv._1, REDACTION_REPLACEMENT_TEXT) } +.map {ignore => (kv._1, REDACTION_REPLACEMENT_TEXT) } --- End diff -- Ah, right I misunderstood that - I took your comment as a bash + scala way to say eliminate space (partly because it's hard to understand spacing in github comments). My bad, let me fix that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17047: [SPARK-19720][CORE] Redact sensitive information ...
Github user markgrover commented on a diff in the pull request: https://github.com/apache/spark/pull/17047#discussion_r103585591 --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala --- @@ -240,14 +240,16 @@ package object config { .longConf .createWithDefault(4 * 1024 * 1024) + private[spark] val SECRET_REDACTION_PROPERTY = "spark.redaction.regex" --- End diff -- Thanks @vanzin Updated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17047: [SPARK-19720][SPARK SUBMIT] Redact sensitive info...
Github user markgrover commented on a diff in the pull request: https://github.com/apache/spark/pull/17047#discussion_r103367049 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2574,13 +2575,30 @@ private[spark] object Utils extends Logging { def redact(conf: SparkConf, kvs: Seq[(String, String)]): Seq[(String, String)] = { val redactionPattern = conf.get(SECRET_REDACTION_PATTERN).r +redact(redactionPattern, kvs) + } + + private def redact(redactionPattern: Regex, kvs: Seq[(String, String)]): Seq[(String, String)] = { kvs.map { kv => redactionPattern.findFirstIn(kv._1) .map { ignore => (kv._1, REDACTION_REPLACEMENT_TEXT) } .getOrElse(kv) } } + /** + * Looks up the redaction regex from within the key value pairs and uses it to redact the rest + * of the key value pairs. No care is taken to make sure the redaction property itself is not + * redacted. So theoretically, the property itself could be configured to redact its own value + * when printing. + * @param kvs + * @return + */ + def redact(kvs: Map[String, String]): Seq[(String, String)] = { --- End diff -- Correct, that's exactly the use case - where there isn't a conf object available yet. I will update the Javadoc. Thanks for reviewing! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17047: [SPARK-19720][SPARK SUBMIT] Redact sensitive info...
GitHub user markgrover opened a pull request: https://github.com/apache/spark/pull/17047 [SPARK-19720][SPARK SUBMIT] Redact sensitive information from SparkSubmit console ## What changes were proposed in this pull request? This change redacts senstive information (based on `spark.redaction.regex` property) from the Spark Submit console logs. Such sensitive information is already being redacted from event logs and yarn logs, etc. ## How was this patch tested? Testing was done manually to make sure that the console logs were not printing any sensitive information. Here's some output from the console: ``` Spark properties used, including those specified through --conf and those from the properties file /etc/spark2/conf/spark-defaults.conf: (spark.yarn.appMasterEnv.HADOOP_CREDSTORE_PASSWORD,*(redacted)) (spark.authenticate,false) (spark.executorEnv.HADOOP_CREDSTORE_PASSWORD,*(redacted)) ``` ``` System properties: (spark.yarn.appMasterEnv.HADOOP_CREDSTORE_PASSWORD,*(redacted)) (spark.authenticate,false) (spark.executorEnv.HADOOP_CREDSTORE_PASSWORD,*(redacted)) ``` There is a risk if new print statements were added to the console down the road, sensitive information may still get leaked, since there is no test that asserts on the console log output. I considered it out of the scope of this JIRA to write an integration test to make sure new leaks don't happen in the future. Running unit tests to make sure nothing else is broken by this change. You can merge this pull request into a Git repository by running: $ git pull https://github.com/markgrover/spark master_redaction Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17047.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17047 commit 000efb1e3152f837e01ce1f80ae108c596f9baa5 Author: Mark Grover Date: 2017-02-24T01:30:05Z [SPARK-19720][SPARK SUBMIT] Redact sensitive information from SparkSubmit console output This change redacts senstive information (based on spark.redaction.regex property) from the Spark Submit console logs. Such sensitive information is already being redacted from event logs and yarn logs, etc. Testing was done manually to make sure that the console logs were not printing any sensitive information. Here's some output from the console: Spark properties used, including those specified through --conf and those from the properties file /etc/spark2/conf/spark-defaults.conf: (spark.yarn.appMasterEnv.HADOOP_CREDSTORE_PASSWORD,*(redacted)) (spark.authenticate,false) (spark.executorEnv.HADOOP_CREDSTORE_PASSWORD,*(redacted)) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15971: [SPARK-18535][UI][YARN] Redact sensitive information fro...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/15971 Ok, looks like all is good now! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15971: [SPARK-18535][UI][YARN] Redact sensitive information fro...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/15971 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15971: [SPARK-18535][UI][YARN] Redact sensitive information fro...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/15971 Thanks for reviewing, Marcelo. Hmm, the failures are still unrelated:-( ``` [info] Exception encountered when attempting to run a suite with class name: org.apache.spark.sql.jdbc.JDBCWriteSuite *** ABORTED *** (1 second, 791 milliseconds) [info] org.h2.jdbc.JdbcSQLException: Schema "TEST" already exists; SQL statement: [info] create schema test [90078-183] [info] at org.h2.message.DbException.getJdbcSQLException(DbException.java:345) [info] at org.h2.message.DbException.get(DbException.java:179) [info] at org.h2.message.DbException.get(DbException.java:155) [info] at org.h2.command.ddl.CreateSchema.update(CreateSchema.java:48) [info] at org.h2.command.CommandContainer.update(CommandContainer.java:78) [info] at org.h2.command.Command.executeUpdate(Command.java:254) [info] at org.h2.jdbc.JdbcPreparedStatement.executeUpdateInternal(JdbcPreparedStatement.java:157) [info] at org.h2.jdbc.JdbcPreparedStatement.executeUpdate(JdbcPreparedStatement.java:143) [info] at org.apache.spark.sql.jdbc.JDBCWriteSuite$$anonfun$29.apply(JDBCWriteSuite.scala:56) [info] at org.apache.spark.sql.jdbc.JDBCWriteSuite$$anonfun$29.apply(JDBCWriteSuite.scala:53) [info] at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:195) [info] at org.apache.spark.sql.jdbc.JDBCWriteSuite.runTest(JDBCWriteSuite.scala:34) [info] at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) [info] at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) [info] at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) [info] at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) [info] at scala.collection.immutable.List.foreach(List.scala:381) [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) [info] at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) [info] at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) [info] at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) [info] at org.scalatest.Suite$class.run(Suite.scala:1424) [info] at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) [info] at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) [info] at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) [info] at org.scalatest.SuperEngine.runImpl(Engine.scala:545) [info] at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) [info] at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:31) [info] at org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257) [info] at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256) [info] at org.apache.spark.sql.jdbc.JDBCWriteSuite.org$scalatest$BeforeAndAfter$$super$run(JDBCWriteSuite.scala:34) [info] at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241) [info] at org.apache.spark.sql.jdbc.JDBCWriteSuite.run(JDBCWriteSuite.scala:34) [info] at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:357) [info] at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:502) [info] at sbt.ForkMain$Run$2.call(ForkMain.java:296) [info] at sbt.ForkMain$Run$2.call(ForkMain.java:286) [info] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [info] at java.lang.Thread.run(Thread.java:745) ``` I won't be available for the next few days. If you deem fit to you, I'd appreciate if you could commit this after the test runs stabilize. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15971: [SPARK-18535][UI][YARN] Redact sensitive informat...
Github user markgrover commented on a diff in the pull request: https://github.com/apache/spark/pull/15971#discussion_r89416437 --- Diff: core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala --- @@ -231,6 +233,15 @@ private[spark] class EventLoggingListener( } } + private[spark] def redactEvent( +event: SparkListenerEnvironmentUpdate): SparkListenerEnvironmentUpdate = { --- End diff -- Thanks, I'll go through Spark style guide so I don't cause as much trouble next time. Thanks for reviewing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15971: [SPARK-18535][UI][YARN] Redact sensitive information fro...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/15971 Thanks for your review, @vanzin I have incorporated all suggestions, I'd appreciate if you could take another look. And, the test failures in the last run seem unrelated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15971: [SPARK-18535][UI][YARN] Redact sensitive informat...
Github user markgrover commented on a diff in the pull request: https://github.com/apache/spark/pull/15971#discussion_r89388400 --- Diff: core/src/test/scala/org/apache/spark/scheduler/EventLoggingListenerSuite.scala --- @@ -95,6 +95,30 @@ class EventLoggingListenerSuite extends SparkFunSuite with LocalSparkContext wit } } + test("Event logging with password redaction") { +val secretPassword = "secret_password" +val conf = getLoggingConf(testDirPath, None).set("spark.executorEnv.HADOOP_CREDSTORE_PASSWORD", + secretPassword) +sc = new SparkContext("local-cluster[2,2,1024]", "test", conf) --- End diff -- Ok, I have updated this test to be less bloated. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15971: [SPARK-18535][UI][YARN] Redact sensitive informat...
Github user markgrover commented on a diff in the pull request: https://github.com/apache/spark/pull/15971#discussion_r89383168 --- Diff: core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala --- @@ -231,6 +233,19 @@ private[spark] class EventLoggingListener( } } + + private def redactEvent(event: SparkListenerEnvironmentUpdate): SparkListenerEnvironmentUpdate = { --- End diff -- >That's the only argument I can buy, and said user has different ways to get that information. True, `sparkConf.getAll` is always available. I still think it's better to put the redaction closer to the "sinks" for now. The good thing though is that if we see the number of "sinks" increasing, and everyone wanting redaction, we can take redaction more upstream. For now, 2 sinks seem manageable and it's hard to guess if future sinks are going to want redaction or not. So, unless you strongly object, I'd like to keep the redaction closer to the "sinks" now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15971: [SPARK-18535][UI][YARN] Redact sensitive informat...
Github user markgrover commented on a diff in the pull request: https://github.com/apache/spark/pull/15971#discussion_r89382000 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2555,6 +2555,19 @@ private[spark] object Utils extends Logging { sparkJars.map(_.split(",")).map(_.filter(_.nonEmpty)).toSeq.flatten } } + + private[util] val REDACTION_REPLACEMENT_TEXT = "*(redacted)" + + def redact(conf: SparkConf, kvs: Seq[(String, String)]): Seq[(String, String)] = { +val redactionPattern = conf.get(SECRET_REDACTION_PATTERN).r +kvs.map { kv => + if (redactionPattern.findFirstIn(kv._1).isDefined) { +(kv._1, REDACTION_REPLACEMENT_TEXT) + } + else kv --- End diff -- That's much better, thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15971: [SPARK-18535][UI][YARN] Redact sensitive informat...
Github user markgrover commented on a diff in the pull request: https://github.com/apache/spark/pull/15971#discussion_r89382120 --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala --- @@ -223,4 +223,13 @@ package object config { " bigger files.") .longConf .createWithDefault(4 * 1024 * 1024) + + private[spark] val SECRET_REDACTION_PATTERN = +ConfigBuilder("spark.redaction.regex") + .doc("Regex to decide which Spark configuration properties and environment variables in " + +"driver and executor environments contain sensitive information. When this regex matches " + +"a property , its value is redacted from the environment UI and various logs like YARN " + +"and event logs") --- End diff -- Fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15971: [SPARK-18535][UI][YARN] Redact sensitive informat...
Github user markgrover commented on a diff in the pull request: https://github.com/apache/spark/pull/15971#discussion_r89382071 --- Diff: core/src/test/scala/org/apache/spark/util/UtilsSuite.scala --- @@ -974,4 +974,28 @@ class UtilsSuite extends SparkFunSuite with ResetSystemProperties with Logging { assert(pValue > threshold) } + + test("redact sensitive information") { +val sparkConf = new SparkConf + +// Set some secret keys +val secretKeys = Seq("" + + "spark.executorEnv.HADOOP_CREDSTORE_PASSWORD", + "spark.my.password", + "spark.my.sECreT") +secretKeys.foreach { key => + sparkConf.set(key, "secret_password") +} +// Set a non-secret key +sparkConf.set("spark.regular.property", "not_a_secret") + +// Redact sensitive information +val redactedConf = Utils.redact(sparkConf, sparkConf.getAll).toMap + +// Assert that secret information got redacted while the regular property remained the same +secretKeys.foreach { key => + assert(redactedConf.get(key).get == Utils.REDACTION_REPLACEMENT_TEXT) +} +assert(redactedConf.get("spark.regular.property").get == "not_a_secret") --- End diff -- Fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15971: [SPARK-18535][UI][YARN] Redact sensitive informat...
Github user markgrover commented on a diff in the pull request: https://github.com/apache/spark/pull/15971#discussion_r89382041 --- Diff: core/src/test/scala/org/apache/spark/util/UtilsSuite.scala --- @@ -974,4 +974,28 @@ class UtilsSuite extends SparkFunSuite with ResetSystemProperties with Logging { assert(pValue > threshold) } + + test("redact sensitive information") { +val sparkConf = new SparkConf + +// Set some secret keys +val secretKeys = Seq("" + --- End diff -- Me trying to format things in my editor led to this, apologies. Fixed it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15971: [SPARK-18535][UI][YARN] Redact sensitive informat...
Github user markgrover commented on a diff in the pull request: https://github.com/apache/spark/pull/15971#discussion_r89382129 --- Diff: core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala --- @@ -231,6 +233,19 @@ private[spark] class EventLoggingListener( } } + --- End diff -- Fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15971: [SPARK-18535][UI][YARN] Redact sensitive informat...
Github user markgrover commented on a diff in the pull request: https://github.com/apache/spark/pull/15971#discussion_r89382059 --- Diff: core/src/test/scala/org/apache/spark/util/UtilsSuite.scala --- @@ -974,4 +974,28 @@ class UtilsSuite extends SparkFunSuite with ResetSystemProperties with Logging { assert(pValue > threshold) } + + test("redact sensitive information") { +val sparkConf = new SparkConf + +// Set some secret keys +val secretKeys = Seq("" + + "spark.executorEnv.HADOOP_CREDSTORE_PASSWORD", + "spark.my.password", + "spark.my.sECreT") +secretKeys.foreach { key => + sparkConf.set(key, "secret_password") +} +// Set a non-secret key +sparkConf.set("spark.regular.property", "not_a_secret") + +// Redact sensitive information +val redactedConf = Utils.redact(sparkConf, sparkConf.getAll).toMap + +// Assert that secret information got redacted while the regular property remained the same +secretKeys.foreach { key => + assert(redactedConf.get(key).get == Utils.REDACTION_REPLACEMENT_TEXT) --- End diff -- Fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15971: [SPARK-18535][UI][YARN] Redact sensitive information fro...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/15971 Thanks for your feedback @vanzin and @ajbozarth. I have updated the PR with your suggestions. The only one that's pending is @vanzin's suggestion of improving the EventLoggingSuiteTest. Let me think about that one for a bit before I update it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15971: [SPARK-18535][UI][YARN] Redact sensitive informat...
Github user markgrover commented on a diff in the pull request: https://github.com/apache/spark/pull/15971#discussion_r89240891 --- Diff: core/src/test/scala/org/apache/spark/util/UtilsSuite.scala --- @@ -974,4 +974,18 @@ class UtilsSuite extends SparkFunSuite with ResetSystemProperties with Logging { assert(pValue > threshold) } + + test("redact sensitive information") { +val sparkConf = new SparkConf +sparkConf.set("spark.executorEnv.HADOOP_CREDSTORE_PASSWORD", "secret_password") --- End diff -- Fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15971: [SPARK-18535][UI][YARN] Redact sensitive informat...
Github user markgrover commented on a diff in the pull request: https://github.com/apache/spark/pull/15971#discussion_r89240880 --- Diff: core/src/test/scala/org/apache/spark/scheduler/EventLoggingListenerSuite.scala --- @@ -95,6 +95,30 @@ class EventLoggingListenerSuite extends SparkFunSuite with LocalSparkContext wit } } + test("Event logging with password redaction") { +val secretPassword = "secret_password" +val conf = getLoggingConf(testDirPath, None).set("spark.executorEnv.HADOOP_CREDSTORE_PASSWORD", + secretPassword) +sc = new SparkContext("local-cluster[2,2,1024]", "test", conf) +assert(sc.eventLogger.isDefined) +val eventLogger = sc.eventLogger.get + +sc.parallelize(1 to 1).count() +sc.stop() + +val logData = EventLoggingListener.openEventLog(new Path(eventLogger.logPath), fileSystem) +val eventLog = Source.fromInputStream(logData).mkString +// Make sure nothing secret shows up anywhere +assert(!eventLog.contains(secretPassword), s"Secret password ($secretPassword) not redacted " + + s"from event logs:\n $eventLog") +val expected = """"spark.executorEnv.HADOOP_CREDSTORE_PASSWORD":"*(redacted)"""" +// Make sure every occurrence of the property is accompanied by a redaction text. +val regex = """"spark.executorEnv.HADOOP_CREDSTORE_PASSWORD":"([^"]*)"""".r +val matches = regex.findAllIn(eventLog) +assert(matches.nonEmpty) +matches.foreach(matched => assert(matched.equals(expected))) --- End diff -- Fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15971: [SPARK-18535][UI][YARN] Redact sensitive informat...
Github user markgrover commented on a diff in the pull request: https://github.com/apache/spark/pull/15971#discussion_r89240853 --- Diff: core/src/test/scala/org/apache/spark/scheduler/EventLoggingListenerSuite.scala --- @@ -95,6 +95,30 @@ class EventLoggingListenerSuite extends SparkFunSuite with LocalSparkContext wit } } + test("Event logging with password redaction") { +val secretPassword = "secret_password" +val conf = getLoggingConf(testDirPath, None).set("spark.executorEnv.HADOOP_CREDSTORE_PASSWORD", + secretPassword) +sc = new SparkContext("local-cluster[2,2,1024]", "test", conf) --- End diff -- Yeah, I wanted a little more than just a "unit" test. This was more broader and checked for actual redaction taking place in event logs, so I have it here. I think you have a valid point though, if you think this is too expensive, I think the method in UtilsSuite.scala does a pretty good job at "unit testing" `redact()`, so I'd rather take this out completely. Thoughts? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15971: [SPARK-18535][UI][YARN] Redact sensitive informat...
Github user markgrover commented on a diff in the pull request: https://github.com/apache/spark/pull/15971#discussion_r89240682 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2555,6 +2555,15 @@ private[spark] object Utils extends Logging { sparkJars.map(_.split(",")).map(_.filter(_.nonEmpty)).toSeq.flatten } } + + private[util] val REDACTION_REPLACEMENT_TEXT = "*(redacted)" + def redact(conf: SparkConf)(kv: (String, String)): (String, String) = { +val redactionPattern = conf.get(SECRET_REDACTION_PATTERN).r --- End diff -- What part do you think is expensive? Going through all the configuration properties and matching them with the regex? If so, I agree. However, that has to be done somewhere. All the callers of this function have a `SparkConf` that they want stuff redacted from. So, if this function accepts a list of tuples, they have to run the regex check to find that list first before sending it to `redact()`. So, overall, unless I am missing something, I don't think we can avoid the expense. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15971: [SPARK-18535][UI][YARN] Redact sensitive informat...
Github user markgrover commented on a diff in the pull request: https://github.com/apache/spark/pull/15971#discussion_r89240387 --- Diff: core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala --- @@ -231,6 +233,19 @@ private[spark] class EventLoggingListener( } } + + private def redactEvent(event: SparkListenerEnvironmentUpdate): SparkListenerEnvironmentUpdate = { --- End diff -- Good question. I thought about this quite a bit when making this change. I should have posted that decision in the PR description, apologies for not providing that context. The way I see it - there are three places where redaction can be done: 1. Right at the source of SparkListenerEnvironmentUpdate ([here, in SparkContext.scala](https://github.com/apache/spark/blob/39a1d30636857715247c82d551b200e1c331ad69/core/src/main/scala/org/apache/spark/SparkContext.scala#L2214)). 2. [In JsonProtocol.scala](https://github.com/apache/spark/blob/39a1d30636857715247c82d551b200e1c331ad69/core/src/main/scala/org/apache/spark/util/JsonProtocol.scala#L167) when converting the event to JSON. 3. In [EventLogginListener.scala](https://github.com/apache/spark/blob/39a1d30636857715247c82d551b200e1c331ad69/core/src/main/scala/org/apache/spark/util/JsonProtocol.scala#L167) (where it is right now), when being persisted to disk. A user could write a custom listener that listened to the environment updates and did something useful with them. And, I didn't want to impose redaction on them. They could be using it to create a clone of their environment, for example and may need to the same sensitive properties. So, I ruled out 1. And, JsonProtocol seemed like a generic utility to convert events to JSON. While I could be selective about only redacting events of `SparkListenerEnvironmentUpdate` type, I didn't want to assume that everyone translating the environment update to JSON should only be seeing redacted configuration. So, that made me rule out 2. I decided that it was best redact "closest to disk", which made me put the redaction code where I did - in EventLoggingListener. Hope that makes sense, happy to hear your thoughts if you think otherwise. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15971: [SPARK-18535][UI][YARN] Redact sensitive informat...
Github user markgrover commented on a diff in the pull request: https://github.com/apache/spark/pull/15971#discussion_r89238304 --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala --- @@ -223,4 +223,13 @@ package object config { " bigger files.") .longConf .createWithDefault(4 * 1024 * 1024) + + private[spark] val SECRET_REDACTION_PATTERN = +ConfigBuilder("spark.secret.redactionPattern") + .doc("Scala regex(case-sensitive) to decide which Spark configuration properties and " + +"environment variables in driver and executor environments contain sensitive information." + +" When this regex matches the property or environment variable name, its value is " + +"redacted from the environment UI and various logs like YARN and event logs") + .stringConf + .createWithDefault("secret|password|SECRET|PASSWORD") --- End diff -- Fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15971: [SPARK-18535][UI][YARN] Redact sensitive informat...
Github user markgrover commented on a diff in the pull request: https://github.com/apache/spark/pull/15971#discussion_r89238293 --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala --- @@ -223,4 +223,13 @@ package object config { " bigger files.") .longConf .createWithDefault(4 * 1024 * 1024) + + private[spark] val SECRET_REDACTION_PATTERN = +ConfigBuilder("spark.secret.redactionPattern") + .doc("Scala regex(case-sensitive) to decide which Spark configuration properties and " + +"environment variables in driver and executor environments contain sensitive information." + +" When this regex matches the property or environment variable name, its value is " + --- End diff -- Fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15971: [SPARK-18535][UI][YARN] Redact sensitive informat...
Github user markgrover commented on a diff in the pull request: https://github.com/apache/spark/pull/15971#discussion_r89238264 --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala --- @@ -223,4 +223,13 @@ package object config { " bigger files.") .longConf .createWithDefault(4 * 1024 * 1024) + + private[spark] val SECRET_REDACTION_PATTERN = +ConfigBuilder("spark.secret.redactionPattern") --- End diff -- Good point! I think it's good for the actual regex to not be redacted. So, I will rename the property to be clearer anyways (and not have secret in the name) to `spark.redaction.regex`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15971: [SPARK-18535][UI][YARN] Redact sensitive information fro...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/15971 > Does this protect against doing sc.getConf.getAll.foreach(println) in spark shell ? No, it doesn't. The goal of the JIRA is the first step - to stop printing sensitive information all over the place. If a user has access to spark-shell (say, in a cluster secured through kerberos), it may be reasonable for them to see the sensitive information. This, however, prevents anyone from going to the unauthenticated Spark UI, and having all the creds to S3. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15971: [SPARK-18535][UI][YARN] Redact sensitive informat...
GitHub user markgrover opened a pull request: https://github.com/apache/spark/pull/15971 [SPARK-18535][UI][YARN] Redact sensitive information from Spark logs and UI ## What changes were proposed in this pull request? This patch adds a new property called `spark.secret.redactionPattern` that allows users to specify a scala regex to decide which Spark configuration properties and environment variables in driver and executor environments contain sensitive information. When this regex matches the property or environment variable name, its value is redacted from the environment UI and various logs like YARN and event logs. This change uses this property to redact information from event logs and YARN logs. It also, updates the UI code to adhere to this property instead of hardcoding the logic to decipher which properties are sensitive. Here's an image of the UI post-redaction: ![image](https://cloud.githubusercontent.com/assets/1709451/20506215/4cc30654-b007-11e6-8aee-4cde253fba2f.png) Here's the text in the YARN logs, post-redaction: ``HADOOP_CREDSTORE_PASSWORD -> *(redacted)`` Here's the text in the event logs, post-redaction: ``...,"spark.executorEnv.HADOOP_CREDSTORE_PASSWORD":"*(redacted)","spark.yarn.appMasterEnv.HADOOP_CREDSTORE_PASSWORD":"*(redacted)",...`` ## How was this patch tested? 1. Unit tests are added to ensure that redaction works. 2. A YARN job reading data off of S3 with confidential information (hadoop credential provider password) being provided in the environment variables of driver and executor. And, afterwards, logs were grepped to make sure that no mention of secret password was present. It was also ensure that the job was able to read the data off of S3 correctly, thereby ensuring that the sensitive information was being trickled down to the right places to read the data. 3. The event logs were checked to make sure no mention of secret password was present. 4. UI environment tab was checked to make sure there was no secret information being displayed. You can merge this pull request into a Git repository by running: $ git pull https://github.com/markgrover/spark master_redaction Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15971.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15971 commit 5dd3630d6a937ba8634054940543509d9186e68e Author: Mark Grover Date: 2016-11-17T01:47:17Z [SPARK-18535][UI][YARN] Redact sensitive information from Spark logs and UI This commit adds a new property called `spark.secret.redactionPattern` that allows users to specify a scala regex to decide which Spark configuration properties and environment variables in driver and executor environments contain sensitive information. When this regex matches the property or environment variable name, its value is redacted from the environment UI and various logs like YARN and event logs. This change uses this property to redact information from event logs and YARN logs. It also, updates the UI code to adhere to this property instead of hardcoding the logic to decipher which properties are sensitive. For testing: 1. Unit tests are added to ensure that redaction works. 2. A YARN job reading data off of S3 with confidential information (hadoop credential provider password) being provided in the environment variables of driver and executor. And, afterwards, logs were grepped to make sure that no mention of secret password was present. It was also ensure that the job was able to read the data off of S3 correctly, thereby ensuring that the sensitive information was being trickled down to the right places to read the data. 3. The event logs were checked to make sure no mention of secret password was present. 4. UI environment tab was checked to make sure there was no secret information being displayed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15623: SPARK-18093][SQL] Fix default value test in SQLConfSuite...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/15623 Thanks @srowen, indeed, it could be stripping the trailing slash too. I am ok either way too. Let me know if you have strong preference to the other one, otherwise, this should be ok. Thanks again! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15623: SPARK-18093][SQL] Fix default value test in SQLCo...
GitHub user markgrover opened a pull request: https://github.com/apache/spark/pull/15623 SPARK-18093][SQL] Fix default value test in SQLConfSuite to work rega⦠â¦rdless of warehouse dir's existence ## What changes were proposed in this pull request? Appending a trailing slash, if there already isn't one for the sake comparison of the two paths. It doesn't take away from the essence of the check, but removes any potential mismatch due to lack of trailing slash. ## How was this patch tested? Ran unit tests and they passed. You can merge this pull request into a Git repository by running: $ git pull https://github.com/markgrover/spark spark-18093 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15623.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15623 commit dea1f337debbd2d0bb2488117f80ec5ded4f76a5 Author: Mark Grover Date: 2016-10-25T13:31:47Z SPARK-18093][SQL] Fix default value test in SQLConfSuite to work regardless of warehouse dir's existence Appending a trailing slash, if there already isn't one for the sake comparison of the two paths. It doesn't take away from the essence of the check, but removes any potential mismatch due to lack of trailing slash. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15572: [DOCS] Update docs to not suggest to package Spar...
GitHub user markgrover opened a pull request: https://github.com/apache/spark/pull/15572 [DOCS] Update docs to not suggest to package Spark before running tests. ## What changes were proposed in this pull request? Update docs to not suggest to package Spark before running tests. ## How was this patch tested? Not creating a JIRA since this pretty small. We haven't had the need to run mvn package before mvn test since 1.6 at least, or so I am told. So, updating the docs to not be misguiding. You can merge this pull request into a Git repository by running: $ git pull https://github.com/markgrover/spark doc_update Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15572.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15572 commit 858e50d6199dcacd313c578904b4d3d45c193088 Author: Mark Grover Date: 2016-10-20T21:57:34Z [DOCS] Update docs to not suggest to package Spark before running tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14270: [SPARK-5847][CORE] Allow for configuring MetricsSystem's...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/14270 Sure. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14270: [SPARK-5847][CORE] Allow for configuring MetricsSystem's...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/14270 Thanks a lot, @vanzin! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14270: [SPARK-5847][CORE] Allow for configuring MetricsSystem's...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/14270 Fixed the nits, resolved the merge conflict. Shortened some test names. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14270: [SPARK-5847][CORE] Allow for configuring MetricsS...
Github user markgrover commented on a diff in the pull request: https://github.com/apache/spark/pull/14270#discussion_r72357797 --- Diff: core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala --- @@ -125,19 +126,26 @@ private[spark] class MetricsSystem private ( * application, executor/driver and metric source. */ private[spark] def buildRegistryName(source: Source): String = { -val appId = conf.getOption("spark.app.id") +val metricsNamespace = conf.get(METRICS_NAMESPACE).map(Some(_)) --- End diff -- Good point, thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14270: [SPARK-5847][CORE] Allow for configuring MetricsS...
Github user markgrover commented on a diff in the pull request: https://github.com/apache/spark/pull/14270#discussion_r72357828 --- Diff: core/src/test/scala/org/apache/spark/metrics/MetricsSystemSuite.scala --- @@ -183,4 +184,89 @@ class MetricsSystemSuite extends SparkFunSuite with BeforeAndAfter with PrivateM assert(metricName != s"$appId.$executorId.${source.sourceName}") assert(metricName === source.sourceName) } + + test("MetricsSystem with Executor instance, with custom namespace") { +val source = new Source { + override val sourceName = "dummySource" + override val metricRegistry = new MetricRegistry() +} + +val appId = "testId" +val appName = "testName" +val executorId = "1" +conf.set("spark.app.id", appId) +conf.set("spark.app.name", appName) +conf.set("spark.executor.id", executorId) +conf.set(METRICS_NAMESPACE, "${spark.app.name}") + +val instanceName = "executor" +val driverMetricsSystem = MetricsSystem.createMetricsSystem(instanceName, conf, securityMgr) + +val metricName = driverMetricsSystem.buildRegistryName(source) +assert(metricName === s"$appName.$executorId.${source.sourceName}") + } + + test("MetricsSystem with Executor instance and custom namespace which is not set") { +val source = new Source { + override val sourceName = "dummySource" + override val metricRegistry = new MetricRegistry() +} + +val executorId = "1" +val namespaceToResolve = "${spark.doesnotexist}" +conf.set("spark.executor.id", executorId) +conf.set(METRICS_NAMESPACE, namespaceToResolve) + +val instanceName = "executor" +val driverMetricsSystem = MetricsSystem.createMetricsSystem(instanceName, conf, securityMgr) + +val metricName = driverMetricsSystem.buildRegistryName(source) +// If the user set the spark.metrics.namespace property to an expansion of another property +// (say ${spark.doesnotexist}, the unresolved name (i.e. litterally ${spark.doesnot}) --- End diff -- Appreciate your thoroughness! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14270: [SPARK-5847][CORE] Allow for configuring MetricsS...
Github user markgrover commented on a diff in the pull request: https://github.com/apache/spark/pull/14270#discussion_r72307684 --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala --- @@ -103,4 +103,9 @@ package object config { .stringConf .checkValues(Set("hive", "in-memory")) .createWithDefault("in-memory") + + // This property sets the root namespace for metrics reporting + private[spark] val METRICS_NAMESPACE = ConfigBuilder("spark.metrics.namespace") +.stringConf +.createOptional --- End diff -- Thanks, but I think that changes the semantics in cases where `spark.app.id` is not defined. Currently, the code [in this PR](https://github.com/apache/spark/pull/14270/files#diff-7ea2624e832b166ca27cd4baca8691d9R129) creates an Option with value `None` if `spark.app.id` is not defined. With the proposed change, it will create an Option with value literal `${spark.app.id}` if `spark.app.id` not defined. And, that changes the existing behavior. Even though in practice, I believe, `spark.app.id` is always defined, there are some tests in MetrisSystemSuite.scala that test the scenario when `spark.app.id` is not defined. And, I am hesitant to change that behavior. I suppose I could use `createWithDefault()` in package.scala for `METRICS_NAMESPACE` but then I'd have to check if there are any unresolved `spark.app.id` references in the value, but that'd be too big of a pain for not much benefit. So, I'd prefer to keep this the way it is. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14270: [SPARK-5847][CORE] Allow for configuring MetricsSystem's...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/14270 Ok, I have pushed changes to use the expansion capabilities brought in by SPARK-16272. Overall, I think it was a very good call to use that, so thanks for the suggestions! Would appreciate a review. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14270: [SPARK-5847][CORE] Allow for configuring MetricsSystem's...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/14270 ok, thanks, I will rely on changes in #14022 (SPARK-16272), now that it's been committed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14270: [SPARK-5847][CORE] Allow for configuring MetricsSystem's...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/14270 Tests passed! I'd really appreciate a review! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14270: [SPARK-5847][CORE] Allow for configuring MetricsSystem's...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/14270 A few design choices I made along the way are: 1. Use a `SparkConf` property to control the namespace instead of a `MetricsConfig` property. The reason is that metrics config properties mean something particular and follow a particular pattern (see [here](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala#L54), for example). And, regardless of what we name this property, it doesn't follow that paradigm - the first part of the property name before the first dot doesn't represent an instance (like master, worker, executor, etc.) like it does for legit metrics properties. Also, the properties defined in `MetricsConfig` are obtained using the `getInstance()` call, where the instance is passed and again, such a retrieval doesn't really apply to this configuration property since it doesn't belong to an instance. I understand it's one more property in `SparkConf` but I really feel its belongs there, right by `spark.metrics.conf`. 2. Based on the previous point, this property shouldn't fall under the `spark.metrics.conf.` prefix. This is because if it does, it's mistakenly understood as a legit metrics property (see code [here](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/metrics/MetricsConfig.scala#L54), for example). 3. The proposed `spark.metrics.namespace` property allows one to specify an arbitrary property to be used as namespace. I did consider whitespacing it so users can use `spark.app.name` or `spark.app.id` only but then we'd have to deal with magic strings(app.name|app.id), and I didn't really feel inclined to do that. And, @ryan-williams who took a stab at the same issue #4632 made the same call and I agree with him. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14223: [SPARK-10614][CORE] Change SystemClock to derive time fr...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/14223 And, thanks for your feedback, @rxin, I will take a look. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14270: [SPARK-5847][CORE] Allow for configuring MetricsS...
GitHub user markgrover opened a pull request: https://github.com/apache/spark/pull/14270 [SPARK-5847][CORE] Allow for configuring MetricsSystem's use of app ID to namespace all metrics ## What changes were proposed in this pull request? Adding a new property to SparkConf called spark.metrics.namespace that allows users to set a custom namespace for executor and driver metrics in the metrics systems. By default, the root namespace used for driver or executor metrics is the value of `spark.app.id`. However, often times, users want to be able to track the metrics across apps for driver and executor metrics, which is hard to do with application ID (i.e. `spark.app.id`) since it changes with every invocation of the app. For such use cases, users can set the `spark.metrics.namespace` property to another spark configuration key like `spark.app.name` which is then used to populate the root namespace of the metrics system (with the app name in our example). `spark.metrics.namespace` property can be set to any arbitrary spark property key, whose value would be used to set the root namespace of the metrics system. Non driver and executor metrics are never prefixed with `spark.app.id`, nor does the `spark.metrics.namespace` property have any such affect on such metrics. ## How was this patch tested? Added new unit tests, modified existing unit tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/markgrover/spark spark-5847 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14270.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14270 commit 7d023c820cae3b970e11b2278f6c2b3f3751cc3b Author: Mark Grover Date: 2016-07-19T05:38:31Z [SPARK-5847][CORE] Allow for configuring MetricsSystem's use of app ID to namespace all metrics Adding a new property to SparkConf called spark.metrics.namespace that allows users to set a custom namespace for executor and driver metrics in the metrics systems. By default, the root namespace used for driver or executor metrics is the value of `spark.app.id`. However, often times, users want to be able to track the metrics across apps for driver and executor metrics, which is hard to do with application ID (i.e. `spark.app.id`) since it changes with every invocation of the app. For such use cases, users can set the `spark.metrics.namespace` property to another spark configuration key like `spark.app.name` which is then used to populate the root namespace of the metrics system (with the app name in our example). `spark.metrics.namespace` property can be set to any arbitrary spark property key, whose value would be used to set the root namespace of the metrics system. Non driver and executor metrics are never prefixed with `spark.app.id`, nor does the `spark.metrics.namespace` property have any such affect on such metrics. commit 605e690eefe905fa52979dde16f04bd208c8b219 Author: Mark Grover Date: 2016-07-19T19:44:55Z Minor fixes based on self-reviewing --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14223: [SPARK-10614][CORE] Change SystemClock to derive time fr...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/14223 Thanks @vanzin Yeah, I noticed that too. Will take a look. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14223: [SPARK-10614][CORE] Change SystemClock to derive time fr...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/14223 The unit test failures from the previous run looked unrelated, fyi. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14223: [SPARK-10614][CORE] SystemClock uses non-monotoni...
Github user markgrover commented on a diff in the pull request: https://github.com/apache/spark/pull/14223#discussion_r71030596 --- Diff: core/src/main/scala/org/apache/spark/util/Clock.scala --- @@ -33,18 +40,18 @@ private[spark] class SystemClock extends Clock { val minPollTime = 25L /** - * @return the same time (milliseconds since the epoch) - * as is reported by `System.currentTimeMillis()` + * @return the same time in milliseconds as is derived from `System.nanoTime()` */ - def getTimeMillis(): Long = System.currentTimeMillis() + def getTimeMillis(): Long = System.nanoTime() / (1000 * 1000) --- End diff -- Will do --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14223: [SPARK-10614][CORE] SystemClock uses non-monotoni...
Github user markgrover commented on a diff in the pull request: https://github.com/apache/spark/pull/14223#discussion_r71030586 --- Diff: core/src/main/scala/org/apache/spark/util/Clock.scala --- @@ -21,7 +21,14 @@ package org.apache.spark.util * An interface to represent clocks, so that they can be mocked out in unit tests. */ private[spark] trait Clock { + /** @return the time in milliseconds */ def getTimeMillis(): Long + + /** + * Wait until a clock's view of current time reaches the specified target time. + * @param targetTime block until the current time is at least this value --- End diff -- Will do --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14223: [SPARK-10614][CORE] SystemClock uses non-monotoni...
GitHub user markgrover opened a pull request: https://github.com/apache/spark/pull/14223 [SPARK-10614][CORE] SystemClock uses non-monotonic time ## What changes were proposed in this pull request? Change SystemClock to derive time from System.nanoTime() instead of System.currentTimeMillis(). This is because using System.currentTimeMillis() makes waitTillTime() routine brittle against systems like VMs where time can go forward and backwards. Shout out to @steveloughran for doing a good chunk of research and a previous PR (#8766) on this in 2015. ## How was this patch tested? Build is green, I am running unit tests, they are looking green, but are also taking a while. So, filing this PR a little proactively. You can merge this pull request into a Git repository by running: $ git pull https://github.com/markgrover/spark SPARK-10614 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14223.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14223 commit a475cd82455e892b23efe2e8fad0dade2ec8b043 Author: Mark Grover Date: 2016-07-14T23:42:39Z [SPARK-10614][CORE] SystemClock uses non-monotonic time in its wait logic --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #10953: [SPARK-12177] [STREAMING] Update KafkaDStreams to...
Github user markgrover closed the pull request at: https://github.com/apache/spark/pull/10953 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #10953: [SPARK-12177] [STREAMING] Update KafkaDStreams to new Ka...
Github user markgrover commented on the issue: https://github.com/apache/spark/pull/10953 Yeah, I agree with @koeninger. This PR is pretty out of date, it makes sense to turn focus on Cody's PR #11863 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13382][DOCS][PYSPARK] Update pyspark te...
Github user markgrover commented on the pull request: https://github.com/apache/spark/pull/11278#issuecomment-218229224 Would love to help, but I am not a committer:-) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13670][launcher] Propagate error from l...
Github user markgrover commented on the pull request: https://github.com/apache/spark/pull/12910#issuecomment-218020176 LGTM, I had tested this on mac a while back (on 03/22) and it worked for me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9384] [core] Easier setting of executor...
Github user markgrover commented on the pull request: https://github.com/apache/spark/pull/7739#issuecomment-217517817 Thanks for bringing this up, Sean. Yeah, I'd like to see this go in too. I realize there are some merge conflicts here. I will make the semantics clear in documentation (they are already there, will make sure they are clear enough). So, if someone is strongly against, let me know. Otherwise, I will be updated the code early next week. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14731][shuffle]Revert SPARK-12130 to ma...
Github user markgrover commented on the pull request: https://github.com/apache/spark/pull/12568#issuecomment-213518615 LGTM. +1 (non-binding) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14731][shuffle]Revert SPARK-12130 to ma...
Github user markgrover commented on a diff in the pull request: https://github.com/apache/spark/pull/12568#discussion_r60772476 --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java --- @@ -183,12 +190,7 @@ public ManagedBuffer getBlockData(String appId, String execId, String blockId) { String.format("Executor is not registered (appId=%s, execId=%s)", appId, execId)); } -if ("sort".equals(executor.shuffleManager) || "tungsten-sort".equals(executor.shuffleManager)) { - return getSortBasedShuffleBlockData(executor, shuffleId, mapId, reduceId); -} else { - throw new UnsupportedOperationException( -"Unsupported shuffle manager: " + executor.shuffleManager); -} +return getSortBasedShuffleBlockData(executor, shuffleId, mapId, reduceId); --- End diff -- Makes sense. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14731][shuffle]Revert SPARK-12130 to ma...
Github user markgrover commented on a diff in the pull request: https://github.com/apache/spark/pull/12568#discussion_r60624962 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -182,7 +182,7 @@ private[spark] class BlockManager( val shuffleConfig = new ExecutorShuffleInfo( diskBlockManager.localDirs.map(_.toString), diskBlockManager.subDirsPerLocalDir, - shuffleManager.shortName) --- End diff -- Yeah, I'm pretty sure it's not. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14731][shuffle]Revert SPARK-12130 to ma...
Github user markgrover commented on a diff in the pull request: https://github.com/apache/spark/pull/12568#discussion_r60613929 --- Diff: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java --- @@ -183,12 +190,7 @@ public ManagedBuffer getBlockData(String appId, String execId, String blockId) { String.format("Executor is not registered (appId=%s, execId=%s)", appId, execId)); } -if ("sort".equals(executor.shuffleManager) || "tungsten-sort".equals(executor.shuffleManager)) { - return getSortBasedShuffleBlockData(executor, shuffleId, mapId, reduceId); -} else { - throw new UnsupportedOperationException( -"Unsupported shuffle manager: " + executor.shuffleManager); -} +return getSortBasedShuffleBlockData(executor, shuffleId, mapId, reduceId); --- End diff -- Personally I'd prefer an assert or similar on the shuffleManager here. What do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14731][shuffle]Revert SPARK-12130 to ma...
Github user markgrover commented on the pull request: https://github.com/apache/spark/pull/12568#issuecomment-213000250 Thanks for picking this up, @lianhuiwang I didn't get a chance to work on this yesterday, so I am very glad you are picked this up, thanks! Left some comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14731][shuffle]Revert SPARK-12130 to ma...
Github user markgrover commented on a diff in the pull request: https://github.com/apache/spark/pull/12568#discussion_r60613362 --- Diff: common/network-shuffle/src/test/java/org/apache/spark/network/shuffle/ExternalShuffleIntegrationSuite.java --- @@ -186,7 +186,12 @@ public void testFetchThreeSort() throws Exception { @Test public void testFetchInvalidShuffle() throws Exception { -registerExecutor("exec-1", dataContext0.createExecutorInfo("unknown sort manager")); +try { + registerExecutor("exec-1", dataContext0.createExecutorInfo("unknown sort manager")); + fail("unknown sort manager "); --- End diff -- nit: trailing whitespace --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14731][shuffle]Revert SPARK-12130 to ma...
Github user markgrover commented on a diff in the pull request: https://github.com/apache/spark/pull/12568#discussion_r60613340 --- Diff: common/network-shuffle/src/test/java/org/apache/spark/network/shuffle/ExternalShuffleIntegrationSuite.java --- @@ -186,7 +186,12 @@ public void testFetchThreeSort() throws Exception { @Test public void testFetchInvalidShuffle() throws Exception { -registerExecutor("exec-1", dataContext0.createExecutorInfo("unknown sort manager")); +try { --- End diff -- I think this test the way it's written doesn't make much sense any more since registerExecutor throws an exception. Personally, I'd prefer if we stopped checked for blocks being empty, and rather assert that the correct type of exception with the correct message is being thrown. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12177][Streaming][Kafka] Update KafkaDS...
Github user markgrover commented on the pull request: https://github.com/apache/spark/pull/11863#issuecomment-212170020 Overall, I agree with the approach. I'll take a look at kafka-exactly-once repo, I don't think it will hurt to add more tests. As far as KafkaUtils go, I am ok with getting rid of it for Scala/Java users but there are a bunch of python utility functions, the equivalent of which will likely be needed if/when we do python support. But, we could just add it later if users asks for it, or we find it's the best place for python utility functions to go. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org