[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-67601524 This looks good to me, so I'm ready to merge it. One final little thing, though: since we have two separate PRs that address SPARK-3967, I'd like to create a new JIRA for this fix just in case this doesn't fix the YARN issue in the original ticket. Therefore, I've opened https://issues.apache.org/jira/browse/SPARK-4896. Do you mind editing this PR's title to reference that JIRA instead? Once you do that, I'll merge this into all of the maintenance branches and resolve the original JIRA. Thanks for your patience during this long review and with the build difficulties. I think that the end result here is a significant improvement over the old `fetchFile` code, so I'm looking forward to merging this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user ryan-williams commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-67253690 Thanks a lot for the build-process brain-dump, @JoshRosen; some of that is way better than processes I've wasted a lot of time on. I'm interested in getting all of these tips and tricks pushed to Spark docs, but even after all of the emailing your response includes tricks I've not seen documented anywhere (e.g. `SPARK_PREPEND_CLASSES`), so I'm leery of spending time trying to document what I've learned for fear that I'm still not at the bottom of the rabbit hole. In any case, I was able to debug and (I believe) fix the test failure, which was a result of having inadvertently changed a file copy to a move, so it's a good thing the test caught it. This should be ready to re-test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-67257189 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-67257529 [Test build #24518 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24518/consoleFull) for PR 2848 at commit [`c14daff`](https://github.com/apache/spark/commit/c14daffb3051ec53921cc9e5532d95c89a433c3b). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-67264665 [Test build #24518 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24518/consoleFull) for PR 2848 at commit [`c14daff`](https://github.com/apache/spark/commit/c14daffb3051ec53921cc9e5532d95c89a433c3b). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-67264671 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24518/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-67106447 Disclaimer: for iterative debugging, I use `sbt` to build Spark, not Maven. Spark 1.2.0 has docs on building with SBT. If possible, I'd switch to using that workflow. The issue here is probably that you're running a full `mvn clean` and starting over from scratch after each change. I'd like to help move this PR along, so the following is going to be some interactive logs to see whether I can quickly iterate on this using Maven. Let's say that I'm starting from a completely cold build (but with Maven dependencies already downloaded): ```bash # Since I have zinc installed, I'll use it: zinc -start git checkout /a/branch/with/your/pr/code # Here, the -T C1 says build in parallel with one thread per core: time mvn -T C1 clean package -DskipTests ``` This didn't take _super_ long, but it was a few minutes: ``` real4m19.537s user3m14.634s sys 0m16.882s ``` Let's run just the test suite that we're interested in ([instructions from here](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools): ``` time mvn -T C1 test -DwildcardSuites=org.apache.spark.FileServerSuite ``` This took a little while because it had to build a bunch of test sources, but it was only a few seconds before the tests started running (and failing): ``` real0m33.968s user0m36.544s sys 0m3.032s FileServerSuite: - Distributing files locally - Distributing files locally security On *** FAILED *** java.io.FileNotFoundException: /var/folders/0k/2qp2p2vs7bv033vljnb8nk1cgn/T/spark-7206dcdd-9db6-4d78-ac9e-2de69619dc67/test/FileServerSuite.txt (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.init(FileInputStream.java:146) at com.google.common.io.Files$FileByteSource.openStream(Files.java:124) at com.google.common.io.Files$FileByteSource.openStream(Files.java:114) at com.google.common.io.ByteSource.copyTo(ByteSource.java:202) at com.google.common.io.Files.copy(Files.java:436) at org.apache.spark.HttpFileServer.addFileToDir(HttpFileServer.scala:72) at org.apache.spark.HttpFileServer.addFile(HttpFileServer.scala:55) at org.apache.spark.SparkContext.addFile(SparkContext.scala:965) at org.apache.spark.FileServerSuite$$anonfun$3.apply$mcV$sp(FileServerSuite.scala:96) ... - Distributing files locally using URL as input *** FAILED *** java.io.FileNotFoundException: /var/folders/0k/2qp2p2vs7bv033vljnb8nk1cgn/T/spark-7206dcdd-9db6-4d78-ac9e-2de69619dc67/test/FileServerSuite.txt (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.init(FileInputStream.java:146) at com.google.common.io.Files$FileByteSource.openStream(Files.java:124) at com.google.common.io.Files$FileByteSource.openStream(Files.java:114) at com.google.common.io.ByteSource.copyTo(ByteSource.java:202) at com.google.common.io.Files.copy(Files.java:436) at org.apache.spark.HttpFileServer.addFileToDir(HttpFileServer.scala:72) at org.apache.spark.HttpFileServer.addFile(HttpFileServer.scala:55) at org.apache.spark.SparkContext.addFile(SparkContext.scala:965) at org.apache.spark.FileServerSuite$$anonfun$5.apply$mcV$sp(FileServerSuite.scala:112) ... - Dynamically adding JARS locally - Distributing files on a standalone cluster *** FAILED *** java.io.FileNotFoundException: /var/folders/0k/2qp2p2vs7bv033vljnb8nk1cgn/T/spark-7206dcdd-9db6-4d78-ac9e-2de69619dc67/test/FileServerSuite.txt (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.init(FileInputStream.java:146) at com.google.common.io.Files$FileByteSource.openStream(Files.java:124) at com.google.common.io.Files$FileByteSource.openStream(Files.java:114) at com.google.common.io.ByteSource.copyTo(ByteSource.java:202) at com.google.common.io.Files.copy(Files.java:436) at org.apache.spark.HttpFileServer.addFileToDir(HttpFileServer.scala:72) at org.apache.spark.HttpFileServer.addFile(HttpFileServer.scala:55) at org.apache.spark.SparkContext.addFile(SparkContext.scala:965) at org.apache.spark.FileServerSuite$$anonfun$8.apply$mcV$sp(FileServerSuite.scala:137) ... - Dynamically adding JARS on a standalone cluster - Dynamically adding JARS on a standalone cluster using local: URL ``` Let's try adding a print statement to `SparkContext.addFile`, then re-running the tests. We could do this by re-packaging: ``` mvn -T C1 package -DskipTests ``` ``` [INFO] BUILD SUCCESS [INFO]
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user ryan-williams commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-66904556 interesting. this may be more appropriate on my [gigantic thread about testing/development best practices](http://apache-spark-developers-list.1001551.n3.nabble.com/Spurious-test-failures-testing-best-practices-td9560.html), but do you have any recommendations for debugging this? AFAICT, this seems to be a test that requires a full assembly-build to run; I've tried running `mvn -Dsuites=FileServerSuite test` on a branch cut from the same `master` that this PR is cut from (b004150adb503ddbb54d5cd544e39ad974497c41) and I get strange failure modes, usually just that the test seems to hang for hours. I ran a `mvn package -DskipTests` to build a fresh assembly, and tried again, and it still hung. I'm currently waiting on a `mvn clean package -DskipTests`, at which point I will try again. Debugging / iterating on a particular (integration) test, like I need to do here, remains a prohibitively slow process for me. I'm guessing this failure must have something to do with having moved / removed (temp?) files that this test was relying on; my naive approach will be to comment out some of the move/removal lines and see if that fixes the test, but on pain of running a `mvn clean package -DskipTests` every time. Is there a better way? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-66865120 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-66865189 [Test build #24431 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24431/consoleFull) for PR 2848 at commit [`bcf2543`](https://github.com/apache/spark/commit/bcf2543de693ce6bf8a9294e0af028d2e1be3ccc). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-66866426 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24431/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-66866422 [Test build #24431 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24431/consoleFull) for PR 2848 at commit [`bcf2543`](https://github.com/apache/spark/commit/bcf2543de693ce6bf8a9294e0af028d2e1be3ccc). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-66868021 Hmm, maybe a legitimate test failure? ``` [info] - Distributing files locally security On *** FAILED *** (82 milliseconds) [info] java.io.FileNotFoundException: /tmp/spark-27c67e3e-d5f5-4d86-b9fa-948319ea6b7a/test/FileServerSuite.txt (No such file or directory) [info] at java.io.FileInputStream.open(Native Method) [info] at java.io.FileInputStream.init(FileInputStream.java:146) [info] at com.google.common.io.Files$FileByteSource.openStream(Files.java:127) [info] at com.google.common.io.Files$FileByteSource.openStream(Files.java:117) [info] at com.google.common.io.ByteSource.copyTo(ByteSource.java:247) [info] at com.google.common.io.Files.copy(Files.java:458) [info] at org.apache.spark.HttpFileServer.addFileToDir(HttpFileServer.scala:72) [info] at org.apache.spark.HttpFileServer.addFile(HttpFileServer.scala:55) [info] at org.apache.spark.SparkContext.addFile(SparkContext.scala:965) [info] at org.apache.spark.FileServerSuite$$anonfun$3.apply$mcV$sp(FileServerSuite.scala:96) [info] at org.apache.spark.FileServerSuite$$anonfun$3.apply(FileServerSuite.scala:90) [info] at org.apache.spark.FileServerSuite$$anonfun$3.apply(FileServerSuite.scala:90) [info] at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) [info] at org.scalatest.Suite$class.withFixture(Suite.scala:1122) [info] at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) [info] at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) [info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) [info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [...] ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user ryan-williams commented on a diff in the pull request: https://github.com/apache/spark/pull/2848#discussion_r21710708 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -412,6 +408,85 @@ private[spark] object Utils extends Logging { } /** + * Download a file from `in` to `tempFile`, then move it to `destFile`, checking whether + * `destFile` already exists, has the same contents as the downloaded file, and can be + * overwritten. + * + * @param url URL that `sourceFile` originated from, for logging purposes. + * @param in InputStream to download. + * @param tempFile File path to download `in` to. + * @param destFile File path to move `tempFile` to. + * @param fileOverwrite Whether to delete/overwrite an existing `destFile` that does not match + * `sourceFile` + */ + private def downloadStreamAndMove( +url: String, +in: InputStream, +tempFile: File, +destFile: File, +fileOverwrite: Boolean): Unit = { + +val out = new FileOutputStream(tempFile) +Utils.copyStream(in, out, closeStreams = true) +copyFile(url, tempFile, destFile, fileOverwrite, removeSourceFile = true) + --- End diff -- done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user ryan-williams commented on a diff in the pull request: https://github.com/apache/spark/pull/2848#discussion_r21710662 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -412,6 +408,85 @@ private[spark] object Utils extends Logging { } /** + * Download a file from `in` to `tempFile`, then move it to `destFile`, checking whether + * `destFile` already exists, has the same contents as the downloaded file, and can be + * overwritten. + * + * @param url URL that `sourceFile` originated from, for logging purposes. + * @param in InputStream to download. + * @param tempFile File path to download `in` to. + * @param destFile File path to move `tempFile` to. + * @param fileOverwrite Whether to delete/overwrite an existing `destFile` that does not match + * `sourceFile` + */ + private def downloadStreamAndMove( +url: String, --- End diff -- done. I can't get intellij to do this kind of spacing by default, have you guys succeeded in making it do so? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user ryan-williams commented on a diff in the pull request: https://github.com/apache/spark/pull/2848#discussion_r21710743 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -412,6 +408,85 @@ private[spark] object Utils extends Logging { } /** + * Download a file from `in` to `tempFile`, then move it to `destFile`, checking whether + * `destFile` already exists, has the same contents as the downloaded file, and can be + * overwritten. + * + * @param url URL that `sourceFile` originated from, for logging purposes. + * @param in InputStream to download. + * @param tempFile File path to download `in` to. + * @param destFile File path to move `tempFile` to. + * @param fileOverwrite Whether to delete/overwrite an existing `destFile` that does not match + * `sourceFile` + */ + private def downloadStreamAndMove( +url: String, +in: InputStream, +tempFile: File, +destFile: File, +fileOverwrite: Boolean): Unit = { + +val out = new FileOutputStream(tempFile) +Utils.copyStream(in, out, closeStreams = true) +copyFile(url, tempFile, destFile, fileOverwrite, removeSourceFile = true) + + } + + /** + * Copy file from `sourceFile` to `destFile`, checking whether `destFile` already exists, has + * the same contents as the downloaded file, and can be overwritten. Optionally removes + * `sourceFile` by moving instead of copying. + * + * @param url URL that `sourceFile` originated from, for logging purposes. + * @param sourceFile File path to copy/move from. + * @param destFile File path to copy/move to. + * @param fileOverwrite Whether to delete/overwrite an existing `destFile` that does not match + * `sourceFile` + * @param removeSourceFile Whether to remove `sourceFile` after / as part of moving/copying it to + * `destFile`. + */ + private def copyFile( +url: String, +sourceFile: File, +destFile: File, +fileOverwrite: Boolean, +removeSourceFile: Boolean = false): Unit = { --- End diff -- done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user ryan-williams commented on a diff in the pull request: https://github.com/apache/spark/pull/2848#discussion_r21710876 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -412,6 +408,85 @@ private[spark] object Utils extends Logging { } /** + * Download a file from `in` to `tempFile`, then move it to `destFile`, checking whether + * `destFile` already exists, has the same contents as the downloaded file, and can be + * overwritten. + * + * @param url URL that `sourceFile` originated from, for logging purposes. + * @param in InputStream to download. + * @param tempFile File path to download `in` to. + * @param destFile File path to move `tempFile` to. + * @param fileOverwrite Whether to delete/overwrite an existing `destFile` that does not match + * `sourceFile` + */ + private def downloadStreamAndMove( +url: String, +in: InputStream, +tempFile: File, +destFile: File, +fileOverwrite: Boolean): Unit = { + +val out = new FileOutputStream(tempFile) +Utils.copyStream(in, out, closeStreams = true) +copyFile(url, tempFile, destFile, fileOverwrite, removeSourceFile = true) + + } + + /** + * Copy file from `sourceFile` to `destFile`, checking whether `destFile` already exists, has + * the same contents as the downloaded file, and can be overwritten. Optionally removes + * `sourceFile` by moving instead of copying. + * + * @param url URL that `sourceFile` originated from, for logging purposes. + * @param sourceFile File path to copy/move from. + * @param destFile File path to copy/move to. + * @param fileOverwrite Whether to delete/overwrite an existing `destFile` that does not match + * `sourceFile` + * @param removeSourceFile Whether to remove `sourceFile` after / as part of moving/copying it to + * `destFile`. + */ + private def copyFile( +url: String, +sourceFile: File, +destFile: File, +fileOverwrite: Boolean, +removeSourceFile: Boolean = false): Unit = { + +var shouldCopy = true +if (destFile.exists) { + if (!Files.equal(sourceFile, destFile)) { +if (fileOverwrite) { + destFile.delete() + logInfo( +sFile $destFile exists and does not match contents of $url, replacing it with $url + ) +} else { + throw new SparkException( +sFile $destFile exists and does not match contents of $url) +} + } else { +// Do nothing if the file contents are the same, i.e. this file has been copied +// previously. +logInfo( + s${sourceFile.getAbsolutePath} has been previously copied to + +destFile.getAbsolutePath --- End diff -- ` s${sourceFile.getAbsolutePath} has been previously copied to ${destFile.getAbsolutePath}` is 101 chars long. I don't see a way to use interpolation and not concatenation while staying = 100 chars. please advise @andrewor14 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user ryan-williams commented on a diff in the pull request: https://github.com/apache/spark/pull/2848#discussion_r21711105 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -412,6 +408,85 @@ private[spark] object Utils extends Logging { } /** + * Download a file from `in` to `tempFile`, then move it to `destFile`, checking whether + * `destFile` already exists, has the same contents as the downloaded file, and can be + * overwritten. + * + * @param url URL that `sourceFile` originated from, for logging purposes. + * @param in InputStream to download. + * @param tempFile File path to download `in` to. + * @param destFile File path to move `tempFile` to. + * @param fileOverwrite Whether to delete/overwrite an existing `destFile` that does not match + * `sourceFile` + */ + private def downloadStreamAndMove( +url: String, +in: InputStream, +tempFile: File, +destFile: File, +fileOverwrite: Boolean): Unit = { + +val out = new FileOutputStream(tempFile) +Utils.copyStream(in, out, closeStreams = true) +copyFile(url, tempFile, destFile, fileOverwrite, removeSourceFile = true) + + } + + /** + * Copy file from `sourceFile` to `destFile`, checking whether `destFile` already exists, has + * the same contents as the downloaded file, and can be overwritten. Optionally removes + * `sourceFile` by moving instead of copying. + * + * @param url URL that `sourceFile` originated from, for logging purposes. + * @param sourceFile File path to copy/move from. + * @param destFile File path to copy/move to. + * @param fileOverwrite Whether to delete/overwrite an existing `destFile` that does not match + * `sourceFile` + * @param removeSourceFile Whether to remove `sourceFile` after / as part of moving/copying it to + * `destFile`. + */ + private def copyFile( +url: String, +sourceFile: File, +destFile: File, +fileOverwrite: Boolean, +removeSourceFile: Boolean = false): Unit = { + +var shouldCopy = true +if (destFile.exists) { + if (!Files.equal(sourceFile, destFile)) { +if (fileOverwrite) { + destFile.delete() + logInfo( +sFile $destFile exists and does not match contents of $url, replacing it with $url + ) +} else { + throw new SparkException( +sFile $destFile exists and does not match contents of $url) +} + } else { +// Do nothing if the file contents are the same, i.e. this file has been copied +// previously. +logInfo( + s${sourceFile.getAbsolutePath} has been previously copied to + +destFile.getAbsolutePath --- End diff -- see also: [previous diff that used only concatenation](https://github.com/apache/spark/pull/2848#discussion-diff-21350437R429) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user ryan-williams commented on a diff in the pull request: https://github.com/apache/spark/pull/2848#discussion_r21711212 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -412,6 +408,85 @@ private[spark] object Utils extends Logging { } /** + * Download a file from `in` to `tempFile`, then move it to `destFile`, checking whether + * `destFile` already exists, has the same contents as the downloaded file, and can be + * overwritten. + * + * @param url URL that `sourceFile` originated from, for logging purposes. + * @param in InputStream to download. + * @param tempFile File path to download `in` to. + * @param destFile File path to move `tempFile` to. + * @param fileOverwrite Whether to delete/overwrite an existing `destFile` that does not match + * `sourceFile` + */ + private def downloadStreamAndMove( +url: String, +in: InputStream, +tempFile: File, +destFile: File, +fileOverwrite: Boolean): Unit = { + +val out = new FileOutputStream(tempFile) +Utils.copyStream(in, out, closeStreams = true) +copyFile(url, tempFile, destFile, fileOverwrite, removeSourceFile = true) + + } + + /** + * Copy file from `sourceFile` to `destFile`, checking whether `destFile` already exists, has + * the same contents as the downloaded file, and can be overwritten. Optionally removes + * `sourceFile` by moving instead of copying. + * + * @param url URL that `sourceFile` originated from, for logging purposes. + * @param sourceFile File path to copy/move from. + * @param destFile File path to copy/move to. + * @param fileOverwrite Whether to delete/overwrite an existing `destFile` that does not match + * `sourceFile` + * @param removeSourceFile Whether to remove `sourceFile` after / as part of moving/copying it to + * `destFile`. + */ + private def copyFile( +url: String, +sourceFile: File, +destFile: File, +fileOverwrite: Boolean, +removeSourceFile: Boolean = false): Unit = { + +var shouldCopy = true +if (destFile.exists) { + if (!Files.equal(sourceFile, destFile)) { +if (fileOverwrite) { + destFile.delete() + logInfo( +sFile $destFile exists and does not match contents of $url, replacing it with $url + ) +} else { + throw new SparkException( +sFile $destFile exists and does not match contents of $url) +} + } else { +// Do nothing if the file contents are the same, i.e. this file has been copied +// previously. +logInfo( + s${sourceFile.getAbsolutePath} has been previously copied to + +destFile.getAbsolutePath --- End diff -- I've changed it to a version that uses `String.format`: ``` logInfo( %s has been previously copied to %s.format( sourceFile.getAbsolutePath, destFile.getAbsolutePath ) ) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user ryan-williams commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-66693008 still adding the requested changes, stay tuned --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/2848#discussion_r21712018 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -412,6 +408,85 @@ private[spark] object Utils extends Logging { } /** + * Download a file from `in` to `tempFile`, then move it to `destFile`, checking whether + * `destFile` already exists, has the same contents as the downloaded file, and can be + * overwritten. + * + * @param url URL that `sourceFile` originated from, for logging purposes. + * @param in InputStream to download. + * @param tempFile File path to download `in` to. + * @param destFile File path to move `tempFile` to. + * @param fileOverwrite Whether to delete/overwrite an existing `destFile` that does not match + * `sourceFile` + */ + private def downloadStreamAndMove( +url: String, +in: InputStream, +tempFile: File, +destFile: File, +fileOverwrite: Boolean): Unit = { + +val out = new FileOutputStream(tempFile) +Utils.copyStream(in, out, closeStreams = true) +copyFile(url, tempFile, destFile, fileOverwrite, removeSourceFile = true) + + } + + /** + * Copy file from `sourceFile` to `destFile`, checking whether `destFile` already exists, has + * the same contents as the downloaded file, and can be overwritten. Optionally removes + * `sourceFile` by moving instead of copying. + * + * @param url URL that `sourceFile` originated from, for logging purposes. + * @param sourceFile File path to copy/move from. + * @param destFile File path to copy/move to. + * @param fileOverwrite Whether to delete/overwrite an existing `destFile` that does not match + * `sourceFile` + * @param removeSourceFile Whether to remove `sourceFile` after / as part of moving/copying it to + * `destFile`. + */ + private def copyFile( +url: String, +sourceFile: File, +destFile: File, +fileOverwrite: Boolean, +removeSourceFile: Boolean = false): Unit = { + +var shouldCopy = true +if (destFile.exists) { + if (!Files.equal(sourceFile, destFile)) { +if (fileOverwrite) { + destFile.delete() + logInfo( +sFile $destFile exists and does not match contents of $url, replacing it with $url + ) +} else { + throw new SparkException( +sFile $destFile exists and does not match contents of $url) +} + } else { +// Do nothing if the file contents are the same, i.e. this file has been copied +// previously. +logInfo( + s${sourceFile.getAbsolutePath} has been previously copied to + +destFile.getAbsolutePath --- End diff -- ok by the way this is not super important so it's ok if you don't fix it --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user ryan-williams commented on a diff in the pull request: https://github.com/apache/spark/pull/2848#discussion_r21720826 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -412,6 +408,85 @@ private[spark] object Utils extends Logging { } /** + * Download a file from `in` to `tempFile`, then move it to `destFile`, checking whether + * `destFile` already exists, has the same contents as the downloaded file, and can be + * overwritten. + * + * @param url URL that `sourceFile` originated from, for logging purposes. + * @param in InputStream to download. + * @param tempFile File path to download `in` to. + * @param destFile File path to move `tempFile` to. + * @param fileOverwrite Whether to delete/overwrite an existing `destFile` that does not match + * `sourceFile` + */ + private def downloadStreamAndMove( +url: String, +in: InputStream, +tempFile: File, +destFile: File, +fileOverwrite: Boolean): Unit = { + +val out = new FileOutputStream(tempFile) +Utils.copyStream(in, out, closeStreams = true) +copyFile(url, tempFile, destFile, fileOverwrite, removeSourceFile = true) + + } + + /** + * Copy file from `sourceFile` to `destFile`, checking whether `destFile` already exists, has + * the same contents as the downloaded file, and can be overwritten. Optionally removes + * `sourceFile` by moving instead of copying. + * + * @param url URL that `sourceFile` originated from, for logging purposes. + * @param sourceFile File path to copy/move from. + * @param destFile File path to copy/move to. + * @param fileOverwrite Whether to delete/overwrite an existing `destFile` that does not match + * `sourceFile` + * @param removeSourceFile Whether to remove `sourceFile` after / as part of moving/copying it to + * `destFile`. + */ + private def copyFile( +url: String, +sourceFile: File, +destFile: File, +fileOverwrite: Boolean, +removeSourceFile: Boolean = false): Unit = { + +var shouldCopy = true --- End diff -- cool, I've moved to that. we rearranged the logic deck-chairs of this block during an earlier review pass and settled on the `var shouldCopy` approach, but I can accept the `return` as being marginally less aesthetically unappealing :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user ryan-williams commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-66716295 OK, I added logic for making sure we always delete `tempFile` and throw an exception if deletion of an existing `sourceFile` fails; I think I hit all the other feedback as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/2848#discussion_r21556089 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -412,6 +408,85 @@ private[spark] object Utils extends Logging { } /** + * Download a file from `in` to `tempFile`, then move it to `destFile`, checking whether + * `destFile` already exists, has the same contents as the downloaded file, and can be + * overwritten. + * + * @param url URL that `sourceFile` originated from, for logging purposes. + * @param in InputStream to download. + * @param tempFile File path to download `in` to. + * @param destFile File path to move `tempFile` to. + * @param fileOverwrite Whether to delete/overwrite an existing `destFile` that does not match + * `sourceFile` + */ + private def downloadStreamAndMove( +url: String, --- End diff -- (but keep the Unit return type!) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/2848#discussion_r21556631 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -412,6 +408,85 @@ private[spark] object Utils extends Logging { } /** + * Download a file from `in` to `tempFile`, then move it to `destFile`, checking whether + * `destFile` already exists, has the same contents as the downloaded file, and can be + * overwritten. + * + * @param url URL that `sourceFile` originated from, for logging purposes. + * @param in InputStream to download. + * @param tempFile File path to download `in` to. + * @param destFile File path to move `tempFile` to. + * @param fileOverwrite Whether to delete/overwrite an existing `destFile` that does not match + * `sourceFile` + */ + private def downloadStreamAndMove( +url: String, +in: InputStream, +tempFile: File, +destFile: File, +fileOverwrite: Boolean): Unit = { + +val out = new FileOutputStream(tempFile) +Utils.copyStream(in, out, closeStreams = true) +copyFile(url, tempFile, destFile, fileOverwrite, removeSourceFile = true) + + } + + /** + * Copy file from `sourceFile` to `destFile`, checking whether `destFile` already exists, has + * the same contents as the downloaded file, and can be overwritten. Optionally removes + * `sourceFile` by moving instead of copying. + * + * @param url URL that `sourceFile` originated from, for logging purposes. + * @param sourceFile File path to copy/move from. + * @param destFile File path to copy/move to. + * @param fileOverwrite Whether to delete/overwrite an existing `destFile` that does not match + * `sourceFile` + * @param removeSourceFile Whether to remove `sourceFile` after / as part of moving/copying it to + * `destFile`. + */ + private def copyFile( +url: String, +sourceFile: File, +destFile: File, +fileOverwrite: Boolean, +removeSourceFile: Boolean = false): Unit = { --- End diff -- indent these by 2 spaces --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/2848#discussion_r21556581 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -412,6 +408,85 @@ private[spark] object Utils extends Logging { } /** + * Download a file from `in` to `tempFile`, then move it to `destFile`, checking whether + * `destFile` already exists, has the same contents as the downloaded file, and can be + * overwritten. + * + * @param url URL that `sourceFile` originated from, for logging purposes. + * @param in InputStream to download. + * @param tempFile File path to download `in` to. + * @param destFile File path to move `tempFile` to. + * @param fileOverwrite Whether to delete/overwrite an existing `destFile` that does not match + * `sourceFile` + */ + private def downloadStreamAndMove( +url: String, +in: InputStream, +tempFile: File, +destFile: File, +fileOverwrite: Boolean): Unit = { + +val out = new FileOutputStream(tempFile) +Utils.copyStream(in, out, closeStreams = true) +copyFile(url, tempFile, destFile, fileOverwrite, removeSourceFile = true) + + } + + /** + * Copy file from `sourceFile` to `destFile`, checking whether `destFile` already exists, has + * the same contents as the downloaded file, and can be overwritten. Optionally removes + * `sourceFile` by moving instead of copying. + * + * @param url URL that `sourceFile` originated from, for logging purposes. + * @param sourceFile File path to copy/move from. + * @param destFile File path to copy/move to. + * @param fileOverwrite Whether to delete/overwrite an existing `destFile` that does not match + * `sourceFile` + * @param removeSourceFile Whether to remove `sourceFile` after / as part of moving/copying it to + * `destFile`. + */ + private def copyFile( +url: String, +sourceFile: File, +destFile: File, +fileOverwrite: Boolean, +removeSourceFile: Boolean = false): Unit = { + +var shouldCopy = true +if (destFile.exists) { + if (!Files.equal(sourceFile, destFile)) { +if (fileOverwrite) { + destFile.delete() --- End diff -- what if `delete` fails? We'll still log the message it already exists, replacing. We should throw an exception or something if `delete` returns false. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/2848#discussion_r21556716 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -412,6 +408,85 @@ private[spark] object Utils extends Logging { } /** + * Download a file from `in` to `tempFile`, then move it to `destFile`, checking whether + * `destFile` already exists, has the same contents as the downloaded file, and can be + * overwritten. + * + * @param url URL that `sourceFile` originated from, for logging purposes. + * @param in InputStream to download. + * @param tempFile File path to download `in` to. + * @param destFile File path to move `tempFile` to. + * @param fileOverwrite Whether to delete/overwrite an existing `destFile` that does not match + * `sourceFile` + */ + private def downloadStreamAndMove( +url: String, +in: InputStream, +tempFile: File, +destFile: File, +fileOverwrite: Boolean): Unit = { + +val out = new FileOutputStream(tempFile) +Utils.copyStream(in, out, closeStreams = true) +copyFile(url, tempFile, destFile, fileOverwrite, removeSourceFile = true) + + } + + /** + * Copy file from `sourceFile` to `destFile`, checking whether `destFile` already exists, has + * the same contents as the downloaded file, and can be overwritten. Optionally removes + * `sourceFile` by moving instead of copying. --- End diff -- Can you elaborate on what happens when `destFile` already exists, when it has the same contents as the downloaded file, and when it cannot be overwritten? Also, if you followed my suggestion to throw an exception if delete fails, we should describe that behavior here as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/2848#discussion_r21556751 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -412,6 +408,85 @@ private[spark] object Utils extends Logging { } /** + * Download a file from `in` to `tempFile`, then move it to `destFile`, checking whether + * `destFile` already exists, has the same contents as the downloaded file, and can be + * overwritten. --- End diff -- Same here, this java doc should be more elaborate. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/2848#discussion_r21556861 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -412,6 +408,85 @@ private[spark] object Utils extends Logging { } /** + * Download a file from `in` to `tempFile`, then move it to `destFile`, checking whether + * `destFile` already exists, has the same contents as the downloaded file, and can be + * overwritten. + * + * @param url URL that `sourceFile` originated from, for logging purposes. + * @param in InputStream to download. + * @param tempFile File path to download `in` to. + * @param destFile File path to move `tempFile` to. + * @param fileOverwrite Whether to delete/overwrite an existing `destFile` that does not match + * `sourceFile` + */ + private def downloadStreamAndMove( --- End diff -- Hm, but you don't always move the file. Also, from the name `downloadStreamAndMove` it's not clear what you're moving. I think it actually just makes sense to call this `downloadFile`, and describe in great detail when the file will actually be moved. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/2848#discussion_r21557057 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -412,6 +408,85 @@ private[spark] object Utils extends Logging { } /** + * Download a file from `in` to `tempFile`, then move it to `destFile`, checking whether + * `destFile` already exists, has the same contents as the downloaded file, and can be + * overwritten. + * + * @param url URL that `sourceFile` originated from, for logging purposes. + * @param in InputStream to download. + * @param tempFile File path to download `in` to. + * @param destFile File path to move `tempFile` to. + * @param fileOverwrite Whether to delete/overwrite an existing `destFile` that does not match + * `sourceFile` + */ + private def downloadStreamAndMove( +url: String, +in: InputStream, +tempFile: File, +destFile: File, +fileOverwrite: Boolean): Unit = { + +val out = new FileOutputStream(tempFile) +Utils.copyStream(in, out, closeStreams = true) +copyFile(url, tempFile, destFile, fileOverwrite, removeSourceFile = true) + + } + + /** + * Copy file from `sourceFile` to `destFile`, checking whether `destFile` already exists, has + * the same contents as the downloaded file, and can be overwritten. Optionally removes + * `sourceFile` by moving instead of copying. + * + * @param url URL that `sourceFile` originated from, for logging purposes. + * @param sourceFile File path to copy/move from. + * @param destFile File path to copy/move to. + * @param fileOverwrite Whether to delete/overwrite an existing `destFile` that does not match + * `sourceFile` + * @param removeSourceFile Whether to remove `sourceFile` after / as part of moving/copying it to + * `destFile`. + */ + private def copyFile( +url: String, +sourceFile: File, +destFile: File, +fileOverwrite: Boolean, +removeSourceFile: Boolean = false): Unit = { + +var shouldCopy = true +if (destFile.exists) { + if (!Files.equal(sourceFile, destFile)) { +if (fileOverwrite) { + destFile.delete() + logInfo( +sFile $destFile exists and does not match contents of $url, replacing it with $url + ) +} else { + throw new SparkException( +sFile $destFile exists and does not match contents of $url) +} + } else { +// Do nothing if the file contents are the same, i.e. this file has been copied +// previously. +logInfo( + s${sourceFile.getAbsolutePath} has been previously copied to + +destFile.getAbsolutePath --- End diff -- weird mix of interpolation + concatenation here. Should just use interpolation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2848#discussion_r21557323 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -412,6 +408,85 @@ private[spark] object Utils extends Logging { } /** + * Download a file from `in` to `tempFile`, then move it to `destFile`, checking whether + * `destFile` already exists, has the same contents as the downloaded file, and can be + * overwritten. + * + * @param url URL that `sourceFile` originated from, for logging purposes. + * @param in InputStream to download. + * @param tempFile File path to download `in` to. + * @param destFile File path to move `tempFile` to. + * @param fileOverwrite Whether to delete/overwrite an existing `destFile` that does not match + * `sourceFile` + */ + private def downloadStreamAndMove( +url: String, +in: InputStream, +tempFile: File, +destFile: File, +fileOverwrite: Boolean): Unit = { + +val out = new FileOutputStream(tempFile) +Utils.copyStream(in, out, closeStreams = true) +copyFile(url, tempFile, destFile, fileOverwrite, removeSourceFile = true) + + } + + /** + * Copy file from `sourceFile` to `destFile`, checking whether `destFile` already exists, has + * the same contents as the downloaded file, and can be overwritten. Optionally removes + * `sourceFile` by moving instead of copying. + * + * @param url URL that `sourceFile` originated from, for logging purposes. + * @param sourceFile File path to copy/move from. + * @param destFile File path to copy/move to. + * @param fileOverwrite Whether to delete/overwrite an existing `destFile` that does not match + * `sourceFile` + * @param removeSourceFile Whether to remove `sourceFile` after / as part of moving/copying it to + * `destFile`. + */ + private def copyFile( +url: String, +sourceFile: File, +destFile: File, +fileOverwrite: Boolean, +removeSourceFile: Boolean = false): Unit = { + +var shouldCopy = true +if (destFile.exists) { + if (!Files.equal(sourceFile, destFile)) { +if (fileOverwrite) { + destFile.delete() --- End diff -- This is a good point; the old code (which was a mess) didn't handle this error case, but we might as well fix it here. I'm in favor of throwing an exception. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2848#discussion_r21501389 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -412,6 +408,85 @@ private[spark] object Utils extends Logging { } /** + * Download a file from `in` to `tempFile`, then move it to `destFile`, checking whether + * `destFile` already exists, has the same contents as the downloaded file, and can be + * overwritten. + * + * @param url URL that `sourceFile` originated from, for logging purposes. + * @param in InputStream to download. + * @param tempFile File path to download `in` to. + * @param destFile File path to move `tempFile` to. + * @param fileOverwrite Whether to delete/overwrite an existing `destFile` that does not match + * `sourceFile` + */ + private def downloadStreamAndMove( +url: String, +in: InputStream, +tempFile: File, +destFile: File, +fileOverwrite: Boolean): Unit = { + +val out = new FileOutputStream(tempFile) +Utils.copyStream(in, out, closeStreams = true) +copyFile(url, tempFile, destFile, fileOverwrite, removeSourceFile = true) + --- End diff -- Mind removing this extra blank line? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2848#discussion_r21501367 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -412,6 +408,85 @@ private[spark] object Utils extends Logging { } /** + * Download a file from `in` to `tempFile`, then move it to `destFile`, checking whether + * `destFile` already exists, has the same contents as the downloaded file, and can be + * overwritten. + * + * @param url URL that `sourceFile` originated from, for logging purposes. + * @param in InputStream to download. + * @param tempFile File path to download `in` to. + * @param destFile File path to move `tempFile` to. + * @param fileOverwrite Whether to delete/overwrite an existing `destFile` that does not match + * `sourceFile` + */ + private def downloadStreamAndMove( +url: String, --- End diff -- Fairly minor style nit, but do you mind indenting these method parameters by two more spaces, like `fetchFile` above: ``` def fetchFile( url: String, targetDir: File, conf: SparkConf, securityMgr: SecurityManager, hadoopConf: Configuration, timestamp: Long, useCache: Boolean) { val fileName = url.split(/).last ``` This is a really minor style point that I've been guilty of overlooking myself: https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2848#discussion_r21501514 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -412,6 +408,85 @@ private[spark] object Utils extends Logging { } /** + * Download a file from `in` to `tempFile`, then move it to `destFile`, checking whether + * `destFile` already exists, has the same contents as the downloaded file, and can be + * overwritten. + * + * @param url URL that `sourceFile` originated from, for logging purposes. + * @param in InputStream to download. + * @param tempFile File path to download `in` to. + * @param destFile File path to move `tempFile` to. + * @param fileOverwrite Whether to delete/overwrite an existing `destFile` that does not match + * `sourceFile` + */ + private def downloadStreamAndMove( +url: String, +in: InputStream, +tempFile: File, +destFile: File, +fileOverwrite: Boolean): Unit = { + +val out = new FileOutputStream(tempFile) +Utils.copyStream(in, out, closeStreams = true) +copyFile(url, tempFile, destFile, fileOverwrite, removeSourceFile = true) + + } + + /** + * Copy file from `sourceFile` to `destFile`, checking whether `destFile` already exists, has + * the same contents as the downloaded file, and can be overwritten. Optionally removes + * `sourceFile` by moving instead of copying. + * + * @param url URL that `sourceFile` originated from, for logging purposes. + * @param sourceFile File path to copy/move from. + * @param destFile File path to copy/move to. + * @param fileOverwrite Whether to delete/overwrite an existing `destFile` that does not match + * `sourceFile` + * @param removeSourceFile Whether to remove `sourceFile` after / as part of moving/copying it to + * `destFile`. + */ + private def copyFile( +url: String, +sourceFile: File, +destFile: File, +fileOverwrite: Boolean, +removeSourceFile: Boolean = false): Unit = { + +var shouldCopy = true --- End diff -- This is super-nitpicky of me, but I love to avoid mutability whenever possible, so it would be nice to see if there was a clean way to remove this variable. It looks like `shouldCopy=false` only in the case where the file contents are the same, so maybe we could just add a `return` on that branch and can remove the `shouldCopy` variable entirely. This would let us remove the `if` on line 478, --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2848#discussion_r21501964 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -412,6 +408,85 @@ private[spark] object Utils extends Logging { } /** + * Download a file from `in` to `tempFile`, then move it to `destFile`, checking whether + * `destFile` already exists, has the same contents as the downloaded file, and can be + * overwritten. + * + * @param url URL that `sourceFile` originated from, for logging purposes. + * @param in InputStream to download. + * @param tempFile File path to download `in` to. + * @param destFile File path to move `tempFile` to. + * @param fileOverwrite Whether to delete/overwrite an existing `destFile` that does not match + * `sourceFile` + */ + private def downloadStreamAndMove( +url: String, +in: InputStream, +tempFile: File, +destFile: File, +fileOverwrite: Boolean): Unit = { + +val out = new FileOutputStream(tempFile) +Utils.copyStream(in, out, closeStreams = true) +copyFile(url, tempFile, destFile, fileOverwrite, removeSourceFile = true) --- End diff -- This should probably be surrounded in a `try-catch` block so that we delete `tempFile` if `copyFile` throws an exception. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2848#discussion_r21502123 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -412,6 +408,85 @@ private[spark] object Utils extends Logging { } /** + * Download a file from `in` to `tempFile`, then move it to `destFile`, checking whether + * `destFile` already exists, has the same contents as the downloaded file, and can be + * overwritten. + * + * @param url URL that `sourceFile` originated from, for logging purposes. + * @param in InputStream to download. + * @param tempFile File path to download `in` to. + * @param destFile File path to move `tempFile` to. + * @param fileOverwrite Whether to delete/overwrite an existing `destFile` that does not match + * `sourceFile` + */ + private def downloadStreamAndMove( +url: String, +in: InputStream, +tempFile: File, +destFile: File, +fileOverwrite: Boolean): Unit = { + +val out = new FileOutputStream(tempFile) +Utils.copyStream(in, out, closeStreams = true) +copyFile(url, tempFile, destFile, fileOverwrite, removeSourceFile = true) --- End diff -- I noticed this because there was a `tempFile.delete()` call in the old code that wasn't preserved here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-66222053 Left a couple more comments. This is looking _much_ better than the code that was there before; thanks for your patience with the really late review. I have to run now, but I'll loop back later to take a more detailed look later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user ryan-williams commented on a diff in the pull request: https://github.com/apache/spark/pull/2848#discussion_r21386174 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -412,6 +403,48 @@ private[spark] object Utils extends Logging { } /** + * Download a file from @in to @tempFile, then move it to @destFile, checking whether @destFile --- End diff -- sure, will do. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user ryan-williams commented on a diff in the pull request: https://github.com/apache/spark/pull/2848#discussion_r21386186 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -412,6 +403,48 @@ private[spark] object Utils extends Logging { } /** + * Download a file from @in to @tempFile, then move it to @destFile, checking whether @destFile + * already exists, is equal to the downloaded file, and can be overwritten. --- End diff -- ditto --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user ryan-williams commented on a diff in the pull request: https://github.com/apache/spark/pull/2848#discussion_r21386332 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -412,6 +403,48 @@ private[spark] object Utils extends Logging { } /** + * Download a file from @in to @tempFile, then move it to @destFile, checking whether @destFile + * already exists, is equal to the downloaded file, and can be overwritten. + */ + private def maybeMoveFile( +url: String, --- End diff -- Yes. The idea was to abstract out this duplicated code from ~4 other places in this file, all of which has a log message that includes the url, without changing any functionality. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user ryan-williams commented on a diff in the pull request: https://github.com/apache/spark/pull/2848#discussion_r21386398 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -412,6 +403,48 @@ private[spark] object Utils extends Logging { } /** + * Download a file from @in to @tempFile, then move it to @destFile, checking whether @destFile + * already exists, is equal to the downloaded file, and can be overwritten. + */ + private def maybeMoveFile( +url: String, +sourceFile: File, +destFile: File, +fileOverwrite: Boolean): Unit = { + +var shouldCopy = true --- End diff -- geez, seems like I had a bad merge at some point and dropped an important code path in this function; e.g. the `Files.move` call is gone. will audit and fix --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user ryan-williams commented on a diff in the pull request: https://github.com/apache/spark/pull/2848#discussion_r21386447 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -412,6 +403,48 @@ private[spark] object Utils extends Logging { } /** + * Download a file from @in to @tempFile, then move it to @destFile, checking whether @destFile + * already exists, is equal to the downloaded file, and can be overwritten. + */ + private def maybeMoveFile( +url: String, +sourceFile: File, +destFile: File, +fileOverwrite: Boolean): Unit = { + +var shouldCopy = true +if (destFile.exists) { --- End diff -- yea, merge bug I think; looking --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user ryan-williams commented on a diff in the pull request: https://github.com/apache/spark/pull/2848#discussion_r21386501 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -412,6 +403,48 @@ private[spark] object Utils extends Logging { } /** + * Download a file from @in to @tempFile, then move it to @destFile, checking whether @destFile + * already exists, is equal to the downloaded file, and can be overwritten. + */ + private def maybeMoveFile( +url: String, +sourceFile: File, +destFile: File, +fileOverwrite: Boolean): Unit = { + +var shouldCopy = true +if (destFile.exists) { + if (!Files.equal(sourceFile, destFile)) { +if (fileOverwrite) { + destFile.delete() + logInfo((File %s exists and does not match contents of %s, + --- End diff -- sure, done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user ryan-williams commented on a diff in the pull request: https://github.com/apache/spark/pull/2848#discussion_r21386652 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -412,6 +403,48 @@ private[spark] object Utils extends Logging { } /** + * Download a file from @in to @tempFile, then move it to @destFile, checking whether @destFile + * already exists, is equal to the downloaded file, and can be overwritten. + */ + private def maybeMoveFile( +url: String, +sourceFile: File, +destFile: File, +fileOverwrite: Boolean): Unit = { + +var shouldCopy = true +if (destFile.exists) { + if (!Files.equal(sourceFile, destFile)) { +if (fileOverwrite) { + destFile.delete() + logInfo((File %s exists and does not match contents of %s, + +replacing it with %s).format(destFile, url, url)) --- End diff -- one would think so; this is how it was at [L459](https://github.com/apache/spark/pull/2848/files#diff-d239aee594001f8391676e1047a0381eL459), [L477](https://github.com/apache/spark/pull/2848/files#diff-d239aee594001f8391676e1047a0381eL477), and [L506](https://github.com/apache/spark/pull/2848/files#diff-d239aee594001f8391676e1047a0381eL506). Not obvious to me what the original intent was but I'll try to figure it out and do something more sensible. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user ryan-williams commented on a diff in the pull request: https://github.com/apache/spark/pull/2848#discussion_r21386945 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -412,6 +403,48 @@ private[spark] object Utils extends Logging { } /** + * Download a file from @in to @tempFile, then move it to @destFile, checking whether @destFile + * already exists, is equal to the downloaded file, and can be overwritten. + */ + private def maybeMoveFile( +url: String, +sourceFile: File, +destFile: File, +fileOverwrite: Boolean): Unit = { + +var shouldCopy = true +if (destFile.exists) { + if (!Files.equal(sourceFile, destFile)) { +if (fileOverwrite) { + destFile.delete() + logInfo((File %s exists and does not match contents of %s, + +replacing it with %s).format(destFile, url, url)) +} else { + throw new SparkException( +File + destFile + exists and does not match contents of + + url) +} + } else { +// Do nothing if the file contents are the same, i.e. this file has been copied +// previously. +logInfo(sourceFile.getAbsolutePath + has been previously copied to --- End diff -- done, though it went to 101 chars and I had to wrap it, which is not the clearest, but whatevs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user ryan-williams commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-65827581 Thanks for the review pass, @JoshRosen. As I mentioned in some of the comments, this was attempting to shoehorn 4 basically-identical blocks of code from elsewhere in this file into one common code path (it started with 1 or 2 and then grew to encompass the others as previous reviewers suggested that that was desirable). The PR has been mostly sitting around for many weeks and apparently one of the maintenance merges I did removed some key bits; I'll fix those and bump this again. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user ryan-williams commented on a diff in the pull request: https://github.com/apache/spark/pull/2848#discussion_r21389032 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -412,6 +403,48 @@ private[spark] object Utils extends Logging { } /** + * Download a file from @in to @tempFile, then move it to @destFile, checking whether @destFile + * already exists, is equal to the downloaded file, and can be overwritten. + */ + private def maybeMoveFile( +url: String, +sourceFile: File, +destFile: File, +fileOverwrite: Boolean): Unit = { + +var shouldCopy = true +if (destFile.exists) { + if (!Files.equal(sourceFile, destFile)) { +if (fileOverwrite) { + destFile.delete() + logInfo((File %s exists and does not match contents of %s, + +replacing it with %s).format(destFile, url, url)) --- End diff -- on reading it again, I think the original code (that predates me) made sense: it asserting that the file exists and does not match `url`, therefore we are replacing the thing that already exists with what is at `url`, because that is what we want it to match. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user ryan-williams commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-65834877 OK @JoshRosen I fixed and cleaned things up. * the two overloaded `maybeMoveFile` signatures are more distinctly named (`downloadStreamAndMove` and `copyFile`) * each is called twice, and the former actually uses the latter, so I think there are some overall code-reuse wins, in addition to the don't move/copy if the thing is already there bug fix happening in one place. I'm open to writing some tests but it's not that obvious how to do so in a way that is meaningful / doesn't just duplicate the logic in the file itself. The fetch file code path is not tested in the existing `UtilsSuite` afaict, and it doesn't feel worth adding an integration test around, imho. lmk what else you'd like. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2848#discussion_r21350361 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -412,6 +403,48 @@ private[spark] object Utils extends Logging { } /** + * Download a file from @in to @tempFile, then move it to @destFile, checking whether @destFile --- End diff -- Style issue: I think it's clearer to use backticks, e.g. ``` download a file from `in` to `tempFile` ... ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2848#discussion_r21350382 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -412,6 +403,48 @@ private[spark] object Utils extends Logging { } /** + * Download a file from @in to @tempFile, then move it to @destFile, checking whether @destFile + * already exists, is equal to the downloaded file, and can be overwritten. --- End diff -- is equal to is potentially ambiguous; do you think it would be clearer to say has the same contents as? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2848#discussion_r21350418 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -412,6 +403,48 @@ private[spark] object Utils extends Logging { } /** + * Download a file from @in to @tempFile, then move it to @destFile, checking whether @destFile + * already exists, is equal to the downloaded file, and can be overwritten. + */ + private def maybeMoveFile( +url: String, +sourceFile: File, +destFile: File, +fileOverwrite: Boolean): Unit = { + +var shouldCopy = true +if (destFile.exists) { + if (!Files.equal(sourceFile, destFile)) { +if (fileOverwrite) { + destFile.delete() + logInfo((File %s exists and does not match contents of %s, + +replacing it with %s).format(destFile, url, url)) --- End diff -- `url, url` here is a bug? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2848#discussion_r21350409 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -412,6 +403,48 @@ private[spark] object Utils extends Logging { } /** + * Download a file from @in to @tempFile, then move it to @destFile, checking whether @destFile + * already exists, is equal to the downloaded file, and can be overwritten. + */ + private def maybeMoveFile( +url: String, +sourceFile: File, +destFile: File, +fileOverwrite: Boolean): Unit = { + +var shouldCopy = true +if (destFile.exists) { + if (!Files.equal(sourceFile, destFile)) { +if (fileOverwrite) { + destFile.delete() + logInfo((File %s exists and does not match contents of %s, + --- End diff -- As long as you're modifying this code, it might be nice to use string formatters, e.g. `sFile $destFile exists and does not match the contents of $url ...` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2848#discussion_r21350464 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -412,6 +403,48 @@ private[spark] object Utils extends Logging { } /** + * Download a file from @in to @tempFile, then move it to @destFile, checking whether @destFile + * already exists, is equal to the downloaded file, and can be overwritten. + */ + private def maybeMoveFile( +url: String, +sourceFile: File, +destFile: File, +fileOverwrite: Boolean): Unit = { + +var shouldCopy = true --- End diff -- This variable isn't read anywhere; I'd remove it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2848#discussion_r21350437 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -412,6 +403,48 @@ private[spark] object Utils extends Logging { } /** + * Download a file from @in to @tempFile, then move it to @destFile, checking whether @destFile + * already exists, is equal to the downloaded file, and can be overwritten. + */ + private def maybeMoveFile( +url: String, +sourceFile: File, +destFile: File, +fileOverwrite: Boolean): Unit = { + +var shouldCopy = true +if (destFile.exists) { + if (!Files.equal(sourceFile, destFile)) { +if (fileOverwrite) { + destFile.delete() + logInfo((File %s exists and does not match contents of %s, + +replacing it with %s).format(destFile, url, url)) +} else { + throw new SparkException( +File + destFile + exists and does not match contents of + + url) +} + } else { +// Do nothing if the file contents are the same, i.e. this file has been copied +// previously. +logInfo(sourceFile.getAbsolutePath + has been previously copied to --- End diff -- Same here; might be nice to use string formatters. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2848#discussion_r21350546 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -412,6 +403,48 @@ private[spark] object Utils extends Logging { } /** + * Download a file from @in to @tempFile, then move it to @destFile, checking whether @destFile + * already exists, is equal to the downloaded file, and can be overwritten. + */ + private def maybeMoveFile( +url: String, +sourceFile: File, +destFile: File, +fileOverwrite: Boolean): Unit = { + +var shouldCopy = true +if (destFile.exists) { --- End diff -- There's no `else` branch for this `if`, which seems like an error. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2848#discussion_r21350677 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -412,6 +403,48 @@ private[spark] object Utils extends Logging { } /** + * Download a file from @in to @tempFile, then move it to @destFile, checking whether @destFile + * already exists, is equal to the downloaded file, and can be overwritten. + */ + private def maybeMoveFile( +url: String, --- End diff -- I think that `url` is only used in log messages; is this intentional? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-65737090 Hi @ryan-williams, I took a quick glance through the code and find the new `maybeMoveFile `method to be a bit confusing: - It takes a `url` parameter that it doesn't use for anything other than logging. - The name `maybe` implies an operation that can succeed or fail; it would be nice if the method's documentation clarified how success / failure is reported, maybe as a Scaladoc `@throws` annotation of a `Try` or something. This is important since the caller may need to know that the source file wasn't moved so that it can perform proper cleanup (e.g. delete the file). - The different cases don't seem to be exhaustively handled; what happens if the destination file _doesn't_ already exist? - One of the overloaded versions seems like it has two responsibilities: download a file then move it; it would be nice to separate these concerns to keep the individual methods simple and easy to understand. Do you know if we have a test suite for this file copying code? If not, we should create one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user ryan-williams commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-65020585 @pwendell @andrewor14 want to give this another look? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user ryan-williams commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-62165292 OK, I refactored a little further. Updates: * renamed the helper function `maybeMoveFile` (instead of `moveFile`) * introduced a second signature for `maybeMoveFile` that just takes two `File`s * this allowed me to bring the 3rd instance of this repeated logic in `Utils.doFetchFile` into the fold, which helps the overall consistency / cleanliness a lot, I think. * incidentally, that last code path handled the `exists` vs. `delete()` trickery differently than I was doing before; it used a boolean `var` that recorded explicitly whether we `shouldCopy` (`true` to start, set to `false` iff we found an identical file to exist). I decided that this way was cleaner, per @andrewor14's and @pwendell's (earlier in this thread) suggestions, and structured `maybeMoveFile` that way. * folded the code path around L397 into `maybeMoveFile` as well, per @andrewor14's last suggestion. lmk how it looks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/2848#discussion_r19985577 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -412,6 +412,36 @@ private[spark] object Utils extends Logging { } /** + * Download a file from @in to @tempFile, then move it to @destFile, checking whether @destFile + * already exists, is equal to the downloaded file, and can be overwritten. + */ + private def moveFile(url: String, + in: InputStream, + tempFile: File, + destFile: File, + fileOverwrite: Boolean): Unit = { --- End diff -- style should be ``` private def moveFile( url: String, ... fileOverwrite: Boolean): Unit = { ... } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-62073167 retest this please, just because Jenkins hasn't tested this one in a while. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-62073442 Hey @ryan-williams +1 on abstracting the duplicate code. Just one clarification: the only change in functionality here is it saves overwriting an existing jar unnecessarily, is that correct? How does this solve SPARK-3967? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-62073441 [Test build #23029 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23029/consoleFull) for PR 2848 at commit [`80a20bd`](https://github.com/apache/spark/commit/80a20bd8422ebd78b89bca6f1655959383b0c701). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user ryan-williams commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-62074970 Thanks for having a look, I'll get to that style nit momentarily. The tie-in to SPARK-3967 is that, on my cluster, when attempting to redundantly copy the JAR deps, I was getting Permission Denied errors because not only were the JARs already there, they were read-only and could not be overwritten. I did not get to the bottom of who was putting them there; it seemed like every executor would try to copy its JARs, find them to be already present, try to copy anyway, and fail due to permission denied. Simply skipping the redundant copy attempt (as this PR does) resulted in my jobs passing, and it seemed like the original authors intended to not attempt to redundantly copy but had a bug, so I settled for this fix and didn't get fully to the bottom of this. For example, suppose an executor E hit the code path where a given JAR is present but its contents are not identical to the one E wants. In this case, E would attempt to overwrite the existing JAR and fail due to the same Permission denied, so there could be some larger way in which it needs to be made possible for executors to overwrite their JAR deps (maybe that is just a cluster config issue I and @preaudc are hitting?), but if so I think that's orthogonal to this change. Let me know if I should just take the SPARK-3967 label out of this PR, since it might not fully solve the issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user ryan-williams commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-62075551 fixed the style nit, also added a comment nit I found in CoGroupedRDD, lmk if you want that in a separate PR or something --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-62081923 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23029/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-62081916 [Test build #23029 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23029/consoleFull) for PR 2848 at commit [`80a20bd`](https://github.com/apache/spark/commit/80a20bd8422ebd78b89bca6f1655959383b0c701). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/2848#discussion_r19990853 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -412,6 +412,37 @@ private[spark] object Utils extends Logging { } /** + * Download a file from @in to @tempFile, then move it to @destFile, checking whether @destFile + * already exists, is equal to the downloaded file, and can be overwritten. + */ + private def moveFile( + url: String, + in: InputStream, + tempFile: File, + destFile: File, + fileOverwrite: Boolean): Unit = { +val out = new FileOutputStream(tempFile) +Utils.copyStream(in, out, closeStreams = true) +if (destFile.exists !Files.equal(tempFile, destFile)) { + if (fileOverwrite) { +destFile.delete() +logInfo((File %s exists and does not match contents of %s, + + replacing it with %s).format(destFile, url, url)) --- End diff -- Hey @ryan-williams my concern is that if this fails (e.g. because of some permissions thing), then we will always have an old version of the file because `destFile.exists` will always be true down there in L440. It might make sense to throw an exception if delete is unsuccessful instead of logging an incorrect message here (I realize you just moved this code, but it would still be good to fix this). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/2848#discussion_r19990952 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -412,6 +412,37 @@ private[spark] object Utils extends Logging { } /** + * Download a file from @in to @tempFile, then move it to @destFile, checking whether @destFile + * already exists, is equal to the downloaded file, and can be overwritten. + */ + private def moveFile( + url: String, + in: InputStream, + tempFile: File, + destFile: File, + fileOverwrite: Boolean): Unit = { +val out = new FileOutputStream(tempFile) +Utils.copyStream(in, out, closeStreams = true) +if (destFile.exists !Files.equal(tempFile, destFile)) { + if (fileOverwrite) { +destFile.delete() +logInfo((File %s exists and does not match contents of %s, + + replacing it with %s).format(destFile, url, url)) + } else { +tempFile.delete() +throw new SparkException( + File + destFile + exists and does not match contents of + + url) + } +} +// If the destFile exists at this point, it is equal to tempFile and we can skip moving +// it; the code above deletes it or throws an Exception, as appropriate, if it exists and +// is not equal to tempFile. --- End diff -- Hey @ryan-williams my concern is that if the delete fails (e.g. because of some permissions thing), then we will always have an old version of the file because `destFile.exists` will always be true. It might make sense to throw an exception if delete is unsuccessful instead of logging a message up there in L429 that is technically not true (I realize you just moved the code, but it would still be good to fix this). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-62087106 Hey @ryan-williams thanks for your explanation. Looks like there could be another potential source of the same issue in L397, when we use the executor cache. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user ryan-williams commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-60550074 bump. @JoshRosen @pwendell lmk what the process is please? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-60550655 At this point its waiting on review from us. No action needed on your side. I need to look this one over again since it was refactored but I'm sure we can get it in soon. --- Sent from my phone On Oct 26, 2014 9:49 PM, Ryan Williams notificati...@github.com wrote: bump. @JoshRosen https://github.com/JoshRosen @pwendell https://github.com/pwendell lmk what the process is please? â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/2848#issuecomment-60550074. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user ryan-williams commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-60550725 cool, thanks @pwendell. I am new here so don't know if there are other things I'm supposed to be doing! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user ryan-williams commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-60172174 bump; what's the process for getting things merged in? any more comments here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-60014039 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user ryan-williams commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-59785816 @preaudc see last commit, I applied this change to the `case _` as well, per your suggestion! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user ryan-williams commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-59807140 Jenkins, test this please. Does that work if I am not an admin? @pwendell agreed, the logic is a little tricky but I couldn't find a simpler way to express it; in the meantime, I factored it out since it was repeated in two `case`s --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-59807368 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-59808502 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21924/consoleFull) for PR 2848 at commit [`5f1a6f1`](https://github.com/apache/spark/commit/5f1a6f184dae3950f9b5348f8d1d997be1b6e7d4). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-59820947 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21924/consoleFull) for PR 2848 at commit [`5f1a6f1`](https://github.com/apache/spark/commit/5f1a6f184dae3950f9b5348f8d1d997be1b6e7d4). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-59820962 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21924/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
GitHub user ryan-williams opened a pull request: https://github.com/apache/spark/pull/2848 [SPARK-3967] donât redundantly overwrite executor JAR deps You can merge this pull request into a Git repository by running: $ git pull https://github.com/ryan-williams/spark fetch-file Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2848.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2848 commit f3c80ae80be474ed9928b319f2b0d7808b028915 Author: Ryan Williams ryan.blake.willi...@gmail.com Date: 2014-10-17T22:21:23Z donât redundantly overwrite executor JAR deps see SPARK-3967 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-59669916 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-59677498 Jenkins, test this please. This LGTM pending tests - I do wonder whether there is a simple way to structure the logic such that we don't need a big comment explaining the safety there, but I couldn't think of one that didn't involve a lot of nesting. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-59677821 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21901/consoleFull) for PR 2848 at commit [`f3c80ae`](https://github.com/apache/spark/commit/f3c80ae80be474ed9928b319f2b0d7808b028915). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-59681083 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21901/consoleFull) for PR 2848 at commit [`f3c80ae`](https://github.com/apache/spark/commit/f3c80ae80be474ed9928b319f2b0d7808b028915). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2848#issuecomment-59681088 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21901/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org