[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...
Github user steveloughran commented on the issue: https://github.com/apache/spark/pull/19497 I guess one aspect of `saveAsNewAPIHadoopFile` is that it calls ` jobConfiguration.set("mapreduce.output.fileoutputformat.outputdir", path)`, and `Configuration.set(String key, String value)` has a check for null key or value. If handling of paths is to be done in the committer, `saveAsNewAPIHadoopFile` should really be looking @ path and calling jobConfiguration.unset("mapreduce.output.fileoutputformat.outputdir) if path==null. Looking at how Hadoop's FileOutputFormat implementations work, they can handle a null/undefined output dir property, *but not an empty one*. ```java public static Path getOutputPath(JobContext job) { String name = job.getConfiguration().get(FileOutputFormat.OUTDIR); return name == null ? null: new Path(name); ``` Which implies that `saveAsNewHadoopFile("")` might want to unset the config option too, so offloading the problem of what happens on an empty path to the committer. Though I'd recommend checking to see what meaningful exceptions actually get raised in this situation when the committer is the normal FileOutputFormat/FileOutputCommitter setup --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19497 Thank you @mridulm. I regret that I raised this here, causing confusion. Let's talk more in another place. I will cc you (and @jiangxb1987) when I happened to file up a JIRA or see similar issue related with this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/19497 @HyukjinKwon Thanks for clarifying. The way I look at it is: `saveAsHadoopFile` is explicitly referring to `Output the RDD to any Hadoop-supported file system` in its description (and name) - and so valid `Path` is a reasonable requirement. Additionally, in createPathFromString for `path == null` we are explicitly throwing `IllegalArgumentException` (`new Path` will do the same now, but I think this changed in past where it used to result in NPE ?). The subsequent `val outputPath = new Path(path)` will do that for other invalid input paths as well. In contrast `saveAsHadoopDataset` is not related to file system but `Output the RDD to any Hadoop-supported storage system` : where output being a valid `Path` is not a requirement. Having said that, we can always iterate in a jira if you feel there is some confusion - it is always better to be explicitly clear about the interfaces we expose and support ! Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19497 I agree this was focussed more on the regression introduced and it should be good enough already, and I am talking about a different thing for behaviour change. Let me organise my idea and and try to file a JIRA later. I think strictly this is separate anyway if I haven't missed something or simply I was wrong. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19497 Currently, I meant `saveAsNewAPIHadoopFile` comparing to `saveAsHadoopFile`. ``` saveAsNewAPIHadoopFile[...]("") // succeeds ``` ``` saveAsHadoopFile[...]("") // fails Can not create a Path from an empty string java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127) at org.apache.hadoop.fs.Path.(Path.java:135) at org.apache.spark.internal.io.SparkHadoopWriterUtils$.createPathFromString(SparkHadoopWriterUtils.scala:54) ``` I wanted to talk about this. `saveAsHadoopFile` is being failed within Spark side. So, I suspect `saveAsNewAPIHadoopFile` should also fail fast in this way. `saveAsHadoopFile` validates the path so I thought `saveAsNewAPIHadoopFile` should also validate. https://github.com/apache/spark/blob/3f958a99921d149fb9fdf7ba7e78957afdad1405/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L1004-L1008 https://github.com/apache/spark/blob/3f958a99921d149fb9fdf7ba7e78957afdad1405/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L983-L987 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/19497 `saveAsNewAPIHadoopFile ` simply delegates to `saveAsNewAPIHadoopDataset` (with some options set), right ? The behavior would be similar ? Do you mean `saveAsHadoopDataset` instead ? I did not change behavior there - since the exception was getting raised from within hadoop code and not from our code (when we pass invalid values), and it is preserving behavior from earlier code. I was focussed more on the regression introduced. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19497 I support this PR itself of course. I have no problem with this. I meant a separate (soft) question about `saveAsNewAPIHadoopFile` (not `saveAsNewAPIHadoopDataset`) to validate `path` parameter which we take in `saveAsNewAPIHadoopFile` explicitly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/19497 To clarify, we can look at changing the behavior (if required) in future - but that should be an explicit design choice informed by hadoop committer design. Until then, we should look to interoperate. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/19497 @HyukjinKwon My intention was to preserve earlier behavior. Particularly for non-path based committer's, the `path` variable and its use/processing is not relevant, it makes more sense to ignore that codepath entirely. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19497 @mridulm, BTW, WDYT about disallowing: ``` .saveAsNewAPIHadoopFile[...]("") .saveAsNewAPIHadoopFile[...]("::invalid:::") ``` within the APIs? If i tested this correctly, this PR also allows both cases but I think we should disallow as it requires `path` and overrides it and `saveAsHadoopFile` disallows it. Seems this is also allowed in branch-2.1 as I recall correctly but looks disallowing it sounds more making sense in any event. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/19497 Thanks for the reviews everyone ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/19497 Thx for taking a deeper look @HyukjinKwon, much appreciated ! I will wait for @jiangxb1987 to also opine before committing - I want to make sure we are not adding incorrect behavior; given that this is a followup to an earlier PR (some excellent work by @szhem btw) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19497 @mridulm, I just checked thought the related changes and checked the tests pass on branch-2.1. Seems this PR will actually also allow the cases below: ```scala .saveAsNewAPIHadoopFile[...]("") .saveAsNewAPIHadoopFile[...]("::invalid:::") ``` Currently both are failed but seems this PR allows those cases: ``` Can not create a Path from an empty string java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127) at org.apache.hadoop.fs.Path.(Path.java:135) at org.apache.hadoop.fs.Path.(Path.java:89) at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.absPathStagingDir(HadoopMapReduceCommitProtocol.scala:61) ... ``` ``` java.net.URISyntaxException: Relative path in absolute URI: ::invalid::: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: ::invalid::: at org.apache.hadoop.fs.Path.initialize(Path.java:206) at org.apache.hadoop.fs.Path.(Path.java:172) at org.apache.hadoop.fs.Path.(Path.java:89) at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.absPathStagingDir(HadoopMapReduceCommitProtocol.scala:61) ... ``` I think we should protect these cases as this. For the cases for old one: ```scala .saveAsHadoopFile[...]("") .saveAsHadoopFile[...]("::invalid:::") ``` these looks failed fast (whether it was initially intended or not) and I guess this PR does not affect these: ``` Can not create a Path from an empty string java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127) at org.apache.hadoop.fs.Path.(Path.java:135) at org.apache.spark.internal.io.SparkHadoopWriterUtils$.createPathFromString(SparkHadoopWriterUtils.scala:54) ``` ``` java.net.URISyntaxException: Relative path in absolute URI: ::invalid::: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: ::invalid::: at org.apache.hadoop.fs.Path.initialize(Path.java:206) at org.apache.hadoop.fs.Path.(Path.java:172) at org.apache.spark.internal.io.SparkHadoopWriterUtils$.createPathFromString(SparkHadoopWriterUtils.scala:54) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19497 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19497 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82754/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19497 **[Test build #82754 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82754/testReport)** for PR 19497 at commit [`a319df3`](https://github.com/apache/spark/commit/a319df36db5bd202a14b44a09e9d1887f1633aec). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19497 Let me take a look with few tests and be back. Also I think I should cc @jiangxb1987 too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19497 **[Test build #82754 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82754/testReport)** for PR 19497 at commit [`a319df3`](https://github.com/apache/spark/commit/a319df36db5bd202a14b44a09e9d1887f1633aec). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19497 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19497 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19497 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82753/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19497 **[Test build #82753 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82753/testReport)** for PR 19497 at commit [`a319df3`](https://github.com/apache/spark/commit/a319df36db5bd202a14b44a09e9d1887f1633aec). * This patch **fails due to an unknown error code, -9**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19497 **[Test build #82753 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82753/testReport)** for PR 19497 at commit [`a319df3`](https://github.com/apache/spark/commit/a319df36db5bd202a14b44a09e9d1887f1633aec). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/19497 +CC @HyukjinKwon, @steveloughran Sorry for messing up PR #19487 The only change in this PR is to use `::invalid::` instead of `test:` in the test to address @steveloughran's comment. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org