[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

2017-10-17 Thread steveloughran
Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/19497
  
I guess one aspect of `saveAsNewAPIHadoopFile` is that it calls ` 
jobConfiguration.set("mapreduce.output.fileoutputformat.outputdir", path)`, and 
`Configuration.set(String key, String value)` has a check for null key or value.

If handling of paths is to be done in the committer, 
`saveAsNewAPIHadoopFile` should really be looking @ path and calling 
jobConfiguration.unset("mapreduce.output.fileoutputformat.outputdir) if 
path==null. 

Looking at how Hadoop's FileOutputFormat implementations work, they can 
handle a null/undefined output dir property, *but not an empty one*.

```java
public static Path getOutputPath(JobContext job) {
   String name = job.getConfiguration().get(FileOutputFormat.OUTDIR);
return name == null ? null: new Path(name);
```  

Which implies that `saveAsNewHadoopFile("")` might want to unset the config 
option too, so offloading the problem of what happens on an empty path to the 
committer. Though I'd recommend checking to see what meaningful exceptions 
actually get raised in this situation when the committer is the normal 
FileOutputFormat/FileOutputCommitter setup


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

2017-10-16 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19497
  
Thank you @mridulm. I regret that I raised this here, causing confusion. 
Let's talk more in another place. I will cc you (and @jiangxb1987) when I 
happened to file up a JIRA or see similar issue related with this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

2017-10-16 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/19497
  
@HyukjinKwon Thanks for clarifying.

The way I look at it is:
`saveAsHadoopFile` is explicitly referring to `Output the RDD to any 
Hadoop-supported file system` in its description (and name) - and so valid 
`Path` is a reasonable requirement.

Additionally, in createPathFromString for `path == null` we are explicitly 
throwing `IllegalArgumentException` (`new Path` will do the same now, but I 
think this changed in past where it used to result in NPE ?).
The subsequent `val outputPath = new Path(path)` will do that for other 
invalid input paths as well.

In contrast `saveAsHadoopDataset` is not related to file system but `Output 
the RDD to any Hadoop-supported storage system` : where output being a valid 
`Path` is not a requirement.

Having said that, we can always iterate in a jira if you feel there is some 
confusion - it is always better to be explicitly clear about the interfaces we 
expose and support !
Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

2017-10-16 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19497
  
I agree this was focussed more on the regression introduced and it should 
be good enough already, and I am talking about a different thing for behaviour 
change.

Let me organise my idea and and try to file a JIRA later. I think strictly 
this is separate anyway if I haven't missed something or simply I was wrong.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

2017-10-16 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19497
  
Currently, I meant `saveAsNewAPIHadoopFile` comparing to `saveAsHadoopFile`.

```
saveAsNewAPIHadoopFile[...]("") // succeeds
```

```
saveAsHadoopFile[...]("") // fails

Can not create a Path from an empty string
java.lang.IllegalArgumentException: Can not create a Path from an empty 
string
at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
at org.apache.hadoop.fs.Path.(Path.java:135)
at 
org.apache.spark.internal.io.SparkHadoopWriterUtils$.createPathFromString(SparkHadoopWriterUtils.scala:54)
```

I wanted to talk about this. `saveAsHadoopFile` is being failed within 
Spark side. So, I suspect `saveAsNewAPIHadoopFile` should also fail fast in 
this way.

`saveAsHadoopFile` validates the path so I thought `saveAsNewAPIHadoopFile` 
should also validate.


https://github.com/apache/spark/blob/3f958a99921d149fb9fdf7ba7e78957afdad1405/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L1004-L1008


https://github.com/apache/spark/blob/3f958a99921d149fb9fdf7ba7e78957afdad1405/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L983-L987



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

2017-10-16 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/19497
  
`saveAsNewAPIHadoopFile ` simply delegates to `saveAsNewAPIHadoopDataset` 
(with some options set), right ? The behavior would be similar ?

Do you mean `saveAsHadoopDataset` instead ?
I did not change behavior there - since the exception was getting raised 
from within hadoop code and not from our code (when we pass invalid values), 
and it is preserving behavior from earlier code.
I was focussed more on the regression introduced.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

2017-10-16 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19497
  
I support this PR itself of course. I have no problem with this.

I meant a separate (soft) question about `saveAsNewAPIHadoopFile` (not 
`saveAsNewAPIHadoopDataset`) to validate `path` parameter which we take in 
`saveAsNewAPIHadoopFile` explicitly.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

2017-10-15 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/19497
  
To clarify, we can look at changing the behavior (if required) in future - 
but that should be an explicit design choice informed by hadoop committer 
design. Until then, we should look to interoperate.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

2017-10-15 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/19497
  
@HyukjinKwon My intention was to preserve earlier behavior.
Particularly for non-path based committer's, the `path` variable and its 
use/processing is not relevant, it makes more sense to ignore that codepath 
entirely.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

2017-10-15 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19497
  
@mridulm, BTW, WDYT about disallowing:

```
.saveAsNewAPIHadoopFile[...]("")
.saveAsNewAPIHadoopFile[...]("::invalid:::")
```

within the APIs? If i tested this correctly, this PR also allows both cases 
but I think we should disallow as it requires `path` and overrides it and 
`saveAsHadoopFile` disallows it. Seems this is also allowed in branch-2.1 as I 
recall correctly but looks disallowing it sounds more making sense in any event.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

2017-10-15 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/19497
  
Thanks for the reviews everyone !


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

2017-10-15 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/19497
  
Thx for taking a deeper look @HyukjinKwon, much appreciated !
I will wait for @jiangxb1987 to also opine before committing - I want to 
make sure we are not adding incorrect behavior; given that this is a followup 
to an earlier PR (some excellent work by @szhem btw)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

2017-10-14 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19497
  
@mridulm, I just checked thought the related changes and checked the tests 
pass on branch-2.1.

Seems this PR will actually also allow the cases below:

```scala
.saveAsNewAPIHadoopFile[...]("")
.saveAsNewAPIHadoopFile[...]("::invalid:::")
```

Currently both are failed but seems this PR allows those cases:

```
Can not create a Path from an empty string
java.lang.IllegalArgumentException: Can not create a Path from an empty 
string
at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
at org.apache.hadoop.fs.Path.(Path.java:135)
at org.apache.hadoop.fs.Path.(Path.java:89)
at 
org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.absPathStagingDir(HadoopMapReduceCommitProtocol.scala:61)
...
```

```
java.net.URISyntaxException: Relative path in absolute URI: ::invalid:::
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
path in absolute URI: ::invalid:::
at org.apache.hadoop.fs.Path.initialize(Path.java:206)
at org.apache.hadoop.fs.Path.(Path.java:172)
at org.apache.hadoop.fs.Path.(Path.java:89)
at 
org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.absPathStagingDir(HadoopMapReduceCommitProtocol.scala:61)
...
```

I think we should protect these cases as this.

For the cases for old one:

```scala
.saveAsHadoopFile[...]("")
.saveAsHadoopFile[...]("::invalid:::")
```

these looks failed fast (whether it was initially intended or not) and I 
guess this PR does not affect these:

```
Can not create a Path from an empty string
java.lang.IllegalArgumentException: Can not create a Path from an empty 
string
at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
at org.apache.hadoop.fs.Path.(Path.java:135)
at 
org.apache.spark.internal.io.SparkHadoopWriterUtils$.createPathFromString(SparkHadoopWriterUtils.scala:54)
```

```
java.net.URISyntaxException: Relative path in absolute URI: ::invalid:::
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
path in absolute URI: ::invalid:::
at org.apache.hadoop.fs.Path.initialize(Path.java:206)
at org.apache.hadoop.fs.Path.(Path.java:172)
at 
org.apache.spark.internal.io.SparkHadoopWriterUtils$.createPathFromString(SparkHadoopWriterUtils.scala:54)
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

2017-10-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19497
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

2017-10-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19497
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82754/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

2017-10-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19497
  
**[Test build #82754 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82754/testReport)**
 for PR 19497 at commit 
[`a319df3`](https://github.com/apache/spark/commit/a319df36db5bd202a14b44a09e9d1887f1633aec).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

2017-10-14 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19497
  
Let me take a look with few tests and be back. Also I think I should cc 
@jiangxb1987 too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

2017-10-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19497
  
**[Test build #82754 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82754/testReport)**
 for PR 19497 at commit 
[`a319df3`](https://github.com/apache/spark/commit/a319df36db5bd202a14b44a09e9d1887f1633aec).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

2017-10-14 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19497
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

2017-10-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19497
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

2017-10-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19497
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82753/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

2017-10-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19497
  
**[Test build #82753 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82753/testReport)**
 for PR 19497 at commit 
[`a319df3`](https://github.com/apache/spark/commit/a319df36db5bd202a14b44a09e9d1887f1633aec).
 * This patch **fails due to an unknown error code, -9**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

2017-10-14 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19497
  
**[Test build #82753 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82753/testReport)**
 for PR 19497 at commit 
[`a319df3`](https://github.com/apache/spark/commit/a319df36db5bd202a14b44a09e9d1887f1633aec).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

2017-10-14 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/19497
  
+CC @HyukjinKwon, @steveloughran 

Sorry for messing up PR #19487
The only change in this PR is to use `::invalid::` instead of `test:` in 
the test to address @steveloughran's comment.

Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org