[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

2018-04-26 Thread steveloughran
Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/19404
  
I think the sync is important, but that you just need to handle the case of 
"fs doesn't support it".

Thinking about this a bit more, I didn't like my proposed patch. Better to 
have

* probe for feature after open(), through a check for implementing 
Syncable, and then calling `hflush()`. It's the lower cost call and if you 
implement one, you have to implement the other.
* if hflush fails, don't use sync, so set `syncable: Optional` to 
None
* when checkpointing, go `syncable.map(_.hsync())`. Which is the core of 
your current patch

you will take a perf hit on the sync, as on HDFS you won't get it returning 
until it has been written down the entire replication chain. But after that, 
you've got a guarantee of durability, which is what checkpoints tend to 
expect...

(side topic: some JIRAs on Flink checkpointing to other stores, especially 
[FLINK-9061](https://issues.apache.org/jira/browse/FLINK-9061)




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

2018-04-25 Thread rekhajoshm
Github user rekhajoshm commented on the issue:

https://github.com/apache/spark/pull/19404
  
Thanks for the good inputs.Closing this PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

2018-04-23 Thread steveloughran
Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/19404
  
BTW, perf wise: hflush() is required to block until the flush has got to 
the store (visible to others), and with hsync actually  saved to the durable 
store. So it will take time, but if you want durability, that's that price. 
Without that hsync call though, there's no guarantee anything will be written 
to the store. If you need this log to recover from failures: hsync is what you 
are going to need, even though it currently only works on HDFS and WASB.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

2018-04-23 Thread steveloughran
Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/19404
  
Problem here is that a stream which doesn't implement hflush/hsync is 
required to throw an exception; it's a way of guaranteeing that if hsync/hflush 
does complete, the action has done what you want - HBase &c utterly depend on 
this.

The fact that FSDataOutputStream implements Syncable and yet streams it may 
relay to may not is the whole reason for 
[HDFS-11644](https://issues.apache.org/jira/browse/HDFS-11644) and the 
`StreamCapabilities` method. As with Erasure Coding, even HDFS streams may not 
support hflush/hsync

This patch is at risk of raising an exception whenever it tries to call 
`hflush()` on non HDFS store or HDFS with Erasure Coding enabled. IF you were 
targeting Hadoop 2.9+ you could just check `hasCapability("hsync")` use it if 
present. For Hadoop 2.6+ you'll have to call `out.hflush()` on the first 
attempt, if any exception (IOE, UnsupportedOperationException, RTE) is raised, 
catch, swallow and never try to hflush again. 

Sorry, it's messy: its why I'd like that `hasCapability(`) probe up for all 
features which are only intermittently available. Can complicate caller code if 
you want to know these things, but stops you getting caught out when you really 
want to know the durability semantics of the FS.

see also WiP 
[OutputStream](https://github.com/steveloughran/hadoop/blob/s3/HADOOP-13327-outputstream-trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/outputstream.md)

(thanks for mentioning me BTW; this is one of those things that would 
probably work well in local tests but blow up in production somewhere)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

2018-04-23 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/19404
  
@steveloughran what do you think of this? 

flushing sounds safe but is there a performance impact here if done on 
every `serialize`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

2017-10-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19404
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

2017-10-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19404
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82439/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

2017-10-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19404
  
**[Test build #82439 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82439/testReport)**
 for PR 19404 at commit 
[`a2d5bc7`](https://github.com/apache/spark/commit/a2d5bc706987accaeb6c9516e7d4a07b5bb3104f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

2017-10-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19404
  
**[Test build #82439 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82439/testReport)**
 for PR 19404 at commit 
[`a2d5bc7`](https://github.com/apache/spark/commit/a2d5bc706987accaeb6c9516e7d4a07b5bb3104f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

2017-10-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19404
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

2017-10-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19404
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82434/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

2017-10-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19404
  
**[Test build #82434 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82434/testReport)**
 for PR 19404 at commit 
[`f945f39`](https://github.com/apache/spark/commit/f945f39438c6a1cabff3f18489dd2082c4cba24c).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

2017-10-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19404
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

2017-10-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19404
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82433/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

2017-10-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19404
  
**[Test build #82433 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82433/testReport)**
 for PR 19404 at commit 
[`9cd3ee6`](https://github.com/apache/spark/commit/9cd3ee69b9ddd678b77d8b78feec51ea8c55e377).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

2017-10-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19404
  
**[Test build #82434 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82434/testReport)**
 for PR 19404 at commit 
[`f945f39`](https://github.com/apache/spark/commit/f945f39438c6a1cabff3f18489dd2082c4cba24c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

2017-10-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19404
  
**[Test build #82433 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82433/testReport)**
 for PR 19404 at commit 
[`9cd3ee6`](https://github.com/apache/spark/commit/9cd3ee69b9ddd678b77d8b78feec51ea8c55e377).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

2017-10-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19404
  
**[Test build #3940 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3940/testReport)**
 for PR 19404 at commit 
[`89cdb3b`](https://github.com/apache/spark/commit/89cdb3bdf70ec39a09b4e598935bb20a8f64f0cb).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

2017-10-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19404
  
**[Test build #3940 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3940/testReport)**
 for PR 19404 at commit 
[`89cdb3b`](https://github.com/apache/spark/commit/89cdb3bdf70ec39a09b4e598935bb20a8f64f0cb).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

2017-09-30 Thread rekhajoshm
Github user rekhajoshm commented on the issue:

https://github.com/apache/spark/pull/19404
  
Seems to be apache spark git/jenkins issue.Please retest after a 
while.thanks


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

2017-09-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19404
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82359/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...

2017-09-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19404
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org