[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Github user steveloughran commented on the issue: https://github.com/apache/spark/pull/19404 I think the sync is important, but that you just need to handle the case of "fs doesn't support it". Thinking about this a bit more, I didn't like my proposed patch. Better to have * probe for feature after open(), through a check for implementing Syncable, and then calling `hflush()`. It's the lower cost call and if you implement one, you have to implement the other. * if hflush fails, don't use sync, so set `syncable: Optional` to None * when checkpointing, go `syncable.map(_.hsync())`. Which is the core of your current patch you will take a perf hit on the sync, as on HDFS you won't get it returning until it has been written down the entire replication chain. But after that, you've got a guarantee of durability, which is what checkpoints tend to expect... (side topic: some JIRAs on Flink checkpointing to other stores, especially [FLINK-9061](https://issues.apache.org/jira/browse/FLINK-9061) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Github user rekhajoshm commented on the issue: https://github.com/apache/spark/pull/19404 Thanks for the good inputs.Closing this PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Github user steveloughran commented on the issue: https://github.com/apache/spark/pull/19404 BTW, perf wise: hflush() is required to block until the flush has got to the store (visible to others), and with hsync actually saved to the durable store. So it will take time, but if you want durability, that's that price. Without that hsync call though, there's no guarantee anything will be written to the store. If you need this log to recover from failures: hsync is what you are going to need, even though it currently only works on HDFS and WASB. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Github user steveloughran commented on the issue: https://github.com/apache/spark/pull/19404 Problem here is that a stream which doesn't implement hflush/hsync is required to throw an exception; it's a way of guaranteeing that if hsync/hflush does complete, the action has done what you want - HBase &c utterly depend on this. The fact that FSDataOutputStream implements Syncable and yet streams it may relay to may not is the whole reason for [HDFS-11644](https://issues.apache.org/jira/browse/HDFS-11644) and the `StreamCapabilities` method. As with Erasure Coding, even HDFS streams may not support hflush/hsync This patch is at risk of raising an exception whenever it tries to call `hflush()` on non HDFS store or HDFS with Erasure Coding enabled. IF you were targeting Hadoop 2.9+ you could just check `hasCapability("hsync")` use it if present. For Hadoop 2.6+ you'll have to call `out.hflush()` on the first attempt, if any exception (IOE, UnsupportedOperationException, RTE) is raised, catch, swallow and never try to hflush again. Sorry, it's messy: its why I'd like that `hasCapability(`) probe up for all features which are only intermittently available. Can complicate caller code if you want to know these things, but stops you getting caught out when you really want to know the durability semantics of the FS. see also WiP [OutputStream](https://github.com/steveloughran/hadoop/blob/s3/HADOOP-13327-outputstream-trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/outputstream.md) (thanks for mentioning me BTW; this is one of those things that would probably work well in local tests but blow up in production somewhere) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19404 @steveloughran what do you think of this? flushing sounds safe but is there a performance impact here if done on every `serialize`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19404 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19404 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82439/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19404 **[Test build #82439 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82439/testReport)** for PR 19404 at commit [`a2d5bc7`](https://github.com/apache/spark/commit/a2d5bc706987accaeb6c9516e7d4a07b5bb3104f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19404 **[Test build #82439 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82439/testReport)** for PR 19404 at commit [`a2d5bc7`](https://github.com/apache/spark/commit/a2d5bc706987accaeb6c9516e7d4a07b5bb3104f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19404 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19404 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82434/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19404 **[Test build #82434 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82434/testReport)** for PR 19404 at commit [`f945f39`](https://github.com/apache/spark/commit/f945f39438c6a1cabff3f18489dd2082c4cba24c). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19404 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19404 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82433/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19404 **[Test build #82433 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82433/testReport)** for PR 19404 at commit [`9cd3ee6`](https://github.com/apache/spark/commit/9cd3ee69b9ddd678b77d8b78feec51ea8c55e377). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19404 **[Test build #82434 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82434/testReport)** for PR 19404 at commit [`f945f39`](https://github.com/apache/spark/commit/f945f39438c6a1cabff3f18489dd2082c4cba24c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19404 **[Test build #82433 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82433/testReport)** for PR 19404 at commit [`9cd3ee6`](https://github.com/apache/spark/commit/9cd3ee69b9ddd678b77d8b78feec51ea8c55e377). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19404 **[Test build #3940 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3940/testReport)** for PR 19404 at commit [`89cdb3b`](https://github.com/apache/spark/commit/89cdb3bdf70ec39a09b4e598935bb20a8f64f0cb). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19404 **[Test build #3940 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3940/testReport)** for PR 19404 at commit [`89cdb3b`](https://github.com/apache/spark/commit/89cdb3bdf70ec39a09b4e598935bb20a8f64f0cb). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Github user rekhajoshm commented on the issue: https://github.com/apache/spark/pull/19404 Seems to be apache spark git/jenkins issue.Please retest after a while.thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19404 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82359/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19404: [SPARK-21760] [Streaming] Fix for Structured streaming t...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19404 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org