[GitHub] spark pull request #15027: [SPARK-17475] [STREAMING] Delete CRC files if the...

2016-11-02 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15027


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15027: [SPARK-17475] [STREAMING] Delete CRC files if the...

2016-09-13 Thread jodersky
Github user jodersky commented on a diff in the pull request:

https://github.com/apache/spark/pull/15027#discussion_r78643449
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala
 ---
@@ -146,6 +146,11 @@ class HDFSMetadataLog[T: ClassTag](sparkSession: 
SparkSession, path: String)
   // It will fail if there is an existing file (someone has 
committed the batch)
   logDebug(s"Attempting to write log #${batchIdToPath(batchId)}")
   fileManager.rename(tempPath, batchIdToPath(batchId))
+
+  // SPARK-17475: HDFSMetadataLog should not leak CRC files
+  // If the underlying filesystem didn't rename the CRC file, 
delete it.
--- End diff --

ok, looks good then


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15027: [SPARK-17475] [STREAMING] Delete CRC files if the...

2016-09-12 Thread frreiss
Github user frreiss commented on a diff in the pull request:

https://github.com/apache/spark/pull/15027#discussion_r78469982
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala
 ---
@@ -146,6 +146,11 @@ class HDFSMetadataLog[T: ClassTag](sparkSession: 
SparkSession, path: String)
   // It will fail if there is an existing file (someone has 
committed the batch)
   logDebug(s"Attempting to write log #${batchIdToPath(batchId)}")
   fileManager.rename(tempPath, batchIdToPath(batchId))
+
+  // SPARK-17475: HDFSMetadataLog should not leak CRC files
+  // If the underlying filesystem didn't rename the CRC file, 
delete it.
--- End diff --

I believe HDFSMetadataLog is only called from Structured Streaming classes 
currently.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15027: [SPARK-17475] [STREAMING] Delete CRC files if the...

2016-09-12 Thread jodersky
Github user jodersky commented on a diff in the pull request:

https://github.com/apache/spark/pull/15027#discussion_r78469460
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala
 ---
@@ -146,6 +146,11 @@ class HDFSMetadataLog[T: ClassTag](sparkSession: 
SparkSession, path: String)
   // It will fail if there is an existing file (someone has 
committed the batch)
   logDebug(s"Attempting to write log #${batchIdToPath(batchId)}")
   fileManager.rename(tempPath, batchIdToPath(batchId))
+
+  // SPARK-17475: HDFSMetadataLog should not leak CRC files
+  // If the underlying filesystem didn't rename the CRC file, 
delete it.
--- End diff --

Is this specific to streaming, or would other parts of spark benefit if 
this behavior were changed in the file manager?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15027: [SPARK-17475] [STREAMING] Delete CRC files if the...

2016-09-09 Thread frreiss
GitHub user frreiss opened a pull request:

https://github.com/apache/spark/pull/15027

[SPARK-17475] [STREAMING] Delete CRC files if the filesystem doesn't use 
checksum files

## What changes were proposed in this pull request?

When the metadata logs for various parts of Structured Streaming are stored 
on non-HDFS filesystems such as NFS or ext4, the HDFSMetadataLog class leaves 
hidden HDFS-style checksum (CRC) files in the log directory, one file per 
batch. This PR modifies HDFSMetadataLog so that it detects the use of a 
filesystem that doesn't use CRC files and removes the CRC files.

## How was this patch tested?

Modified an existing test case in HDFSMetadataLogSuite to check whether 
HDFSMetadataLog correctly removes CRC files on the local POSIX filesystem.  Ran 
the entire regression suite.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/frreiss/spark-fred fred-17475

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15027.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15027


commit 3a2a9c116659f526189de6b8b98fb6c92024a7a6
Author: frreiss 
Date:   2016-09-09T17:30:08Z

Delete CRC files when the filesystem doesn't support checksums.

commit 9ff89c0228c09764fa6444528050a35e823db0e6
Author: frreiss 
Date:   2016-09-09T17:30:52Z

Merge branch 'master' of https://github.com/apache/spark into fred-17475




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org