[GitHub] [spark] zsxwing commented on issue #26590: [SPARK-29953][SS] Don't clean up source files for FileStreamSource if the files belong to the output of FileStreamSink
zsxwing commented on issue #26590: [SPARK-29953][SS] Don't clean up source files for FileStreamSource if the files belong to the output of FileStreamSink URL: https://github.com/apache/spark/pull/26590#issuecomment-562441349 Thanks! Merging to master, This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zsxwing commented on issue #26590: [SPARK-29953][SS] Don't clean up source files for FileStreamSource if the files belong to the output of FileStreamSink
zsxwing commented on issue #26590: [SPARK-29953][SS] Don't clean up source files for FileStreamSource if the files belong to the output of FileStreamSink URL: https://github.com/apache/spark/pull/26590#issuecomment-562346871 LGTM. retest this please. Triggering another test since the last run was 3 days ago. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zsxwing commented on issue #26590: [SPARK-29953][SS] Don't clean up source files for FileStreamSource if the files belong to the output of FileStreamSink
zsxwing commented on issue #26590: [SPARK-29953][SS] Don't clean up source files for FileStreamSource if the files belong to the output of FileStreamSink URL: https://github.com/apache/spark/pull/26590#issuecomment-557216576 > Checking all the files in all the directories in each micro-batch is definitely an overkill. +1. I think the fundamental issue is the FileIndex interface doesn't work for complicated things. There are multiple issues here. Another example: if a user is using a glob path in `FileStreamSource`, we always go to `InMemoryFileIndex`, even if there are some matched paths created by `FileStreamSink`. `InMemoryFileIndex` knowns nothing about `MetadataLogFileIndex` and uses its own logic to list files. Ideally, the defending codes should be added when doing the file listing if we would like to prevent such cases because it can also prevent reading incorrect files. However, I think that's a pretty large change and probably not worth (I have not yet figured out how to make Hadoop's glob pattern codes understand `MetadataLogFileIndex`, maybe impossible). Hence I suggest we just block the `cleanSource` option when listing files using `MetadataLogFileIndex`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zsxwing commented on issue #26590: [SPARK-29953][SS] Don't clean up source files for FileStreamSource if the files belong to the output of FileStreamSink
zsxwing commented on issue #26590: [SPARK-29953][SS] Don't clean up source files for FileStreamSource if the files belong to the output of FileStreamSink URL: https://github.com/apache/spark/pull/26590#issuecomment-556950146 @HeartSaVioR I think we can simply detect whether we are using `MetadataLogFileIndex` here: https://github.com/apache/spark/blob/ba2bc4b0e0eea0c1b6732a18cb20e61e4f693156/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L205 We don't need to do such complicated check because for cases you are checking, we won't go through `MetadataLogFileIndex` so the result is not correct anyway and the user should not use such path. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org