[ https://issues.apache.org/jira/browse/SPARK-29217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16936386#comment-16936386 ]
Jungtaek Lim commented on SPARK-29217: -------------------------------------- The metadata is leveraged to provide end-to-end exactly once, and that's expected behavior for Spark batch/streaming query when reading from Spark datasource sink outputs. Unfortunately, there's no official approach to remove files in output. Workarounds are described in comments on SPARK-24295, though it requires nasty modification of metadata by user side. I proposed SPARK-27188 to deal with the issue generally, but even in proposed approach it relies on retention as Spark cannot know which file/directory end users deleted. (Technically, Spark can check the files' status and remove deleted files in metadata, but it cannot be done immediately so you'll still have a chance to see the error. And the cost of checking all output files so far is too huge which we may not want to.) > How to read streaming output path by ignoring metadata log files > ---------------------------------------------------------------- > > Key: SPARK-29217 > URL: https://issues.apache.org/jira/browse/SPARK-29217 > Project: Spark > Issue Type: Question > Components: Spark Core > Affects Versions: 2.4.3 > Reporter: Thanida > Priority: Minor > > As the output path of spark streaming contains `_spark_metadata` directory, > reading by > {code:java} > spark.read.format("parquet").load(filepath) > {code} > always depend on files listing in metadata log. > Moving some files in the output while streaming caused reading data failed. > So, how to read data in the streaming output path by ignoring metadata log > files? -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org