[jira] [Commented] (SPARK-29217) How to read streaming output path by ignoring metadata log files
[ https://issues.apache.org/jira/browse/SPARK-29217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936388#comment-16936388 ] Jungtaek Lim commented on SPARK-29217: -- Btw, ideally I encourage asking questions to either users mailing list or StackOverflow as it helps searching answer for same question easier. (As a side-effect, it may also give some credit to someone answering question.) > How to read streaming output path by ignoring metadata log files > > > Key: SPARK-29217 > URL: https://issues.apache.org/jira/browse/SPARK-29217 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Thanida >Priority: Minor > > As the output path of spark streaming contains `_spark_metadata` directory, > reading by > {code:java} > spark.read.format("parquet").load(filepath) > {code} > always depend on files listing in metadata log. > Moving some files in the output while streaming caused reading data failed. > So, how to read data in the streaming output path by ignoring metadata log > files? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29217) How to read streaming output path by ignoring metadata log files
[ https://issues.apache.org/jira/browse/SPARK-29217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936386#comment-16936386 ] Jungtaek Lim commented on SPARK-29217: -- The metadata is leveraged to provide end-to-end exactly once, and that's expected behavior for Spark batch/streaming query when reading from Spark datasource sink outputs. Unfortunately, there's no official approach to remove files in output. Workarounds are described in comments on SPARK-24295, though it requires nasty modification of metadata by user side. I proposed SPARK-27188 to deal with the issue generally, but even in proposed approach it relies on retention as Spark cannot know which file/directory end users deleted. (Technically, Spark can check the files' status and remove deleted files in metadata, but it cannot be done immediately so you'll still have a chance to see the error. And the cost of checking all output files so far is too huge which we may not want to.) > How to read streaming output path by ignoring metadata log files > > > Key: SPARK-29217 > URL: https://issues.apache.org/jira/browse/SPARK-29217 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Thanida >Priority: Minor > > As the output path of spark streaming contains `_spark_metadata` directory, > reading by > {code:java} > spark.read.format("parquet").load(filepath) > {code} > always depend on files listing in metadata log. > Moving some files in the output while streaming caused reading data failed. > So, how to read data in the streaming output path by ignoring metadata log > files? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29217) How to read streaming output path by ignoring metadata log files
[ https://issues.apache.org/jira/browse/SPARK-29217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936309#comment-16936309 ] Thanida commented on SPARK-29217: - I use spark stream writer {code:java} df.writeStream .trigger(Trigger.ProcessingTime(3)) .outputMode("append") .format("parquet") .option("path", "path/destination") .partitionBy("dt").start(); {code} I got data in the output as {code:java} - _spark_metadata/.. - dt=20190923/part-0-parquet - dt=20190923/part-1-parquet - dt=20190923/part-2-parquet - dt=20190924/part-0-parquet{code} Then, I delete one partition {code:java} dt=20190923{code} After that, I read the data by {code:java} spark.read.format("parquet").load("path/destination"){code} I got an error java.io.FileNotFoundException {code:java} java.io.FileNotFoundException: File file:path/destination/dt=20190923/part-0-parquet{code} > How to read streaming output path by ignoring metadata log files > > > Key: SPARK-29217 > URL: https://issues.apache.org/jira/browse/SPARK-29217 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Thanida >Priority: Minor > > As the output path of spark streaming contains `_spark_metadata` directory, > reading by > {code:java} > spark.read.format("parquet").load(filepath) > {code} > always depend on files listing in metadata log. > Moving some files in the output while streaming caused reading data failed. > So, how to read data in the streaming output path by ignoring metadata log > files? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29217) How to read streaming output path by ignoring metadata log files
[ https://issues.apache.org/jira/browse/SPARK-29217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936186#comment-16936186 ] holdenk commented on SPARK-29217: - Can you clarify what you mean by "Moving some files in the output while streaming"? > How to read streaming output path by ignoring metadata log files > > > Key: SPARK-29217 > URL: https://issues.apache.org/jira/browse/SPARK-29217 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Thanida >Priority: Minor > > As the output path of spark streaming contains `_spark_metadata` directory, > reading by > {code:java} > spark.read.format("parquet").load(filepath) > {code} > always depend on files listing in metadata log. > Moving some files in the output while streaming caused reading data failed. > So, how to read data in the streaming output path by ignoring metadata log > files? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org