[jira] [Commented] (SPARK-29217) How to read streaming output path by ignoring metadata log files

2019-09-23 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936388#comment-16936388
 ] 

Jungtaek Lim commented on SPARK-29217:
--

Btw, ideally I encourage asking questions to either users mailing list or 
StackOverflow as it helps searching answer for same question easier. (As a 
side-effect, it may also give some credit to someone answering question.)

> How to read streaming output path by ignoring metadata log files
> 
>
> Key: SPARK-29217
> URL: https://issues.apache.org/jira/browse/SPARK-29217
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Thanida
>Priority: Minor
>
> As the output path of spark streaming contains `_spark_metadata` directory, 
> reading by  
> {code:java}
> spark.read.format("parquet").load(filepath)
> {code}
> always depend on files listing in metadata log.
> Moving some files in the output while streaming caused reading data failed. 
> So, how to read data in the streaming output path by ignoring metadata log 
> files?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29217) How to read streaming output path by ignoring metadata log files

2019-09-23 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936386#comment-16936386
 ] 

Jungtaek Lim commented on SPARK-29217:
--

The metadata is leveraged to provide end-to-end exactly once, and that's 
expected behavior for Spark batch/streaming query when reading from Spark 
datasource sink outputs.

Unfortunately, there's no official approach to remove files in output. 
Workarounds are described in comments on SPARK-24295, though it requires nasty 
modification of metadata by user side.

I proposed SPARK-27188 to deal with the issue generally, but even in proposed 
approach it relies on retention as Spark cannot know which file/directory end 
users deleted. (Technically, Spark can check the files' status and remove 
deleted files in metadata, but it cannot be done immediately so you'll still 
have a chance to see the error. And the cost of checking all output files so 
far is too huge which we may not want to.)

> How to read streaming output path by ignoring metadata log files
> 
>
> Key: SPARK-29217
> URL: https://issues.apache.org/jira/browse/SPARK-29217
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Thanida
>Priority: Minor
>
> As the output path of spark streaming contains `_spark_metadata` directory, 
> reading by  
> {code:java}
> spark.read.format("parquet").load(filepath)
> {code}
> always depend on files listing in metadata log.
> Moving some files in the output while streaming caused reading data failed. 
> So, how to read data in the streaming output path by ignoring metadata log 
> files?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29217) How to read streaming output path by ignoring metadata log files

2019-09-23 Thread Thanida (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936309#comment-16936309
 ] 

Thanida commented on SPARK-29217:
-

I use spark stream writer
{code:java}
df.writeStream
.trigger(Trigger.ProcessingTime(3))
.outputMode("append")
.format("parquet")
.option("path", "path/destination")
.partitionBy("dt").start();
{code}
I got data in the output as
{code:java}
- _spark_metadata/..
- dt=20190923/part-0-parquet
- dt=20190923/part-1-parquet
- dt=20190923/part-2-parquet
- dt=20190924/part-0-parquet{code}
Then, I delete one partition

{code:java}
dt=20190923{code}
After that, I read the data by
{code:java}
spark.read.format("parquet").load("path/destination"){code}
I got an error java.io.FileNotFoundException
{code:java}
java.io.FileNotFoundException: File 
file:path/destination/dt=20190923/part-0-parquet{code}
 

> How to read streaming output path by ignoring metadata log files
> 
>
> Key: SPARK-29217
> URL: https://issues.apache.org/jira/browse/SPARK-29217
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Thanida
>Priority: Minor
>
> As the output path of spark streaming contains `_spark_metadata` directory, 
> reading by  
> {code:java}
> spark.read.format("parquet").load(filepath)
> {code}
> always depend on files listing in metadata log.
> Moving some files in the output while streaming caused reading data failed. 
> So, how to read data in the streaming output path by ignoring metadata log 
> files?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29217) How to read streaming output path by ignoring metadata log files

2019-09-23 Thread holdenk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936186#comment-16936186
 ] 

holdenk commented on SPARK-29217:
-

Can you clarify what you mean by "Moving some files in the output while 
streaming"?

> How to read streaming output path by ignoring metadata log files
> 
>
> Key: SPARK-29217
> URL: https://issues.apache.org/jira/browse/SPARK-29217
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Thanida
>Priority: Minor
>
> As the output path of spark streaming contains `_spark_metadata` directory, 
> reading by  
> {code:java}
> spark.read.format("parquet").load(filepath)
> {code}
> always depend on files listing in metadata log.
> Moving some files in the output while streaming caused reading data failed. 
> So, how to read data in the streaming output path by ignoring metadata log 
> files?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org