[jira] [Commented] (SPARK-29995) Structured Streaming file-sink log grow indefinitely

2020-02-25 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045009#comment-17045009
 ] 

Jungtaek Lim commented on SPARK-29995:
--

[~zhangliming]

Hi, if you're open to try out something on your environment, could you please 
try out SPARK-30946 and see how much it helps? You will need to back up your 
checkpoint and "_spark_metadata" directory in output directory as SPARK-30946 
will convert them to V2 format which is in proposal (no guarantee whether it 
will be accepted, and when).

If you're not open to try out something but open to provide your metadata 
files, please upload it somewhere and let me know. The latest 1 compact file 
would be OK but it would be better if you can provide a set of one compact 
interval (9.compact to XXX(X+1)8, 9 files). If you would like to do it 
privately, please contact me via mail, kabhwan-opensource AT gmail.com

Thanks!

> Structured Streaming file-sink log grow indefinitely
> 
>
> Key: SPARK-29995
> URL: https://issues.apache.org/jira/browse/SPARK-29995
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: zhang liming
>Priority: Major
> Attachments: file.png, task.png
>
>
> When i use structured streaming parquet sink, I've noticed that the 
> File-Sink-Log files keep getting bigger, they are in 
> \{$checkpoint/_spark_metadata/}, i don't think this is reasonable.
> And when they merge files,task batches take longer to run, just like the 
> screenshot below



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29995) Structured Streaming file-sink log grow indefinitely

2019-11-25 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16981406#comment-16981406
 ] 

Jungtaek Lim commented on SPARK-29995:
--

The thing is, "exactly-once" on file stream sink is achieved only when 
downstream query reads metadata on the output directory. In other words, if you 
delete some of metadata, the query which writes to the output to the directory 
may crash if it's still running, even downstream query will miss reading quite 
number of files from the output directory. That would be OK if you're not 
reading the output from another Spark query.

> Structured Streaming file-sink log grow indefinitely
> 
>
> Key: SPARK-29995
> URL: https://issues.apache.org/jira/browse/SPARK-29995
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: zhang liming
>Priority: Major
> Attachments: file.png, task.png
>
>
> When i use structured streaming parquet sink, I've noticed that the 
> File-Sink-Log files keep getting bigger, they are in 
> \{$checkpoint/_spark_metadata/}, i don't think this is reasonable.
> And when they merge files,task batches take longer to run, just like the 
> screenshot below



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29995) Structured Streaming file-sink log grow indefinitely

2019-11-22 Thread zhang liming (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16979997#comment-16979997
 ] 

zhang liming commented on SPARK-29995:
--

May I delete the previous contents of the latest .compact file to control the 
size of the subsequent merged file?

> Structured Streaming file-sink log grow indefinitely
> 
>
> Key: SPARK-29995
> URL: https://issues.apache.org/jira/browse/SPARK-29995
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: zhang liming
>Priority: Major
> Attachments: file.png, task.png
>
>
> When i use structured streaming parquet sink, I've noticed that the 
> File-Sink-Log files keep getting bigger, they are in 
> \{$checkpoint/_spark_metadata/}, i don't think this is reasonable.
> And when they merge files,task batches take longer to run, just like the 
> screenshot below



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29995) Structured Streaming file-sink log grow indefinitely

2019-11-21 Thread zhang liming (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16979888#comment-16979888
 ] 

zhang liming commented on SPARK-29995:
--

Ok,thanks for your reply.

> Structured Streaming file-sink log grow indefinitely
> 
>
> Key: SPARK-29995
> URL: https://issues.apache.org/jira/browse/SPARK-29995
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: zhang liming
>Priority: Major
> Attachments: file.png, task.png
>
>
> When i use structured streaming parquet sink, I've noticed that the 
> File-Sink-Log files keep getting bigger, they are in 
> \{$checkpoint/_spark_metadata/}, i don't think this is reasonable.
> And when they merge files,task batches take longer to run, just like the 
> screenshot below



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29995) Structured Streaming file-sink log grow indefinitely

2019-11-21 Thread zhang liming (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16979886#comment-16979886
 ] 

zhang liming commented on SPARK-29995:
--

好的,谢谢你的回复

> Structured Streaming file-sink log grow indefinitely
> 
>
> Key: SPARK-29995
> URL: https://issues.apache.org/jira/browse/SPARK-29995
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: zhang liming
>Priority: Major
> Attachments: file.png, task.png
>
>
> When i use structured streaming parquet sink, I've noticed that the 
> File-Sink-Log files keep getting bigger, they are in 
> \{$checkpoint/_spark_metadata/}, i don't think this is reasonable.
> And when they merge files,task batches take longer to run, just like the 
> screenshot below



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29995) Structured Streaming file-sink log grow indefinitely

2019-11-21 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16979870#comment-16979870
 ] 

Jungtaek Lim commented on SPARK-29995:
--

Btw, SPARK-27188 is the approach to solve this issue what I proposed so far.

> Structured Streaming file-sink log grow indefinitely
> 
>
> Key: SPARK-29995
> URL: https://issues.apache.org/jira/browse/SPARK-29995
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: zhang liming
>Priority: Major
> Attachments: file.png, task.png
>
>
> When i use structured streaming parquet sink, I've noticed that the 
> File-Sink-Log files keep getting bigger, they are in 
> \{$checkpoint/_spark_metadata/}, i don't think this is reasonable.
> And when they merge files,task batches take longer to run, just like the 
> screenshot below



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29995) Structured Streaming file-sink log grow indefinitely

2019-11-21 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16979869#comment-16979869
 ] 

Jungtaek Lim commented on SPARK-29995:
--

Removing myself in Shepherd - that's mostly used for SPIP issue which 
represents "supporter" to help driving the issue forward, which only 
committer/PMC member could take the role (I'm one of contributors).

And there's an issue SPARK-24295 so this is technically "duplicated" issue, 
though I think end users are continuously hitting the issue and we should 
provide the solution (or at least workaround).

> Structured Streaming file-sink log grow indefinitely
> 
>
> Key: SPARK-29995
> URL: https://issues.apache.org/jira/browse/SPARK-29995
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: zhang liming
>Priority: Major
> Attachments: file.png, task.png
>
>
> When i use structured streaming parquet sink, I've noticed that the 
> File-Sink-Log files keep getting bigger, they are in 
> \{$checkpoint/_spark_metadata/}, i don't think this is reasonable.
> And when they merge files,task batches take longer to run, just like the 
> screenshot below



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org