[ 
https://issues.apache.org/jira/browse/SPARK-24295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17045011#comment-17045011
 ] 

Jungtaek Lim commented on SPARK-24295:
--------------------------------------

[~iqbal_khattra] [~alfredo-gimenez-bv]

Hi, if you're open to try out something on your environment, could you please 
try out SPARK-30946 and see how much it helps? You will need to back up your 
checkpoint and "_spark_metadata" directory in output directory as SPARK-30946 
will convert them to V2 format which is in proposal (no guarantee whether it 
will be accepted, and when).

If you're not open to try out something but open to provide your metadata 
files, please upload it somewhere and let me know. The latest 1 compact file 
would be OK but it would be better if you can provide a set of one compact 
interval (XXXX9.compact to XXX(X+1)8, 9 files). If you would like to do it 
privately, please contact me via mail, kabhwan-opensource AT gmail.com

Thanks!

> Purge Structured streaming FileStreamSinkLog metadata compact file data.
> ------------------------------------------------------------------------
>
>                 Key: SPARK-24295
>                 URL: https://issues.apache.org/jira/browse/SPARK-24295
>             Project: Spark
>          Issue Type: Bug
>          Components: Structured Streaming
>    Affects Versions: 2.3.0
>            Reporter: Iqbal Singh
>            Priority: Major
>         Attachments: spark_metadatalog_compaction_perfbug_repro.tar.gz
>
>
> FileStreamSinkLog metadata logs are concatenated to a single compact file 
> after defined compact interval.
> For long running jobs, compact file size can grow up to 10's of GB's, Causing 
> slowness  while reading the data from FileStreamSinkLog dir as spark is 
> defaulting to the "__spark__metadata" dir for the read.
> We need a functionality to purge the compact file size.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to