Github user szhem commented on the issue:

    https://github.com/apache/spark/pull/19410
  
    Hi @asolimando,
    
    I believe that the solution with weak references will work and probably 
with `ContextCleaner` too, but there are some points I'd like to discuss if you 
don't mind
    
    - Let's take `ContextCleaner` first. In that case we should have a pretty 
simple solution, but the backward compatibility of `PeriodicRDDCheckpointer` 
and `PeriodicGraphCheckpointer` will be lost, because 
      - it will be necessary to update these classes to prevent deleting 
checkpoint files 
      - user will always have to provide 
`spark.cleaner.referenceTracking.cleanCheckpoints` property in order to clean 
unnecessary checkpoints. 
      - the users who haven't specified 
`spark.cleaner.referenceTracking.cleanCheckpoints` previously (and I believe 
there will be most of them) will be affected by this new unexpected behaviour 
    
    - In case of custom solution based on weak references 
      - it will be necessary to poll a reference queue at some place and moment 
to remove unnecessary checkpoint files. 
      - polling the reference queue in the separate thread will complicate the 
code 
      - polling the reference queue synchronously does not guarantee deleting 
all the unnecessary checkpoint files.
    
    In case of reference queue, could you please recommend the convenient place 
in the source code to do it? 
    
    As for me 
    - setting `spark.cleaner.referenceTracking.cleanCheckpoints` to `true` by 
default should work 
    - setting `spark.cleaner.referenceTracking.cleanCheckpoints` to `true` by 
default will allow us to simplify `PeriodicRDDCheckpointer` and 
`PeriodicGraphCheckpointer` too by deleting unnecessary code that cleans up 
checkpoint files
    - setting `spark.cleaner.referenceTracking.cleanCheckpoints` to `true` by 
default sounds reasonable and should not break the code of those users who 
didn't do it previously
    - but setting `spark.cleaner.referenceTracking.cleanCheckpoints` to `true` 
will probably break the backward compatibility (although this PR tries to 
preserve it)
    
    What do you think? Will the community accept setting 
`spark.cleaner.referenceTracking.cleanCheckpoints` to `true`by default?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to