[ 
https://issues.apache.org/jira/browse/SPARK-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390771#comment-14390771
 ] 

Antony Mayi commented on SPARK-6334:
------------------------------------

bq. btw. I see based on the sourcecode checkpointing should be happening every 
3 iterations - how comes I don't see any drops in the disk usage at least once 
every three iterations? it just seems to be growing constantly... which worries 
me that even more frequent checkpointing wont help...

ok, I am now sure increasing the checkpointing interval is likely not going to 
help same as it is not helping now - the disk usage just grows even after 3x 
iterations. I just tried dirty hack - running parallel thread that forces GC 
every x minutes and suddenly I can notice the disk space gets cleared upon 
every three iterations when GC runs.

see this pattern - first run without forcing GC and then another one where 
there is noticeable disk usage drops every three steps (ALS iterations):
!gc.png!

so really what's needed to get the shuffles cleaned upon checkpointing is 
forcing GC.

this was my dirty hack:

{code}
from threading import Thread, Event
class GC(Thread):
    def __init__(self, context, period=600):
        Thread.__init__(self)
        self.context = context
        self.period = period
        self.daemon = True
        self.stopped = Event()
    def stop(self):
        self.stopped.set()
    def run(self):
        self.stopped.clear()
        while not self.stopped.is_set():
            self.stopped.wait(self.period)
            self.context._jvm.System.gc()

sc.setCheckpointDir('/tmp')

gc = GC(sc)
gc.start()

training = 
sc.pickleFile('/tmp/dataset').repartition(768).persist(StorageLevel.MEMORY_AND_DISK)
model = ALS.trainImplicit(training, 50, 15, lambda_=0.1, blocks=-1, alpha=40)

gc.stop()
{code}

> spark-local dir not getting cleared during ALS
> ----------------------------------------------
>
>                 Key: SPARK-6334
>                 URL: https://issues.apache.org/jira/browse/SPARK-6334
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.2.0
>            Reporter: Antony Mayi
>         Attachments: als-diskusage.png, gc.png
>
>
> when running bigger ALS training spark spills loads of temp data into the 
> local-dir (in my case yarn/local/usercache/antony.mayi/appcache/... - running 
> on YARN from cdh 5.3.2) eventually causing all the disks of all nodes running 
> out of space (in my case I have 12TB of available disk capacity before 
> kicking off the ALS but it all gets used (and yarn kills the containers when 
> reaching 90%).
> even with all recommended options (configuring checkpointing and forcing GC 
> when possible) it still doesn't get cleared.
> here is my (pseudo)code (pyspark):
> {code}
> sc.setCheckpointDir('/tmp')
> training = 
> sc.pickleFile('/tmp/dataset').repartition(768).persist(StorageLevel.MEMORY_AND_DISK)
> model = ALS.trainImplicit(training, 50, 15, lambda_=0.1, blocks=-1, alpha=40)
> sc._jvm.System.gc()
> {code}
> the training RDD has about 3.5 billions of items (~60GB on disk). after about 
> 6 hours the ALS will consume all 12TB of disk space in local-dir data and 
> gets killed. my cluster has 192 cores, 1.5TB RAM and for this task I am using 
> 37 executors of 4 cores/28+4GB RAM each.
> this is the graph of disk consumption pattern showing the space being all 
> eaten from 7% to 90% during the ALS (90% is when YARN kills the container):
> !als-diskusage.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to