[ 
https://issues.apache.org/jira/browse/MAHOUT-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13019048#comment-13019048
 ] 

Jonathan Traupman commented on MAHOUT-666:
------------------------------------------

So I have a new patch that reverses the default and discards the temp files 
unless an option is specified to keep them around. I also removed the two 
deleteOnExit() calls in TimesSquaredJob since they are redundant if we're 
removing the root temp directory. This has the added benefit of keeping the 
actual file contents around if the "keep temp files" conf parameter is 
specified, which I imagine might be useful for debugging. 

What is the best way to submit this patch? Shall I upload it to this issue or 
just create a new one?


> DistributedSparseMatrix should clean up after itself when doing times(Vector) 
> and timesSquared(Vector)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-666
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-666
>             Project: Mahout
>          Issue Type: Bug
>          Components: Math
>    Affects Versions: 0.5
>         Environment: Linux x86_64 2.6.18, Mac OS 10.6 64-bit, Hadoop 0.20.2, 
> Java 1.6
>            Reporter: Jonathan Traupman
>            Assignee: Jake Mannix
>            Priority: Minor
>             Fix For: 0.5
>
>         Attachments: mahout-666.patch
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> The directories created during the times() and timesSquared() methods in 
> DistributedSparseMatrix leave behind a lot of cruft. While the individual 
> files are tagged with deleteOnExit, but the directories are not. Also, but 
> not deleting them until JVM exit, a job that does repeated matrix/vector 
> multiplies, like DistributedLanczosSolver, creates a lot of temp files that 
> stick around for the whole run, even though the results they contain are read 
> once and then never again. 
> Our cluster admins enforce both file count and size quotas, so since 5 temp 
> files/directories are created on each iteration of DistributedLanczosSolver, 
> we're constantly bumping into the quota with large SVDs. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to