[ 
https://issues.apache.org/jira/browse/MAHOUT-834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13133981#comment-13133981
 ] 

Sean Owen commented on MAHOUT-834:
----------------------------------

Out of interest, I tried making all jobs delete their temp output. Many fail as 
they chain together and do use previous jobs' intermediate output, it seems. 
Certainly the tests inspect this output.

This is just my own preferences or habits speaking here, but I'm accustomed to 
these big jobs tend to leave all their output and logs and such around for 
inspection and possible debugging, and left to the caller to decide when and 
how to clean up.

Certainly I think the current behavior is reasonable, and needs to stay the 
default behavior. Could there be a flag? sure. It seems almost harder to write 
(into all jobs), document, debug and support a flag that's just replacing one 
"delete" line of code or script somewhere.

I think it's a defensible idea; I would lean towards leaving it as-is though. 
Making a mistake here just gives you an exception that reasonably clearly 
states the problem. Making a mistake the other way might be surprising or 
undesirable as it's deleting stuff.
                
> rowsimilarityjob doesn't clean it's temp dir, and fails when seeing it again
> ----------------------------------------------------------------------------
>
>                 Key: MAHOUT-834
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-834
>             Project: Mahout
>          Issue Type: Bug
>          Components: Integration
>            Reporter: Dan Brickley
>            Priority: Minor
>
> If I do this:
> mahout rowsimilarity --input matrixified/matrix --output sims/ 
> --numberOfColumns 27684 --similarityClassname SIMILARITY_LOGLIKELIHOOD 
> --excludeSelfSimilarity
> then clean my output and rerun,
> rm -rf sims/ # (though this step doesn't even seem needed)
> then try again:
> mahout rowsimilarity --input matrixified/matrix --output sims/ 
> --numberOfColumns 27684 --similarityClassname SIMILARITY_LOGLIKELIHOOD 
> --excludeSelfSimilarity
> The temp files left from the first run make a re-run impossible - we get: 
> "Exception in thread "main" 
> org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory 
> temp/weights already exists".
> Manually deleting the temp directory fixes this.
> I get same behaviour if I explicitly pass in a --tempdir path, e.g.:
> mahout rowsimilarity --input matrixified/matrix --output sims/ 
> --numberOfColumns 27684 --similarityClassname SIMILARITY_LOGLIKELIHOOD 
> --excludeSelfSimilarity --tempDir tmp2/
> Presumably something like HadoopUtil.delete(getConf(),tempDirPath) is needed 
> somewhere?  (and maybe --overwrite too ?)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to