[
https://issues.apache.org/jira/browse/MAHOUT-834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13134120#comment-13134120
]
Sean Owen commented on MAHOUT-834:
----------------------------------
On the one hand I'm reluctant to mix output and intermediate output. You make a
good point that cluster iterations before the final one could be considered
intermediate results, but are in the output folder. I don't know how much it
changes things to move it all under output; there's still the question of
whether it's overwritten or not. And then you're more strongly tying together
whether the output and temp both stay or are deleted.
I'd rather avoid change unless there's a clear deficiency. I'm also trying to
avoid surprise; in Hadoop-land, which is historically based on HDFS, which is a
write-once sort of storage system, things tend to prefer to not be deleted. For
example Hadoop will refuse by default to overwrite anything as you see. I think
not-overwriting is a fine default to follow.
Well, you proposed a flag. And there's apparently already an --overwrite flag
in circulation in some jobs to control this behavior. How about we port and
extend that to cover temp/output -- does that substantially answer the issue
here? It would avoid changes to current behavior which people might be used to.
> rowsimilarityjob doesn't clean it's temp dir, and fails when seeing it again
> ----------------------------------------------------------------------------
>
> Key: MAHOUT-834
> URL: https://issues.apache.org/jira/browse/MAHOUT-834
> Project: Mahout
> Issue Type: Bug
> Components: Integration
> Reporter: Dan Brickley
> Priority: Minor
>
> If I do this:
> mahout rowsimilarity --input matrixified/matrix --output sims/
> --numberOfColumns 27684 --similarityClassname SIMILARITY_LOGLIKELIHOOD
> --excludeSelfSimilarity
> then clean my output and rerun,
> rm -rf sims/ # (though this step doesn't even seem needed)
> then try again:
> mahout rowsimilarity --input matrixified/matrix --output sims/
> --numberOfColumns 27684 --similarityClassname SIMILARITY_LOGLIKELIHOOD
> --excludeSelfSimilarity
> The temp files left from the first run make a re-run impossible - we get:
> "Exception in thread "main"
> org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
> temp/weights already exists".
> Manually deleting the temp directory fixes this.
> I get same behaviour if I explicitly pass in a --tempdir path, e.g.:
> mahout rowsimilarity --input matrixified/matrix --output sims/
> --numberOfColumns 27684 --similarityClassname SIMILARITY_LOGLIKELIHOOD
> --excludeSelfSimilarity --tempDir tmp2/
> Presumably something like HadoopUtil.delete(getConf(),tempDirPath) is needed
> somewhere? (and maybe --overwrite too ?)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira