I'm catching up on some mail and I came across this patch - this looks OK to me (though I'm not too familiar with the nuances of running on EMR).
I'm unit testing it now, but I wanted to ask what the policy on committing patches delivered via link is? Should I request a resubmit as a JIRA attachment before applying this? If there are no objections (based on that or otherwise), I'll probably take this patch as my first commit. -tom On Wed, Feb 22, 2012 at 12:18 AM, Matteo Riondato (Created) (JIRA) <[email protected]> wrote: > Patch to make PFPGrowth run on Amazon MapReduce (also shows patterns for > making other algorithms work in Amazon MapReduce) > -------------------------------------------------------------------------------------------------------------------------- > > Key: MAHOUT-980 > URL: https://issues.apache.org/jira/browse/MAHOUT-980 > Project: Mahout > Issue Type: Improvement > Components: Frequent Itemset/Association Rule Mining > Affects Versions: 0.6, 0.5, 0.7 > Environment: Amazon MapReduce > Reporter: Matteo Riondato > Fix For: 0.7 > > > The patch at http://www.cs.brown.edu/~matteo/PFPGrowth.java.diff (against > trunk as of Wed Feb 22 00:07:35 EST 2012, revision 1292127) makes it possible > to run PFPGrowth on Elastic MapReduce. > > The problem was in the way the fList stored in the DistributedCache was > accessed. DistributedCache.getCacheFiles(conf) should be reserved for > internal use according to the Hadoop API Documentation. The suggested way to > access the files in the DistributedCache is through > DistributedCache.getLocalCacheFiles(conf) and then through a LocalFilesystem. > This is what the patch does. Note that there is a fallback case if we are > running PFPGrowth with "-method mapreduce" but locally (e.g. when HADOOP_HOME > is not set or MAHOUT_LOCAL is set). In this case, we use > DistributedCache.getCacheFiles() as it is done in the unpatched version. > > A quick grep in the source tree shows that there are other places where > DistributedCache.getCacheFiles(conf) is used. It may be worth checking > whether the corresponding algorithms can be made to work in Amazon MapReduce > by fixing them in a similar fashion. > > The patch was tested also outside Amazon MapReduce and does not change any > other functionality. > > -- > This message is automatically generated by JIRA. > If you think it was sent incorrectly, please contact your JIRA > administrators: > https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa > For more information on JIRA, see: http://www.atlassian.com/software/jira > >
