[
https://issues.apache.org/jira/browse/MAHOUT-980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matteo Riondato updated MAHOUT-980:
-----------------------------------
Status: Patch Available (was: Open)
Index: core/src/main/java/org/apache/mahout/fpm/pfpgrowth/PFPGrowth.java
===================================================================
--- core/src/main/java/org/apache/mahout/fpm/pfpgrowth/PFPGrowth.java
(revision 1292113)
+++ core/src/main/java/org/apache/mahout/fpm/pfpgrowth/PFPGrowth.java
(working copy)
@@ -96,15 +96,28 @@
*/
public static List<Pair<String,Long>> readFList(Configuration conf) throws
IOException {
List<Pair<String,Long>> list = new ArrayList<Pair<String,Long>>();
- URI[] files = DistributedCache.getCacheFiles(conf);
+ Path[] files = DistributedCache.getLocalCacheFiles(conf);
if (files == null) {
throw new IOException("Cannot read Frequency list from Distributed
Cache");
}
if (files.length != 1) {
throw new IOException("Cannot read Frequency list from Distributed Cache
("+files.length+")");
}
+ FileSystem fs = FileSystem.getLocal(conf);
+ Path fListLocalPath = fs.makeQualified(files[0]);
+ // Fallback if we are running locally.
+ if (! fs.exists(fListLocalPath)) {
+ URI[] filesURIs = DistributedCache.getCacheFiles(conf);
+ if (filesURIs == null) {
+ throw new IOException("Cannot read Frequency list from Distributed
Cache");
+ }
+ if (filesURIs.length != 1) {
+ throw new IOException("Cannot read Frequency list from Distributed
Cache ("+files.length+")");
+ }
+ fListLocalPath = new Path(filesURIs[0].getPath());
+ }
for (Pair<Text,LongWritable> record :
- new SequenceFileIterable<Text,LongWritable>(new
Path(files[0].getPath()), true, conf)) {
+ new SequenceFileIterable<Text,LongWritable>(fListLocalPath, true,
conf)) {
list.add(new Pair<String,Long>(record.getFirst().toString(),
record.getSecond().get()));
}
return list;
> Patch to make PFPGrowth run on Amazon MapReduce (also shows patterns for
> making other algorithms work in Amazon MapReduce)
> --------------------------------------------------------------------------------------------------------------------------
>
> Key: MAHOUT-980
> URL: https://issues.apache.org/jira/browse/MAHOUT-980
> Project: Mahout
> Issue Type: Improvement
> Components: Frequent Itemset/Association Rule Mining
> Affects Versions: 0.6, 0.5, 0.7
> Environment: Amazon MapReduce
> Reporter: Matteo Riondato
> Labels: hadoop, patch
> Fix For: 0.7
>
>
> The patch at http://www.cs.brown.edu/~matteo/PFPGrowth.java.diff (against
> trunk as of Wed Feb 22 00:07:35 EST 2012, revision 1292127) makes it possible
> to run PFPGrowth on Elastic MapReduce.
> The problem was in the way the fList stored in the DistributedCache was
> accessed. DistributedCache.getCacheFiles(conf) should be reserved for
> internal use according to the Hadoop API Documentation. The suggested way to
> access the files in the DistributedCache is through
> DistributedCache.getLocalCacheFiles(conf) and then through a LocalFilesystem.
> This is what the patch does. Note that there is a fallback case if we are
> running PFPGrowth with "-method mapreduce" but locally (e.g. when HADOOP_HOME
> is not set or MAHOUT_LOCAL is set). In this case, we use
> DistributedCache.getCacheFiles() as it is done in the unpatched version.
> A quick grep in the source tree shows that there are other places where
> DistributedCache.getCacheFiles(conf) is used. It may be worth checking
> whether the corresponding algorithms can be made to work in Amazon MapReduce
> by fixing them in a similar fashion.
> The patch was tested also outside Amazon MapReduce and does not change any
> other functionality.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira