Andrey Davydov created MAHOUT-1128:
--------------------------------------

             Summary:  MAHOUT-999 issue still actual
                 Key: MAHOUT-1128
                 URL: https://issues.apache.org/jira/browse/MAHOUT-1128
             Project: Mahout
          Issue Type: Bug
          Components: Clustering
    Affects Versions: 0.7
         Environment: I work on Hadoop 1.0.3 cluster deployed on Amazon EC2 
virtual computers with Ubuntu 11 and mahout-core.jar 0.7 from maven-central.
I run my application from separated "clien" machine and it submit tasks to 
cluster.


            Reporter: Andrey Davydov


I'm sorry my english is not well and I'm newbie with Mahout. But it seems that 
MAHOUT-999 issue still actual.

I use mahout-core 0.7 loaded from maven-central and I've got the same fail. 

I've investigate sources and found following in the 
org.apache.mahout.clustering.classify.ClusterClassifier class:

  public void writeToSeqFiles(Path path) throws IOException {
    writePolicy(policy, path);
    Configuration config = new Configuration();
    FileSystem fs = FileSystem.get(path.toUri(), config);
    SequenceFile.Writer writer = null;
    ClusterWritable cw = new ClusterWritable();
    for (int i = 0; i < models.size(); i++) {
...
      } finally {
        Closeables.closeQuietly(writer);
      }
    }
  }
  
  public void readFromSeqFiles(Configuration conf, Path path) throws 
IOException {
    Configuration config = new Configuration();
    List<Cluster> clusters = Lists.newArrayList();
    for (ClusterWritable cw : new 
SequenceFileDirValueIterable<ClusterWritable>(path, PathType.LIST,
        PathFilters.logsCRCFilter(), config)) {
...
    }
    this.models = clusters;
    modelClass = models.get(0).getClass().getName();
    this.policy = readPolicy(path);
  }

Both methods use new default Configuration and they try to work with local file 
system. I.e. KMeansDriver wrote initial clusters to local file system of the 
"client" system and CIMapper try to read it from cluster node local file system.

It seems that current implementation can work only pseudo-distributed hadoop 
system. I think that ClusterClassifier should store intermediate results in the 
HDFS using Configuration passed by api from user.








--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to