[jira] [Commented] (MAHOUT-1247) cluster-reuters doesn't work on Hadoop

2013-06-09 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13679074#comment-13679074
 ] 

Grant Ingersoll commented on MAHOUT-1247:
-

Here's the first error I'm getting: https://paste.apache.org/cik6
{quote}
java.lang.IllegalStateException: 
/tmp/hadoop-grantingersoll/mapred/local/taskTracker/distcache/4475940891381251304_1262960862_693852121/localhostdicVec/dictionary.file-0
at 
org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable.iterator(SequenceFileIterable.java:63)
at 
org.apache.mahout.vectorizer.term.TFPartialVectorReducer.setup(TFPartialVectorReducer.java:146)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
at 
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:650)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.io.FileNotFoundException: File does not exist: 
hdfs://localhost:9000/tmp/hadoop-grantingersoll/mapred/local/taskTracker/distcache/4475940891381251304_1262960862_693852121/localhostdicVec/dictionary.file-0
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:528)
at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:796)
at 
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1479)
at 
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1474)
at 
org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.init(SequenceFileIterator.java:58)
at 
org.apache.mahout.common.iterator.sequencefile.SequenceFileIterable.iterator(SequenceFileIterable.java:61)
... 9 more
{quote}

Might be related to MAHOUT-992, but not sure.  I added a main to 
DictionaryVectorizer that allows you to reproduce this off of the prior run of 
cluster-reuters without having to go re-run everything.

 cluster-reuters doesn't work on Hadoop
 --

 Key: MAHOUT-1247
 URL: https://issues.apache.org/jira/browse/MAHOUT-1247
 Project: Mahout
  Issue Type: Bug
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Fix For: 0.8


 At least two issues:
 1. MAHOUT-992 messed up the Distributed Cache stuff somehow
 2. The ExtractReuters data is not being moved to HDFS.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1247) cluster-reuters doesn't work on Hadoop

2013-06-09 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13679076#comment-13679076
 ] 

Grant Ingersoll commented on MAHOUT-1247:
-

After you run cluster-reuters.sh, you can run:
{code}bin/mahout org.apache.mahout.vectorizer.DictionaryVectorizer -i 
/tmp/mahout-work-grantingersoll/reuters-out-seqdir-sparse-kmeans/tokenized-documents
 -o ./dicVec{code}

Make sure you have HADOOP_HOME set and also substitute in the appropriate work 
directory.

 cluster-reuters doesn't work on Hadoop
 --

 Key: MAHOUT-1247
 URL: https://issues.apache.org/jira/browse/MAHOUT-1247
 Project: Mahout
  Issue Type: Bug
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Fix For: 0.8


 At least two issues:
 1. MAHOUT-992 messed up the Distributed Cache stuff somehow
 2. The ExtractReuters data is not being moved to HDFS.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1247) cluster-reuters doesn't work on Hadoop

2013-06-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13679085#comment-13679085
 ] 

Hudson commented on MAHOUT-1247:


Integrated in Mahout-Quality #2065 (See 
[https://builds.apache.org/job/Mahout-Quality/2065/])
add some helpers to AbstractJob, add a main to DictionaryVectorizer to try 
and isolate some issues in testing DicVec on Hadoop for MAHOUT-1247 (Revision 
1491225)

 Result = SUCCESS
gsingers : 
Files : 
* /mahout/trunk/core/src/main/java/org/apache/mahout/common/AbstractJob.java
* 
/mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/DictionaryVectorizer.java


 cluster-reuters doesn't work on Hadoop
 --

 Key: MAHOUT-1247
 URL: https://issues.apache.org/jira/browse/MAHOUT-1247
 Project: Mahout
  Issue Type: Bug
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Fix For: 0.8


 At least two issues:
 1. MAHOUT-992 messed up the Distributed Cache stuff somehow
 2. The ExtractReuters data is not being moved to HDFS.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1247) cluster-reuters doesn't work on Hadoop

2013-06-09 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13679090#comment-13679090
 ] 

Grant Ingersoll commented on MAHOUT-1247:
-

I think I see the issue.  The cache file is local, the Iterator, however, has 
a Hadoop conf that is expecting an HDFS file, hence it can't find it.

 cluster-reuters doesn't work on Hadoop
 --

 Key: MAHOUT-1247
 URL: https://issues.apache.org/jira/browse/MAHOUT-1247
 Project: Mahout
  Issue Type: Bug
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
 Fix For: 0.8


 At least two issues:
 1. MAHOUT-992 messed up the Distributed Cache stuff somehow
 2. The ExtractReuters data is not being moved to HDFS.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira