Hello,
I am trying to get tfidf vectors from a corpus of 100k documents. I
noticed that tfidf sequence file is empty, while the tf vectors are not.
Here is the log from SparseVectorsFromSequenceFiles:
INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles:
Maximum n-gram size is: 1
INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles:
Minimum LLR value: 1.0
INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: Number
of reduce tasks: 1
INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles:
Tokenizing documents in /opt/seq
INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles:
Creating Term Frequency Vectors
INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles:
Calculating IDF
INFO org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles: Pruning
Here is the tfidf output dir:
root@test:[/opt/sparse/tfidf-vectors] # ll
total 20K
drwxr-xr-x 2 tomcat7 tomcat7 4.0K Apr 21 12:27 .
drwxr-xr-x 9 tomcat7 tomcat7 4.0K Apr 21 12:27 ..
-rw-r--r-- 1 tomcat7 tomcat7 90 Apr 21 12:27 part-r-00000
-rw-r--r-- 1 tomcat7 tomcat7 12 Apr 21 12:27 .part-r-00000.crc
-rw-r--r-- 1 tomcat7 tomcat7 0 Apr 21 12:27 _SUCCESS
-rw-r--r-- 1 tomcat7 tomcat7 8 Apr 21 12:27 ._SUCCESS.crc
Here is the tf output dir:
root@test:[/opt/sparse/tf-vectors] # ll
total 3.7M
drwxr-xr-x 2 tomcat7 tomcat7 4.0K Apr 21 12:27 .
drwxr-xr-x 9 tomcat7 tomcat7 4.0K Apr 21 12:27 ..
-rw-r--r-- 1 tomcat7 tomcat7 3.6M Apr 21 12:27 part-r-00000
-rw-r--r-- 1 tomcat7 tomcat7 29K Apr 21 12:27 .part-r-00000.crc
-rw-r--r-- 1 tomcat7 tomcat7 0 Apr 21 12:27 _SUCCESS
-rw-r--r-- 1 tomcat7 tomcat7 8 Apr 21 12:27 ._SUCCESS.crc
Here is the input dir:
root@test:[/opt/seq] # ll
total 81M
drwxr-xr-x 2 tomcat7 tomcat7 4.0K Apr 21 12:25 .
drwxrwxrwx 9 tomcat7 root 4.0K Apr 21 12:25 ..
-rw-r--r-- 1 tomcat7 tomcat7 31M Apr 21 12:25 part-m-00000
-rw-r--r-- 1 tomcat7 tomcat7 242K Apr 21 12:25 .part-m-00000.crc
-rw-r--r-- 1 tomcat7 tomcat7 31M Apr 21 12:25 part-m-00001
-rw-r--r-- 1 tomcat7 tomcat7 242K Apr 21 12:25 .part-m-00001.crc
-rw-r--r-- 1 tomcat7 tomcat7 20M Apr 21 12:25 part-m-00002
-rw-r--r-- 1 tomcat7 tomcat7 155K Apr 21 12:25 .part-m-00002.crc
-rw-r--r-- 1 tomcat7 tomcat7 0 Apr 21 12:25 _SUCCESS
-rw-r--r-- 1 tomcat7 tomcat7 8 Apr 21 12:25 ._SUCCESS.crc
I am running it using the toolrunner with the following parameters:
-i /opt/seq -o /opt/sparse/ -nv --maxDFSigma 2.0 --weight tfidf
Any hints why it might be failing?
Best,
Max