Re: TFIDFConverter generates empty tfidf-vectors

2013-09-22 Thread Taner Diler
Gökhan, thank you.

your advice solved my problem.


On Sun, Sep 8, 2013 at 3:11 PM, Gokhan Capan gkhn...@gmail.com wrote:

 Taner,

 It seems to have tf-idf vectors later, you need to create tf vectors
 (DictionaryVectorizer.createTermFrequencyVectors) with logNormalize option
 set to false, and normPower option set to -1.0f. This applies to
 HighDFWordsPruner.pruneVectors, too.

 I believe that solves your problem.

 Best

 Gokhan


 On Wed, Sep 4, 2013 at 4:54 PM, Taner Diler taner.di...@gmail.com wrote:

  Actually, my real motivation was to visualize reuters vectors like
  DisplayKMeans example and then implement to web contents that I've
  collected and additionaly to discover what I can do with generated tfidf
  vectors. But TDIDF doesn't work and why?
 
  There is one main class that doesn't extend AbstractJob. Yes it has main
  method that executes all steps. And I'm trying to implement a sample that
  in mahout wiki and everywhere in net. In Eclipse, I've just added
  mahout-0.8-job.jar  , you know it includes all depended package, and
  hadoop-core.1.2.0.jar.
 
  import java.io.IOException;
  import java.util.ArrayList;
  import java.util.Collections;
  import java.util.Comparator;
  import java.util.HashMap;
  import java.util.List;
  import java.util.Map;
  import java.util.Set;
 
  import org.apache.hadoop.conf.Configuration;
  import org.apache.hadoop.fs.FileSystem;
  import org.apache.hadoop.fs.Path;
  import org.apache.hadoop.io.IntWritable;
  import org.apache.hadoop.io.LongWritable;
  import org.apache.hadoop.io.SequenceFile;
  import org.apache.hadoop.io.Text;
  import org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat;
  import org.apache.lucene.analysis.Analyzer;
  import org.apache.lucene.analysis.standard.StandardAnalyzer;
  import org.apache.lucene.util.Version;
  import org.apache.mahout.clustering.Cluster;
  import org.apache.mahout.clustering.canopy.CanopyDriver;
  import org.apache.mahout.clustering.classify.WeightedVectorWritable;
  import org.apache.mahout.clustering.display.DisplayKMeans;
  import org.apache.mahout.clustering.kmeans.KMeansDriver;
  import org.apache.mahout.common.HadoopUtil;
  import org.apache.mahout.common.Pair;
  import org.apache.mahout.common.StringTuple;
  import org.apache.mahout.common.distance.EuclideanDistanceMeasure;
  import org.apache.mahout.common.distance.TanimotoDistanceMeasure;
  import org.apache.mahout.math.Vector.Element;
  import org.apache.mahout.math.VectorWritable;
  import org.apache.mahout.vectorizer.DictionaryVectorizer;
  import org.apache.mahout.vectorizer.DocumentProcessor;
  import org.apache.mahout.vectorizer.tfidf.TFIDFConverter;
 
 
  public class MahoutReutersKMeans {
 
  private static int minSupport = 2;
  private static int maxNGramSize = 2;
  private static float minLLRValue = 1;
  private static float normPower = 2;
  private static boolean logNormalize = true;
  private static int numReducers = 1;
  private static int chunkSizeInMegabytes = 200;
  private static boolean sequentialAccess = true;
  private static boolean namedVectors = false;
 
  private static int minDf = 5;
 
  private static long maxDF = 95;
 
  /**
   * @param args
   * @throws IOException
   * @throws InterruptedException
   * @throws ClassNotFoundException
   */
  public static void main(String[] args) throws IOException,
  ClassNotFoundException, InterruptedException {
 
  Configuration conf = new Configuration();
  String HADOOP_HOME = System.getenv(HADOOP_PREFIX);
 
  conf.addResource(new Path(HADOOP_HOME, conf/core-site.xml));
  conf.addResource(new Path(HADOOP_HOME, conf/hdfs-site.xml));
  conf.addResource(new Path(HADOOP_HOME, conf/mapred-site.xml));
 
  FileSystem fs  = FileSystem.get(conf);
 
  Path inputDir = new Path(reuters-seqfiles);
  String outputDir = reuters-kmeans-try;
  HadoopUtil.delete(conf, new Path(outputDir));
  StandardAnalyzer analyzer = new
  StandardAnalyzer(Version.LUCENE_43);
  Path tokenizedPath = new
  Path(DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER);
  DocumentProcessor.tokenizeDocuments(inputDir,
  analyzer.getClass().asSubclass(Analyzer.class), tokenizedPath, conf);
 
 
  DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
 new
  Path(outputDir),
  DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER, conf,
  minSupport , maxNGramSize, minLLRValue, normPower , logNormalize,
  numReducers , chunkSizeInMegabytes , sequentialAccess, namedVectors);
 
 
  PairLong[], ListPath features =
 TFIDFConverter.calculateDF(new
  Path(outputDir,
  DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new
  Path(outputDir), conf, chunkSizeInMegabytes);
  TFIDFConverter.processTfIdf(new Path(outputDir,
  DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new
  

Re: TFIDFConverter generates empty tfidf-vectors

2013-09-08 Thread Gokhan Capan
Taner,

It seems to have tf-idf vectors later, you need to create tf vectors
(DictionaryVectorizer.createTermFrequencyVectors) with logNormalize option
set to false, and normPower option set to -1.0f. This applies to
HighDFWordsPruner.pruneVectors, too.

I believe that solves your problem.

Best

Gokhan


On Wed, Sep 4, 2013 at 4:54 PM, Taner Diler taner.di...@gmail.com wrote:

 Actually, my real motivation was to visualize reuters vectors like
 DisplayKMeans example and then implement to web contents that I've
 collected and additionaly to discover what I can do with generated tfidf
 vectors. But TDIDF doesn't work and why?

 There is one main class that doesn't extend AbstractJob. Yes it has main
 method that executes all steps. And I'm trying to implement a sample that
 in mahout wiki and everywhere in net. In Eclipse, I've just added
 mahout-0.8-job.jar  , you know it includes all depended package, and
 hadoop-core.1.2.0.jar.

 import java.io.IOException;
 import java.util.ArrayList;
 import java.util.Collections;
 import java.util.Comparator;
 import java.util.HashMap;
 import java.util.List;
 import java.util.Map;
 import java.util.Set;

 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.fs.FileSystem;
 import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.io.IntWritable;
 import org.apache.hadoop.io.LongWritable;
 import org.apache.hadoop.io.SequenceFile;
 import org.apache.hadoop.io.Text;
 import org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat;
 import org.apache.lucene.analysis.Analyzer;
 import org.apache.lucene.analysis.standard.StandardAnalyzer;
 import org.apache.lucene.util.Version;
 import org.apache.mahout.clustering.Cluster;
 import org.apache.mahout.clustering.canopy.CanopyDriver;
 import org.apache.mahout.clustering.classify.WeightedVectorWritable;
 import org.apache.mahout.clustering.display.DisplayKMeans;
 import org.apache.mahout.clustering.kmeans.KMeansDriver;
 import org.apache.mahout.common.HadoopUtil;
 import org.apache.mahout.common.Pair;
 import org.apache.mahout.common.StringTuple;
 import org.apache.mahout.common.distance.EuclideanDistanceMeasure;
 import org.apache.mahout.common.distance.TanimotoDistanceMeasure;
 import org.apache.mahout.math.Vector.Element;
 import org.apache.mahout.math.VectorWritable;
 import org.apache.mahout.vectorizer.DictionaryVectorizer;
 import org.apache.mahout.vectorizer.DocumentProcessor;
 import org.apache.mahout.vectorizer.tfidf.TFIDFConverter;


 public class MahoutReutersKMeans {

 private static int minSupport = 2;
 private static int maxNGramSize = 2;
 private static float minLLRValue = 1;
 private static float normPower = 2;
 private static boolean logNormalize = true;
 private static int numReducers = 1;
 private static int chunkSizeInMegabytes = 200;
 private static boolean sequentialAccess = true;
 private static boolean namedVectors = false;

 private static int minDf = 5;

 private static long maxDF = 95;

 /**
  * @param args
  * @throws IOException
  * @throws InterruptedException
  * @throws ClassNotFoundException
  */
 public static void main(String[] args) throws IOException,
 ClassNotFoundException, InterruptedException {

 Configuration conf = new Configuration();
 String HADOOP_HOME = System.getenv(HADOOP_PREFIX);

 conf.addResource(new Path(HADOOP_HOME, conf/core-site.xml));
 conf.addResource(new Path(HADOOP_HOME, conf/hdfs-site.xml));
 conf.addResource(new Path(HADOOP_HOME, conf/mapred-site.xml));

 FileSystem fs  = FileSystem.get(conf);

 Path inputDir = new Path(reuters-seqfiles);
 String outputDir = reuters-kmeans-try;
 HadoopUtil.delete(conf, new Path(outputDir));
 StandardAnalyzer analyzer = new
 StandardAnalyzer(Version.LUCENE_43);
 Path tokenizedPath = new
 Path(DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER);
 DocumentProcessor.tokenizeDocuments(inputDir,
 analyzer.getClass().asSubclass(Analyzer.class), tokenizedPath, conf);


 DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, new
 Path(outputDir),
 DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER, conf,
 minSupport , maxNGramSize, minLLRValue, normPower , logNormalize,
 numReducers , chunkSizeInMegabytes , sequentialAccess, namedVectors);


 PairLong[], ListPath features = TFIDFConverter.calculateDF(new
 Path(outputDir,
 DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new
 Path(outputDir), conf, chunkSizeInMegabytes);
 TFIDFConverter.processTfIdf(new Path(outputDir,
 DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new
 Path(outputDir), conf, features, minDf , maxDF , normPower, logNormalize,
 sequentialAccess, false, numReducers);


 //Path tfidfVectorsPath = new Path(outputDir, tfidf-vectors);
 //Path canopyCentroidsPath = new Path(outputDir,
 

Re: TFIDFConverter generates empty tfidf-vectors

2013-09-04 Thread Taner Diler
mahout seq2sparse -i reuters-seqfiles/ -o reuters-kmeans-try -chunk 200 -wt
tfidf -s 2 -md 5 -x 95 -ng 2 -ml 50 -n 2 -seq

this command works well.

Gokhan, I changed minLLR value to 1.0 in java but result is same empty
tfidf-vectors.


On Tue, Sep 3, 2013 at 10:47 AM, Taner Diler taner.di...@gmail.com wrote:

 Gokhan, I try it from commandline it works. I will send the command to
 compare command line parameters to TFIDFConverter params.

 Suneel, I had checked the seqfiles. I didn't see any problem other
 generated seqfiles but I will checked  and send samples from each seqfiles.


 On Sun, Sep 1, 2013 at 11:02 PM, Gokhan Capan gkhn...@gmail.com wrote:

 Suneel is right indeed. I assumed that everything performed prior to
 vector
 generation is done correctly.

 By the way, if the suggestions do not work, could you try running
 seq2sparse from commandline with the same arguments and see if that works
 well?

 On Sun, Sep 1, 2013 at 7:23 PM, Suneel Marthi suneel_mar...@yahoo.com
 wrote:

  I would first check to see if the input 'seqfiles' for TFIDFGenerator
 have
  any meat in them.
  This could also happen if the input seqfiles are empty.


 
 
  
   From: Taner Diler taner.di...@gmail.com
  To: user@mahout.apache.org
  Sent: Sunday, September 1, 2013 2:24 AM
  Subject: TFIDFConverter generates empty tfidf-vectors
 
 
  Hi all,
 
  I try to run Reuters KMeans example in Java, but TFIDFComverter
 generates
  tfidf-vectors as empty. How can I fix that?
 
  private static int minSupport = 2;
  private static int maxNGramSize = 2;
  private static float minLLRValue = 50;
  private static float normPower = 2;
  private static boolean logNormalize = true;
  private static int numReducers = 1;
  private static int chunkSizeInMegabytes = 200;
  private static boolean sequentialAccess = true;
  private static boolean namedVectors = false;
  private static int minDf = 5;
  private static long maxDF = 95;
 
  Path inputDir = new Path(reuters-seqfiles);
  String outputDir = reuters-kmeans-try;
  HadoopUtil.delete(conf, new Path(outputDir));
  StandardAnalyzer analyzer = new
  StandardAnalyzer(Version.LUCENE_43);
  Path tokenizedPath = new
  Path(DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER);
  DocumentProcessor.tokenizeDocuments(inputDir,
  analyzer.getClass().asSubclass(Analyzer.class), tokenizedPath, conf);
 
 
  DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
 new
  Path(outputDir),
  DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER,
 conf,
  minSupport , maxNGramSize, minLLRValue, normPower , logNormalize,
  numReducers , chunkSizeInMegabytes , sequentialAccess, namedVectors);
 
 
  PairLong[], ListPath features =
 TFIDFConverter.calculateDF(new
  Path(outputDir,
  DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new
  Path(outputDir), conf, chunkSizeInMegabytes);
  TFIDFConverter.processTfIdf(new Path(outputDir,
  DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new
  Path(outputDir), conf, features, minDf , maxDF , normPower,
 logNormalize,
  sequentialAccess, false, numReducers);
 





Re: TFIDFConverter generates empty tfidf-vectors

2013-09-04 Thread Taner Diler
Suneel, samples from generated seqfiles:

df-count

Key: -1: Value: 21578
Key: 0: Value: 43
Key: 1: Value: 2
Key: 2: Value: 2
Key: 3: Value: 2
...

tf-vectors

Key class: class org.apache.hadoop.io.Text Value Class: class
org.apache.mahout.math.VectorWritable
Key: /reut2-000.sgm-0.txt: Value:
{62:0.024521886354905213,222:0.024521886354905213,291:0.024521886354905213,1411:0.024521886354905213,1421:0.024521886354905213,1451:0.024521886
354905213,1456:0.024521886354905213

wordcount/ngrams

Key class: class org.apache.hadoop.io.Text Value Class: class
org.apache.hadoop.io.DoubleWritable
Key: 0: Value: 166.0
Key: 0.003: Value: 2.0
Key: 0.006913: Value: 2.0
Key: 0.007050: Value: 2.0

wordcount/subgrams

Key class: class org.apache.mahout.vectorizer.collocations.llr.Gram Value
Class: class org.apache.mahout.vectorizer.collocations.llr.Gram
Key: '0 0'[n]:12: Value: '0'[h]:166
Key: '0 25'[n]:2: Value: '0'[h]:166
Key: '0 92'[n]:107: Value: '0'[h]:166

frequency.file-0

Key class: class org.apache.hadoop.io.IntWritable Value Class: class
org.apache.hadoop.io.LongWritable
Key: 0: Value: 43
Key: 1: Value: 2
Key: 2: Value: 2
Key: 3: Value: 2
Key: 4: Value: 9
Key: 5: Value: 4


dictionary.file-0

Key class: class org.apache.hadoop.io.Text Value Class: class
org.apache.hadoop.io.IntWritable
Key: 0: Value: 0
Key: 0.003: Value: 1
Key: 0.006913: Value: 2
Key: 0.007050: Value: 3
Key: 0.01: Value: 4
Key: 0.02: Value: 5
Key: 0.025: Value: 6





On Wed, Sep 4, 2013 at 12:45 PM, Taner Diler taner.di...@gmail.com wrote:

 mahout seq2sparse -i reuters-seqfiles/ -o reuters-kmeans-try -chunk 200
 -wt tfidf -s 2 -md 5 -x 95 -ng 2 -ml 50 -n 2 -seq

 this command works well.

 Gokhan, I changed minLLR value to 1.0 in java but result is same empty
 tfidf-vectors.


 On Tue, Sep 3, 2013 at 10:47 AM, Taner Diler taner.di...@gmail.comwrote:

 Gokhan, I try it from commandline it works. I will send the command to
 compare command line parameters to TFIDFConverter params.

 Suneel, I had checked the seqfiles. I didn't see any problem other
 generated seqfiles but I will checked  and send samples from each seqfiles.


 On Sun, Sep 1, 2013 at 11:02 PM, Gokhan Capan gkhn...@gmail.com wrote:

 Suneel is right indeed. I assumed that everything performed prior to
 vector
 generation is done correctly.

 By the way, if the suggestions do not work, could you try running
 seq2sparse from commandline with the same arguments and see if that works
 well?

 On Sun, Sep 1, 2013 at 7:23 PM, Suneel Marthi suneel_mar...@yahoo.com
 wrote:

  I would first check to see if the input 'seqfiles' for TFIDFGenerator
 have
  any meat in them.
  This could also happen if the input seqfiles are empty.


 
 
  
   From: Taner Diler taner.di...@gmail.com
  To: user@mahout.apache.org
  Sent: Sunday, September 1, 2013 2:24 AM
  Subject: TFIDFConverter generates empty tfidf-vectors
 
 
  Hi all,
 
  I try to run Reuters KMeans example in Java, but TFIDFComverter
 generates
  tfidf-vectors as empty. How can I fix that?
 
  private static int minSupport = 2;
  private static int maxNGramSize = 2;
  private static float minLLRValue = 50;
  private static float normPower = 2;
  private static boolean logNormalize = true;
  private static int numReducers = 1;
  private static int chunkSizeInMegabytes = 200;
  private static boolean sequentialAccess = true;
  private static boolean namedVectors = false;
  private static int minDf = 5;
  private static long maxDF = 95;
 
  Path inputDir = new Path(reuters-seqfiles);
  String outputDir = reuters-kmeans-try;
  HadoopUtil.delete(conf, new Path(outputDir));
  StandardAnalyzer analyzer = new
  StandardAnalyzer(Version.LUCENE_43);
  Path tokenizedPath = new
  Path(DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER);
  DocumentProcessor.tokenizeDocuments(inputDir,
  analyzer.getClass().asSubclass(Analyzer.class), tokenizedPath, conf);
 
 
  DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
 new
  Path(outputDir),
  DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER,
 conf,
  minSupport , maxNGramSize, minLLRValue, normPower , logNormalize,
  numReducers , chunkSizeInMegabytes , sequentialAccess, namedVectors);
 
 
  PairLong[], ListPath features =
 TFIDFConverter.calculateDF(new
  Path(outputDir,
  DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER),
 new
  Path(outputDir), conf, chunkSizeInMegabytes);
  TFIDFConverter.processTfIdf(new Path(outputDir,
  DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER),
 new
  Path(outputDir), conf, features, minDf , maxDF , normPower,
 logNormalize,
  sequentialAccess, false, numReducers);
 






Re: TFIDFConverter generates empty tfidf-vectors

2013-09-04 Thread Gokhan Capan
Taner,

A few questions:

Is there a specific reason not to consider using seq2sparse directly? (You
can edit seq2sparse.props to avoid passing commandline arguments every time
you run it, if that is the case)

Java code you attached seems to do the same thing with
SparseVectorFromSequenceFiles#run(String[]),  which is also the method
called when you run seq2sparse. I'm gonna debug it anyway.

And I would like to know how you run the java code. Does your main class
extend AbstractJob to make it runnable using bin/mahout? And does it have
a main method that submits your job to your hadoop cluster? Are you using
hadoop jar command to run it?

Best

Gokhan


On Wed, Sep 4, 2013 at 1:15 PM, Taner Diler taner.di...@gmail.com wrote:

 Suneel, samples from generated seqfiles:

 df-count

 Key: -1: Value: 21578
 Key: 0: Value: 43
 Key: 1: Value: 2
 Key: 2: Value: 2
 Key: 3: Value: 2
 ...

 tf-vectors

 Key class: class org.apache.hadoop.io.Text Value Class: class
 org.apache.mahout.math.VectorWritable
 Key: /reut2-000.sgm-0.txt: Value:

 {62:0.024521886354905213,222:0.024521886354905213,291:0.024521886354905213,1411:0.024521886354905213,1421:0.024521886354905213,1451:0.024521886
 354905213,1456:0.024521886354905213

 wordcount/ngrams

 Key class: class org.apache.hadoop.io.Text Value Class: class
 org.apache.hadoop.io.DoubleWritable
 Key: 0: Value: 166.0
 Key: 0.003: Value: 2.0
 Key: 0.006913: Value: 2.0
 Key: 0.007050: Value: 2.0

 wordcount/subgrams

 Key class: class org.apache.mahout.vectorizer.collocations.llr.Gram Value
 Class: class org.apache.mahout.vectorizer.collocations.llr.Gram
 Key: '0 0'[n]:12: Value: '0'[h]:166
 Key: '0 25'[n]:2: Value: '0'[h]:166
 Key: '0 92'[n]:107: Value: '0'[h]:166

 frequency.file-0

 Key class: class org.apache.hadoop.io.IntWritable Value Class: class
 org.apache.hadoop.io.LongWritable
 Key: 0: Value: 43
 Key: 1: Value: 2
 Key: 2: Value: 2
 Key: 3: Value: 2
 Key: 4: Value: 9
 Key: 5: Value: 4


 dictionary.file-0

 Key class: class org.apache.hadoop.io.Text Value Class: class
 org.apache.hadoop.io.IntWritable
 Key: 0: Value: 0
 Key: 0.003: Value: 1
 Key: 0.006913: Value: 2
 Key: 0.007050: Value: 3
 Key: 0.01: Value: 4
 Key: 0.02: Value: 5
 Key: 0.025: Value: 6





 On Wed, Sep 4, 2013 at 12:45 PM, Taner Diler taner.di...@gmail.com
 wrote:

  mahout seq2sparse -i reuters-seqfiles/ -o reuters-kmeans-try -chunk 200
  -wt tfidf -s 2 -md 5 -x 95 -ng 2 -ml 50 -n 2 -seq
 
  this command works well.
 
  Gokhan, I changed minLLR value to 1.0 in java but result is same empty
  tfidf-vectors.
 
 
  On Tue, Sep 3, 2013 at 10:47 AM, Taner Diler taner.di...@gmail.com
 wrote:
 
  Gokhan, I try it from commandline it works. I will send the command to
  compare command line parameters to TFIDFConverter params.
 
  Suneel, I had checked the seqfiles. I didn't see any problem other
  generated seqfiles but I will checked  and send samples from each
 seqfiles.
 
 
  On Sun, Sep 1, 2013 at 11:02 PM, Gokhan Capan gkhn...@gmail.com
 wrote:
 
  Suneel is right indeed. I assumed that everything performed prior to
  vector
  generation is done correctly.
 
  By the way, if the suggestions do not work, could you try running
  seq2sparse from commandline with the same arguments and see if that
 works
  well?
 
  On Sun, Sep 1, 2013 at 7:23 PM, Suneel Marthi suneel_mar...@yahoo.com
  wrote:
 
   I would first check to see if the input 'seqfiles' for TFIDFGenerator
  have
   any meat in them.
   This could also happen if the input seqfiles are empty.
 
 
  
  
   
From: Taner Diler taner.di...@gmail.com
   To: user@mahout.apache.org
   Sent: Sunday, September 1, 2013 2:24 AM
   Subject: TFIDFConverter generates empty tfidf-vectors
  
  
   Hi all,
  
   I try to run Reuters KMeans example in Java, but TFIDFComverter
  generates
   tfidf-vectors as empty. How can I fix that?
  
   private static int minSupport = 2;
   private static int maxNGramSize = 2;
   private static float minLLRValue = 50;
   private static float normPower = 2;
   private static boolean logNormalize = true;
   private static int numReducers = 1;
   private static int chunkSizeInMegabytes = 200;
   private static boolean sequentialAccess = true;
   private static boolean namedVectors = false;
   private static int minDf = 5;
   private static long maxDF = 95;
  
   Path inputDir = new Path(reuters-seqfiles);
   String outputDir = reuters-kmeans-try;
   HadoopUtil.delete(conf, new Path(outputDir));
   StandardAnalyzer analyzer = new
   StandardAnalyzer(Version.LUCENE_43);
   Path tokenizedPath = new
   Path(DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER);
   DocumentProcessor.tokenizeDocuments(inputDir,
   analyzer.getClass().asSubclass(Analyzer.class), tokenizedPath, conf);
  
  
  
 DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
  new
   Path(outputDir),
   

Re: TFIDFConverter generates empty tfidf-vectors

2013-09-04 Thread Taner Diler
Actually, my real motivation was to visualize reuters vectors like
DisplayKMeans example and then implement to web contents that I've
collected and additionaly to discover what I can do with generated tfidf
vectors. But TDIDF doesn't work and why?

There is one main class that doesn't extend AbstractJob. Yes it has main
method that executes all steps. And I'm trying to implement a sample that
in mahout wiki and everywhere in net. In Eclipse, I've just added
mahout-0.8-job.jar  , you know it includes all depended package, and
hadoop-core.1.2.0.jar.

import java.io.IOException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Set;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.util.Version;
import org.apache.mahout.clustering.Cluster;
import org.apache.mahout.clustering.canopy.CanopyDriver;
import org.apache.mahout.clustering.classify.WeightedVectorWritable;
import org.apache.mahout.clustering.display.DisplayKMeans;
import org.apache.mahout.clustering.kmeans.KMeansDriver;
import org.apache.mahout.common.HadoopUtil;
import org.apache.mahout.common.Pair;
import org.apache.mahout.common.StringTuple;
import org.apache.mahout.common.distance.EuclideanDistanceMeasure;
import org.apache.mahout.common.distance.TanimotoDistanceMeasure;
import org.apache.mahout.math.Vector.Element;
import org.apache.mahout.math.VectorWritable;
import org.apache.mahout.vectorizer.DictionaryVectorizer;
import org.apache.mahout.vectorizer.DocumentProcessor;
import org.apache.mahout.vectorizer.tfidf.TFIDFConverter;


public class MahoutReutersKMeans {

private static int minSupport = 2;
private static int maxNGramSize = 2;
private static float minLLRValue = 1;
private static float normPower = 2;
private static boolean logNormalize = true;
private static int numReducers = 1;
private static int chunkSizeInMegabytes = 200;
private static boolean sequentialAccess = true;
private static boolean namedVectors = false;

private static int minDf = 5;

private static long maxDF = 95;

/**
 * @param args
 * @throws IOException
 * @throws InterruptedException
 * @throws ClassNotFoundException
 */
public static void main(String[] args) throws IOException,
ClassNotFoundException, InterruptedException {

Configuration conf = new Configuration();
String HADOOP_HOME = System.getenv(HADOOP_PREFIX);

conf.addResource(new Path(HADOOP_HOME, conf/core-site.xml));
conf.addResource(new Path(HADOOP_HOME, conf/hdfs-site.xml));
conf.addResource(new Path(HADOOP_HOME, conf/mapred-site.xml));

FileSystem fs  = FileSystem.get(conf);

Path inputDir = new Path(reuters-seqfiles);
String outputDir = reuters-kmeans-try;
HadoopUtil.delete(conf, new Path(outputDir));
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_43);
Path tokenizedPath = new
Path(DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER);
DocumentProcessor.tokenizeDocuments(inputDir,
analyzer.getClass().asSubclass(Analyzer.class), tokenizedPath, conf);


DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, new
Path(outputDir),
DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER, conf,
minSupport , maxNGramSize, minLLRValue, normPower , logNormalize,
numReducers , chunkSizeInMegabytes , sequentialAccess, namedVectors);


PairLong[], ListPath features = TFIDFConverter.calculateDF(new
Path(outputDir,
DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new
Path(outputDir), conf, chunkSizeInMegabytes);
TFIDFConverter.processTfIdf(new Path(outputDir,
DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new
Path(outputDir), conf, features, minDf , maxDF , normPower, logNormalize,
sequentialAccess, false, numReducers);


//Path tfidfVectorsPath = new Path(outputDir, tfidf-vectors);
//Path canopyCentroidsPath = new Path(outputDir,
canopy-centroids);
//Path clustersPath = new Path(outputDir, clusters);
//
//CanopyDriver.run(conf, tfidfVectorsPath, canopyCentroidsPath, new
EuclideanDistanceMeasure(), 250, 120, false, 0.01, false);
//KMeansDriver.run(conf, tfidfVectorsPath, new
Path(canopyCentroidsPath, clusters-0-final), clustersPath, new
TanimotoDistanceMeasure(), 0.01, 20, true, 0.01, false);
//
//SequenceFile.Reader reader = new SequenceFile.Reader(fs, new

Re: TFIDFConverter generates empty tfidf-vectors

2013-09-03 Thread Taner Diler
Gokhan, I try it from commandline it works. I will send the command to
compare command line parameters to TFIDFConverter params.

Suneel, I had checked the seqfiles. I didn't see any problem other
generated seqfiles but I will checked  and send samples from each seqfiles.


On Sun, Sep 1, 2013 at 11:02 PM, Gokhan Capan gkhn...@gmail.com wrote:

 Suneel is right indeed. I assumed that everything performed prior to vector
 generation is done correctly.

 By the way, if the suggestions do not work, could you try running
 seq2sparse from commandline with the same arguments and see if that works
 well?

 On Sun, Sep 1, 2013 at 7:23 PM, Suneel Marthi suneel_mar...@yahoo.com
 wrote:

  I would first check to see if the input 'seqfiles' for TFIDFGenerator
 have
  any meat in them.
  This could also happen if the input seqfiles are empty.


 
 
  
   From: Taner Diler taner.di...@gmail.com
  To: user@mahout.apache.org
  Sent: Sunday, September 1, 2013 2:24 AM
  Subject: TFIDFConverter generates empty tfidf-vectors
 
 
  Hi all,
 
  I try to run Reuters KMeans example in Java, but TFIDFComverter generates
  tfidf-vectors as empty. How can I fix that?
 
  private static int minSupport = 2;
  private static int maxNGramSize = 2;
  private static float minLLRValue = 50;
  private static float normPower = 2;
  private static boolean logNormalize = true;
  private static int numReducers = 1;
  private static int chunkSizeInMegabytes = 200;
  private static boolean sequentialAccess = true;
  private static boolean namedVectors = false;
  private static int minDf = 5;
  private static long maxDF = 95;
 
  Path inputDir = new Path(reuters-seqfiles);
  String outputDir = reuters-kmeans-try;
  HadoopUtil.delete(conf, new Path(outputDir));
  StandardAnalyzer analyzer = new
  StandardAnalyzer(Version.LUCENE_43);
  Path tokenizedPath = new
  Path(DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER);
  DocumentProcessor.tokenizeDocuments(inputDir,
  analyzer.getClass().asSubclass(Analyzer.class), tokenizedPath, conf);
 
 
  DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
 new
  Path(outputDir),
  DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER, conf,
  minSupport , maxNGramSize, minLLRValue, normPower , logNormalize,
  numReducers , chunkSizeInMegabytes , sequentialAccess, namedVectors);
 
 
  PairLong[], ListPath features =
 TFIDFConverter.calculateDF(new
  Path(outputDir,
  DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new
  Path(outputDir), conf, chunkSizeInMegabytes);
  TFIDFConverter.processTfIdf(new Path(outputDir,
  DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new
  Path(outputDir), conf, features, minDf , maxDF , normPower, logNormalize,
  sequentialAccess, false, numReducers);
 



Re: TFIDFConverter generates empty tfidf-vectors

2013-09-01 Thread Gokhan Capan
Taner,

Could you try reducing minLLR value? (It is not a normalized measure, but
its default value is 1.0)

Best,
Gokhan


On Sun, Sep 1, 2013 at 9:24 AM, Taner Diler taner.di...@gmail.com wrote:

 Hi all,

 I try to run Reuters KMeans example in Java, but TFIDFComverter generates
 tfidf-vectors as empty. How can I fix that?

 private static int minSupport = 2;
 private static int maxNGramSize = 2;
 private static float minLLRValue = 50;
 private static float normPower = 2;
 private static boolean logNormalize = true;
 private static int numReducers = 1;
 private static int chunkSizeInMegabytes = 200;
 private static boolean sequentialAccess = true;
 private static boolean namedVectors = false;
 private static int minDf = 5;
 private static long maxDF = 95;

 Path inputDir = new Path(reuters-seqfiles);
 String outputDir = reuters-kmeans-try;
 HadoopUtil.delete(conf, new Path(outputDir));
 StandardAnalyzer analyzer = new
 StandardAnalyzer(Version.LUCENE_43);
 Path tokenizedPath = new
 Path(DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER);
 DocumentProcessor.tokenizeDocuments(inputDir,
 analyzer.getClass().asSubclass(Analyzer.class), tokenizedPath, conf);


 DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, new
 Path(outputDir),
 DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER, conf,
 minSupport , maxNGramSize, minLLRValue, normPower , logNormalize,
 numReducers , chunkSizeInMegabytes , sequentialAccess, namedVectors);


 PairLong[], ListPath features = TFIDFConverter.calculateDF(new
 Path(outputDir,
 DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new
 Path(outputDir), conf, chunkSizeInMegabytes);
 TFIDFConverter.processTfIdf(new Path(outputDir,
 DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new
 Path(outputDir), conf, features, minDf , maxDF , normPower, logNormalize,
 sequentialAccess, false, numReducers);



Re: TFIDFConverter generates empty tfidf-vectors

2013-09-01 Thread Suneel Marthi
I would first check to see if the input 'seqfiles' for TFIDFGenerator have any 
meat in them. 
This could also happen if the input seqfiles are empty.




 From: Taner Diler taner.di...@gmail.com
To: user@mahout.apache.org 
Sent: Sunday, September 1, 2013 2:24 AM
Subject: TFIDFConverter generates empty tfidf-vectors
 

Hi all,

I try to run Reuters KMeans example in Java, but TFIDFComverter generates
tfidf-vectors as empty. How can I fix that?

    private static int minSupport = 2;
    private static int maxNGramSize = 2;
    private static float minLLRValue = 50;
    private static float normPower = 2;
    private static boolean logNormalize = true;
    private static int numReducers = 1;
    private static int chunkSizeInMegabytes = 200;
    private static boolean sequentialAccess = true;
    private static boolean namedVectors = false;
    private static int minDf = 5;
    private static long maxDF = 95;

        Path inputDir = new Path(reuters-seqfiles);
        String outputDir = reuters-kmeans-try;
        HadoopUtil.delete(conf, new Path(outputDir));
        StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_43);
        Path tokenizedPath = new
Path(DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER);
        DocumentProcessor.tokenizeDocuments(inputDir,
analyzer.getClass().asSubclass(Analyzer.class), tokenizedPath, conf);


        DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, new
Path(outputDir),
                DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER, conf,
minSupport , maxNGramSize, minLLRValue, normPower , logNormalize,
numReducers , chunkSizeInMegabytes , sequentialAccess, namedVectors);


        PairLong[], ListPath features = TFIDFConverter.calculateDF(new
Path(outputDir,
                DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new
Path(outputDir), conf, chunkSizeInMegabytes);
        TFIDFConverter.processTfIdf(new Path(outputDir,
                DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new
Path(outputDir), conf, features, minDf , maxDF , normPower, logNormalize,
sequentialAccess, false, numReducers);

Re: TFIDFConverter generates empty tfidf-vectors

2013-09-01 Thread Gokhan Capan
Suneel is right indeed. I assumed that everything performed prior to vector
generation is done correctly.

By the way, if the suggestions do not work, could you try running
seq2sparse from commandline with the same arguments and see if that works
well?

On Sun, Sep 1, 2013 at 7:23 PM, Suneel Marthi suneel_mar...@yahoo.comwrote:

 I would first check to see if the input 'seqfiles' for TFIDFGenerator have
 any meat in them.
 This could also happen if the input seqfiles are empty.




 
  From: Taner Diler taner.di...@gmail.com
 To: user@mahout.apache.org
 Sent: Sunday, September 1, 2013 2:24 AM
 Subject: TFIDFConverter generates empty tfidf-vectors


 Hi all,

 I try to run Reuters KMeans example in Java, but TFIDFComverter generates
 tfidf-vectors as empty. How can I fix that?

 private static int minSupport = 2;
 private static int maxNGramSize = 2;
 private static float minLLRValue = 50;
 private static float normPower = 2;
 private static boolean logNormalize = true;
 private static int numReducers = 1;
 private static int chunkSizeInMegabytes = 200;
 private static boolean sequentialAccess = true;
 private static boolean namedVectors = false;
 private static int minDf = 5;
 private static long maxDF = 95;

 Path inputDir = new Path(reuters-seqfiles);
 String outputDir = reuters-kmeans-try;
 HadoopUtil.delete(conf, new Path(outputDir));
 StandardAnalyzer analyzer = new
 StandardAnalyzer(Version.LUCENE_43);
 Path tokenizedPath = new
 Path(DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER);
 DocumentProcessor.tokenizeDocuments(inputDir,
 analyzer.getClass().asSubclass(Analyzer.class), tokenizedPath, conf);


 DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, new
 Path(outputDir),
 DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER, conf,
 minSupport , maxNGramSize, minLLRValue, normPower , logNormalize,
 numReducers , chunkSizeInMegabytes , sequentialAccess, namedVectors);


 PairLong[], ListPath features = TFIDFConverter.calculateDF(new
 Path(outputDir,
 DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new
 Path(outputDir), conf, chunkSizeInMegabytes);
 TFIDFConverter.processTfIdf(new Path(outputDir,
 DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new
 Path(outputDir), conf, features, minDf , maxDF , normPower, logNormalize,
 sequentialAccess, false, numReducers);