TFIDFConverter generates empty tfidf-vectors

2013-09-01 Thread Taner Diler
Hi all,

I try to run Reuters KMeans example in Java, but TFIDFComverter generates
tfidf-vectors as empty. How can I fix that?

private static int minSupport = 2;
private static int maxNGramSize = 2;
private static float minLLRValue = 50;
private static float normPower = 2;
private static boolean logNormalize = true;
private static int numReducers = 1;
private static int chunkSizeInMegabytes = 200;
private static boolean sequentialAccess = true;
private static boolean namedVectors = false;
private static int minDf = 5;
private static long maxDF = 95;

Path inputDir = new Path(reuters-seqfiles);
String outputDir = reuters-kmeans-try;
HadoopUtil.delete(conf, new Path(outputDir));
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_43);
Path tokenizedPath = new
Path(DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER);
DocumentProcessor.tokenizeDocuments(inputDir,
analyzer.getClass().asSubclass(Analyzer.class), tokenizedPath, conf);


DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, new
Path(outputDir),
DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER, conf,
minSupport , maxNGramSize, minLLRValue, normPower , logNormalize,
numReducers , chunkSizeInMegabytes , sequentialAccess, namedVectors);


PairLong[], ListPath features = TFIDFConverter.calculateDF(new
Path(outputDir,
DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new
Path(outputDir), conf, chunkSizeInMegabytes);
TFIDFConverter.processTfIdf(new Path(outputDir,
DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new
Path(outputDir), conf, features, minDf , maxDF , normPower, logNormalize,
sequentialAccess, false, numReducers);


Visualizing Reuters KMeans Clustering

2013-09-01 Thread Taner Diler
Hi all,

How can I visualize Reuters KMeans Clustering as in DisplayKMeans.java?

Thanks.


Re: TFIDFConverter generates empty tfidf-vectors

2013-09-01 Thread Gokhan Capan
Taner,

Could you try reducing minLLR value? (It is not a normalized measure, but
its default value is 1.0)

Best,
Gokhan


On Sun, Sep 1, 2013 at 9:24 AM, Taner Diler taner.di...@gmail.com wrote:

 Hi all,

 I try to run Reuters KMeans example in Java, but TFIDFComverter generates
 tfidf-vectors as empty. How can I fix that?

 private static int minSupport = 2;
 private static int maxNGramSize = 2;
 private static float minLLRValue = 50;
 private static float normPower = 2;
 private static boolean logNormalize = true;
 private static int numReducers = 1;
 private static int chunkSizeInMegabytes = 200;
 private static boolean sequentialAccess = true;
 private static boolean namedVectors = false;
 private static int minDf = 5;
 private static long maxDF = 95;

 Path inputDir = new Path(reuters-seqfiles);
 String outputDir = reuters-kmeans-try;
 HadoopUtil.delete(conf, new Path(outputDir));
 StandardAnalyzer analyzer = new
 StandardAnalyzer(Version.LUCENE_43);
 Path tokenizedPath = new
 Path(DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER);
 DocumentProcessor.tokenizeDocuments(inputDir,
 analyzer.getClass().asSubclass(Analyzer.class), tokenizedPath, conf);


 DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, new
 Path(outputDir),
 DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER, conf,
 minSupport , maxNGramSize, minLLRValue, normPower , logNormalize,
 numReducers , chunkSizeInMegabytes , sequentialAccess, namedVectors);


 PairLong[], ListPath features = TFIDFConverter.calculateDF(new
 Path(outputDir,
 DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new
 Path(outputDir), conf, chunkSizeInMegabytes);
 TFIDFConverter.processTfIdf(new Path(outputDir,
 DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new
 Path(outputDir), conf, features, minDf , maxDF , normPower, logNormalize,
 sequentialAccess, false, numReducers);



Re: TFIDFConverter generates empty tfidf-vectors

2013-09-01 Thread Suneel Marthi
I would first check to see if the input 'seqfiles' for TFIDFGenerator have any 
meat in them. 
This could also happen if the input seqfiles are empty.




 From: Taner Diler taner.di...@gmail.com
To: user@mahout.apache.org 
Sent: Sunday, September 1, 2013 2:24 AM
Subject: TFIDFConverter generates empty tfidf-vectors
 

Hi all,

I try to run Reuters KMeans example in Java, but TFIDFComverter generates
tfidf-vectors as empty. How can I fix that?

    private static int minSupport = 2;
    private static int maxNGramSize = 2;
    private static float minLLRValue = 50;
    private static float normPower = 2;
    private static boolean logNormalize = true;
    private static int numReducers = 1;
    private static int chunkSizeInMegabytes = 200;
    private static boolean sequentialAccess = true;
    private static boolean namedVectors = false;
    private static int minDf = 5;
    private static long maxDF = 95;

        Path inputDir = new Path(reuters-seqfiles);
        String outputDir = reuters-kmeans-try;
        HadoopUtil.delete(conf, new Path(outputDir));
        StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_43);
        Path tokenizedPath = new
Path(DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER);
        DocumentProcessor.tokenizeDocuments(inputDir,
analyzer.getClass().asSubclass(Analyzer.class), tokenizedPath, conf);


        DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, new
Path(outputDir),
                DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER, conf,
minSupport , maxNGramSize, minLLRValue, normPower , logNormalize,
numReducers , chunkSizeInMegabytes , sequentialAccess, namedVectors);


        PairLong[], ListPath features = TFIDFConverter.calculateDF(new
Path(outputDir,
                DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new
Path(outputDir), conf, chunkSizeInMegabytes);
        TFIDFConverter.processTfIdf(new Path(outputDir,
                DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new
Path(outputDir), conf, features, minDf , maxDF , normPower, logNormalize,
sequentialAccess, false, numReducers);

Re: TFIDFConverter generates empty tfidf-vectors

2013-09-01 Thread Gokhan Capan
Suneel is right indeed. I assumed that everything performed prior to vector
generation is done correctly.

By the way, if the suggestions do not work, could you try running
seq2sparse from commandline with the same arguments and see if that works
well?

On Sun, Sep 1, 2013 at 7:23 PM, Suneel Marthi suneel_mar...@yahoo.comwrote:

 I would first check to see if the input 'seqfiles' for TFIDFGenerator have
 any meat in them.
 This could also happen if the input seqfiles are empty.




 
  From: Taner Diler taner.di...@gmail.com
 To: user@mahout.apache.org
 Sent: Sunday, September 1, 2013 2:24 AM
 Subject: TFIDFConverter generates empty tfidf-vectors


 Hi all,

 I try to run Reuters KMeans example in Java, but TFIDFComverter generates
 tfidf-vectors as empty. How can I fix that?

 private static int minSupport = 2;
 private static int maxNGramSize = 2;
 private static float minLLRValue = 50;
 private static float normPower = 2;
 private static boolean logNormalize = true;
 private static int numReducers = 1;
 private static int chunkSizeInMegabytes = 200;
 private static boolean sequentialAccess = true;
 private static boolean namedVectors = false;
 private static int minDf = 5;
 private static long maxDF = 95;

 Path inputDir = new Path(reuters-seqfiles);
 String outputDir = reuters-kmeans-try;
 HadoopUtil.delete(conf, new Path(outputDir));
 StandardAnalyzer analyzer = new
 StandardAnalyzer(Version.LUCENE_43);
 Path tokenizedPath = new
 Path(DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER);
 DocumentProcessor.tokenizeDocuments(inputDir,
 analyzer.getClass().asSubclass(Analyzer.class), tokenizedPath, conf);


 DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, new
 Path(outputDir),
 DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER, conf,
 minSupport , maxNGramSize, minLLRValue, normPower , logNormalize,
 numReducers , chunkSizeInMegabytes , sequentialAccess, namedVectors);


 PairLong[], ListPath features = TFIDFConverter.calculateDF(new
 Path(outputDir,
 DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new
 Path(outputDir), conf, chunkSizeInMegabytes);
 TFIDFConverter.processTfIdf(new Path(outputDir,
 DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new
 Path(outputDir), conf, features, minDf , maxDF , normPower, logNormalize,
 sequentialAccess, false, numReducers);