TFIDFConverter generates empty tfidf-vectors
Hi all, I try to run Reuters KMeans example in Java, but TFIDFComverter generates tfidf-vectors as empty. How can I fix that? private static int minSupport = 2; private static int maxNGramSize = 2; private static float minLLRValue = 50; private static float normPower = 2; private static boolean logNormalize = true; private static int numReducers = 1; private static int chunkSizeInMegabytes = 200; private static boolean sequentialAccess = true; private static boolean namedVectors = false; private static int minDf = 5; private static long maxDF = 95; Path inputDir = new Path(reuters-seqfiles); String outputDir = reuters-kmeans-try; HadoopUtil.delete(conf, new Path(outputDir)); StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_43); Path tokenizedPath = new Path(DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER); DocumentProcessor.tokenizeDocuments(inputDir, analyzer.getClass().asSubclass(Analyzer.class), tokenizedPath, conf); DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, new Path(outputDir), DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER, conf, minSupport , maxNGramSize, minLLRValue, normPower , logNormalize, numReducers , chunkSizeInMegabytes , sequentialAccess, namedVectors); PairLong[], ListPath features = TFIDFConverter.calculateDF(new Path(outputDir, DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new Path(outputDir), conf, chunkSizeInMegabytes); TFIDFConverter.processTfIdf(new Path(outputDir, DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new Path(outputDir), conf, features, minDf , maxDF , normPower, logNormalize, sequentialAccess, false, numReducers);
Visualizing Reuters KMeans Clustering
Hi all, How can I visualize Reuters KMeans Clustering as in DisplayKMeans.java? Thanks.
Re: TFIDFConverter generates empty tfidf-vectors
Taner, Could you try reducing minLLR value? (It is not a normalized measure, but its default value is 1.0) Best, Gokhan On Sun, Sep 1, 2013 at 9:24 AM, Taner Diler taner.di...@gmail.com wrote: Hi all, I try to run Reuters KMeans example in Java, but TFIDFComverter generates tfidf-vectors as empty. How can I fix that? private static int minSupport = 2; private static int maxNGramSize = 2; private static float minLLRValue = 50; private static float normPower = 2; private static boolean logNormalize = true; private static int numReducers = 1; private static int chunkSizeInMegabytes = 200; private static boolean sequentialAccess = true; private static boolean namedVectors = false; private static int minDf = 5; private static long maxDF = 95; Path inputDir = new Path(reuters-seqfiles); String outputDir = reuters-kmeans-try; HadoopUtil.delete(conf, new Path(outputDir)); StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_43); Path tokenizedPath = new Path(DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER); DocumentProcessor.tokenizeDocuments(inputDir, analyzer.getClass().asSubclass(Analyzer.class), tokenizedPath, conf); DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, new Path(outputDir), DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER, conf, minSupport , maxNGramSize, minLLRValue, normPower , logNormalize, numReducers , chunkSizeInMegabytes , sequentialAccess, namedVectors); PairLong[], ListPath features = TFIDFConverter.calculateDF(new Path(outputDir, DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new Path(outputDir), conf, chunkSizeInMegabytes); TFIDFConverter.processTfIdf(new Path(outputDir, DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new Path(outputDir), conf, features, minDf , maxDF , normPower, logNormalize, sequentialAccess, false, numReducers);
Re: TFIDFConverter generates empty tfidf-vectors
I would first check to see if the input 'seqfiles' for TFIDFGenerator have any meat in them. This could also happen if the input seqfiles are empty. From: Taner Diler taner.di...@gmail.com To: user@mahout.apache.org Sent: Sunday, September 1, 2013 2:24 AM Subject: TFIDFConverter generates empty tfidf-vectors Hi all, I try to run Reuters KMeans example in Java, but TFIDFComverter generates tfidf-vectors as empty. How can I fix that? private static int minSupport = 2; private static int maxNGramSize = 2; private static float minLLRValue = 50; private static float normPower = 2; private static boolean logNormalize = true; private static int numReducers = 1; private static int chunkSizeInMegabytes = 200; private static boolean sequentialAccess = true; private static boolean namedVectors = false; private static int minDf = 5; private static long maxDF = 95; Path inputDir = new Path(reuters-seqfiles); String outputDir = reuters-kmeans-try; HadoopUtil.delete(conf, new Path(outputDir)); StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_43); Path tokenizedPath = new Path(DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER); DocumentProcessor.tokenizeDocuments(inputDir, analyzer.getClass().asSubclass(Analyzer.class), tokenizedPath, conf); DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, new Path(outputDir), DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER, conf, minSupport , maxNGramSize, minLLRValue, normPower , logNormalize, numReducers , chunkSizeInMegabytes , sequentialAccess, namedVectors); PairLong[], ListPath features = TFIDFConverter.calculateDF(new Path(outputDir, DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new Path(outputDir), conf, chunkSizeInMegabytes); TFIDFConverter.processTfIdf(new Path(outputDir, DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new Path(outputDir), conf, features, minDf , maxDF , normPower, logNormalize, sequentialAccess, false, numReducers);
Re: TFIDFConverter generates empty tfidf-vectors
Suneel is right indeed. I assumed that everything performed prior to vector generation is done correctly. By the way, if the suggestions do not work, could you try running seq2sparse from commandline with the same arguments and see if that works well? On Sun, Sep 1, 2013 at 7:23 PM, Suneel Marthi suneel_mar...@yahoo.comwrote: I would first check to see if the input 'seqfiles' for TFIDFGenerator have any meat in them. This could also happen if the input seqfiles are empty. From: Taner Diler taner.di...@gmail.com To: user@mahout.apache.org Sent: Sunday, September 1, 2013 2:24 AM Subject: TFIDFConverter generates empty tfidf-vectors Hi all, I try to run Reuters KMeans example in Java, but TFIDFComverter generates tfidf-vectors as empty. How can I fix that? private static int minSupport = 2; private static int maxNGramSize = 2; private static float minLLRValue = 50; private static float normPower = 2; private static boolean logNormalize = true; private static int numReducers = 1; private static int chunkSizeInMegabytes = 200; private static boolean sequentialAccess = true; private static boolean namedVectors = false; private static int minDf = 5; private static long maxDF = 95; Path inputDir = new Path(reuters-seqfiles); String outputDir = reuters-kmeans-try; HadoopUtil.delete(conf, new Path(outputDir)); StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_43); Path tokenizedPath = new Path(DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER); DocumentProcessor.tokenizeDocuments(inputDir, analyzer.getClass().asSubclass(Analyzer.class), tokenizedPath, conf); DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, new Path(outputDir), DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER, conf, minSupport , maxNGramSize, minLLRValue, normPower , logNormalize, numReducers , chunkSizeInMegabytes , sequentialAccess, namedVectors); PairLong[], ListPath features = TFIDFConverter.calculateDF(new Path(outputDir, DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new Path(outputDir), conf, chunkSizeInMegabytes); TFIDFConverter.processTfIdf(new Path(outputDir, DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new Path(outputDir), conf, features, minDf , maxDF , normPower, logNormalize, sequentialAccess, false, numReducers);