mahout seq2sparse -i reuters-seqfiles/ -o reuters-kmeans-try -chunk 200 -wt tfidf -s 2 -md 5 -x 95 -ng 2 -ml 50 -n 2 -seq
this command works well. Gokhan, I changed minLLR value to 1.0 in java but result is same empty tfidf-vectors. On Tue, Sep 3, 2013 at 10:47 AM, Taner Diler <taner.di...@gmail.com> wrote: > Gokhan, I try it from commandline it works. I will send the command to > compare command line parameters to TFIDFConverter params. > > Suneel, I had checked the seqfiles. I didn't see any problem other > generated seqfiles but I will checked and send samples from each seqfiles. > > > On Sun, Sep 1, 2013 at 11:02 PM, Gokhan Capan <gkhn...@gmail.com> wrote: > >> Suneel is right indeed. I assumed that everything performed prior to >> vector >> generation is done correctly. >> >> By the way, if the suggestions do not work, could you try running >> seq2sparse from commandline with the same arguments and see if that works >> well? >> >> On Sun, Sep 1, 2013 at 7:23 PM, Suneel Marthi <suneel_mar...@yahoo.com >> >wrote: >> >> > I would first check to see if the input 'seqfiles' for TFIDFGenerator >> have >> > any meat in them. >> > This could also happen if the input seqfiles are empty. >> >> >> > >> > >> > ________________________________ >> > From: Taner Diler <taner.di...@gmail.com> >> > To: user@mahout.apache.org >> > Sent: Sunday, September 1, 2013 2:24 AM >> > Subject: TFIDFConverter generates empty tfidf-vectors >> > >> > >> > Hi all, >> > >> > I try to run Reuters KMeans example in Java, but TFIDFComverter >> generates >> > tfidf-vectors as empty. How can I fix that? >> > >> > private static int minSupport = 2; >> > private static int maxNGramSize = 2; >> > private static float minLLRValue = 50; >> > private static float normPower = 2; >> > private static boolean logNormalize = true; >> > private static int numReducers = 1; >> > private static int chunkSizeInMegabytes = 200; >> > private static boolean sequentialAccess = true; >> > private static boolean namedVectors = false; >> > private static int minDf = 5; >> > private static long maxDF = 95; >> > >> > Path inputDir = new Path("reuters-seqfiles"); >> > String outputDir = "reuters-kmeans-try"; >> > HadoopUtil.delete(conf, new Path(outputDir)); >> > StandardAnalyzer analyzer = new >> > StandardAnalyzer(Version.LUCENE_43); >> > Path tokenizedPath = new >> > Path(DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER); >> > DocumentProcessor.tokenizeDocuments(inputDir, >> > analyzer.getClass().asSubclass(Analyzer.class), tokenizedPath, conf); >> > >> > >> > DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, >> new >> > Path(outputDir), >> > DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER, >> conf, >> > minSupport , maxNGramSize, minLLRValue, normPower , logNormalize, >> > numReducers , chunkSizeInMegabytes , sequentialAccess, namedVectors); >> > >> > >> > Pair<Long[], List<Path>> features = >> TFIDFConverter.calculateDF(new >> > Path(outputDir, >> > DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new >> > Path(outputDir), conf, chunkSizeInMegabytes); >> > TFIDFConverter.processTfIdf(new Path(outputDir, >> > DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new >> > Path(outputDir), conf, features, minDf , maxDF , normPower, >> logNormalize, >> > sequentialAccess, false, numReducers); >> > >> > >