Suneel, samples from generated seqfiles:

df-count

Key: -1: Value: 21578
Key: 0: Value: 43
Key: 1: Value: 2
Key: 2: Value: 2
Key: 3: Value: 2
...

tf-vectors

Key class: class org.apache.hadoop.io.Text Value Class: class
org.apache.mahout.math.VectorWritable
Key: /reut2-000.sgm-0.txt: Value:
{62:0.024521886354905213,222:0.024521886354905213,291:0.024521886354905213,1411:0.024521886354905213,1421:0.024521886354905213,1451:0.024521886
354905213,1456:0.024521886354905213....

wordcount/ngrams

Key class: class org.apache.hadoop.io.Text Value Class: class
org.apache.hadoop.io.DoubleWritable
Key: 0: Value: 166.0
Key: 0.003: Value: 2.0
Key: 0.006913: Value: 2.0
Key: 0.007050: Value: 2.0

wordcount/subgrams

Key class: class org.apache.mahout.vectorizer.collocations.llr.Gram Value
Class: class org.apache.mahout.vectorizer.collocations.llr.Gram
Key: '0 0'[n]:12: Value: '0'[h]:166
Key: '0 25'[n]:2: Value: '0'[h]:166
Key: '0 92'[n]:107: Value: '0'[h]:166

frequency.file-0

Key class: class org.apache.hadoop.io.IntWritable Value Class: class
org.apache.hadoop.io.LongWritable
Key: 0: Value: 43
Key: 1: Value: 2
Key: 2: Value: 2
Key: 3: Value: 2
Key: 4: Value: 9
Key: 5: Value: 4


dictionary.file-0

Key class: class org.apache.hadoop.io.Text Value Class: class
org.apache.hadoop.io.IntWritable
Key: 0: Value: 0
Key: 0.003: Value: 1
Key: 0.006913: Value: 2
Key: 0.007050: Value: 3
Key: 0.01: Value: 4
Key: 0.02: Value: 5
Key: 0.025: Value: 6





On Wed, Sep 4, 2013 at 12:45 PM, Taner Diler <taner.di...@gmail.com> wrote:

> mahout seq2sparse -i reuters-seqfiles/ -o reuters-kmeans-try -chunk 200
> -wt tfidf -s 2 -md 5 -x 95 -ng 2 -ml 50 -n 2 -seq
>
> this command works well.
>
> Gokhan, I changed minLLR value to 1.0 in java but result is same empty
> tfidf-vectors.
>
>
> On Tue, Sep 3, 2013 at 10:47 AM, Taner Diler <taner.di...@gmail.com>wrote:
>
>> Gokhan, I try it from commandline it works. I will send the command to
>> compare command line parameters to TFIDFConverter params.
>>
>> Suneel, I had checked the seqfiles. I didn't see any problem other
>> generated seqfiles but I will checked  and send samples from each seqfiles.
>>
>>
>> On Sun, Sep 1, 2013 at 11:02 PM, Gokhan Capan <gkhn...@gmail.com> wrote:
>>
>>> Suneel is right indeed. I assumed that everything performed prior to
>>> vector
>>> generation is done correctly.
>>>
>>> By the way, if the suggestions do not work, could you try running
>>> seq2sparse from commandline with the same arguments and see if that works
>>> well?
>>>
>>> On Sun, Sep 1, 2013 at 7:23 PM, Suneel Marthi <suneel_mar...@yahoo.com
>>> >wrote:
>>>
>>> > I would first check to see if the input 'seqfiles' for TFIDFGenerator
>>> have
>>> > any meat in them.
>>> > This could also happen if the input seqfiles are empty.
>>>
>>>
>>> >
>>> >
>>> > ________________________________
>>> >  From: Taner Diler <taner.di...@gmail.com>
>>> > To: user@mahout.apache.org
>>> > Sent: Sunday, September 1, 2013 2:24 AM
>>> > Subject: TFIDFConverter generates empty tfidf-vectors
>>> >
>>> >
>>> > Hi all,
>>> >
>>> > I try to run Reuters KMeans example in Java, but TFIDFComverter
>>> generates
>>> > tfidf-vectors as empty. How can I fix that?
>>> >
>>> >     private static int minSupport = 2;
>>> >     private static int maxNGramSize = 2;
>>> >     private static float minLLRValue = 50;
>>> >     private static float normPower = 2;
>>> >     private static boolean logNormalize = true;
>>> >     private static int numReducers = 1;
>>> >     private static int chunkSizeInMegabytes = 200;
>>> >     private static boolean sequentialAccess = true;
>>> >     private static boolean namedVectors = false;
>>> >     private static int minDf = 5;
>>> >     private static long maxDF = 95;
>>> >
>>> >         Path inputDir = new Path("reuters-seqfiles");
>>> >         String outputDir = "reuters-kmeans-try";
>>> >         HadoopUtil.delete(conf, new Path(outputDir));
>>> >         StandardAnalyzer analyzer = new
>>> > StandardAnalyzer(Version.LUCENE_43);
>>> >         Path tokenizedPath = new
>>> > Path(DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER);
>>> >         DocumentProcessor.tokenizeDocuments(inputDir,
>>> > analyzer.getClass().asSubclass(Analyzer.class), tokenizedPath, conf);
>>> >
>>> >
>>> >         DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
>>> new
>>> > Path(outputDir),
>>> >                 DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER,
>>> conf,
>>> > minSupport , maxNGramSize, minLLRValue, normPower , logNormalize,
>>> > numReducers , chunkSizeInMegabytes , sequentialAccess, namedVectors);
>>> >
>>> >
>>> >         Pair<Long[], List<Path>> features =
>>> TFIDFConverter.calculateDF(new
>>> > Path(outputDir,
>>> >                 DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER),
>>> new
>>> > Path(outputDir), conf, chunkSizeInMegabytes);
>>> >         TFIDFConverter.processTfIdf(new Path(outputDir,
>>> >                 DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER),
>>> new
>>> > Path(outputDir), conf, features, minDf , maxDF , normPower,
>>> logNormalize,
>>> > sequentialAccess, false, numReducers);
>>> >
>>>
>>
>>
>

Reply via email to