I'm not sure I understand your question correctly. If you know the keys, you could put them into a file, write a Map-only Job that loads the keys from the file and filters the data to only retain the key-values pairs where the key is contained in your list.

Does that make sense?

--sebastian


On 05/07/2014 09:46 AM, Richard Scharrer (JIRA) wrote:
Richard Scharrer created MAHOUT-1549:
----------------------------------------

              Summary: Extracting tfidf-vectors by key
                  Key: MAHOUT-1549
                  URL: https://issues.apache.org/jira/browse/MAHOUT-1549
              Project: Mahout
           Issue Type: Question
           Components: Classification
     Affects Versions: 0.9, 0.8, 0.7
             Reporter: Richard Scharrer


Hi,
I have about 200000 tfidf-vectors and I need to extract 500 of them of which I 
have the keys. Is there some kind of magical option which allows me something 
like taking the output of mahout seqdumper and transform it back into a 
sequencefile that I can use for trainnb /testnb? The sequencefiles of tfidf use 
the Text class for the keys and the VectorWritable class for the values. I tried
https://github.com/kevinweil/elephant-bird/blob/master/pig/src/main/java/com/twitter/elephantbird/pig/store/SequenceFileStorage.java
with different settings but the output always gives me the Text class for both, 
key and value which can't be used in trainnb and testnb.

I posted this question on:

http://stackoverflow.com/questions/23502362/extracting-tfidf-vectors-by-key-without-destroying-the-fileformat

I ask this question in here because I've seen similar questions on 
stackoverflow that where asked last year and still didn't get an answer

I really need this information so in case you know anything please tell me.

Regards,
Richard



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Reply via email to