Hi Robin, I'm seeing some strangeness from this, I've got a directory with 100k documents. I build a sequence file using SequenceFilesFromDirectory, which emits 4 chunks for this particular dataset. I then dump each of the chinks using SequenceFileDumper. I only see 75,964 documents in the resulting dump. I've tried with 10k files and it seems to work fine as long as all of the documents can fit into a single chunk, but once I get beyond a single chunk is seems to lose documents. In this particular case I can fit about 24k files per chunk using the default chunk size.
The commands I'm using are: To create the sequence file: mvn -e exec:java -Dexec.mainClass=org.apache.mahout.text.SequenceFilesFromDirectory -Dexec.args="--parent /u01/test0-10k --outputDir /u01/test0-10k-seq --keyPrefix test-10k --charset UTF-8" Then for each chunk: mvn exec:java -Dexec.mainClass=org.apache.mahout.utils.SequenceFileDumper -Dexec.args="-s /u01/test0-10k-seq/chunk-0 -o /u01/test0-10k-dump/chunk-0.dump" Any ideas? If I find anything particular I'll follow-up Drew (Thanks for the commit Sean) On Tue, Jan 12, 2010 at 8:14 PM, Sean Owen (JIRA) <j...@apache.org> wrote: > > [ > https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel] > > Sean Owen resolved MAHOUT-237. > ------------------------------ > > Resolution: Fixed > > > Map/Reduce Implementation of Document Vectorizer > > ------------------------------------------------ > > > > Key: MAHOUT-237 > > URL: https://issues.apache.org/jira/browse/MAHOUT-237 > > Project: Mahout > > Issue Type: New Feature > > Affects Versions: 0.3 > > Reporter: Robin Anil > > Assignee: Robin Anil > > Fix For: 0.3 > > > > Attachments: DictionaryVectorizer.patch, > DictionaryVectorizer.patch, DictionaryVectorizer.patch, > DictionaryVectorizer.patch, DictionaryVectorizer.patch, > SparseVector-VIntWritable.patch > > > > > > Current Vectorizer uses Lucene Index to convert documents into > SparseVectors > > Ted is working on a Hash based Vectorizer which can map features into > Vectors of fixed size and sum it up to get the document Vector > > This is a pure bag-of-words based Vectorizer written in Map/Reduce. > > The input document is in SequenceFile<Text,Text> . with key = docid, > value = content > > First Map/Reduce over the document collection and generate the feature > counts. > > Second Sequential pass reads the output of the map/reduce and converts > them to SequenceFile<Text, LongWritable> where key=feature, value = unique > id > > Second stage should create shards of features of a given split size > > Third Map/Reduce over the document collection, using each shard and > create Partial(containing the features of the given shard) SparseVectors > > Fourth Map/Reduce over partial shard, group by docid, create full > document Vector > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue online. > >