[
https://issues.apache.org/jira/browse/MAHOUT-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13149441#comment-13149441
]
Joe Prasanna Kumar commented on MAHOUT-833:
-------------------------------------------
Josh,
For the SequenceFilesFromDirectory, the doc comment says "Converts a directory
of text documents into SequenceFiles of Specified chunkSize". so we are anywayz
expecting text documents and the output format says docid => content. I am
thinking that
1) we should use a custom InputFormat which will parse the data according to
specified options. For eg, we can extend the FileInputFormat and specifying
isSplitable() to be false. So each file will be consumed by Mapper as 1 whole
file. The map function will process the file according to the options and emit
key value pairs.
2) I guess we wont really need a Reducer.
3) The driver will use setOutputFormatClass(SequenceFileOutputFormat.class) to
write the key,values from Mapper as SequenceFile
The same approach would go for SequenceFilesFromMailArchives where we can have
1) A separate InputFormat class that will have a RecordReader which will split
each mail message as a separate Key, Value pair for consumption by Mapper.
Mapper will further parse the message according to the options and emit the
proper KV pairs.
2) I guess we wont really need a Reducer.
3) The driver will use setOutputFormatClass(SequenceFileOutputFormat.class) to
write the key,values from Mapper as SequenceFile
Team,
If this approach looks rite, I can submit a patch for this. Please let me know.
Appreciate any feedbacks,
Joe.
> Make conversion to sequence files map-reduce
> --------------------------------------------
>
> Key: MAHOUT-833
> URL: https://issues.apache.org/jira/browse/MAHOUT-833
> Project: Mahout
> Issue Type: Improvement
> Components: Integration
> Affects Versions: 0.5
> Reporter: Grant Ingersoll
> Labels: MAHOUT_INTRO_CONTRIBUTE
>
> Given input that is on HDFS, the SequenceFilesFrom****.java classes should be
> able to do their work in parallel.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira