[ 
https://issues.apache.org/jira/browse/MAHOUT-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13149441#comment-13149441
 ] 

Joe Prasanna Kumar commented on MAHOUT-833:
-------------------------------------------

Josh,
For the SequenceFilesFromDirectory, the doc comment says "Converts a directory 
of text documents into SequenceFiles of Specified chunkSize". so we are anywayz 
expecting text documents and the output format says  docid => content. I am 
thinking that 
1) we should use a custom InputFormat which will parse the data according to 
specified options. For eg, we can extend the FileInputFormat and specifying 
isSplitable() to be false. So each file will be consumed by Mapper as 1 whole 
file. The map function will process the file according to the options and emit 
key value pairs.
2) I guess we wont really need a Reducer.
3) The driver will use setOutputFormatClass(SequenceFileOutputFormat.class) to 
write the key,values from Mapper as SequenceFile

The same approach would go for SequenceFilesFromMailArchives where we can have 
1) A separate InputFormat class that will have a RecordReader which will split 
each mail message as a separate Key, Value pair for consumption by Mapper. 
Mapper will further parse the message according to the options and emit the 
proper KV pairs. 
2) I guess we wont really need a Reducer.
3) The driver will use setOutputFormatClass(SequenceFileOutputFormat.class) to 
write the key,values from Mapper as SequenceFile

Team,
If this approach looks rite, I can submit a patch for this. Please let me know.

Appreciate any feedbacks,
Joe.


                
> Make conversion to sequence files map-reduce
> --------------------------------------------
>
>                 Key: MAHOUT-833
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-833
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Integration
>    Affects Versions: 0.5
>            Reporter: Grant Ingersoll
>              Labels: MAHOUT_INTRO_CONTRIBUTE
>
> Given input that is on HDFS, the SequenceFilesFrom****.java classes should be 
> able to do their work in parallel.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to