[ 
https://issues.apache.org/jira/browse/MAHOUT-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13211675#comment-13211675
 ] 

Josh Patterson commented on MAHOUT-833:
---------------------------------------

Joe,
Quick overview of the patch as is:

- only does the SequenceFilesFromDirectory.java codepath, does not address the 
mail part (yet).
- old codepath recurses through the dirs wtih fs.listStatus() and writes into a 
single ChunkedWriter for the sequence file
- since the dirs can have subdirs, had to include a function that built a 
recursive list of subdirs based on the input path
- since we had lots of small file paths, I ended up subclassing 
CombineFileInputFormat for MultiTextFileInputFormat
-- for a great explanation of how CombineFileInputFormat works and how its 
used: 
http://lucene.472066.n3.nabble.com/help-on-CombineFileInputFormat-td781357.html
-- basically: each split is a bunch of small file input paths so each mapper 
gets fed a lot of files (we dont want each mapper looking at a single file like 
we'd normally see with TextInputFormat)
- the chunkSize param was a bit of a trick in MR, thought I was going to have 
to do it by hand in MR, but ended up going with "mapred.max.split.size" 
- tested on the reuters extracted files that are used in some of the demos 
since it has around 21k smallish text files to work from
- the JobSplitWriter started complaining about "max block locations exceeded 
for split", which caused me to set "mapreduce.job.max.split.locations" to a 
very large number in the job conf
- all of the changes are localized in the integration module in o.a.m.text
- new vs old MR API
-- looking at AbstractJob.prepareJob(), I can see that most of Mahout's MR Jobs 
are using the newer MR api. I tried to accommodate that same pattern here.
-- unfortunately, Hadoop 0.20.205 does not currently have a class for 
CombineFileInputFormat
-- currently the code works with the old API specifically because of this 
issue, I'm looking at filing a JIRA with Hadoop for this
                
> Make conversion to sequence files map-reduce
> --------------------------------------------
>
>                 Key: MAHOUT-833
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-833
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Integration
>    Affects Versions: 0.7
>            Reporter: Grant Ingersoll
>              Labels: MAHOUT_INTRO_CONTRIBUTE
>         Attachments: MAHOUT-833.patch
>
>
> Given input that is on HDFS, the SequenceFilesFrom****.java classes should be 
> able to do their work in parallel.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to