[
https://issues.apache.org/jira/browse/MAHOUT-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13211675#comment-13211675
]
Josh Patterson commented on MAHOUT-833:
---------------------------------------
Joe,
Quick overview of the patch as is:
- only does the SequenceFilesFromDirectory.java codepath, does not address the
mail part (yet).
- old codepath recurses through the dirs wtih fs.listStatus() and writes into a
single ChunkedWriter for the sequence file
- since the dirs can have subdirs, had to include a function that built a
recursive list of subdirs based on the input path
- since we had lots of small file paths, I ended up subclassing
CombineFileInputFormat for MultiTextFileInputFormat
-- for a great explanation of how CombineFileInputFormat works and how its
used:
http://lucene.472066.n3.nabble.com/help-on-CombineFileInputFormat-td781357.html
-- basically: each split is a bunch of small file input paths so each mapper
gets fed a lot of files (we dont want each mapper looking at a single file like
we'd normally see with TextInputFormat)
- the chunkSize param was a bit of a trick in MR, thought I was going to have
to do it by hand in MR, but ended up going with "mapred.max.split.size"
- tested on the reuters extracted files that are used in some of the demos
since it has around 21k smallish text files to work from
- the JobSplitWriter started complaining about "max block locations exceeded
for split", which caused me to set "mapreduce.job.max.split.locations" to a
very large number in the job conf
- all of the changes are localized in the integration module in o.a.m.text
- new vs old MR API
-- looking at AbstractJob.prepareJob(), I can see that most of Mahout's MR Jobs
are using the newer MR api. I tried to accommodate that same pattern here.
-- unfortunately, Hadoop 0.20.205 does not currently have a class for
CombineFileInputFormat
-- currently the code works with the old API specifically because of this
issue, I'm looking at filing a JIRA with Hadoop for this
> Make conversion to sequence files map-reduce
> --------------------------------------------
>
> Key: MAHOUT-833
> URL: https://issues.apache.org/jira/browse/MAHOUT-833
> Project: Mahout
> Issue Type: Improvement
> Components: Integration
> Affects Versions: 0.7
> Reporter: Grant Ingersoll
> Labels: MAHOUT_INTRO_CONTRIBUTE
> Attachments: MAHOUT-833.patch
>
>
> Given input that is on HDFS, the SequenceFilesFrom****.java classes should be
> able to do their work in parallel.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira