[ https://issues.apache.org/jira/browse/MAHOUT-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13211675#comment-13211675 ]
Josh Patterson commented on MAHOUT-833: --------------------------------------- Joe, Quick overview of the patch as is: - only does the SequenceFilesFromDirectory.java codepath, does not address the mail part (yet). - old codepath recurses through the dirs wtih fs.listStatus() and writes into a single ChunkedWriter for the sequence file - since the dirs can have subdirs, had to include a function that built a recursive list of subdirs based on the input path - since we had lots of small file paths, I ended up subclassing CombineFileInputFormat for MultiTextFileInputFormat -- for a great explanation of how CombineFileInputFormat works and how its used: http://lucene.472066.n3.nabble.com/help-on-CombineFileInputFormat-td781357.html -- basically: each split is a bunch of small file input paths so each mapper gets fed a lot of files (we dont want each mapper looking at a single file like we'd normally see with TextInputFormat) - the chunkSize param was a bit of a trick in MR, thought I was going to have to do it by hand in MR, but ended up going with "mapred.max.split.size" - tested on the reuters extracted files that are used in some of the demos since it has around 21k smallish text files to work from - the JobSplitWriter started complaining about "max block locations exceeded for split", which caused me to set "mapreduce.job.max.split.locations" to a very large number in the job conf - all of the changes are localized in the integration module in o.a.m.text - new vs old MR API -- looking at AbstractJob.prepareJob(), I can see that most of Mahout's MR Jobs are using the newer MR api. I tried to accommodate that same pattern here. -- unfortunately, Hadoop 0.20.205 does not currently have a class for CombineFileInputFormat -- currently the code works with the old API specifically because of this issue, I'm looking at filing a JIRA with Hadoop for this > Make conversion to sequence files map-reduce > -------------------------------------------- > > Key: MAHOUT-833 > URL: https://issues.apache.org/jira/browse/MAHOUT-833 > Project: Mahout > Issue Type: Improvement > Components: Integration > Affects Versions: 0.7 > Reporter: Grant Ingersoll > Labels: MAHOUT_INTRO_CONTRIBUTE > Attachments: MAHOUT-833.patch > > > Given input that is on HDFS, the SequenceFilesFrom****.java classes should be > able to do their work in parallel. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira