[ https://issues.apache.org/jira/browse/MAHOUT-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13151019#comment-13151019 ]
Josh Patterson commented on MAHOUT-833: --------------------------------------- I think for "SequenceFilesFromDirectory" with FileInputFormat you would run into the issue where each file in the directory would generate a map task, and if you had no reducer, each file would be in a separate output sequence file, which would create lots of relatively small files. This also has the downside of not leveraging tasks setup/teardown time; Although the reduce side could generate the sequence files, ideally we'd like to see each mapper process more files per task. An alternative approach: - On client side (pre-MR), list files recursively using HDFS api. Output to a file. - Use the NLineInputFormat against that file to split among multiple mappers JP > Make conversion to sequence files map-reduce > -------------------------------------------- > > Key: MAHOUT-833 > URL: https://issues.apache.org/jira/browse/MAHOUT-833 > Project: Mahout > Issue Type: Improvement > Components: Integration > Affects Versions: 0.5 > Reporter: Grant Ingersoll > Labels: MAHOUT_INTRO_CONTRIBUTE > > Given input that is on HDFS, the SequenceFilesFrom****.java classes should be > able to do their work in parallel. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira