GitHub user yinxusen opened a pull request: https://github.com/apache/spark/pull/164
[SPARK-1133] add small files input in MLlib I add the pull request for the JIRA issue [SPARK-1133](https://spark-project.atlassian.net/browse/SPARK-1133), which brings a new files reader API in MLlib. The interface exposes to end-user is `smallTextFiles()` in MLUtils.scala, which is similar with `textFile()` in SparkContext.scala. The interface reads a directory of text files from HDFS or native disk, then creates an `RDD[(String, String)]`, the former string is the file name, while the latter is the file content. This interface is really useful when you want to read a bunch of files in your application. Take Latent Dirichlet Allocation, a.k.a. LDA as an example, it is urgent to use this interface. The typical scenario here is reading a bunch of files from native disk, then processing it, and saving as a `sequenceFile` in the end. It can also read a directory of files in HDFS, though it is not a good practice. Similar implementation in Mahout is named `sequenceFileFromDirectory`, but it is not that easy to use. Here `newAPIHadoopFile()` is used instead of `hadoopFile()`, because of the `CombineFileInputFormat` in the new Hadoop API has the opportunity to set `isSplitable()` to false, otherwise we should use shuffle to merge file content after reading blocks from HDFS. You can merge this pull request into a Git repository by running: $ git pull https://github.com/yinxusen/spark small-files-input Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/164.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #164 ---- commit fd93e5915bf3679c1795881f2079c2db506fe22c Author: Xusen Yin <yinxu...@gmail.com> Date: 2014-03-18T00:19:42Z add small text files input API commit 9bf87d443841c9db4bb9a1c595eecc04fe27d049 Author: Xusen Yin <yinxu...@gmail.com> Date: 2014-03-18T00:21:37Z Merge branch 'master' into small-files-input ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---