GitHub user yinxusen opened a pull request:

    https://github.com/apache/spark/pull/164

    [SPARK-1133] add small files input in MLlib

    I add the pull request for the JIRA issue 
[SPARK-1133](https://spark-project.atlassian.net/browse/SPARK-1133), which 
brings a new files reader API in MLlib.
    
    The interface exposes to end-user is `smallTextFiles()` in MLUtils.scala, 
which is similar with `textFile()` in SparkContext.scala. The interface reads a 
directory of text files from HDFS or native disk, then creates an `RDD[(String, 
String)]`, the former string is the file name, while the latter is the file 
content.
    
    This interface is really useful when you want to read a bunch of files in 
your application. Take Latent Dirichlet Allocation, a.k.a. LDA as an example, 
it is urgent to use this interface.
    
    The typical scenario here is reading a bunch of files from native disk, 
then processing it, and saving as a `sequenceFile` in the end. It can also read 
a directory of files in HDFS, though it is not a good practice. 
    
    Similar implementation in Mahout is named `sequenceFileFromDirectory`, but 
it is not that easy to use. Here `newAPIHadoopFile()` is used instead of 
`hadoopFile()`, because of the `CombineFileInputFormat` in the new Hadoop API 
has the opportunity to set `isSplitable()` to false, otherwise we should use 
shuffle to merge file content after reading blocks from HDFS. 
    
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yinxusen/spark small-files-input

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/164.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #164
    
----
commit fd93e5915bf3679c1795881f2079c2db506fe22c
Author: Xusen Yin <yinxu...@gmail.com>
Date:   2014-03-18T00:19:42Z

    add small text files input API

commit 9bf87d443841c9db4bb9a1c595eecc04fe27d049
Author: Xusen Yin <yinxu...@gmail.com>
Date:   2014-03-18T00:21:37Z

    Merge branch 'master' into small-files-input

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to