[ https://issues.apache.org/jira/browse/SPARK-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Matei Zaharia updated SPARK-1133: --------------------------------- Assignee: Xusen Yin > Add a new small files input for MLlib, which will return an RDD[(fileName, > content)] > ------------------------------------------------------------------------------------ > > Key: SPARK-1133 > URL: https://issues.apache.org/jira/browse/SPARK-1133 > Project: Spark > Issue Type: Improvement > Components: Input/Output > Affects Versions: 1.0.0 > Reporter: Xusen Yin > Assignee: Xusen Yin > Priority: Minor > Labels: IO, MLLib,, hadoop > > As I am moving forward to write a LDA (Latent Dirichlet Allocation) > implementation to Spark MLlib, I find that a small files input API is useful, > so I write a smallTextFiles() to support it. > smallTextFiles() digests a directory of text files, then return an > RDD\[(String, String)\], the former String is the file name, while the latter > one is the contents of the text file. > smallTextFiles() can be used for local disk I/O, or HDFS I/O, just like the > textFiles() in SparkContext. In the scenario of LDA, there are 2 common uses: > 1. smallTextFiles() is used to preprocess local disk files, i.e. combine > those files into a huge one, then transfer it onto HDFS to do further > process, such as LDA clustering. > 2. It is also used to transfer the raw directory of small files onto HDFS > (though it is not recommended, because it will cost too many namenode > entries), then clustering it directly with LDA. -- This message was sent by Atlassian JIRA (v6.2#6252)