You currently can't use SparkContext inside a Spark task, so in this case you'd have to call some kind of local K-means library. One example you can try to use is Weka (http://www.cs.waikato.ac.nz/ml/weka/). You can then load your text files as an RDD of strings with SparkContext.wholeTextFiles and call Weka on each one.
Matei On Jul 14, 2014, at 11:30 AM, Rahul Bhojwani <rahulbhojwani2...@gmail.com> wrote: > I understand that the question is very unprofessional, but I am a newbie. If > you could share some link where I can ask such questions, if not here. > > But please answer. > > > On Mon, Jul 14, 2014 at 6:52 PM, Rahul Bhojwani <rahulbhojwani2...@gmail.com> > wrote: > Hey, My question is for this situation: > Suppose we have 100000 files each containing list of features in each row. > > Task is that for each file cluster the features in that file and write the > corresponding cluster along with it in a new file. So we have to generate > 100000 more files by applying clustering in each file individually. > > So can I do it this way, that get rdd of list of files and apply map. Inside > the mapper function which will be handling each file, get another spark > context and use Mllib kmeans to get the clustered output file. > > Please suggest the appropriate method to tackle this problem. > > Thanks, > Rahul Kumar Bhojwani > 3rd year, B.Tech > Computer Science Engineering > National Institute Of Technology, Karnataka > 9945197359 > > > > > -- > Rahul K Bhojwani > 3rd Year B.Tech > Computer Science and Engineering > National Institute of Technology, Karnataka