You currently can't use SparkContext inside a Spark task, so in this case you'd 
have to call some kind of local K-means library. One example you can try to use 
is Weka (http://www.cs.waikato.ac.nz/ml/weka/). You can then load your text 
files as an RDD of strings with SparkContext.wholeTextFiles and call Weka on 
each one.

Matei

On Jul 14, 2014, at 11:30 AM, Rahul Bhojwani <rahulbhojwani2...@gmail.com> 
wrote:

> I understand that the question is very unprofessional, but I am a newbie. If 
> you could share some link where I can ask such questions, if not here. 
> 
> But please answer.
> 
> 
> On Mon, Jul 14, 2014 at 6:52 PM, Rahul Bhojwani <rahulbhojwani2...@gmail.com> 
> wrote:
> Hey, My question is for this situation:
> Suppose we have 100000 files each containing list of features in each row.
> 
> Task is that for each file cluster the features in that file and write the 
> corresponding cluster along with it in a new file.  So we have to generate 
> 100000 more files by applying clustering in each file individually.
> 
> So can I do it this way, that get rdd of list of files and apply map. Inside 
> the mapper function which will be handling each file, get another spark 
> context and use Mllib kmeans to get the clustered output file.
> 
> Please suggest the appropriate method to tackle this problem.
> 
> Thanks, 
> Rahul Kumar Bhojwani
> 3rd year, B.Tech
> Computer Science Engineering
> National Institute Of Technology, Karnataka
> 9945197359
> 
> 
> 
> 
> -- 
> Rahul K Bhojwani
> 3rd Year B.Tech
> Computer Science and Engineering
> National Institute of Technology, Karnataka

Reply via email to