Try this (replace ... with the appropriate values for your environment):

import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.linalg.Vector

val sc = new SparkContext(...)
val documents = sc.wholeTextFile(...)
val tokenized = documents.map{ case(path, document) => (path, 
document.split("\\s+"))}
val numFeatures = 100000
val hashingTF = new HashingTF(numFeatures)
val featurized = tokenized.map{case(path, words) => (path, 
hashingTF.transform(words))}


Mohammed

From: rishikesh thakur [mailto:rishikeshtha...@hotmail.com]
Sent: Friday, July 17, 2015 12:33 AM
To: Mohammed Guller
Subject: Re: Feature Generation On Spark


Thanks I did look at the example. I am using Spark 1.2. The modules mentioned 
there are not in 1.2 I guess. The import is failing


Rishi

________________________________
From: Mohammed Guller <moham...@glassbeam.com<mailto:moham...@glassbeam.com>>
Sent: Friday, July 10, 2015 2:31 AM
To: rishikesh thakur; ayan guha; Michal Čizmazia
Cc: user
Subject: RE: Feature Generation On Spark


Take a look at the examples here:

https://spark.apache.org/docs/latest/ml-guide.html



Mohammed



From: rishikesh thakur [mailto:rishikeshtha...@hotmail.com]
Sent: Saturday, July 4, 2015 10:49 PM
To: ayan guha; Michal Čizmazia
Cc: user
Subject: RE: Feature Generation On Spark



I have one document per file and each file is to be converted to a feature 
vector. Pretty much like standard feature construction for document 
classification.



Thanks

Rishi

________________________________

Date: Sun, 5 Jul 2015 01:44:04 +1000
Subject: Re: Feature Generation On Spark
From: guha.a...@gmail.com<mailto:guha.a...@gmail.com>
To: mici...@gmail.com<mailto:mici...@gmail.com>
CC: rishikeshtha...@hotmail.com<mailto:rishikeshtha...@hotmail.com>; 
user@spark.apache.org<mailto:user@spark.apache.org>

Do you have one document per file or multiple document in the file?

On 4 Jul 2015 23:38, "Michal Čizmazia" 
<mici...@gmail.com<mailto:mici...@gmail.com>> wrote:

Spark Context has a method wholeTextFiles. Is that what you need?

On 4 July 2015 at 07:04, rishikesh 
<rishikeshtha...@hotmail.com<mailto:rishikeshtha...@hotmail.com>> wrote:
> Hi
>
> I am new to Spark and am working on document classification. Before model
> fitting I need to do feature generation. Each document is to be converted to
> a feature vector. However I am not sure how to do that. While testing
> locally I have a static list of tokens and when I parse a file I do a lookup
> and increment counters.
>
> In the case of Spark I can create an RDD which loads all the documents
> however I am not sure if one files goes to one executor or multiple. If the
> file is split then the feature vectors needs to be merged. But I am not able
> to figure out how to do that.
>
> Thanks
> Rishi
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Feature-Generation-On-Spark-tp23617.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: 
> user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
> For additional commands, e-mail: 
> user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: 
user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>

Reply via email to