Re: TF-IDF from spark-1.1.0 not working on cluster mode
This is worker log, not executor log. The executor log can be found in folders like /newdisk2/rta/rtauser/workerdir/app-20150109182514-0001/0/ . -Xiangrui On Fri, Jan 9, 2015 at 5:03 AM, Priya Ch learnings.chitt...@gmail.com wrote: Please find the attached worker log. I could see stream closed exception On Wed, Jan 7, 2015 at 10:51 AM, Xiangrui Meng men...@gmail.com wrote: Could you attach the executor log? That may help identify the root cause. -Xiangrui On Mon, Jan 5, 2015 at 11:12 PM, Priya Ch learnings.chitt...@gmail.com wrote: Hi All, Word2Vec and TF-IDF algorithms in spark mllib-1.1.0 are working only in local mode and not on distributed mode. Null pointer exception has been thrown. Is this a bug in spark-1.1.0 ? Following is the code: def main(args:Array[String]) { val conf=new SparkConf val sc=new SparkContext(conf) val documents=sc.textFile(hdfs://IMPETUS-DSRV02:9000/nlp/sampletext).map(_.split( ).toSeq) val hashingTF = new HashingTF() val tf= hashingTF.transform(documents) tf.cache() val idf = new IDF().fit(tf) val tfidf = idf.transform(tf) val rdd=tfidf.map { vec = println(vector is+vec) (10) } rdd.saveAsTextFile(/home/padma/usecase) } Exception thrown: 15/01/06 12:36:09 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks 15/01/06 12:36:10 INFO cluster.SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkexecu...@impetus-dsrv05.impetus.co.in:33898/user/Executor#-1525890167] with ID 0 15/01/06 12:36:10 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, IMPETUS-DSRV05.impetus.co.in, NODE_LOCAL, 1408 bytes) 15/01/06 12:36:10 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, IMPETUS-DSRV05.impetus.co.in, NODE_LOCAL, 1408 bytes) 15/01/06 12:36:10 INFO storage.BlockManagerMasterActor: Registering block manager IMPETUS-DSRV05.impetus.co.in:35130 with 2.1 GB RAM 15/01/06 12:36:12 INFO network.ConnectionManager: Accepted connection from [IMPETUS-DSRV05.impetus.co.in/192.168.145.195:46888] 15/01/06 12:36:12 INFO network.SendingConnection: Initiating connection to [IMPETUS-DSRV05.impetus.co.in/192.168.145.195:35130] 15/01/06 12:36:12 INFO network.SendingConnection: Connected to [IMPETUS-DSRV05.impetus.co.in/192.168.145.195:35130], 1 messages pending 15/01/06 12:36:12 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on IMPETUS-DSRV05.impetus.co.in:35130 (size: 2.1 KB, free: 2.1 GB) 15/01/06 12:36:12 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on IMPETUS-DSRV05.impetus.co.in:35130 (size: 10.1 KB, free: 2.1 GB) 15/01/06 12:36:13 INFO storage.BlockManagerInfo: Added rdd_3_1 in memory on IMPETUS-DSRV05.impetus.co.in:35130 (size: 280.0 B, free: 2.1 GB) 15/01/06 12:36:13 INFO storage.BlockManagerInfo: Added rdd_3_0 in memory on IMPETUS-DSRV05.impetus.co.in:35130 (size: 416.0 B, free: 2.1 GB) 15/01/06 12:36:13 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, IMPETUS-DSRV05.impetus.co.in): java.lang.NullPointerException: org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) java.lang.Thread.run(Thread.java:722) Thanks, Padma Ch - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: TF-IDF from spark-1.1.0 not working on cluster mode
Please find the attached worker log. I could see stream closed exception On Wed, Jan 7, 2015 at 10:51 AM, Xiangrui Meng men...@gmail.com wrote: Could you attach the executor log? That may help identify the root cause. -Xiangrui On Mon, Jan 5, 2015 at 11:12 PM, Priya Ch learnings.chitt...@gmail.com wrote: Hi All, Word2Vec and TF-IDF algorithms in spark mllib-1.1.0 are working only in local mode and not on distributed mode. Null pointer exception has been thrown. Is this a bug in spark-1.1.0 ? Following is the code: def main(args:Array[String]) { val conf=new SparkConf val sc=new SparkContext(conf) val documents=sc.textFile(hdfs://IMPETUS-DSRV02:9000/nlp/sampletext).map(_.split( ).toSeq) val hashingTF = new HashingTF() val tf= hashingTF.transform(documents) tf.cache() val idf = new IDF().fit(tf) val tfidf = idf.transform(tf) val rdd=tfidf.map { vec = println(vector is+vec) (10) } rdd.saveAsTextFile(/home/padma/usecase) } Exception thrown: 15/01/06 12:36:09 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks 15/01/06 12:36:10 INFO cluster.SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp:// sparkexecu...@impetus-dsrv05.impetus.co.in:33898/user/Executor#-1525890167 ] with ID 0 15/01/06 12:36:10 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, IMPETUS-DSRV05.impetus.co.in, NODE_LOCAL, 1408 bytes) 15/01/06 12:36:10 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, IMPETUS-DSRV05.impetus.co.in, NODE_LOCAL, 1408 bytes) 15/01/06 12:36:10 INFO storage.BlockManagerMasterActor: Registering block manager IMPETUS-DSRV05.impetus.co.in:35130 with 2.1 GB RAM 15/01/06 12:36:12 INFO network.ConnectionManager: Accepted connection from [IMPETUS-DSRV05.impetus.co.in/192.168.145.195:46888] 15/01/06 12:36:12 INFO network.SendingConnection: Initiating connection to [IMPETUS-DSRV05.impetus.co.in/192.168.145.195:35130] 15/01/06 12:36:12 INFO network.SendingConnection: Connected to [IMPETUS-DSRV05.impetus.co.in/192.168.145.195:35130], 1 messages pending 15/01/06 12:36:12 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on IMPETUS-DSRV05.impetus.co.in:35130 (size: 2.1 KB, free: 2.1 GB) 15/01/06 12:36:12 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on IMPETUS-DSRV05.impetus.co.in:35130 (size: 10.1 KB, free: 2.1 GB) 15/01/06 12:36:13 INFO storage.BlockManagerInfo: Added rdd_3_1 in memory on IMPETUS-DSRV05.impetus.co.in:35130 (size: 280.0 B, free: 2.1 GB) 15/01/06 12:36:13 INFO storage.BlockManagerInfo: Added rdd_3_0 in memory on IMPETUS-DSRV05.impetus.co.in:35130 (size: 416.0 B, free: 2.1 GB) 15/01/06 12:36:13 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, IMPETUS-DSRV05.impetus.co.in): java.lang.NullPointerException: org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) java.lang.Thread.run(Thread.java:722) Thanks, Padma Ch spark-rtauser-org.apache.spark.deploy.worker.Worker-1-IMPETUS-DSRV02.out Description: Binary data - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: TF-IDF from spark-1.1.0 not working on cluster mode
Could you attach the executor log? That may help identify the root cause. -Xiangrui On Mon, Jan 5, 2015 at 11:12 PM, Priya Ch learnings.chitt...@gmail.com wrote: Hi All, Word2Vec and TF-IDF algorithms in spark mllib-1.1.0 are working only in local mode and not on distributed mode. Null pointer exception has been thrown. Is this a bug in spark-1.1.0 ? Following is the code: def main(args:Array[String]) { val conf=new SparkConf val sc=new SparkContext(conf) val documents=sc.textFile(hdfs://IMPETUS-DSRV02:9000/nlp/sampletext).map(_.split( ).toSeq) val hashingTF = new HashingTF() val tf= hashingTF.transform(documents) tf.cache() val idf = new IDF().fit(tf) val tfidf = idf.transform(tf) val rdd=tfidf.map { vec = println(vector is+vec) (10) } rdd.saveAsTextFile(/home/padma/usecase) } Exception thrown: 15/01/06 12:36:09 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks 15/01/06 12:36:10 INFO cluster.SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkexecu...@impetus-dsrv05.impetus.co.in:33898/user/Executor#-1525890167] with ID 0 15/01/06 12:36:10 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, IMPETUS-DSRV05.impetus.co.in, NODE_LOCAL, 1408 bytes) 15/01/06 12:36:10 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, IMPETUS-DSRV05.impetus.co.in, NODE_LOCAL, 1408 bytes) 15/01/06 12:36:10 INFO storage.BlockManagerMasterActor: Registering block manager IMPETUS-DSRV05.impetus.co.in:35130 with 2.1 GB RAM 15/01/06 12:36:12 INFO network.ConnectionManager: Accepted connection from [IMPETUS-DSRV05.impetus.co.in/192.168.145.195:46888] 15/01/06 12:36:12 INFO network.SendingConnection: Initiating connection to [IMPETUS-DSRV05.impetus.co.in/192.168.145.195:35130] 15/01/06 12:36:12 INFO network.SendingConnection: Connected to [IMPETUS-DSRV05.impetus.co.in/192.168.145.195:35130], 1 messages pending 15/01/06 12:36:12 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on IMPETUS-DSRV05.impetus.co.in:35130 (size: 2.1 KB, free: 2.1 GB) 15/01/06 12:36:12 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on IMPETUS-DSRV05.impetus.co.in:35130 (size: 10.1 KB, free: 2.1 GB) 15/01/06 12:36:13 INFO storage.BlockManagerInfo: Added rdd_3_1 in memory on IMPETUS-DSRV05.impetus.co.in:35130 (size: 280.0 B, free: 2.1 GB) 15/01/06 12:36:13 INFO storage.BlockManagerInfo: Added rdd_3_0 in memory on IMPETUS-DSRV05.impetus.co.in:35130 (size: 416.0 B, free: 2.1 GB) 15/01/06 12:36:13 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, IMPETUS-DSRV05.impetus.co.in): java.lang.NullPointerException: org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) java.lang.Thread.run(Thread.java:722) Thanks, Padma Ch - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
TF-IDF from spark-1.1.0 not working on cluster mode
Hi All, Word2Vec and TF-IDF algorithms in spark mllib-1.1.0 are working only in local mode and not on distributed mode. Null pointer exception has been thrown. Is this a bug in spark-1.1.0 ? *Following is the code:* def main(args:Array[String]) { val conf=new SparkConf val sc=new SparkContext(conf) val documents=sc.textFile(hdfs://IMPETUS-DSRV02:9000/nlp/sampletext).map(_.split( ).toSeq) val hashingTF = new HashingTF() val tf= hashingTF.transform(documents) tf.cache() val idf = new IDF().fit(tf) val tfidf = idf.transform(tf) val rdd=tfidf.map { vec = println(vector is+vec) (10) } rdd.saveAsTextFile(/home/padma/usecase) } *Exception thrown:* 15/01/06 12:36:09 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks 15/01/06 12:36:10 INFO cluster.SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp:// sparkexecu...@impetus-dsrv05.impetus.co.in:33898/user/Executor#-1525890167] with ID 0 15/01/06 12:36:10 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, IMPETUS-DSRV05.impetus.co.in, NODE_LOCAL, 1408 bytes) 15/01/06 12:36:10 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, IMPETUS-DSRV05.impetus.co.in, NODE_LOCAL, 1408 bytes) 15/01/06 12:36:10 INFO storage.BlockManagerMasterActor: Registering block manager IMPETUS-DSRV05.impetus.co.in:35130 with 2.1 GB RAM 15/01/06 12:36:12 INFO network.ConnectionManager: Accepted connection from [ IMPETUS-DSRV05.impetus.co.in/192.168.145.195:46888] 15/01/06 12:36:12 INFO network.SendingConnection: Initiating connection to [ IMPETUS-DSRV05.impetus.co.in/192.168.145.195:35130] 15/01/06 12:36:12 INFO network.SendingConnection: Connected to [ IMPETUS-DSRV05.impetus.co.in/192.168.145.195:35130], 1 messages pending 15/01/06 12:36:12 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on IMPETUS-DSRV05.impetus.co.in:35130 (size: 2.1 KB, free: 2.1 GB) 15/01/06 12:36:12 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on IMPETUS-DSRV05.impetus.co.in:35130 (size: 10.1 KB, free: 2.1 GB) 15/01/06 12:36:13 INFO storage.BlockManagerInfo: Added rdd_3_1 in memory on IMPETUS-DSRV05.impetus.co.in:35130 (size: 280.0 B, free: 2.1 GB) 15/01/06 12:36:13 INFO storage.BlockManagerInfo: Added rdd_3_0 in memory on IMPETUS-DSRV05.impetus.co.in:35130 (size: 416.0 B, free: 2.1 GB) 15/01/06 12:36:13 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, IMPETUS-DSRV05.impetus.co.in): java.lang.NullPointerException: org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) java.lang.Thread.run(Thread.java:722) Thanks, Padma Ch
Re: TF-IDF in Spark 1.1.0
Can you show how to do IDF transform on tfWithId? Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/TF-IDF-in-Spark-1-1-0-tp16389p20877.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: TF-IDF in Spark 1.1.0
Thanks for the response. Appreciate the help! Burke On Tue, Oct 14, 2014 at 3:00 PM, Xiangrui Meng men...@gmail.com wrote: You cannot recover the document from the TF-IDF vector, because HashingTF is not reversible. You can assign each document a unique ID, and join back the result after training. HasingTF can transform individual record: val docs: RDD[(String, Seq[String])] = ... val tf = new HashingTF() val tfWithId: RDD[(String, Vector)] = docs.mapValues(tf.transform) ... Best, Xiangrui On Tue, Oct 14, 2014 at 9:15 AM, Burke Webster burke.webs...@gmail.com wrote: I'm following the Mllib example for TF-IDF and ran into a problem due to my lack of knowledge of Scala and spark. Any help would be greatly appreciated. Following the Mllib example I could do something like this: import org.apache.spark.rdd.RDD import org.apache.spark.SparkContext import org.apache.spark.mllib.feature.HashingTF import org.apache.spark.mllib.linalg.Vector import org.apache.spark.mllib.feature.IDF val sc: SparkContext = ... val documents: RDD[Seq[String]] = sc.textFile(...).map(_.split( ).toSeq) val hashingTF = new HashingTF() val tf: RDD[Vector] = hasingTF.transform(documents) tf.cache() val idf = new IDF().fit(tf) val tfidf: RDD[Vector] = idf.transform(tf) As a result I would have an RDD containing the TF-IDF vectors for the input documents. My question is how do I map the vector back to the original input document? My end goal is to compute document similarity using cosine similarity. From what I can tell, I can compute TF-IDF, apply the L2 norm, and then compute the dot-product. Has anybody done this? Currently, my example looks more like this: import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf import org.apache.spark.mllib.feature.HashingTF import org.apache.spark.mllib.feature.IDF import org.apache.spark.mllib.linalg.Vector import org.apache.spark.rdd.RDD import org.apache.spark.SparkContext val sc: SparkContext = ... // input is sequence file of the form (docid: Text, content: Text) val data: RDD[(String, String)] = sc.sequenceFile[String, String](“corpus”) val docs: RDD[(String, Seq[String])] = data.mapValues(v = v.split( ).toSeq) val hashingTF = new HashingTF() val tf: RDD[(String, Vector)] = hashingTF.?? I'm trying to maintain some linking from the document identifier to it's eventual vertex representation. I'm I going about this incorrectly? Thanks
TF-IDF in Spark 1.1.0
I'm following the Mllib example for TF-IDF and ran into a problem due to my lack of knowledge of Scala and spark. Any help would be greatly appreciated. Following the Mllib example I could do something like this: import org.apache.spark.rdd.RDD import org.apache.spark.SparkContext import org.apache.spark.mllib.feature.HashingTF import org.apache.spark.mllib.linalg.Vector import org.apache.spark.mllib.feature.IDF val sc: SparkContext = ... val documents: RDD[Seq[String]] = sc.textFile(...).map(_.split( ).toSeq) val hashingTF = new HashingTF() val tf: RDD[Vector] = hasingTF.transform(documents) tf.cache() val idf = new IDF().fit(tf) val tfidf: RDD[Vector] = idf.transform(tf) As a result I would have an RDD containing the TF-IDF vectors for the input documents. My question is how do I map the vector back to the original input document? My end goal is to compute document similarity using cosine similarity. From what I can tell, I can compute TF-IDF, apply the L2 norm, and then compute the dot-product. Has anybody done this? Currently, my example looks more like this: import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf import org.apache.spark.mllib.feature.HashingTF import org.apache.spark.mllib.feature.IDF import org.apache.spark.mllib.linalg.Vector import org.apache.spark.rdd.RDD import org.apache.spark.SparkContext val sc: SparkContext = ... // input is sequence file of the form (docid: Text, content: Text) val data: RDD[(String, String)] = sc.sequenceFile[String, String](“corpus”) val docs: RDD[(String, Seq[String])] = data.mapValues(v = v.split( ).toSeq) val hashingTF = new HashingTF() val tf: RDD[(String, Vector)] = hashingTF.?? I'm trying to maintain some linking from the document identifier to it's eventual vertex representation. I'm I going about this incorrectly? Thanks
Re: TF-IDF in Spark 1.1.0
You cannot recover the document from the TF-IDF vector, because HashingTF is not reversible. You can assign each document a unique ID, and join back the result after training. HasingTF can transform individual record: val docs: RDD[(String, Seq[String])] = ... val tf = new HashingTF() val tfWithId: RDD[(String, Vector)] = docs.mapValues(tf.transform) ... Best, Xiangrui On Tue, Oct 14, 2014 at 9:15 AM, Burke Webster burke.webs...@gmail.com wrote: I'm following the Mllib example for TF-IDF and ran into a problem due to my lack of knowledge of Scala and spark. Any help would be greatly appreciated. Following the Mllib example I could do something like this: import org.apache.spark.rdd.RDD import org.apache.spark.SparkContext import org.apache.spark.mllib.feature.HashingTF import org.apache.spark.mllib.linalg.Vector import org.apache.spark.mllib.feature.IDF val sc: SparkContext = ... val documents: RDD[Seq[String]] = sc.textFile(...).map(_.split( ).toSeq) val hashingTF = new HashingTF() val tf: RDD[Vector] = hasingTF.transform(documents) tf.cache() val idf = new IDF().fit(tf) val tfidf: RDD[Vector] = idf.transform(tf) As a result I would have an RDD containing the TF-IDF vectors for the input documents. My question is how do I map the vector back to the original input document? My end goal is to compute document similarity using cosine similarity. From what I can tell, I can compute TF-IDF, apply the L2 norm, and then compute the dot-product. Has anybody done this? Currently, my example looks more like this: import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf import org.apache.spark.mllib.feature.HashingTF import org.apache.spark.mllib.feature.IDF import org.apache.spark.mllib.linalg.Vector import org.apache.spark.rdd.RDD import org.apache.spark.SparkContext val sc: SparkContext = ... // input is sequence file of the form (docid: Text, content: Text) val data: RDD[(String, String)] = sc.sequenceFile[String, String](“corpus”) val docs: RDD[(String, Seq[String])] = data.mapValues(v = v.split( ).toSeq) val hashingTF = new HashingTF() val tf: RDD[(String, Vector)] = hashingTF.?? I'm trying to maintain some linking from the document identifier to it's eventual vertex representation. I'm I going about this incorrectly? Thanks - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org