Re: TF-IDF from spark-1.1.0 not working on cluster mode

2015-01-09 Thread Xiangrui Meng
This is worker log, not executor log. The executor log can be found in
folders like /newdisk2/rta/rtauser/workerdir/app-20150109182514-0001/0/
. -Xiangrui

On Fri, Jan 9, 2015 at 5:03 AM, Priya Ch learnings.chitt...@gmail.com wrote:
 Please find the attached worker log.
  I could see stream closed exception

 On Wed, Jan 7, 2015 at 10:51 AM, Xiangrui Meng men...@gmail.com wrote:

 Could you attach the executor log? That may help identify the root
 cause. -Xiangrui

 On Mon, Jan 5, 2015 at 11:12 PM, Priya Ch learnings.chitt...@gmail.com
 wrote:
  Hi All,
 
  Word2Vec and TF-IDF algorithms in spark mllib-1.1.0 are working only in
  local mode and not on distributed mode. Null pointer exception has been
  thrown. Is this a bug in spark-1.1.0 ?
 
  Following is the code:
def main(args:Array[String])
{
   val conf=new SparkConf
   val sc=new SparkContext(conf)
   val
 
  documents=sc.textFile(hdfs://IMPETUS-DSRV02:9000/nlp/sampletext).map(_.split(
  ).toSeq)
   val hashingTF = new HashingTF()
   val tf= hashingTF.transform(documents)
   tf.cache()
  val idf = new IDF().fit(tf)
  val tfidf = idf.transform(tf)
   val rdd=tfidf.map { vec = println(vector is+vec)
  (10)
 }
   rdd.saveAsTextFile(/home/padma/usecase)
 
}
 
 
 
 
  Exception thrown:
 
  15/01/06 12:36:09 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0
  with
  2 tasks
  15/01/06 12:36:10 INFO cluster.SparkDeploySchedulerBackend: Registered
  executor:
 
  Actor[akka.tcp://sparkexecu...@impetus-dsrv05.impetus.co.in:33898/user/Executor#-1525890167]
  with ID 0
  15/01/06 12:36:10 INFO scheduler.TaskSetManager: Starting task 0.0 in
  stage
  0.0 (TID 0, IMPETUS-DSRV05.impetus.co.in, NODE_LOCAL, 1408 bytes)
  15/01/06 12:36:10 INFO scheduler.TaskSetManager: Starting task 1.0 in
  stage
  0.0 (TID 1, IMPETUS-DSRV05.impetus.co.in, NODE_LOCAL, 1408 bytes)
  15/01/06 12:36:10 INFO storage.BlockManagerMasterActor: Registering
  block
  manager IMPETUS-DSRV05.impetus.co.in:35130 with 2.1 GB RAM
  15/01/06 12:36:12 INFO network.ConnectionManager: Accepted connection
  from
  [IMPETUS-DSRV05.impetus.co.in/192.168.145.195:46888]
  15/01/06 12:36:12 INFO network.SendingConnection: Initiating connection
  to
  [IMPETUS-DSRV05.impetus.co.in/192.168.145.195:35130]
  15/01/06 12:36:12 INFO network.SendingConnection: Connected to
  [IMPETUS-DSRV05.impetus.co.in/192.168.145.195:35130], 1 messages pending
  15/01/06 12:36:12 INFO storage.BlockManagerInfo: Added
  broadcast_1_piece0 in
  memory on IMPETUS-DSRV05.impetus.co.in:35130 (size: 2.1 KB, free: 2.1
  GB)
  15/01/06 12:36:12 INFO storage.BlockManagerInfo: Added
  broadcast_0_piece0 in
  memory on IMPETUS-DSRV05.impetus.co.in:35130 (size: 10.1 KB, free: 2.1
  GB)
  15/01/06 12:36:13 INFO storage.BlockManagerInfo: Added rdd_3_1 in memory
  on
  IMPETUS-DSRV05.impetus.co.in:35130 (size: 280.0 B, free: 2.1 GB)
  15/01/06 12:36:13 INFO storage.BlockManagerInfo: Added rdd_3_0 in memory
  on
  IMPETUS-DSRV05.impetus.co.in:35130 (size: 416.0 B, free: 2.1 GB)
  15/01/06 12:36:13 WARN scheduler.TaskSetManager: Lost task 1.0 in stage
  0.0
  (TID 1, IMPETUS-DSRV05.impetus.co.in): java.lang.NullPointerException:
  org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
  org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 
  org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
  org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
  org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
  org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
  org.apache.spark.scheduler.Task.run(Task.scala:54)
 
  org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 
 
  java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 
 
  java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
  java.lang.Thread.run(Thread.java:722)
 
 
  Thanks,
  Padma Ch



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: TF-IDF from spark-1.1.0 not working on cluster mode

2015-01-09 Thread Priya Ch
Please find the attached worker log.
 I could see stream closed exception

On Wed, Jan 7, 2015 at 10:51 AM, Xiangrui Meng men...@gmail.com wrote:

 Could you attach the executor log? That may help identify the root
 cause. -Xiangrui

 On Mon, Jan 5, 2015 at 11:12 PM, Priya Ch learnings.chitt...@gmail.com
 wrote:
  Hi All,
 
  Word2Vec and TF-IDF algorithms in spark mllib-1.1.0 are working only in
  local mode and not on distributed mode. Null pointer exception has been
  thrown. Is this a bug in spark-1.1.0 ?
 
  Following is the code:
def main(args:Array[String])
{
   val conf=new SparkConf
   val sc=new SparkContext(conf)
   val
 
 documents=sc.textFile(hdfs://IMPETUS-DSRV02:9000/nlp/sampletext).map(_.split(
  ).toSeq)
   val hashingTF = new HashingTF()
   val tf= hashingTF.transform(documents)
   tf.cache()
  val idf = new IDF().fit(tf)
  val tfidf = idf.transform(tf)
   val rdd=tfidf.map { vec = println(vector is+vec)
  (10)
 }
   rdd.saveAsTextFile(/home/padma/usecase)
 
}
 
 
 
 
  Exception thrown:
 
  15/01/06 12:36:09 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0
 with
  2 tasks
  15/01/06 12:36:10 INFO cluster.SparkDeploySchedulerBackend: Registered
  executor:
  Actor[akka.tcp://
 sparkexecu...@impetus-dsrv05.impetus.co.in:33898/user/Executor#-1525890167
 ]
  with ID 0
  15/01/06 12:36:10 INFO scheduler.TaskSetManager: Starting task 0.0 in
 stage
  0.0 (TID 0, IMPETUS-DSRV05.impetus.co.in, NODE_LOCAL, 1408 bytes)
  15/01/06 12:36:10 INFO scheduler.TaskSetManager: Starting task 1.0 in
 stage
  0.0 (TID 1, IMPETUS-DSRV05.impetus.co.in, NODE_LOCAL, 1408 bytes)
  15/01/06 12:36:10 INFO storage.BlockManagerMasterActor: Registering block
  manager IMPETUS-DSRV05.impetus.co.in:35130 with 2.1 GB RAM
  15/01/06 12:36:12 INFO network.ConnectionManager: Accepted connection
 from
  [IMPETUS-DSRV05.impetus.co.in/192.168.145.195:46888]
  15/01/06 12:36:12 INFO network.SendingConnection: Initiating connection
 to
  [IMPETUS-DSRV05.impetus.co.in/192.168.145.195:35130]
  15/01/06 12:36:12 INFO network.SendingConnection: Connected to
  [IMPETUS-DSRV05.impetus.co.in/192.168.145.195:35130], 1 messages pending
  15/01/06 12:36:12 INFO storage.BlockManagerInfo: Added
 broadcast_1_piece0 in
  memory on IMPETUS-DSRV05.impetus.co.in:35130 (size: 2.1 KB, free: 2.1
 GB)
  15/01/06 12:36:12 INFO storage.BlockManagerInfo: Added
 broadcast_0_piece0 in
  memory on IMPETUS-DSRV05.impetus.co.in:35130 (size: 10.1 KB, free: 2.1
 GB)
  15/01/06 12:36:13 INFO storage.BlockManagerInfo: Added rdd_3_1 in memory
 on
  IMPETUS-DSRV05.impetus.co.in:35130 (size: 280.0 B, free: 2.1 GB)
  15/01/06 12:36:13 INFO storage.BlockManagerInfo: Added rdd_3_0 in memory
 on
  IMPETUS-DSRV05.impetus.co.in:35130 (size: 416.0 B, free: 2.1 GB)
  15/01/06 12:36:13 WARN scheduler.TaskSetManager: Lost task 1.0 in stage
 0.0
  (TID 1, IMPETUS-DSRV05.impetus.co.in): java.lang.NullPointerException:
  org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
  org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 
  org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
  org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
  org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
  org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
  org.apache.spark.scheduler.Task.run(Task.scala:54)
 
  org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
  java.lang.Thread.run(Thread.java:722)
 
 
  Thanks,
  Padma Ch



spark-rtauser-org.apache.spark.deploy.worker.Worker-1-IMPETUS-DSRV02.out
Description: Binary data

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: TF-IDF from spark-1.1.0 not working on cluster mode

2015-01-06 Thread Xiangrui Meng
Could you attach the executor log? That may help identify the root
cause. -Xiangrui

On Mon, Jan 5, 2015 at 11:12 PM, Priya Ch learnings.chitt...@gmail.com wrote:
 Hi All,

 Word2Vec and TF-IDF algorithms in spark mllib-1.1.0 are working only in
 local mode and not on distributed mode. Null pointer exception has been
 thrown. Is this a bug in spark-1.1.0 ?

 Following is the code:
   def main(args:Array[String])
   {
  val conf=new SparkConf
  val sc=new SparkContext(conf)
  val
 documents=sc.textFile(hdfs://IMPETUS-DSRV02:9000/nlp/sampletext).map(_.split(
 ).toSeq)
  val hashingTF = new HashingTF()
  val tf= hashingTF.transform(documents)
  tf.cache()
 val idf = new IDF().fit(tf)
 val tfidf = idf.transform(tf)
  val rdd=tfidf.map { vec = println(vector is+vec)
 (10)
}
  rdd.saveAsTextFile(/home/padma/usecase)

   }




 Exception thrown:

 15/01/06 12:36:09 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with
 2 tasks
 15/01/06 12:36:10 INFO cluster.SparkDeploySchedulerBackend: Registered
 executor:
 Actor[akka.tcp://sparkexecu...@impetus-dsrv05.impetus.co.in:33898/user/Executor#-1525890167]
 with ID 0
 15/01/06 12:36:10 INFO scheduler.TaskSetManager: Starting task 0.0 in stage
 0.0 (TID 0, IMPETUS-DSRV05.impetus.co.in, NODE_LOCAL, 1408 bytes)
 15/01/06 12:36:10 INFO scheduler.TaskSetManager: Starting task 1.0 in stage
 0.0 (TID 1, IMPETUS-DSRV05.impetus.co.in, NODE_LOCAL, 1408 bytes)
 15/01/06 12:36:10 INFO storage.BlockManagerMasterActor: Registering block
 manager IMPETUS-DSRV05.impetus.co.in:35130 with 2.1 GB RAM
 15/01/06 12:36:12 INFO network.ConnectionManager: Accepted connection from
 [IMPETUS-DSRV05.impetus.co.in/192.168.145.195:46888]
 15/01/06 12:36:12 INFO network.SendingConnection: Initiating connection to
 [IMPETUS-DSRV05.impetus.co.in/192.168.145.195:35130]
 15/01/06 12:36:12 INFO network.SendingConnection: Connected to
 [IMPETUS-DSRV05.impetus.co.in/192.168.145.195:35130], 1 messages pending
 15/01/06 12:36:12 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in
 memory on IMPETUS-DSRV05.impetus.co.in:35130 (size: 2.1 KB, free: 2.1 GB)
 15/01/06 12:36:12 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in
 memory on IMPETUS-DSRV05.impetus.co.in:35130 (size: 10.1 KB, free: 2.1 GB)
 15/01/06 12:36:13 INFO storage.BlockManagerInfo: Added rdd_3_1 in memory on
 IMPETUS-DSRV05.impetus.co.in:35130 (size: 280.0 B, free: 2.1 GB)
 15/01/06 12:36:13 INFO storage.BlockManagerInfo: Added rdd_3_0 in memory on
 IMPETUS-DSRV05.impetus.co.in:35130 (size: 416.0 B, free: 2.1 GB)
 15/01/06 12:36:13 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 0.0
 (TID 1, IMPETUS-DSRV05.impetus.co.in): java.lang.NullPointerException:
 org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)

 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 org.apache.spark.scheduler.Task.run(Task.scala:54)

 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 java.lang.Thread.run(Thread.java:722)


 Thanks,
 Padma Ch

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



TF-IDF from spark-1.1.0 not working on cluster mode

2015-01-05 Thread Priya Ch
Hi All,

Word2Vec and TF-IDF algorithms in spark mllib-1.1.0 are working only in
local mode and not on distributed mode. Null pointer exception has been
thrown. Is this a bug in spark-1.1.0 ?

*Following is the code:*
  def main(args:Array[String])
  {
 val conf=new SparkConf
 val sc=new SparkContext(conf)
 val
documents=sc.textFile(hdfs://IMPETUS-DSRV02:9000/nlp/sampletext).map(_.split(
).toSeq)
 val hashingTF = new HashingTF()
 val tf= hashingTF.transform(documents)
 tf.cache()
val idf = new IDF().fit(tf)
val tfidf = idf.transform(tf)
 val rdd=tfidf.map { vec = println(vector is+vec)
(10)
   }
 rdd.saveAsTextFile(/home/padma/usecase)

  }




*Exception thrown:*

15/01/06 12:36:09 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0
with 2 tasks
15/01/06 12:36:10 INFO cluster.SparkDeploySchedulerBackend: Registered
executor: Actor[akka.tcp://
sparkexecu...@impetus-dsrv05.impetus.co.in:33898/user/Executor#-1525890167]
with ID 0
15/01/06 12:36:10 INFO scheduler.TaskSetManager: Starting task 0.0 in stage
0.0 (TID 0, IMPETUS-DSRV05.impetus.co.in, NODE_LOCAL, 1408 bytes)
15/01/06 12:36:10 INFO scheduler.TaskSetManager: Starting task 1.0 in stage
0.0 (TID 1, IMPETUS-DSRV05.impetus.co.in, NODE_LOCAL, 1408 bytes)
15/01/06 12:36:10 INFO storage.BlockManagerMasterActor: Registering block
manager IMPETUS-DSRV05.impetus.co.in:35130 with 2.1 GB RAM
15/01/06 12:36:12 INFO network.ConnectionManager: Accepted connection from [
IMPETUS-DSRV05.impetus.co.in/192.168.145.195:46888]
15/01/06 12:36:12 INFO network.SendingConnection: Initiating connection to [
IMPETUS-DSRV05.impetus.co.in/192.168.145.195:35130]
15/01/06 12:36:12 INFO network.SendingConnection: Connected to [
IMPETUS-DSRV05.impetus.co.in/192.168.145.195:35130], 1 messages pending
15/01/06 12:36:12 INFO storage.BlockManagerInfo: Added broadcast_1_piece0
in memory on IMPETUS-DSRV05.impetus.co.in:35130 (size: 2.1 KB, free: 2.1 GB)
15/01/06 12:36:12 INFO storage.BlockManagerInfo: Added broadcast_0_piece0
in memory on IMPETUS-DSRV05.impetus.co.in:35130 (size: 10.1 KB, free: 2.1
GB)
15/01/06 12:36:13 INFO storage.BlockManagerInfo: Added rdd_3_1 in memory on
IMPETUS-DSRV05.impetus.co.in:35130 (size: 280.0 B, free: 2.1 GB)
15/01/06 12:36:13 INFO storage.BlockManagerInfo: Added rdd_3_0 in memory on
IMPETUS-DSRV05.impetus.co.in:35130 (size: 416.0 B, free: 2.1 GB)
15/01/06 12:36:13 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 0.0
(TID 1, IMPETUS-DSRV05.impetus.co.in): java.lang.NullPointerException:
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)

org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)

org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
java.lang.Thread.run(Thread.java:722)


Thanks,
Padma Ch


Re: TF-IDF in Spark 1.1.0

2014-12-28 Thread Yao
Can you show how to do IDF transform on tfWithId? Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/TF-IDF-in-Spark-1-1-0-tp16389p20877.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: TF-IDF in Spark 1.1.0

2014-10-16 Thread Burke Webster
Thanks for the response.  Appreciate the help!

Burke

On Tue, Oct 14, 2014 at 3:00 PM, Xiangrui Meng men...@gmail.com wrote:

 You cannot recover the document from the TF-IDF vector, because
 HashingTF is not reversible. You can assign each document a unique ID,
 and join back the result after training. HasingTF can transform
 individual record:

 val docs: RDD[(String, Seq[String])] = ...

 val tf = new HashingTF()
 val tfWithId: RDD[(String, Vector)] = docs.mapValues(tf.transform)

 ...

 Best,
 Xiangrui

 On Tue, Oct 14, 2014 at 9:15 AM, Burke Webster burke.webs...@gmail.com
 wrote:
  I'm following the Mllib example for TF-IDF and ran into a problem due to
 my
  lack of knowledge of Scala and spark.  Any help would be greatly
  appreciated.
 
  Following the Mllib example I could do something like this:
 
  import org.apache.spark.rdd.RDD
  import org.apache.spark.SparkContext
  import org.apache.spark.mllib.feature.HashingTF
  import org.apache.spark.mllib.linalg.Vector
  import org.apache.spark.mllib.feature.IDF
 
  val sc: SparkContext = ...
  val documents: RDD[Seq[String]] = sc.textFile(...).map(_.split(
 ).toSeq)
 
  val hashingTF = new HashingTF()
  val tf: RDD[Vector] = hasingTF.transform(documents)
  tf.cache()
 
  val idf = new IDF().fit(tf)
  val tfidf: RDD[Vector] = idf.transform(tf)
 
  As a result I would have an RDD containing the TF-IDF vectors for the
 input
  documents.  My question is how do I map the vector back to the original
  input document?
 
  My end goal is to compute document similarity using cosine similarity.
 From
  what I can tell, I can compute TF-IDF, apply the L2 norm, and then
 compute
  the dot-product.  Has anybody done this?
 
  Currently, my example looks more like this:
 
  import org.apache.spark.SparkContext._
  import org.apache.spark.SparkConf
  import org.apache.spark.mllib.feature.HashingTF
  import org.apache.spark.mllib.feature.IDF
  import org.apache.spark.mllib.linalg.Vector
  import org.apache.spark.rdd.RDD
  import org.apache.spark.SparkContext
 
  val sc: SparkContext = ...
 
  // input is sequence file of the form (docid: Text, content: Text)
  val data: RDD[(String, String)] = sc.sequenceFile[String,
 String](“corpus”)
 
  val docs: RDD[(String, Seq[String])] = data.mapValues(v = v.split(
  ).toSeq)
 
  val hashingTF = new HashingTF()
  val tf: RDD[(String, Vector)] = hashingTF.??
 
  I'm trying to maintain some linking from the document identifier to it's
  eventual vertex representation.  I'm I going about this incorrectly?
 
  Thanks



TF-IDF in Spark 1.1.0

2014-10-14 Thread Burke Webster
I'm following the Mllib example for TF-IDF and ran into a problem due to my
lack of knowledge of Scala and spark.  Any help would be greatly
appreciated.

Following the Mllib example I could do something like this:

import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.feature.IDF

val sc: SparkContext = ...
val documents: RDD[Seq[String]] = sc.textFile(...).map(_.split( ).toSeq)

val hashingTF = new HashingTF()
val tf: RDD[Vector] = hasingTF.transform(documents)
tf.cache()

val idf = new IDF().fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)

As a result I would have an RDD containing the TF-IDF vectors for the input
documents.  My question is how do I map the vector back to the original
input document?

My end goal is to compute document similarity using cosine similarity.
From what I can tell, I can compute TF-IDF, apply the L2 norm, and then
compute the dot-product.  Has anybody done this?

Currently, my example looks more like this:

import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.feature.IDF
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext

val sc: SparkContext = ...

// input is sequence file of the form (docid: Text, content: Text)
val data: RDD[(String, String)] = sc.sequenceFile[String, String](“corpus”)

val docs: RDD[(String, Seq[String])] = data.mapValues(v = v.split(
).toSeq)

val hashingTF = new HashingTF()
val tf: RDD[(String, Vector)] = hashingTF.??

I'm trying to maintain some linking from the document identifier to it's
eventual vertex representation.  I'm I going about this incorrectly?

Thanks


Re: TF-IDF in Spark 1.1.0

2014-10-14 Thread Xiangrui Meng
You cannot recover the document from the TF-IDF vector, because
HashingTF is not reversible. You can assign each document a unique ID,
and join back the result after training. HasingTF can transform
individual record:

val docs: RDD[(String, Seq[String])] = ...

val tf = new HashingTF()
val tfWithId: RDD[(String, Vector)] = docs.mapValues(tf.transform)

...

Best,
Xiangrui

On Tue, Oct 14, 2014 at 9:15 AM, Burke Webster burke.webs...@gmail.com wrote:
 I'm following the Mllib example for TF-IDF and ran into a problem due to my
 lack of knowledge of Scala and spark.  Any help would be greatly
 appreciated.

 Following the Mllib example I could do something like this:

 import org.apache.spark.rdd.RDD
 import org.apache.spark.SparkContext
 import org.apache.spark.mllib.feature.HashingTF
 import org.apache.spark.mllib.linalg.Vector
 import org.apache.spark.mllib.feature.IDF

 val sc: SparkContext = ...
 val documents: RDD[Seq[String]] = sc.textFile(...).map(_.split( ).toSeq)

 val hashingTF = new HashingTF()
 val tf: RDD[Vector] = hasingTF.transform(documents)
 tf.cache()

 val idf = new IDF().fit(tf)
 val tfidf: RDD[Vector] = idf.transform(tf)

 As a result I would have an RDD containing the TF-IDF vectors for the input
 documents.  My question is how do I map the vector back to the original
 input document?

 My end goal is to compute document similarity using cosine similarity.  From
 what I can tell, I can compute TF-IDF, apply the L2 norm, and then compute
 the dot-product.  Has anybody done this?

 Currently, my example looks more like this:

 import org.apache.spark.SparkContext._
 import org.apache.spark.SparkConf
 import org.apache.spark.mllib.feature.HashingTF
 import org.apache.spark.mllib.feature.IDF
 import org.apache.spark.mllib.linalg.Vector
 import org.apache.spark.rdd.RDD
 import org.apache.spark.SparkContext

 val sc: SparkContext = ...

 // input is sequence file of the form (docid: Text, content: Text)
 val data: RDD[(String, String)] = sc.sequenceFile[String, String](“corpus”)

 val docs: RDD[(String, Seq[String])] = data.mapValues(v = v.split(
 ).toSeq)

 val hashingTF = new HashingTF()
 val tf: RDD[(String, Vector)] = hashingTF.??

 I'm trying to maintain some linking from the document identifier to it's
 eventual vertex representation.  I'm I going about this incorrectly?

 Thanks

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org