Hi Andy, Actually, the output of ML IDF model is the TF-IDF vector of each instance rather than IDF vector. So it's unnecessary to do member wise multiplication to calculate TF-IDF value. You can refer the code at here: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/IDF.scala#L121 I found the document of IDF is not very clear, we need to update it.
Thanks Yanbo 2016-01-16 6:10 GMT+08:00 Andy Davidson <a...@santacruzintegration.com>: > I wonder if I am missing something? TF-IDF is very popular. Spark ML has a > lot of transformers how ever it TF_IDF is not supported directly. > > Spark provide a HashingTF and IDF transformer. The java doc > http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf > > Mentions you can implement TFIDF as follows > > TFIDF(t,d,D)=TF(t,d)・IDF(t,D). > > The problem I am running into is both HashingTF and IDF return a sparse > vector. > > *Ideally the spark code to implement TFIDF would be one line.* > > > * DataFrame ret = tmp.withColumn("features", > tmp.col("tf").multiply(tmp.col("idf")));* > > org.apache.spark.sql.AnalysisException: cannot resolve '(tf * idf)' due to > data type mismatch: '(tf * idf)' requires numeric type, not vector; > > I could implement my own UDF to do member wise multiplication how ever > given how common TF-IDF is I wonder if this code already exists some where > > I found org.apache.spark.util.Vector.Multiplier. There is no > documentation how ever give the argument is double, my guess is it just > does scalar multiplication. > > I guess I could do something like > > Double[] v = mySparkVector.toArray(); > Then use JBlas to do member wise multiplication > > I assume sparceVectors are not distributed so there would not be any > additional communication cost > > > If this code is truly missing. I would be happy to write it and donate it > > Andy > > > From: Andrew Davidson <a...@santacruzintegration.com> > Date: Wednesday, January 13, 2016 at 2:52 PM > To: "user @spark" <user@spark.apache.org> > Subject: trouble calculating TF-IDF data type mismatch: '(tf * idf)' > requires numeric type, not vector; > > Bellow is a little snippet of my Java Test Code. Any idea how I implement > member wise vector multiplication? > > Kind regards > > Andy > > transformed df printSchema() > > root > > |-- id: integer (nullable = false) > > |-- label: double (nullable = false) > > |-- words: array (nullable = false) > > | |-- element: string (containsNull = true) > > |-- tf: vector (nullable = true) > > |-- idf: vector (nullable = true) > > > > +---+-----+----------------------------+-------------------------+-------------------------------------------------------+ > > |id |label|words |tf |idf > | > > > +---+-----+----------------------------+-------------------------+-------------------------------------------------------+ > > |0 |0.0 |[Chinese, Beijing, Chinese] |(7,[1,2],[2.0,1.0]) > |(7,[1,2],[0.0,0.9162907318741551]) | > > |1 |0.0 |[Chinese, Chinese, Shanghai]|(7,[1,4],[2.0,1.0]) > |(7,[1,4],[0.0,0.9162907318741551]) | > > |2 |0.0 |[Chinese, Macao] |(7,[1,6],[1.0,1.0]) > |(7,[1,6],[0.0,0.9162907318741551]) | > > |3 |1.0 |[Tokyo, Japan, Chinese] > |(7,[1,3,5],[1.0,1.0,1.0])|(7,[1,3,5],[0.0,0.9162907318741551,0.9162907318741551])| > > > +---+-----+----------------------------+-------------------------+-------------------------------------------------------+ > > @Test > > public void test() { > > DataFrame rawTrainingDF = createTrainingData(); > > DataFrame trainingDF = runPipleLineTF_IDF(rawTrainingDF); > > . . . > > } > > private DataFrame runPipleLineTF_IDF(DataFrame rawDF) { > > HashingTF hashingTF = new HashingTF() > > .setInputCol("words") > > .setOutputCol("tf") > > .setNumFeatures(dictionarySize); > > > > DataFrame termFrequenceDF = hashingTF.transform(rawDF); > > > > termFrequenceDF.cache(); // idf needs to make 2 passes over data > set > > IDFModel idf = new IDF() > > //.setMinDocFreq(1) // our vocabulary has 6 words > we hash into 7 > > .setInputCol(hashingTF.getOutputCol()) > > .setOutputCol("idf") > > .fit(termFrequenceDF); > > > DataFrame tmp = idf.transform(termFrequenceDF); > > > > DataFrame ret = tmp.withColumn("features", tmp.col("tf").multiply( > tmp.col("idf"))); > > logger.warn("\ntransformed df printSchema()"); > > ret.printSchema(); > > ret.show(false); > > > > return ret; > > } > > > org.apache.spark.sql.AnalysisException: cannot resolve '(tf * idf)' due to > data type mismatch: '(tf * idf)' requires numeric type, not vector; > > > > private DataFrame createTrainingData() { > > // make sure we only use dictionarySize words > > JavaRDD<Row> rdd = javaSparkContext.parallelize(Arrays.asList( > > // 0 is Chinese > > // 1 in notChinese > > RowFactory.create(0, 0.0, Arrays.asList("Chinese", > "Beijing", "Chinese")), > > RowFactory.create(1, 0.0, Arrays.asList("Chinese", > "Chinese", "Shanghai")), > > RowFactory.create(2, 0.0, Arrays.asList("Chinese", "Macao" > )), > > RowFactory.create(3, 1.0, Arrays.asList("Tokyo", "Japan", > "Chinese")))); > > > > return createData(rdd); > > } > > > > private DataFrame createTestData() { > > JavaRDD<Row> rdd = javaSparkContext.parallelize(Arrays.asList( > > // 0 is Chinese > > // 1 in notChinese > > // "bernoulli" requires label to be IntegerType > > RowFactory.create(4, 1.0, Arrays.asList("Chinese", > "Chinese", "Chinese", "Tokyo", "Japan")))); > > return createData(rdd); > > } > >