Re: has any one implemented TF_IDF using ML transformers?

Yanbo Liang Sun, 17 Jan 2016 00:35:21 -0800

Hi Andy,

Actually, the output of ML IDF model is the TF-IDF vector of each instance
rather than IDF vector.
So it's unnecessary to do member wise multiplication to calculate TF-IDF
value. You can refer the code at here:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/IDF.scala#L121
I found the document of IDF is not very clear, we need to update it.


Thanks
Yanbo

2016-01-16 6:10 GMT+08:00 Andy Davidson <a...@santacruzintegration.com>:

> I wonder if I am missing something? TF-IDF is very popular. Spark ML has a
> lot of transformers how ever it TF_IDF is not supported directly.
>
> Spark provide a HashingTF and IDF transformer. The java doc
> http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf
>
> Mentions you can implement TFIDF as follows
>
> TFIDF(t,d,D)=TF(t,d)・IDF(t,D).
>
> The problem I am running into is both HashingTF and IDF return a sparse
> vector.
>
> *Ideally the spark code  to implement TFIDF would be one line.*
>
>
> * DataFrame ret = tmp.withColumn("features", 
> tmp.col("tf").multiply(tmp.col("idf")));*
>
> org.apache.spark.sql.AnalysisException: cannot resolve '(tf * idf)' due to
> data type mismatch: '(tf * idf)' requires numeric type, not vector;
>
> I could implement my own UDF to do member wise multiplication how ever
> given how common TF-IDF is I wonder if this code already exists some where
>
> I found  org.apache.spark.util.Vector.Multiplier. There is no
> documentation how ever give the argument is double, my guess is it just
> does scalar multiplication.
>
> I guess I could do something like
>
> Double[] v = mySparkVector.toArray();
>  Then use JBlas to do member wise multiplication
>
> I assume sparceVectors are not distributed so there  would not be any
> additional communication cost
>
>
> If this code is truly missing. I would be happy to write it and donate it
>
> Andy
>
>
> From: Andrew Davidson <a...@santacruzintegration.com>
> Date: Wednesday, January 13, 2016 at 2:52 PM
> To: "user @spark" <user@spark.apache.org>
> Subject: trouble calculating TF-IDF data type mismatch: '(tf * idf)'
> requires numeric type, not vector;
>
> Bellow is a little snippet of my Java Test Code. Any idea how I implement
> member wise vector multiplication?
>
> Kind regards
>
> Andy
>
> transformed df printSchema()
>
> root
>
>  |-- id: integer (nullable = false)
>
>  |-- label: double (nullable = false)
>
>  |-- words: array (nullable = false)
>
>  |    |-- element: string (containsNull = true)
>
>  |-- tf: vector (nullable = true)
>
>  |-- idf: vector (nullable = true)
>
>
>
> +---+-----+----------------------------+-------------------------+-------------------------------------------------------+
>
> |id |label|words                       |tf                       |idf
>                                               |
>
>
> +---+-----+----------------------------+-------------------------+-------------------------------------------------------+
>
> |0  |0.0  |[Chinese, Beijing, Chinese] |(7,[1,2],[2.0,1.0])
> |(7,[1,2],[0.0,0.9162907318741551])                     |
>
> |1  |0.0  |[Chinese, Chinese, Shanghai]|(7,[1,4],[2.0,1.0])
> |(7,[1,4],[0.0,0.9162907318741551])                     |
>
> |2  |0.0  |[Chinese, Macao]            |(7,[1,6],[1.0,1.0])
> |(7,[1,6],[0.0,0.9162907318741551])                     |
>
> |3  |1.0  |[Tokyo, Japan, Chinese]
> |(7,[1,3,5],[1.0,1.0,1.0])|(7,[1,3,5],[0.0,0.9162907318741551,0.9162907318741551])|
>
>
> +---+-----+----------------------------+-------------------------+-------------------------------------------------------+
>
>     @Test
>
>     public void test() {
>
>         DataFrame rawTrainingDF = createTrainingData();
>
>         DataFrame trainingDF = runPipleLineTF_IDF(rawTrainingDF);
>
> . . .
>
> }
>
>    private DataFrame runPipleLineTF_IDF(DataFrame rawDF) {
>
>         HashingTF hashingTF = new HashingTF()
>
>                                     .setInputCol("words")
>
>                                     .setOutputCol("tf")
>
>                                     .setNumFeatures(dictionarySize);
>
>
>
>         DataFrame termFrequenceDF = hashingTF.transform(rawDF);
>
>
>
>         termFrequenceDF.cache(); // idf needs to make 2 passes over data
> set
>
>         IDFModel idf = new IDF()
>
>                         //.setMinDocFreq(1) // our vocabulary has 6 words
> we hash into 7
>
>                         .setInputCol(hashingTF.getOutputCol())
>
>                         .setOutputCol("idf")
>
>                         .fit(termFrequenceDF);
>
>
>         DataFrame tmp = idf.transform(termFrequenceDF);
>
>
>
>         DataFrame ret = tmp.withColumn("features", tmp.col("tf").multiply(
> tmp.col("idf")));
>
>         logger.warn("\ntransformed df printSchema()");
>
>         ret.printSchema();
>
>         ret.show(false);
>
>
>
>         return ret;
>
>     }
>
>
> org.apache.spark.sql.AnalysisException: cannot resolve '(tf * idf)' due to
> data type mismatch: '(tf * idf)' requires numeric type, not vector;
>
>
>
>     private DataFrame createTrainingData() {
>
>         // make sure we only use dictionarySize words
>
>         JavaRDD<Row> rdd = javaSparkContext.parallelize(Arrays.asList(
>
>                 // 0 is Chinese
>
>                 // 1 in notChinese
>
>                 RowFactory.create(0, 0.0, Arrays.asList("Chinese",
> "Beijing", "Chinese")),
>
>                 RowFactory.create(1, 0.0, Arrays.asList("Chinese",
> "Chinese", "Shanghai")),
>
>                 RowFactory.create(2, 0.0, Arrays.asList("Chinese", "Macao"
> )),
>
>                 RowFactory.create(3, 1.0, Arrays.asList("Tokyo", "Japan",
> "Chinese"))));
>
>
>
>         return createData(rdd);
>
>     }
>
>
>
>     private DataFrame createTestData() {
>
>         JavaRDD<Row> rdd = javaSparkContext.parallelize(Arrays.asList(
>
>                 // 0 is Chinese
>
>                 // 1 in notChinese
>
>                 // "bernoulli" requires label to be IntegerType
>
>                 RowFactory.create(4, 1.0, Arrays.asList("Chinese",
> "Chinese", "Chinese", "Tokyo", "Japan"))));
>
>         return createData(rdd);
>
>     }
>
>

Re: has any one implemented TF_IDF using ML transformers?

Reply via email to