Latent Dirichlet Allocation in Spark
Hi I am trying to do topic modeling in Spark using Spark's LDA package. Using Spark 2.0.2 and pyspark API. I ran the code as below: *from pyspark.ml.clustering import LDA* *lda = LDA(featuresCol="tf_features",k=10, seed=1, optimizer="online")* *ldaModel=lda.fit(tf_df)* *lda_df=ldaModel.transform(tf_df)* I went through the docs to understand the output (the form of data) Spark generates for LDA. I understand the ldaModel.describeTopics() method. Gives topics with list of terms and weights. But I am not sure I understand the method ldamodel.topicsMatrix(). It gives me this: if the doc says it is the distribution of words for each topic (1184 words as rows, 10 topics as columns and the values of these cells. But then these values are not probabilities which is what one would expect for topic-word distribution. These have random values more than 1 (132.76, 3.00 and so on). Any jdea on this? Thanks ᐧ
Cosine Similarity Implementation in Spark
I have a data frame which has two columns (id, vector (tf-idf)). The first column signifies the Id of the document while the second column is a Vector(tf-idf) values. I want to use DIMSUM for cosine similarity but unfortunately I have Spark 1.x and looks like these methods are implemented only in Spark 2.x onwards and hence the corresponding cosineSimilarity method for RowMatrix is not there. So I thought maybe I can use the cosineSimilarity method of IndexedRowMatrix object as I see a corresponding cosine similarity method for IndexedRowMatrix docs. So here the couple of questions on the same. 1). So how do I first convert my spark data frame to IndexedRowMatrix format? 2) Does cosine similarity method in IndexedRowMatrix also uses DIMSUM as cosineSimilarity method of RowMatrix? 3). In RowMatrix, if I use Scala then I do have access to cosine similarity method there. However , it gives a matrix of similarities with no row indices (since RowMatrix is a index less matrix). So how do I infer the cosine similarity of each doc id with other from the output of RowMatrix? Please advise. Link to docs on IndexedRowMatrix. http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.RowMatrix.columnSimilarities ᐧ
Re: Cosine Similarity of Word2Vec algo more than 1?
ok got that. I understand that the ordering won't change. I just wanted to make sure I am getting the right thing or I understand what I am getting since it didn't make sense going by the cosine calculation. One last confirmation and I appreciate all the time you are spending to reply: In the github link issue , it's mentioned fVector is not normalised. By fVector it's meant the word vector we want to find synonyms for?. So in my example, it would be vector for 'science' word which I passed to the method?. if yes, then I guess, solution should be simple. Just divide the current cosine output by the norm of this vector. And this vector we can get by doing model.transform('science') if I am right? Lastly, I would be very happy to update to docs if it is editable for all the things I encounter as not mentioned or not very clear. ᐧ On Thu, Dec 29, 2016 at 2:28 PM, Sean Owen wrote: > Yes, the vectors are not otherwise normalized. > You are basically getting the cosine similarity, but times the norm of the > word vector you supplied, because it's not divided through. You could just > divide the results yourself. > I don't think it will be back-ported because the the behavior was intended > in 1.x, just wrongly documented, and we don't want to change the behavior > in 1.x. The results are still correctly ordered anyway. > > On Thu, Dec 29, 2016 at 10:11 PM Manish Tripathi > wrote: > >> Sean, >> >> Thanks for answer. I am using Spark 1.6 so are you saying the output I am >> getting is cos(A,B)=dot(A,B)/norm(A) ? >> >> My point with respect to normalization was that if you normalise or don't >> normalize both vectors A,B, the output would be same. Since if I normalize >> A and B, then >> >> Cos(A,B)= dot(A,B)/norm(A)*norm(B). since norm=1 it is just dot(A,B). If >> we don't normalize it would have a norm in the denominator so output is >> same. >> >> But I understand you are saying in Spark 1.x, one vector was not >> normalized. If that is the case then it makes sense. >> >> Any idea how to fix this (get the right cosine similarity) in Spark 1.x? >> . If the input word in findSynonyms is not normalized while calculating >> cosine, then doing w2vmodel.transform(input_word) to get a vector >> representation and then diving the current result by the norm of this >> vector should be correct? >> >> Also, I am very open to editing the docs on things I find not properly >> documented or wrong, but I need to know if that is allowed (is it like a >> Wiki)?. >> ᐧ >> >> On Thu, Dec 29, 2016 at 1:59 PM, Sean Owen wrote: >> >> It should be the cosine similarity, yes. I think this is what was fixed >> in https://issues.apache.org/jira/browse/SPARK-7617 ; previously it was >> really just outputting the 'unnormalized' similarity (dot / norm(a) only) >> but the docs said cosine similarity. Now it's cosine similarity in Spark 2. >> The normalization most certainly matters here, and it's the opposite: >> dividing the dot by vec norms gives you the cosine. >> >> Although docs can always be better (and here was a case where it was >> wrong) all of this comes with javadoc and examples. Right now at least, >> .transform() describes the operation as you do, so it is documented. I'd >> propose you invest in improving the docs rather than saying 'this isn't >> what I expected'. >> >> (No, our book isn't a reference for MLlib, more like worked examples) >> >> On Thu, Dec 29, 2016 at 9:49 PM Manish Tripathi >> wrote: >> >> I used a word2vec algorithm of spark to compute documents vector of a >> text. >> >> I then used the findSynonyms function of the model object to get >> synonyms of few words. >> >> I see something like this: >> >> >> >> >> I do not understand why the cosine similarity is being calculated as more >> than 1. Cosine similarity should be between 0 and 1 or max -1 and +1 >> (taking negative angles). >> >> Why it is more than 1 here? What's going wrong here?. >> >> Please note, normalization of the vectors should not be changing the >> cosine similarity values since the formula remains the same. If you >> normalise it's just a dot product then, if you don't it's dot product/ >> (normA)*(normB). >> >> I am facing lot of issues with respect to understanding or interpreting >> the output of Spark's ml algos. The documentation is not very clear and >> there is hardly anything mentioned with respect to how and what is being >&
Re: Cosine Similarity of Word2Vec algo more than 1?
Sean, Thanks for answer. I am using Spark 1.6 so are you saying the output I am getting is cos(A,B)=dot(A,B)/norm(A) ? My point with respect to normalization was that if you normalise or don't normalize both vectors A,B, the output would be same. Since if I normalize A and B, then Cos(A,B)= dot(A,B)/norm(A)*norm(B). since norm=1 it is just dot(A,B). If we don't normalize it would have a norm in the denominator so output is same. But I understand you are saying in Spark 1.x, one vector was not normalized. If that is the case then it makes sense. Any idea how to fix this (get the right cosine similarity) in Spark 1.x? . If the input word in findSynonyms is not normalized while calculating cosine, then doing w2vmodel.transform(input_word) to get a vector representation and then diving the current result by the norm of this vector should be correct? Also, I am very open to editing the docs on things I find not properly documented or wrong, but I need to know if that is allowed (is it like a Wiki)?. ᐧ On Thu, Dec 29, 2016 at 1:59 PM, Sean Owen wrote: > It should be the cosine similarity, yes. I think this is what was fixed in > https://issues.apache.org/jira/browse/SPARK-7617 ; previously it was > really just outputting the 'unnormalized' similarity (dot / norm(a) only) > but the docs said cosine similarity. Now it's cosine similarity in Spark 2. > The normalization most certainly matters here, and it's the opposite: > dividing the dot by vec norms gives you the cosine. > > Although docs can always be better (and here was a case where it was > wrong) all of this comes with javadoc and examples. Right now at least, > .transform() describes the operation as you do, so it is documented. I'd > propose you invest in improving the docs rather than saying 'this isn't > what I expected'. > > (No, our book isn't a reference for MLlib, more like worked examples) > > On Thu, Dec 29, 2016 at 9:49 PM Manish Tripathi > wrote: > >> I used a word2vec algorithm of spark to compute documents vector of a >> text. >> >> I then used the findSynonyms function of the model object to get >> synonyms of few words. >> >> I see something like this: >> >> >> >> >> I do not understand why the cosine similarity is being calculated as more >> than 1. Cosine similarity should be between 0 and 1 or max -1 and +1 >> (taking negative angles). >> >> Why it is more than 1 here? What's going wrong here?. >> >> Please note, normalization of the vectors should not be changing the >> cosine similarity values since the formula remains the same. If you >> normalise it's just a dot product then, if you don't it's dot product/ >> (normA)*(normB). >> >> I am facing lot of issues with respect to understanding or interpreting >> the output of Spark's ml algos. The documentation is not very clear and >> there is hardly anything mentioned with respect to how and what is being >> returned. >> >> For ex. word2vec algorithm is to convert word to vector form. So I would >> expect .transform method would give me vector of each word in the text. >> >> However .transform basically returns doc2vec (averages all word vectors >> of a text). This is confusing since nothing of this is mentioned in the >> docs and I keep thinking why I have only one word vector instead of word >> vectors for all words. >> >> I do understand by returning doc2vec it is helpful since now one doesn't >> have to average out each word vector for the whole text. But the docs don't >> help or explicitly say that. >> >> This ends up wasting lot of time in just figuring out what is being >> returned from an algorithm from Spark. >> >> Does someone have a better solution for this? >> >> I have read the Spark book. That is not about Mllib. I am not sure if >> Sean's book would cover all the documentation aspect better than what we >> have currently on the docs page. >> >> Thanks >> >> >> >> ᐧ >> >
Cosine Similarity of Word2Vec algo more than 1?
I used a word2vec algorithm of spark to compute documents vector of a text. I then used the findSynonyms function of the model object to get synonyms of few words. I see something like this: I do not understand why the cosine similarity is being calculated as more than 1. Cosine similarity should be between 0 and 1 or max -1 and +1 (taking negative angles). Why it is more than 1 here? What's going wrong here?. Please note, normalization of the vectors should not be changing the cosine similarity values since the formula remains the same. If you normalise it's just a dot product then, if you don't it's dot product/ (normA)*(normB). I am facing lot of issues with respect to understanding or interpreting the output of Spark's ml algos. The documentation is not very clear and there is hardly anything mentioned with respect to how and what is being returned. For ex. word2vec algorithm is to convert word to vector form. So I would expect .transform method would give me vector of each word in the text. However .transform basically returns doc2vec (averages all word vectors of a text). This is confusing since nothing of this is mentioned in the docs and I keep thinking why I have only one word vector instead of word vectors for all words. I do understand by returning doc2vec it is helpful since now one doesn't have to average out each word vector for the whole text. But the docs don't help or explicitly say that. This ends up wasting lot of time in just figuring out what is being returned from an algorithm from Spark. Does someone have a better solution for this? I have read the Spark book. That is not about Mllib. I am not sure if Sean's book would cover all the documentation aspect better than what we have currently on the docs page. Thanks ᐧ
Re: Negative values of predictions in ALS.tranform
Thanks a bunch. That's very helpful. On Friday, December 16, 2016, Sean Owen wrote: > That all looks correct. > > On Thu, Dec 15, 2016 at 11:54 PM Manish Tripathi > wrote: > >> ok. Thanks. So here is what I understood. >> >> Input data to Als.fit(implicitPrefs=True) is the actual strengths (count >> data). So if I have a matrix of (user,item,views/purchases) I pass that as >> the input and not the binarized one (preference). This signifies the >> strength. >> >> 2) Since we also pass the alpha parameter to this Als.fit() method, Spark >> internally creates the confidence matrix +1+alpha*input_data or some other >> alpha factor. >> >> 3). The output which it gives is basically a factorization of 0/1 matrix >> (binarized matrix from initial input data), hence the output also resembles >> the preference matrix (0/1) suggesting the interaction. So typically it >> should be between 0-1but if it is negative it means very less >> preference/interaction >> >> *Does all the above sound correct?.* >> >> If yes, then one last question- >> >> 1). *For explicit dataset where we don't use implicitPref=True,* the >> predicted ratings would be actual ratings like it can be 2.3,4.5 etc and >> not the interaction measure. That is because in explicit we are not using >> the confidence matrix and preference matrix concept and use the actual >> rating data. So any output from Spark ALS for explicit data would be a >> rating prediction. >> ᐧ >> >> On Thu, Dec 15, 2016 at 3:46 PM, Sean Owen > > wrote: >> >>> No, input are weights or strengths. The output is a factorization of the >>> binarization of that to 0/1, not probs or a factorization of the input. >>> This explains the range of the output. >>> >>> >>> On Thu, Dec 15, 2016, 23:43 Manish Tripathi >> > wrote: >>> >>>> when you say *implicit ALS *is* factoring the 0/1 matrix. , are you >>>> saying for implicit feedback algorithm we need to pass the input data as >>>> the preference matrix i.e a matrix of 0 and 1?. * >>>> >>>> Then how will they calculate the confidence matrix which is basically >>>> =1+alpha*count matrix. If we don't pass the actual count of values (views >>>> etc) then how does Spark calculates the confidence matrix?. >>>> >>>> I was of the understanding that input data for >>>> als.fit(implicitPref=True) is the actual count matrix of the >>>> views/purchases?. Am I going wrong here if yes, then how is Spark >>>> calculating the confidence matrix if it doesn't have the actual count data. >>>> >>>> The original paper on which Spark algo is based needs the actual count >>>> data to create a confidence matrix and also needs the 0/1 matrix since the >>>> objective functions uses both the confidence matrix and 0/1 matrix to find >>>> the user and item factors. >>>> ᐧ >>>> >>>> On Thu, Dec 15, 2016 at 3:38 PM, Sean Owen >>> > wrote: >>>> >>>>> No, you can't interpret the output as probabilities at all. In >>>>> particular they may be negative. It is not predicting rating but >>>>> interaction. Negative means very strongly not predicted to interact. No, >>>>> implicit ALS *is* factoring the 0/1 matrix. >>>>> >>>>> On Thu, Dec 15, 2016, 23:31 Manish Tripathi >>>> > wrote: >>>>> >>>>>> Ok. So we can kind of interpret the output as probabilities even >>>>>> though it is not modeling probabilities. This is to be able to use it for >>>>>> binaryclassification evaluator. >>>>>> >>>>>> So the way I understand is and as per the algo, the predicted matrix >>>>>> is basically a dot product of user factor and item factor matrix. >>>>>> >>>>>> but in what circumstances the ratings predicted can be negative. I >>>>>> can understand if the individual user factor vector and item factor >>>>>> vector >>>>>> is having negative factor terms, then it can be negative. But practically >>>>>> does negative make any sense? AS per algorithm the dot product is the >>>>>> predicted rating. So rating shouldnt be negative for it to make any >>>>>> sense. >>>>>> Also rating just between 0-1 is normalised rating? Typically rating we >>>&g
Re: Negative values of predictions in ALS.tranform
ok. Thanks. So here is what I understood. Input data to Als.fit(implicitPrefs=True) is the actual strengths (count data). So if I have a matrix of (user,item,views/purchases) I pass that as the input and not the binarized one (preference). This signifies the strength. 2) Since we also pass the alpha parameter to this Als.fit() method, Spark internally creates the confidence matrix +1+alpha*input_data or some other alpha factor. 3). The output which it gives is basically a factorization of 0/1 matrix (binarized matrix from initial input data), hence the output also resembles the preference matrix (0/1) suggesting the interaction. So typically it should be between 0-1but if it is negative it means very less preference/interaction *Does all the above sound correct?.* If yes, then one last question- 1). *For explicit dataset where we don't use implicitPref=True,* the predicted ratings would be actual ratings like it can be 2.3,4.5 etc and not the interaction measure. That is because in explicit we are not using the confidence matrix and preference matrix concept and use the actual rating data. So any output from Spark ALS for explicit data would be a rating prediction. ᐧ On Thu, Dec 15, 2016 at 3:46 PM, Sean Owen wrote: > No, input are weights or strengths. The output is a factorization of the > binarization of that to 0/1, not probs or a factorization of the input. > This explains the range of the output. > > > On Thu, Dec 15, 2016, 23:43 Manish Tripathi wrote: > >> when you say *implicit ALS *is* factoring the 0/1 matrix. , are you >> saying for implicit feedback algorithm we need to pass the input data as >> the preference matrix i.e a matrix of 0 and 1?. * >> >> Then how will they calculate the confidence matrix which is basically >> =1+alpha*count matrix. If we don't pass the actual count of values (views >> etc) then how does Spark calculates the confidence matrix?. >> >> I was of the understanding that input data for als.fit(implicitPref=True) >> is the actual count matrix of the views/purchases?. Am I going wrong here >> if yes, then how is Spark calculating the confidence matrix if it doesn't >> have the actual count data. >> >> The original paper on which Spark algo is based needs the actual count >> data to create a confidence matrix and also needs the 0/1 matrix since the >> objective functions uses both the confidence matrix and 0/1 matrix to find >> the user and item factors. >> ᐧ >> >> On Thu, Dec 15, 2016 at 3:38 PM, Sean Owen wrote: >> >> No, you can't interpret the output as probabilities at all. In particular >> they may be negative. It is not predicting rating but interaction. Negative >> means very strongly not predicted to interact. No, implicit ALS *is* >> factoring the 0/1 matrix. >> >> On Thu, Dec 15, 2016, 23:31 Manish Tripathi wrote: >> >> Ok. So we can kind of interpret the output as probabilities even though >> it is not modeling probabilities. This is to be able to use it for >> binaryclassification evaluator. >> >> So the way I understand is and as per the algo, the predicted matrix is >> basically a dot product of user factor and item factor matrix. >> >> but in what circumstances the ratings predicted can be negative. I can >> understand if the individual user factor vector and item factor vector is >> having negative factor terms, then it can be negative. But practically does >> negative make any sense? AS per algorithm the dot product is the predicted >> rating. So rating shouldnt be negative for it to make any sense. Also >> rating just between 0-1 is normalised rating? Typically rating we expect to >> be like any real value 2.3,4.5 etc. >> >> Also please note, for implicit feedback ALS, we don't feed 0/1 matrix. We >> feed the count matrix (discrete count values) and am assuming spark >> internally converts it into a preference matrix (1/0) and a confidence >> matrix =1+alpha*count_matrix >> >> >> >> >> ᐧ >> >> On Thu, Dec 15, 2016 at 2:56 PM, Sean Owen wrote: >> >> No, ALS is not modeling probabilities. The outputs are reconstructions of >> a 0/1 matrix. Most values will be in [0,1], but, it's possible to get >> values outside that range. >> >> On Thu, Dec 15, 2016 at 10:21 PM Manish Tripathi >> wrote: >> >> Hi >> >> ran the ALS model for implicit feedback thing. Then I used the .transform >> method of the model to predict the ratings for the original dataset. My >> dataset is of the form (user,item,rating) >> >> I see something like below: >> >> predictions.show(5,truncate=Fal
Re: Negative values of predictions in ALS.tranform
when you say *implicit ALS *is* factoring the 0/1 matrix. , are you saying for implicit feedback algorithm we need to pass the input data as the preference matrix i.e a matrix of 0 and 1?. * Then how will they calculate the confidence matrix which is basically =1+alpha*count matrix. If we don't pass the actual count of values (views etc) then how does Spark calculates the confidence matrix?. I was of the understanding that input data for als.fit(implicitPref=True) is the actual count matrix of the views/purchases?. Am I going wrong here if yes, then how is Spark calculating the confidence matrix if it doesn't have the actual count data. The original paper on which Spark algo is based needs the actual count data to create a confidence matrix and also needs the 0/1 matrix since the objective functions uses both the confidence matrix and 0/1 matrix to find the user and item factors. ᐧ On Thu, Dec 15, 2016 at 3:38 PM, Sean Owen wrote: > No, you can't interpret the output as probabilities at all. In particular > they may be negative. It is not predicting rating but interaction. Negative > means very strongly not predicted to interact. No, implicit ALS *is* > factoring the 0/1 matrix. > > On Thu, Dec 15, 2016, 23:31 Manish Tripathi wrote: > >> Ok. So we can kind of interpret the output as probabilities even though >> it is not modeling probabilities. This is to be able to use it for >> binaryclassification evaluator. >> >> So the way I understand is and as per the algo, the predicted matrix is >> basically a dot product of user factor and item factor matrix. >> >> but in what circumstances the ratings predicted can be negative. I can >> understand if the individual user factor vector and item factor vector is >> having negative factor terms, then it can be negative. But practically does >> negative make any sense? AS per algorithm the dot product is the predicted >> rating. So rating shouldnt be negative for it to make any sense. Also >> rating just between 0-1 is normalised rating? Typically rating we expect to >> be like any real value 2.3,4.5 etc. >> >> Also please note, for implicit feedback ALS, we don't feed 0/1 matrix. We >> feed the count matrix (discrete count values) and am assuming spark >> internally converts it into a preference matrix (1/0) and a confidence >> matrix =1+alpha*count_matrix >> >> >> >> >> ᐧ >> >> On Thu, Dec 15, 2016 at 2:56 PM, Sean Owen wrote: >> >> No, ALS is not modeling probabilities. The outputs are reconstructions of >> a 0/1 matrix. Most values will be in [0,1], but, it's possible to get >> values outside that range. >> >> On Thu, Dec 15, 2016 at 10:21 PM Manish Tripathi >> wrote: >> >> Hi >> >> ran the ALS model for implicit feedback thing. Then I used the .transform >> method of the model to predict the ratings for the original dataset. My >> dataset is of the form (user,item,rating) >> >> I see something like below: >> >> predictions.show(5,truncate=False) >> >> >> Why is the last prediction value negative ?. Isn't the transform method >> giving the prediction(probability) of seeing the rating as 1?. I had counts >> data for rating (implicit feedback) and for validation dataset I binarized >> the rating (1 if >0 else 0). My training data has rating positive (it's >> basically the count of views to a video). >> >> I used following to train: >> >> * als = ALS(rank=x, maxIter=15, regParam=y, >> implicitPrefs=True,alpha=40.0)* >> >> *model=als.fit(self.train)* >> >> What does negative prediction mean here and is it ok to have that? >> ᐧ >> >> >>
Re: Negative values of predictions in ALS.tranform
Ok. So we can kind of interpret the output as probabilities even though it is not modeling probabilities. This is to be able to use it for binaryclassification evaluator. So the way I understand is and as per the algo, the predicted matrix is basically a dot product of user factor and item factor matrix. but in what circumstances the ratings predicted can be negative. I can understand if the individual user factor vector and item factor vector is having negative factor terms, then it can be negative. But practically does negative make any sense? AS per algorithm the dot product is the predicted rating. So rating shouldnt be negative for it to make any sense. Also rating just between 0-1 is normalised rating? Typically rating we expect to be like any real value 2.3,4.5 etc. Also please note, for implicit feedback ALS, we don't feed 0/1 matrix. We feed the count matrix (discrete count values) and am assuming spark internally converts it into a preference matrix (1/0) and a confidence matrix =1+alpha*count_matrix ᐧ On Thu, Dec 15, 2016 at 2:56 PM, Sean Owen wrote: > No, ALS is not modeling probabilities. The outputs are reconstructions of > a 0/1 matrix. Most values will be in [0,1], but, it's possible to get > values outside that range. > > On Thu, Dec 15, 2016 at 10:21 PM Manish Tripathi > wrote: > >> Hi >> >> ran the ALS model for implicit feedback thing. Then I used the .transform >> method of the model to predict the ratings for the original dataset. My >> dataset is of the form (user,item,rating) >> >> I see something like below: >> >> predictions.show(5,truncate=False) >> >> >> Why is the last prediction value negative ?. Isn't the transform method >> giving the prediction(probability) of seeing the rating as 1?. I had counts >> data for rating (implicit feedback) and for validation dataset I binarized >> the rating (1 if >0 else 0). My training data has rating positive (it's >> basically the count of views to a video). >> >> I used following to train: >> >> * als = ALS(rank=x, maxIter=15, regParam=y, >> implicitPrefs=True,alpha=40.0)* >> >> *model=als.fit(self.train)* >> >> What does negative prediction mean here and is it ok to have that? >> ᐧ >> >
Negative values of predictions in ALS.tranform
Hi ran the ALS model for implicit feedback thing. Then I used the .transform method of the model to predict the ratings for the original dataset. My dataset is of the form (user,item,rating) I see something like below: predictions.show(5,truncate=False) Why is the last prediction value negative ?. Isn't the transform method giving the prediction(probability) of seeing the rating as 1?. I had counts data for rating (implicit feedback) and for validation dataset I binarized the rating (1 if >0 else 0). My training data has rating positive (it's basically the count of views to a video). I used following to train: * als = ALS(rank=x, maxIter=15, regParam=y, implicitPrefs=True,alpha=40.0)* *model=als.fit(self.train)* What does negative prediction mean here and is it ok to have that? ᐧ
Spark Float to VectorUDT for ML evaluator lib
Hi I am trying to run the ML Binary Evaluation Classifier metrics to compare the rating with predicted values and get the AreaROC. My dataframe has two columns with rating as int (I have binarized it) and predicitions which is a float. When I pass it to the ML evaluator method I get an error as shown below: Can someone help me with gettng this sorted out?. Appreciate all the help Stackoverflow post: http://stackoverflow.com/questions/40408898/converting-the-float-column-in-spark-dataframe-to-vectorudt I was trying to use the pyspark.ml.evaluation Binary classification metric like below evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction") print evaluator.evaluate(predictions) My Predictions data frame looks like this: predictions.select('rating','prediction') predictions.show() +--++ |rating| prediction| +--++ | 1| 0.14829934| | 1|-0.017862909| | 1| 0.4951505| | 1|0.0074382657| | 1|-0.002562912| | 1| 0.0208337| | 1| 0.049362548| | 1| 0.0969| | 1| 0.17998546| | 1| 0.019649783| | 1| 0.031353004| | 1| 0.03657037| | 1| 0.23280995| | 1| 0.033190556| | 1| 0.35569906| | 1| 0.030974165| | 1| 0.1422375| | 1| 0.19786166| | 1| 0.07740938| | 1| 0.33970386| +--++ only showing top 20 rows The datatype of each column is as follows: predictions.printSchema() root |-- rating: integer (nullable = true) |-- prediction: float (nullable = true) Now I get an error with above Ml code saying prediction column is Float and expected a VectorUDT. /Users/i854319/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args) 811 answer = self.gateway_client.send_command(command) 812 return_value = get_return_value( --> 813 answer, self.gateway_client, self.target_id, self.name) 814 815 for temp_arg in temp_args: /Users/i854319/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw) 51 raise AnalysisException(s.split(': ', 1)[1], stackTrace) 52 if s.startswith('java.lang.IllegalArgumentException: '): ---> 53 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace) 54 raise 55 return deco IllegalArgumentException: u'requirement failed: Column prediction must be of type org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually FloatType.' So I thought of converting the predictions column from float to VectorUDT as below: *Applying the schema to the dataframe to convert the float column type to VectorUDT* from pyspark.sql.types import IntegerType, StructType,StructField schema = StructType([ StructField("rating", IntegerType, True), StructField("prediction", VectorUDT(), True) ]) predictions_dtype=sqlContext.createDataFrame(prediction,schema) But Now I get this error. --- AssertionErrorTraceback (most recent call last) in () 4 5 schema = StructType([ > 6 StructField("rating", IntegerType, True), 7 StructField("prediction", VectorUDT(), True) 8 ]) /Users/i854319/spark/python/pyspark/sql/types.pyc in __init__(self, name, dataType, nullable, metadata) 401 False 402 """ --> 403 assert isinstance(dataType, DataType), "dataType should be DataType" 404 if not isinstance(name, str): 405 name = name.encode('utf-8') AssertionError: dataType should be DataType It takes so much time to run an ml algo in spark libraries with so many weird errors. Even I tried Mllib with RDD data. That is giving the ValueError: Null pointer exception. ᐧ