How do i specify the data types in a DF
Hello; I have a simple file of mobile IDFA. they look like gregconv = ['00013FEE-7561-47F3-95BC-CA18D20BCF78', '000D9B97-2B54-4B80-AAA1-C1CB42CFBF3A', '000F9E1F-BC7E-47E1-BF68-C68F6D987B96'] I am trying to make this RDD into a data frame ConvRecord = Row("IDFA") gregconvdf = gregconv.map(lambda x: ConvRecord(*x)).toDF() i get the following error Traceback (most recent call last): File "", line 1, in File "/homes/afarahat/aofspark/share/spark/python/pyspark/sql/context.py", line 60, in toDF return sqlContext.createDataFrame(self, schema, sampleRatio) File "/homes/afarahat/aofspark/share/spark/python/pyspark/sql/context.py", line 351, in createDataFrame _verify_type(row, schema) File "/homes/afarahat/aofspark/share/spark/python/pyspark/sql/types.py", line 1027, in _verify_type "length of fields (%d)" % (len(obj), len(dataType.fields))) ValueError: Length of object (36) does not match with length of fields (1) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-do-i-specify-the-data-types-in-a-DF-tp24090.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
How can the RegressionMetrics produce negative R2 and explained variance?
Hello; I am using the ALS recommendation MLLibb. To select the optimal rank, I have a number of users who used multiple items as my test. I then get the prediction on these users and compare it to the observed. I use the RegressionMetrics to estimate the R^2. I keep getting a negative value. r2 = -1.18966999676 explained var = -1.18955347415 count = 11620309 Here is my Pyspark code : train1.cache() test1.cache() numIterations =10 for i in range(10) : rank = int(40+i*10) als = ALS(rank=rank, maxIter=numIterations,implicitPrefs=False) model = als.fit(train1) predobs = model.transform(test1).select("prediction","rating").map(lambda p : (p.prediction,p.rating)).filter(lambda p: (math.isnan(p[0]) == False)) metrics = RegressionMetrics(predobs) mycount = predobs.count() myr2 = metrics.r2 myvar = metrics.explainedVariance print "hooo",rank, " r2 = ",myr2, "explained var = ", myvar, "count = ",mycount -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-can-the-RegressionMetrics-produce-negative-R2-and-explained-variance-tp23779.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
ALS :how to set numUserBlocks and numItemBlocks
any guidance how to set these 2? I have way more users (100s of millions than items) Thanks Ayman -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ALS-how-to-set-numUserBlocks-and-numItemBlocks-tp23503.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
HiveContext /Spark much slower than Hive
I have a simple HQL (below). In hive it takes maybe 10 minutes to complete. When I do this with Spark it seems to take for every. The table is partitioned by "datestamp". I am using Spark 1.3.1 How can i tune/optimize here is the query tumblruser=hiveCtx.sql(" select s_mobile_id, receive_time from mx3.post_tp_annotated_mb_impr where ad_id = 30590918987 and datestamp >='20150623' ") Thanks Ayman -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/HiveContext-Spark-much-slower-than-Hive-tp23480.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
How to get the ALS reconstruction error
Hello; I am fitting ALS models and would like to get an initial idea of the number of factors.I wan tot use the reconstruction error on train data as a measure. Does the API expose the reconstruction error ? Thanks Ayman -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-the-ALS-reconstruction-error-tp23416.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Un-persist RDD in a loop
Hello; I am trying to get the optimal number of factors in ALS. To that end, i am scanning various values and evaluating the RSE. DO i need to un-perisist the RDD between loops or will the resources (memory) get automatically deleted and re-assigned between iterations. for i in range(5): rank = 5 +int(i ) #imodel = ALS.trainImplicit(smallratings, rank, numIterations) imodel = ALS.train(smallratings, rank, numIterations) predictions = imodel.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2])) ratesAndPreds = smallratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions) MSE = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).reduce(lambda x, y: x + y) / ratesAndPreds.count() print "ho ho ", rank, " ", MSE predictions.unpersist() ratesAndPreds.unpersist() -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Un-persist-RDD-in-a-loop-tp23414.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Matrix Multiplication and mllib.recommendation
Hello; I am trying to get predictions after running the ALS model. The model works fine. In the prediction/recommendation , I have about 30 ,000 products and 90 Millions users. When i try the predict all it fails. I have been trying to formulate the problem as a Matrix multiplication where I first get the product features, broadcast them and then do a dot product. Its still very slow. Any reason why here is a sample code def doMultiply(x): a = [] #multiply by mylen = len(pf.value) for i in range(mylen) : myprod = numpy.dot(x,pf.value[i][1]) a.append(myprod) return a myModel = MatrixFactorizationModel.load(sc, "FlurryModelPath") #I need to select which products to broadcast but lets try all m1 = myModel.productFeatures().sample(False, 0.001) pf = sc.broadcast(m1.collect()) uf = myModel.userFeatures() f1 = uf.map(lambda x : (x[0], doMultiply(x[1]))) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Matrix-Multiplication-and-mllib-recommendation-tp23384.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Pyspark Dense Matrix Multiply : One of them can fit in Memory
Hello I would like to Multiply two matrices C = A* B A is a m x k , B is a kxl k,l m so that B can easily fit in memory. Any ideas or suggestions how to do that in Pyspark? Thanks Ayman -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Dense-Matrix-Multiply-One-of-them-can-fit-in-Memory-tp23344.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
ALS predictALL not completing
Hello; I have a data set of about 80 Million users and 12,000 items (very sparse ). I can get the training part working no problem. (model has 20 factors), However, when i try using Predict all for 80 Million x 10 items , the jib does not complete. When i use a smaller data set say 500k or a million it completes. Any ideas suggestions ? Thanks Ayman -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ALS-predictALL-not-completing-tp23327.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org