How do i specify the data types in a DF

2015-07-30 Thread afarahat
: ConvRecord(*x)).toDF() i get the following error Traceback (most recent call last): File stdin, line 1, in module File /homes/afarahat/aofspark/share/spark/python/pyspark/sql/context.py, line 60, in toDF return sqlContext.createDataFrame(self, schema, sampleRatio) File /homes/afarahat

How can the RegressionMetrics produce negative R2 and explained variance?

2015-07-12 Thread afarahat
Hello; I am using the ALS recommendation MLLibb. To select the optimal rank, I have a number of users who used multiple items as my test. I then get the prediction on these users and compare it to the observed. I use the RegressionMetrics to estimate the R^2. I keep getting a negative value.

ALS :how to set numUserBlocks and numItemBlocks

2015-06-25 Thread afarahat
any guidance how to set these 2? I have way more users (100s of millions than items) Thanks Ayman -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ALS-how-to-set-numUserBlocks-and-numItemBlocks-tp23503.html Sent from the Apache Spark User List mailing list

HiveContext /Spark much slower than Hive

2015-06-24 Thread afarahat
I have a simple HQL (below). In hive it takes maybe 10 minutes to complete. When I do this with Spark it seems to take for every. The table is partitioned by datestamp. I am using Spark 1.3.1 How can i tune/optimize here is the query tumblruser=hiveCtx.sql( select s_mobile_id, receive_time

How to get the ALS reconstruction error

2015-06-20 Thread afarahat
Hello; I am fitting ALS models and would like to get an initial idea of the number of factors.I wan tot use the reconstruction error on train data as a measure. Does the API expose the reconstruction error ? Thanks Ayman -- View this message in context:

Un-persist RDD in a loop

2015-06-19 Thread afarahat
Hello; I am trying to get the optimal number of factors in ALS. To that end, i am scanning various values and evaluating the RSE. DO i need to un-perisist the RDD between loops or will the resources (memory) get automatically deleted and re-assigned between iterations. for i in range(5):

Matrix Multiplication and mllib.recommendation

2015-06-17 Thread afarahat
Hello; I am trying to get predictions after running the ALS model. The model works fine. In the prediction/recommendation , I have about 30 ,000 products and 90 Millions users. When i try the predict all it fails. I have been trying to formulate the problem as a Matrix multiplication where I

Pyspark Dense Matrix Multiply : One of them can fit in Memory

2015-06-16 Thread afarahat
Hello I would like to Multiply two matrices C = A* B A is a m x k , B is a kxl k,l m so that B can easily fit in memory. Any ideas or suggestions how to do that in Pyspark? Thanks Ayman -- View this message in context:

ALS predictALL not completing

2015-06-15 Thread afarahat
Hello; I have a data set of about 80 Million users and 12,000 items (very sparse ). I can get the training part working no problem. (model has 20 factors), However, when i try using Predict all for 80 Million x 10 items , the jib does not complete. When i use a smaller data set say 500k or a