Hi Jorge, Thanks for the example. I managed to get the job to run but the results are appalling.
The best I could get it: Test Mean Squared Error: 684.3709679595169 Learned regression tree model: DecisionTreeModel regressor of depth 30 with 6905 nodes I tried tweaking maxDepth and maxBins but I couldn't get any better results. Do you know how I could improve the results? On 5 February 2016 at 08:34, Jorge Machado <jom...@me.com> wrote: > Hi, > > For Example an array: > > 3 Categories : Nov,Dec, Jan. > > Nov = 1,0,0 > Dec = 0,1,0 > Jan = 0,0,1 > for the complete Year you would have 12 Categories. Like Jan = > 1,0,0,0,0,0,0,0,0,0,0,0 > Pages: > PageA: 0,0,0,1 > PageB: 0,0,1,0 > PageC:0,1,0,0 > PageD:1,0,0,0 > > If you are using decisionTree I think you do not need to normalize the > other values > > You should have at the end for Januar and PageA something like : > > LabeledPoint (label , (0,0,1,0,0,01,1.0,2.0,3.0)) > > Pass the LabeledPoint to the ML model. > > test it. > > PS: label is what you want to predict. > > On 02/02/2016, at 20:44, diplomatic Guru <diplomaticg...@gmail.com> wrote: > > Hi Jorge, > > Unfortunately, I couldn't transform the data as you suggested. > > This is what I get: > > +---+---------+-------------+ > | id|pageIndex| pageVec| > +---+---------+-------------+ > |0.0| 3.0| (3,[],[])| > |1.0| 0.0|(3,[0],[1.0])| > |2.0| 2.0|(3,[2],[1.0])| > |3.0| 1.0|(3,[1],[1.0])| > +---+---------+-------------+ > > > This is the snippets: > > JavaRDD<Row> jrdd = jsc.parallelize(Arrays.asList( > RowFactory.create(0.0, "PageA", 1.0, 2.0, 3.0), > RowFactory.create(1.0, "PageB", 4.0, 5.0, 6.0), > RowFactory.create(2.0, "PageC", 7.0, 8.0, 9.0), > RowFactory.create(3.0, "PageD", 10.0, 11.0, 12.0) > > )); > > StructType schema = new StructType(new StructField[] { > new StructField("id", DataTypes.DoubleType, false, > Metadata.empty()), > new StructField("page", DataTypes.StringType, false, > Metadata.empty()), > new StructField("Nov", DataTypes.DoubleType, false, > Metadata.empty()), > new StructField("Dec", DataTypes.DoubleType, false, > Metadata.empty()), > new StructField("Jan", DataTypes.DoubleType, false, > Metadata.empty()) }); > > DataFrame df = sqlContext.createDataFrame(jrdd, schema); > > StringIndexerModel indexer = new > StringIndexer().setInputCol("page").setInputCol("Nov") > > .setInputCol("Dec").setInputCol("Jan").setOutputCol("pageIndex").fit(df); > > OneHotEncoder encoder = new > OneHotEncoder().setInputCol("pageIndex").setOutputCol("pageVec"); > > DataFrame indexed = indexer.transform(df); > > DataFrame encoded = encoder.transform(indexed); > encoded.select("id", "pageIndex", "pageVec").show(); > > > Could you please let me know what I'm doing wrong? > > > PS: My cluster is running Spark 1.3.0, which doesn't support > StringIndexer, OneHotEncoder but for testing this I've installed the 1.6.0 > on my local machine. > > Cheer. > > > On 2 February 2016 at 10:25, Jorge Machado <jom...@me.com> wrote: > >> Hi Guru, >> >> Any results ? :) >> >> On 01/02/2016, at 14:34, diplomatic Guru <diplomaticg...@gmail.com> >> wrote: >> >> Hi Jorge, >> >> Thank you for the reply and your example. I'll try your suggestion and >> will let you know the outcome. >> >> Cheers >> >> >> On 1 February 2016 at 13:17, Jorge Machado <jom...@me.com> wrote: >> >>> Hi Guru, >>> >>> So First transform your Name pages with OneHotEncoder ( >>> https://spark.apache.org/docs/latest/ml-features.html#onehotencoder) >>> then make the same thing for months: >>> >>> You will end with something like: >>> (first tree are the pagename, the other the month,) >>> (0,0,1,0,0,1) >>> >>> then you have your label that is what you want to predict. At the end >>> you will have an LabeledPoint with (10000 -> (0,0,1,0,0,1)) this will >>> represent (10000 -> (PageA, UV_NOV)) >>> After that try a regression tree with >>> >>> val model = DecisionTree.trainRegressor(trainingData, >>> categoricalFeaturesInfo, impurity,maxDepth, maxBins) >>> >>> >>> Regards >>> Jorge >>> >>> On 01/02/2016, at 12:29, diplomatic Guru <diplomaticg...@gmail.com> >>> wrote: >>> >>> Any suggestions please? >>> >>> >>> On 29 January 2016 at 22:31, diplomatic Guru <diplomaticg...@gmail.com> >>> wrote: >>> >>>> Hello guys, >>>> >>>> I'm trying understand how I could predict the next month page views >>>> based on the previous access pattern. >>>> >>>> For example, I've collected statistics on page views: >>>> >>>> e.g. >>>> Page,UniqueView >>>> ------------------------- >>>> pageA, 10000 >>>> pageB, 999 >>>> ... >>>> pageZ,200 >>>> >>>> I aggregate the statistics monthly. >>>> >>>> I've prepared a file containing last 3 months as this: >>>> >>>> e.g. >>>> Page,UV_NOV, UV_DEC, UV_JAN >>>> --------------------------------------------------- >>>> pageA, 10000,9989,11000 >>>> pageB, 999,500,700 >>>> ... >>>> pageZ,200,50,34 >>>> >>>> >>>> Based on above information, I want to predict the next month (FEB). >>>> >>>> Which alogrithm do you think will suit most, I think linear regression >>>> is the safe bet. However, I'm struggling to prepare this data for LR ML, >>>> especially how do I prepare the X,Y relationship. >>>> >>>> The Y is easy (uniqiue visitors), but not sure about the X(it should be >>>> Page,right). However, how do I plot those three months of data. >>>> >>>> Could you give me an example based on above example data? >>>> >>>> >>>> >>>> Page,UV_NOV, UV_DEC, UV_JAN >>>> --------------------------------------------------- >>>> 1, 10000,9989,11000 >>>> 2, 999,500,700 >>>> ... >>>> 26,200,50,34 >>>> >>>> >>>> >>>> >>>> >>> >>> >> >> > >