Hi Jorge,

Thanks for the example. I managed to get the job to run but the results are
appalling.

The best I could get it:
Test Mean Squared Error: 684.3709679595169
Learned regression tree model:
DecisionTreeModel regressor of depth 30 with 6905 nodes

I tried tweaking maxDepth and maxBins but I couldn't get any better results.

Do you know how I could improve the results?



On 5 February 2016 at 08:34, Jorge Machado <jom...@me.com> wrote:

> Hi,
>
> For Example an array:
>
> 3 Categories : Nov,Dec, Jan.
>
> Nov = 1,0,0
> Dec = 0,1,0
> Jan = 0,0,1
> for the complete Year you would have 12 Categories.  Like  Jan =
> 1,0,0,0,0,0,0,0,0,0,0,0
> Pages:
> PageA: 0,0,0,1
> PageB: 0,0,1,0
> PageC:0,1,0,0
> PageD:1,0,0,0
>
> If you are using decisionTree I think you do not need to normalize the
> other values
>
> You should have at the end for Januar and PageA something like :
>
> LabeledPoint (label , (0,0,1,0,0,01,1.0,2.0,3.0))
>
> Pass the LabeledPoint to the ML model.
>
> test it.
>
> PS: label is what you want to predict.
>
> On 02/02/2016, at 20:44, diplomatic Guru <diplomaticg...@gmail.com> wrote:
>
> Hi Jorge,
>
> Unfortunately, I couldn't transform the data as you suggested.
>
> This is what I get:
>
> +---+---------+-------------+
> | id|pageIndex|      pageVec|
> +---+---------+-------------+
> |0.0|      3.0|    (3,[],[])|
> |1.0|      0.0|(3,[0],[1.0])|
> |2.0|      2.0|(3,[2],[1.0])|
> |3.0|      1.0|(3,[1],[1.0])|
> +---+---------+-------------+
>
>
> This is the snippets:
>
> JavaRDD<Row> jrdd = jsc.parallelize(Arrays.asList(
>         RowFactory.create(0.0, "PageA", 1.0, 2.0, 3.0),
>         RowFactory.create(1.0, "PageB", 4.0, 5.0, 6.0),
>         RowFactory.create(2.0, "PageC", 7.0, 8.0, 9.0),
>         RowFactory.create(3.0, "PageD", 10.0, 11.0, 12.0)
>
>     ));
>
>     StructType schema = new StructType(new StructField[] {
>         new StructField("id", DataTypes.DoubleType, false,
> Metadata.empty()),
>         new StructField("page", DataTypes.StringType, false,
> Metadata.empty()),
>         new StructField("Nov", DataTypes.DoubleType, false,
> Metadata.empty()),
>         new StructField("Dec", DataTypes.DoubleType, false,
> Metadata.empty()),
>         new StructField("Jan", DataTypes.DoubleType, false,
> Metadata.empty()) });
>
>     DataFrame df = sqlContext.createDataFrame(jrdd, schema);
>
>     StringIndexerModel indexer = new
> StringIndexer().setInputCol("page").setInputCol("Nov")
>
> .setInputCol("Dec").setInputCol("Jan").setOutputCol("pageIndex").fit(df);
>
>     OneHotEncoder encoder = new
> OneHotEncoder().setInputCol("pageIndex").setOutputCol("pageVec");
>
>     DataFrame indexed = indexer.transform(df);
>
>     DataFrame encoded = encoder.transform(indexed);
>     encoded.select("id", "pageIndex", "pageVec").show();
>
>
> Could you please let me know what I'm doing wrong?
>
>
> PS: My cluster is running Spark 1.3.0, which doesn't support
> StringIndexer, OneHotEncoder  but for testing this I've installed the 1.6.0
> on my local machine.
>
> Cheer.
>
>
> On 2 February 2016 at 10:25, Jorge Machado <jom...@me.com> wrote:
>
>> Hi Guru,
>>
>> Any results ? :)
>>
>> On 01/02/2016, at 14:34, diplomatic Guru <diplomaticg...@gmail.com>
>> wrote:
>>
>> Hi Jorge,
>>
>> Thank you for the reply and your example. I'll try your suggestion and
>> will let you know the outcome.
>>
>> Cheers
>>
>>
>> On 1 February 2016 at 13:17, Jorge Machado <jom...@me.com> wrote:
>>
>>> Hi Guru,
>>>
>>> So First transform your Name pages with OneHotEncoder (
>>> https://spark.apache.org/docs/latest/ml-features.html#onehotencoder)
>>> then make the same thing for months:
>>>
>>> You will end with something like:
>>> (first tree are the pagename, the other the month,)
>>> (0,0,1,0,0,1)
>>>
>>> then you have your label that is what you want to predict. At the end
>>> you will have an LabeledPoint with (10000 -> (0,0,1,0,0,1)) this will
>>> represent (10000 -> (PageA, UV_NOV))
>>> After that try a regression tree with
>>>
>>> val model = DecisionTree.trainRegressor(trainingData,
>>> categoricalFeaturesInfo, impurity,maxDepth, maxBins)
>>>
>>>
>>> Regards
>>> Jorge
>>>
>>> On 01/02/2016, at 12:29, diplomatic Guru <diplomaticg...@gmail.com>
>>> wrote:
>>>
>>> Any suggestions please?
>>>
>>>
>>> On 29 January 2016 at 22:31, diplomatic Guru <diplomaticg...@gmail.com>
>>> wrote:
>>>
>>>> Hello guys,
>>>>
>>>> I'm trying understand how I could predict the next month page views
>>>> based on the previous access pattern.
>>>>
>>>> For example, I've collected statistics on page views:
>>>>
>>>> e.g.
>>>> Page,UniqueView
>>>> -------------------------
>>>> pageA, 10000
>>>> pageB, 999
>>>> ...
>>>> pageZ,200
>>>>
>>>> I aggregate the statistics monthly.
>>>>
>>>> I've prepared a file containing last 3 months as this:
>>>>
>>>> e.g.
>>>> Page,UV_NOV, UV_DEC, UV_JAN
>>>> ---------------------------------------------------
>>>> pageA, 10000,9989,11000
>>>> pageB, 999,500,700
>>>> ...
>>>> pageZ,200,50,34
>>>>
>>>>
>>>> Based on above information, I want to predict the next month (FEB).
>>>>
>>>> Which alogrithm do you think will suit most, I think linear regression
>>>> is the safe bet. However, I'm struggling to prepare this data for LR ML,
>>>> especially how do I prepare the X,Y relationship.
>>>>
>>>> The Y is easy (uniqiue visitors), but not sure about the X(it should be
>>>> Page,right). However, how do I plot those three months of data.
>>>>
>>>> Could you give me an example based on above example data?
>>>>
>>>>
>>>>
>>>> Page,UV_NOV, UV_DEC, UV_JAN
>>>> ---------------------------------------------------
>>>> 1, 10000,9989,11000
>>>> 2, 999,500,700
>>>> ...
>>>> 26,200,50,34
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>

Reply via email to