PMML doesn’t make a lot of sense when the model is a potentially massive 
matrix. One reason is that it will be pretty hard (impossible?) to parallelize 
read/write with the engines we use. JSON has the same problem and the only way 
SchemaRDD can read JSON is by bending the rules.

Seems like a good thing to support for algos that can make good use of it. Does 
that narrow it down to naive bayes today?

On Mar 5, 2015, at 2:19 AM, Ted Dunning <ted.dunn...@gmail.com> wrote:

PMML is a machine-to-machine mechanism, not intended really for human
consumption or production.  Based on XML, it is, of course, bloated, but
that doesn't really matter for readability since reading isn't the goal.

The vision of making models easy to transfer from system to system is nice,
but the reality has fallen far short, unfortunately.  The problem is that
systems often have special aspects that make it hard to replicate exact
actions from one system to another.  Having a textual format for numerical
data doesn't help.

Here, for instance, is a linear regression model that I created using R:

<PMML version="4.2" xmlns="http://www.dmg.org/PMML-4_2"; xmlns:xsi="
http://www.w3.org/2001/XMLSchema-instance"; xsi:schemaLocation="
http://www.dmg.org/PMML-4_2 http://www.dmg.org/v4-2/pmml-4-2.xsd";>
<Header copyright="Copyright (c) 2015 tdunning" description="Linear
Regression Model">
 <Extension name="user" value="tdunning" extender="Rattle/PMML"/>
 <Application name="Rattle/PMML" version="1.4"/>
 <Timestamp>2015-03-05 09:46:32</Timestamp>
</Header>
<DataDictionary numberOfFields="4">
 <DataField name="y" optype="continuous" dataType="double"/>
 <DataField name="x1" optype="continuous" dataType="double"/>
 <DataField name="x2" optype="continuous" dataType="double"/>
 <DataField name="x3" optype="continuous" dataType="double"/>
</DataDictionary>
<RegressionModel modelName="Linear_Regression_Model"
functionName="regression" algorithmName="least squares">
 <MiningSchema>
  <MiningField name="y" usageType="predicted"/>
  <MiningField name="x1" usageType="active"/>
  <MiningField name="x2" usageType="active"/>
  <MiningField name="x3" usageType="active"/>
 </MiningSchema>
 <Output>
  <OutputField name="Predicted_y" feature="predictedValue"/>
 </Output>
 <RegressionTable intercept="-0.000669089797102863">
  <NumericPredictor name="x1" exponent="1" coefficient="3.00018785681213"/>
  <NumericPredictor name="x2" exponent="1"
coefficient="-1.00362806356329"/>
  <NumericPredictor name="x3" exponent="1"
coefficient="0.998224481877296"/>
 </RegressionTable>
</RegressionModel>
</PMML>

This looks pretty reasonable (if verbose).   It takes 1.5kB to store a
model but this compresses to around 600 bytes.

More involved models are a different story.  I built a simple random forest
on the same data and simply conversion to PMML took several minutes.
Presumably the R package involved is kind of inefficient, but this still is
pretty daunting.  Manipulating the resulting PMML representation is
actually quite difficult.

Saving the random forest model ultimately resulted in a 50MB file.
Compression reduced that to about 6MB.  This is pretty massive for a fairly
simple model.




On Thu, Mar 5, 2015 at 4:25 AM, Andrew Musselman <andrew.mussel...@gmail.com
> wrote:

> I think keeping it simple is best, try implementing one or two models in
> XML and then get fancy if it makes sense.
> 
> On Wednesday, March 4, 2015, Saikat Kanjilal <sxk1...@hotmail.com> wrote:
> 
>> Next question: Is the audience for PMML programmers or could it be folks
>> that can script?  I'm wondering how this intersects with a simple spark
>> like DSL , could Mahout implement an intersection between the two?  If
>> there's interest I can go into examples.
>> 
>> Sent from my iPhone
>> 
>>> On Mar 4, 2015, at 4:17 PM, Andrew Musselman <
> andrew.mussel...@gmail.com
>> <javascript:;>> wrote:
>>> 
>>> Sure, those would be options.
>>> 
>>>> On Wed, Mar 4, 2015 at 3:41 PM, Saikat Kanjilal <sxk1...@hotmail.com
>> <javascript:;>> wrote:
>>>> 
>>>> Question, is there a way to introduce PMML with using a more
> lightweight
>>>> format like yaml or json?
>>>> 
>>>>> Date: Wed, 4 Mar 2015 13:25:29 -0800
>>>>> Subject: Re: PMML
>>>>> From: andrew.mussel...@gmail.com <javascript:;>
>>>>> To: dev@mahout.apache.org <javascript:;>
>>>>> 
>>>>> Yes, the limitations are often an issue for people doing things that
>>>> aren't
>>>>> in the PMML spec yet; there could be room for suggesting new features
>> in
>>>>> the spec by building them though, I suppose.
>>>>> 
>>>>> Also agree that XML is a lousy/bloated way of representing stuff like
>>>> this,
>>>>> but in the end it's just a choice of representation so there may be
>>>> reason
>>>>> to use some other encoding and then provide an XML-export function.
>>>>> 
>>>>>> On Wed, Mar 4, 2015 at 11:42 AM, Dmitriy Lyubimov <
> dlie...@gmail.com
>> <javascript:;>>
>>>>> wrote:
>>>>> 
>>>>>> I am willing to +1 any contribution at this point.
>>>>>> 
>>>>>> my previous company used pmml to serialize simple stuff, but i don't
>>>>>> have first hand experience. Its flexibility is ultimately pretty
>>>>>> limited, isn't it? And xml is ultimately a media which is too ugly
> and
>>>>>> too verbose at the same time to represent models with any more or
> less
>>>>>> decent number of parameters?
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Tue, Mar 3, 2015 at 8:19 PM, Suneel Marthi <
>> suneel.mar...@gmail.com <javascript:;>
>>>>> 
>>>>>> wrote:
>>>>>>> It makes sense to support PMML for classification and clustering
>>>> tasks to
>>>>>>> be able to share and distribute trained models. Sean, Pat, Dmitriy
>>>> and
>>>>>> Ted
>>>>>>> please chime in.
>>>>>>> 
>>>>>>> PMML support in Mahout was talked about for a long time now but
> never
>>>>>>> really got any traction to take off.
>>>>>>> 
>>>>>>> +1 to build this.
>>>>>>> 
>>>>>>> On Tue, Mar 3, 2015 at 11:14 PM, Andrew Musselman <
>>>>>>> andrew.mussel...@gmail.com <javascript:;>> wrote:
>>>>>>> 
>>>>>>>> How much interest is there in a mahout-pmml module, with a
> starting
>>>>>> point
>>>>>>>> to be able to export a few analytic/scoring jobs to PMML
>>>> representation?
>>>>>>>> 
>>>>>>>> I've seen a lot of interest at in being able to use PMML to
>>>> translate
>>>>>>>> analytic work into production(though I think people talk about it
>>>> more
>>>>>> than
>>>>>>>> they do it), and it could be a benchmark as part of a "definition
> of
>>>>>> done"
>>>>>>>> for any existing/new method we include since there's a spec to
>>>> build to.
>>>>>>>> 
>>>>>>>> Best
>>>>>>>> Andrew
>>>> 
>>>> 
>> 
> 

Reply via email to