Yes, it makes sense having one for Naive Bayes and KMeans (when we have
that !!).

On Thu, Mar 5, 2015 at 11:49 AM, Pat Ferrel <p...@occamsmachete.com> wrote:

> PMML doesn’t make a lot of sense when the model is a potentially massive
> matrix. One reason is that it will be pretty hard (impossible?) to
> parallelize read/write with the engines we use. JSON has the same problem
> and the only way SchemaRDD can read JSON is by bending the rules.
>
> Seems like a good thing to support for algos that can make good use of it.
> Does that narrow it down to naive bayes today?
>
> On Mar 5, 2015, at 2:19 AM, Ted Dunning <ted.dunn...@gmail.com> wrote:
>
> PMML is a machine-to-machine mechanism, not intended really for human
> consumption or production.  Based on XML, it is, of course, bloated, but
> that doesn't really matter for readability since reading isn't the goal.
>
> The vision of making models easy to transfer from system to system is nice,
> but the reality has fallen far short, unfortunately.  The problem is that
> systems often have special aspects that make it hard to replicate exact
> actions from one system to another.  Having a textual format for numerical
> data doesn't help.
>
> Here, for instance, is a linear regression model that I created using R:
>
> <PMML version="4.2" xmlns="http://www.dmg.org/PMML-4_2"; xmlns:xsi="
> http://www.w3.org/2001/XMLSchema-instance"; xsi:schemaLocation="
> http://www.dmg.org/PMML-4_2 http://www.dmg.org/v4-2/pmml-4-2.xsd";>
> <Header copyright="Copyright (c) 2015 tdunning" description="Linear
> Regression Model">
>  <Extension name="user" value="tdunning" extender="Rattle/PMML"/>
>  <Application name="Rattle/PMML" version="1.4"/>
>  <Timestamp>2015-03-05 09:46:32</Timestamp>
> </Header>
> <DataDictionary numberOfFields="4">
>  <DataField name="y" optype="continuous" dataType="double"/>
>  <DataField name="x1" optype="continuous" dataType="double"/>
>  <DataField name="x2" optype="continuous" dataType="double"/>
>  <DataField name="x3" optype="continuous" dataType="double"/>
> </DataDictionary>
> <RegressionModel modelName="Linear_Regression_Model"
> functionName="regression" algorithmName="least squares">
>  <MiningSchema>
>   <MiningField name="y" usageType="predicted"/>
>   <MiningField name="x1" usageType="active"/>
>   <MiningField name="x2" usageType="active"/>
>   <MiningField name="x3" usageType="active"/>
>  </MiningSchema>
>  <Output>
>   <OutputField name="Predicted_y" feature="predictedValue"/>
>  </Output>
>  <RegressionTable intercept="-0.000669089797102863">
>   <NumericPredictor name="x1" exponent="1" coefficient="3.00018785681213"/>
>   <NumericPredictor name="x2" exponent="1"
> coefficient="-1.00362806356329"/>
>   <NumericPredictor name="x3" exponent="1"
> coefficient="0.998224481877296"/>
>  </RegressionTable>
> </RegressionModel>
> </PMML>
>
> This looks pretty reasonable (if verbose).   It takes 1.5kB to store a
> model but this compresses to around 600 bytes.
>
> More involved models are a different story.  I built a simple random forest
> on the same data and simply conversion to PMML took several minutes.
> Presumably the R package involved is kind of inefficient, but this still is
> pretty daunting.  Manipulating the resulting PMML representation is
> actually quite difficult.
>
> Saving the random forest model ultimately resulted in a 50MB file.
> Compression reduced that to about 6MB.  This is pretty massive for a fairly
> simple model.
>
>
>
>
> On Thu, Mar 5, 2015 at 4:25 AM, Andrew Musselman <
> andrew.mussel...@gmail.com
> > wrote:
>
> > I think keeping it simple is best, try implementing one or two models in
> > XML and then get fancy if it makes sense.
> >
> > On Wednesday, March 4, 2015, Saikat Kanjilal <sxk1...@hotmail.com>
> wrote:
> >
> >> Next question: Is the audience for PMML programmers or could it be folks
> >> that can script?  I'm wondering how this intersects with a simple spark
> >> like DSL , could Mahout implement an intersection between the two?  If
> >> there's interest I can go into examples.
> >>
> >> Sent from my iPhone
> >>
> >>> On Mar 4, 2015, at 4:17 PM, Andrew Musselman <
> > andrew.mussel...@gmail.com
> >> <javascript:;>> wrote:
> >>>
> >>> Sure, those would be options.
> >>>
> >>>> On Wed, Mar 4, 2015 at 3:41 PM, Saikat Kanjilal <sxk1...@hotmail.com
> >> <javascript:;>> wrote:
> >>>>
> >>>> Question, is there a way to introduce PMML with using a more
> > lightweight
> >>>> format like yaml or json?
> >>>>
> >>>>> Date: Wed, 4 Mar 2015 13:25:29 -0800
> >>>>> Subject: Re: PMML
> >>>>> From: andrew.mussel...@gmail.com <javascript:;>
> >>>>> To: dev@mahout.apache.org <javascript:;>
> >>>>>
> >>>>> Yes, the limitations are often an issue for people doing things that
> >>>> aren't
> >>>>> in the PMML spec yet; there could be room for suggesting new features
> >> in
> >>>>> the spec by building them though, I suppose.
> >>>>>
> >>>>> Also agree that XML is a lousy/bloated way of representing stuff like
> >>>> this,
> >>>>> but in the end it's just a choice of representation so there may be
> >>>> reason
> >>>>> to use some other encoding and then provide an XML-export function.
> >>>>>
> >>>>>> On Wed, Mar 4, 2015 at 11:42 AM, Dmitriy Lyubimov <
> > dlie...@gmail.com
> >> <javascript:;>>
> >>>>> wrote:
> >>>>>
> >>>>>> I am willing to +1 any contribution at this point.
> >>>>>>
> >>>>>> my previous company used pmml to serialize simple stuff, but i don't
> >>>>>> have first hand experience. Its flexibility is ultimately pretty
> >>>>>> limited, isn't it? And xml is ultimately a media which is too ugly
> > and
> >>>>>> too verbose at the same time to represent models with any more or
> > less
> >>>>>> decent number of parameters?
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Mar 3, 2015 at 8:19 PM, Suneel Marthi <
> >> suneel.mar...@gmail.com <javascript:;>
> >>>>>
> >>>>>> wrote:
> >>>>>>> It makes sense to support PMML for classification and clustering
> >>>> tasks to
> >>>>>>> be able to share and distribute trained models. Sean, Pat, Dmitriy
> >>>> and
> >>>>>> Ted
> >>>>>>> please chime in.
> >>>>>>>
> >>>>>>> PMML support in Mahout was talked about for a long time now but
> > never
> >>>>>>> really got any traction to take off.
> >>>>>>>
> >>>>>>> +1 to build this.
> >>>>>>>
> >>>>>>> On Tue, Mar 3, 2015 at 11:14 PM, Andrew Musselman <
> >>>>>>> andrew.mussel...@gmail.com <javascript:;>> wrote:
> >>>>>>>
> >>>>>>>> How much interest is there in a mahout-pmml module, with a
> > starting
> >>>>>> point
> >>>>>>>> to be able to export a few analytic/scoring jobs to PMML
> >>>> representation?
> >>>>>>>>
> >>>>>>>> I've seen a lot of interest at in being able to use PMML to
> >>>> translate
> >>>>>>>> analytic work into production(though I think people talk about it
> >>>> more
> >>>>>> than
> >>>>>>>> they do it), and it could be a benchmark as part of a "definition
> > of
> >>>>>> done"
> >>>>>>>> for any existing/new method we include since there's a spec to
> >>>> build to.
> >>>>>>>>
> >>>>>>>> Best
> >>>>>>>> Andrew
> >>>>
> >>>>
> >>
> >
>
>

Reply via email to