Yes, it makes sense having one for Naive Bayes and KMeans (when we have that !!).
On Thu, Mar 5, 2015 at 11:49 AM, Pat Ferrel <p...@occamsmachete.com> wrote: > PMML doesn’t make a lot of sense when the model is a potentially massive > matrix. One reason is that it will be pretty hard (impossible?) to > parallelize read/write with the engines we use. JSON has the same problem > and the only way SchemaRDD can read JSON is by bending the rules. > > Seems like a good thing to support for algos that can make good use of it. > Does that narrow it down to naive bayes today? > > On Mar 5, 2015, at 2:19 AM, Ted Dunning <ted.dunn...@gmail.com> wrote: > > PMML is a machine-to-machine mechanism, not intended really for human > consumption or production. Based on XML, it is, of course, bloated, but > that doesn't really matter for readability since reading isn't the goal. > > The vision of making models easy to transfer from system to system is nice, > but the reality has fallen far short, unfortunately. The problem is that > systems often have special aspects that make it hard to replicate exact > actions from one system to another. Having a textual format for numerical > data doesn't help. > > Here, for instance, is a linear regression model that I created using R: > > <PMML version="4.2" xmlns="http://www.dmg.org/PMML-4_2" xmlns:xsi=" > http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation=" > http://www.dmg.org/PMML-4_2 http://www.dmg.org/v4-2/pmml-4-2.xsd"> > <Header copyright="Copyright (c) 2015 tdunning" description="Linear > Regression Model"> > <Extension name="user" value="tdunning" extender="Rattle/PMML"/> > <Application name="Rattle/PMML" version="1.4"/> > <Timestamp>2015-03-05 09:46:32</Timestamp> > </Header> > <DataDictionary numberOfFields="4"> > <DataField name="y" optype="continuous" dataType="double"/> > <DataField name="x1" optype="continuous" dataType="double"/> > <DataField name="x2" optype="continuous" dataType="double"/> > <DataField name="x3" optype="continuous" dataType="double"/> > </DataDictionary> > <RegressionModel modelName="Linear_Regression_Model" > functionName="regression" algorithmName="least squares"> > <MiningSchema> > <MiningField name="y" usageType="predicted"/> > <MiningField name="x1" usageType="active"/> > <MiningField name="x2" usageType="active"/> > <MiningField name="x3" usageType="active"/> > </MiningSchema> > <Output> > <OutputField name="Predicted_y" feature="predictedValue"/> > </Output> > <RegressionTable intercept="-0.000669089797102863"> > <NumericPredictor name="x1" exponent="1" coefficient="3.00018785681213"/> > <NumericPredictor name="x2" exponent="1" > coefficient="-1.00362806356329"/> > <NumericPredictor name="x3" exponent="1" > coefficient="0.998224481877296"/> > </RegressionTable> > </RegressionModel> > </PMML> > > This looks pretty reasonable (if verbose). It takes 1.5kB to store a > model but this compresses to around 600 bytes. > > More involved models are a different story. I built a simple random forest > on the same data and simply conversion to PMML took several minutes. > Presumably the R package involved is kind of inefficient, but this still is > pretty daunting. Manipulating the resulting PMML representation is > actually quite difficult. > > Saving the random forest model ultimately resulted in a 50MB file. > Compression reduced that to about 6MB. This is pretty massive for a fairly > simple model. > > > > > On Thu, Mar 5, 2015 at 4:25 AM, Andrew Musselman < > andrew.mussel...@gmail.com > > wrote: > > > I think keeping it simple is best, try implementing one or two models in > > XML and then get fancy if it makes sense. > > > > On Wednesday, March 4, 2015, Saikat Kanjilal <sxk1...@hotmail.com> > wrote: > > > >> Next question: Is the audience for PMML programmers or could it be folks > >> that can script? I'm wondering how this intersects with a simple spark > >> like DSL , could Mahout implement an intersection between the two? If > >> there's interest I can go into examples. > >> > >> Sent from my iPhone > >> > >>> On Mar 4, 2015, at 4:17 PM, Andrew Musselman < > > andrew.mussel...@gmail.com > >> <javascript:;>> wrote: > >>> > >>> Sure, those would be options. > >>> > >>>> On Wed, Mar 4, 2015 at 3:41 PM, Saikat Kanjilal <sxk1...@hotmail.com > >> <javascript:;>> wrote: > >>>> > >>>> Question, is there a way to introduce PMML with using a more > > lightweight > >>>> format like yaml or json? > >>>> > >>>>> Date: Wed, 4 Mar 2015 13:25:29 -0800 > >>>>> Subject: Re: PMML > >>>>> From: andrew.mussel...@gmail.com <javascript:;> > >>>>> To: dev@mahout.apache.org <javascript:;> > >>>>> > >>>>> Yes, the limitations are often an issue for people doing things that > >>>> aren't > >>>>> in the PMML spec yet; there could be room for suggesting new features > >> in > >>>>> the spec by building them though, I suppose. > >>>>> > >>>>> Also agree that XML is a lousy/bloated way of representing stuff like > >>>> this, > >>>>> but in the end it's just a choice of representation so there may be > >>>> reason > >>>>> to use some other encoding and then provide an XML-export function. > >>>>> > >>>>>> On Wed, Mar 4, 2015 at 11:42 AM, Dmitriy Lyubimov < > > dlie...@gmail.com > >> <javascript:;>> > >>>>> wrote: > >>>>> > >>>>>> I am willing to +1 any contribution at this point. > >>>>>> > >>>>>> my previous company used pmml to serialize simple stuff, but i don't > >>>>>> have first hand experience. Its flexibility is ultimately pretty > >>>>>> limited, isn't it? And xml is ultimately a media which is too ugly > > and > >>>>>> too verbose at the same time to represent models with any more or > > less > >>>>>> decent number of parameters? > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Tue, Mar 3, 2015 at 8:19 PM, Suneel Marthi < > >> suneel.mar...@gmail.com <javascript:;> > >>>>> > >>>>>> wrote: > >>>>>>> It makes sense to support PMML for classification and clustering > >>>> tasks to > >>>>>>> be able to share and distribute trained models. Sean, Pat, Dmitriy > >>>> and > >>>>>> Ted > >>>>>>> please chime in. > >>>>>>> > >>>>>>> PMML support in Mahout was talked about for a long time now but > > never > >>>>>>> really got any traction to take off. > >>>>>>> > >>>>>>> +1 to build this. > >>>>>>> > >>>>>>> On Tue, Mar 3, 2015 at 11:14 PM, Andrew Musselman < > >>>>>>> andrew.mussel...@gmail.com <javascript:;>> wrote: > >>>>>>> > >>>>>>>> How much interest is there in a mahout-pmml module, with a > > starting > >>>>>> point > >>>>>>>> to be able to export a few analytic/scoring jobs to PMML > >>>> representation? > >>>>>>>> > >>>>>>>> I've seen a lot of interest at in being able to use PMML to > >>>> translate > >>>>>>>> analytic work into production(though I think people talk about it > >>>> more > >>>>>> than > >>>>>>>> they do it), and it could be a benchmark as part of a "definition > > of > >>>>>> done" > >>>>>>>> for any existing/new method we include since there's a spec to > >>>> build to. > >>>>>>>> > >>>>>>>> Best > >>>>>>>> Andrew > >>>> > >>>> > >> > > > >