Re: DML transform() function

Deron Eriksson Thu, 10 Dec 2015 09:11:03 -0800

Hi Shirish,

Thank you for the explanation. That clears everything up. I misinterpreted
the meaning of the error message.


I'll add that useful table to the documentation.

Deron


On Wed, Dec 9, 2015 at 5:59 PM, Shirish Tatikonda <
shirish.tatiko...@gmail.com> wrote:

> Hi Deron,
>
> As the error said "A column can not be binned and scaled.", no column can
> be subjected to both *binning* and *scaling *because it does not make
> sense. *Binning* turns a scale column with continuous values into a
> categorical column. On the other hand, *Scaling* can only be done on
> continuous values.
>
> The error *does not *mean that *Scaling* is not supported. We do support S
> *caling*.
>
> At some point, I wanted to add the following table (which is currently
> present in Java code as comments) to our documentation to indicate
> transformations that can be used *simultaneously* on a single column. While
> you are at it, could you make sure it is added to the documentation?
>
> x indicates the combination is invalid.
> * indicates the combination is allowed.
> - indicates the combination is not applicable.
>
>   OMIT MVI RCD BIN DCD SCL
> OMIT     -  x   *   *   *   *
> MVI      x  -   *   *   *   *
> RCD      *  *   -   x   *   x
> BIN      *  *   x   -   *   x
> DCD      *  *   *   *   -   x
> SCL      *  *   x   x   x   -
>
> OMIT = Missing value handling by *omitting *rows
> MVI  = Missing value handling by *imputation*
> RCD  = Recoding
> BIN  = Binning
> DCD  = Dummycoding
> SCL  = Scaling
>
> Let me know if you have any further questions.
>
> Thank you,
> Shirish
>
>
> On Wed, Dec 9, 2015 at 4:53 PM, Deron Eriksson <deroneriks...@gmail.com>
> wrote:
>
> > Hi,
> >
> > I'm working on updating the online docs for the DML transform() function
> > since a couple things didn't copy over in the conversion to markdown.
> > However, I've run into an issue when I execute the transform() example.
> In
> > summary, is the "scale" transformation no longer allowed, and "bin" is
> > allowed?
> >
> > I did the following:
> >
> > I created data.csv:
> >
> >
> >
> zipcode,district,sqft,numbedrooms,numbathrooms,floors,view,saleprice,askingprice
> > 95141,south,3002,6,3,2,FALSE,929,934
> > NA,west,1373,,1,3,FALSE,695,698
> > 91312,south,NA,6,2,2,FALSE,902,
> > 94555,NA,1835,3,,3,,888,892
> > 95141,west,2770,5,2.5,,TRUE,812,816
> > 95141,east,2833,6,2.5,2,TRUE,927,
> > 96334,NA,1339,6,3,1,FALSE,672,675
> > 96334,south,2742,6,2.5,2,FALSE,872,876
> > 96334,north,2195,5,2.5,2,FALSE,799,803
> >
> > I created data.csv.mtd:
> >
> > {
> >     "data_type": "frame",
> >     "format": "csv",
> >     "sep": ",",
> >     "header": true,
> >     "na.strings": [ "NA", "" ]
> > }
> >
> > I created data.spec.json:
> >
> > {
> >     "omit": [ "zipcode" ]
> >    ,"impute":
> >     [ { "name": "district"    , "method": "constant", "value": "south" }
> >      ,{ "name": "numbedrooms" , "method": "constant", "value": 2 }
> >      ,{ "name": "numbathrooms", "method": "constant", "value": 1 }
> >      ,{ "name": "floors"      , "method": "constant", "value": 1 }
> >      ,{ "name": "view"        , "method": "global_mode" }
> >      ,{ "name": "askingprice" , "method": "global_mean" }
> >     ]
> >
> >     ,"recode":
> >     [ "zipcode", "district", "numbedrooms", "numbathrooms", "floors",
> > "view" ]
> >
> >     ,"bin":
> >     [ { "name": "saleprice"  , "method": "equi-width", "numbins": 3 }
> >      ,{ "name": "sqft"       , "method": "equi-width", "numbins": 4 }
> >     ]
> >
> >     ,"dummycode":
> >     [ "district", "numbathrooms", "floors", "view", "saleprice", "sqft" ]
> >
> >     ,"scale":
> >     [ { "name": "sqft", "method": "mean-subtraction" }
> >      ,{ "name": "saleprice", "method": "z-score" }
> >      ,{ "name": "askingprice", "method": "z-score" }
> >     ]
> > }
> >
> > I executed the following DML:
> >
> > D = read("data.csv");
> > tfD = transform(target=D,
> >                 transformSpec="data.spec.json",
> >                 transformPath="example-transform");
> > s = sum(tfD);
> > print("Sum = " + s);
> >
> > This generated the following error:
> >
> > java.lang.IllegalArgumentException: Invalid transformations on column ID
> 3.
> > A column can not be binned and scaled.
> >
> > So, I removed the "scale" from data.spec.json:
> >
> > {
> >     "omit": [ "zipcode" ]
> >    ,"impute":
> >     [ { "name": "district"    , "method": "constant", "value": "south" }
> >      ,{ "name": "numbedrooms" , "method": "constant", "value": 2 }
> >      ,{ "name": "numbathrooms", "method": "constant", "value": 1 }
> >      ,{ "name": "floors"      , "method": "constant", "value": 1 }
> >      ,{ "name": "view"        , "method": "global_mode" }
> >      ,{ "name": "askingprice" , "method": "global_mean" }
> >     ]
> >
> >     ,"recode":
> >     [ "zipcode", "district", "numbedrooms", "numbathrooms", "floors",
> > "view" ]
> >
> >     ,"bin":
> >     [ { "name": "saleprice"  , "method": "equi-width", "numbins": 3 }
> >      ,{ "name": "sqft"       , "method": "equi-width", "numbins": 4 }
> >     ]
> >
> >     ,"dummycode":
> >     [ "district", "numbathrooms", "floors", "view", "saleprice", "sqft" ]
> >
> > }
> >
> > This generated:
> >
> > java.lang.RuntimeException: Encountered "NA" in column ID "3", when
> > expecting a numeric value. Consider adding "NA" to na.strings, along with
> > an appropriate imputation method.
> >
> > So, I set "sqft" to be "global_mean" in the "impute" section of the spec.
> >
> > {
> >     "omit": [ "zipcode" ]
> >    ,"impute":
> >     [ { "name": "district"    , "method": "constant", "value": "south" }
> >      ,{ "name": "numbedrooms" , "method": "constant", "value": 2 }
> >      ,{ "name": "numbathrooms", "method": "constant", "value": 1 }
> >      ,{ "name": "floors"      , "method": "constant", "value": 1 }
> >      ,{ "name": "view"        , "method": "global_mode" }
> >      ,{ "name": "askingprice" , "method": "global_mean" }
> >      ,{ "name": "sqft"        , "method": "global_mean" }
> >     ]
> >
> >     ,"recode":
> >     [ "zipcode", "district", "numbedrooms", "numbathrooms", "floors",
> > "view" ]
> >
> >     ,"bin":
> >     [ { "name": "saleprice"  , "method": "equi-width", "numbins": 3 }
> >      ,{ "name": "sqft"       , "method": "equi-width", "numbins": 4 }
> >     ]
> >
> >     ,"dummycode":
> >     [ "district", "numbathrooms", "floors", "view", "saleprice", "sqft" ]
> >
> > }
> >
> > This allowed the DML to execute successfully.
> >
> > So, is "scale" not allowed anymore? And "bin" is allowed (despite the
> > message saying it isn't allowed)?
> >
> > Thank you,
> > Deron
> >
>

Re: DML transform() function

Reply via email to