Hi Shirish, Thank you for the explanation. That clears everything up. I misinterpreted the meaning of the error message.
I'll add that useful table to the documentation. Deron On Wed, Dec 9, 2015 at 5:59 PM, Shirish Tatikonda < [email protected]> wrote: > Hi Deron, > > As the error said "A column can not be binned and scaled.", no column can > be subjected to both *binning* and *scaling *because it does not make > sense. *Binning* turns a scale column with continuous values into a > categorical column. On the other hand, *Scaling* can only be done on > continuous values. > > The error *does not *mean that *Scaling* is not supported. We do support S > *caling*. > > At some point, I wanted to add the following table (which is currently > present in Java code as comments) to our documentation to indicate > transformations that can be used *simultaneously* on a single column. While > you are at it, could you make sure it is added to the documentation? > > x indicates the combination is invalid. > * indicates the combination is allowed. > - indicates the combination is not applicable. > > OMIT MVI RCD BIN DCD SCL > OMIT - x * * * * > MVI x - * * * * > RCD * * - x * x > BIN * * x - * x > DCD * * * * - x > SCL * * x x x - > > OMIT = Missing value handling by *omitting *rows > MVI = Missing value handling by *imputation* > RCD = Recoding > BIN = Binning > DCD = Dummycoding > SCL = Scaling > > Let me know if you have any further questions. > > Thank you, > Shirish > > > On Wed, Dec 9, 2015 at 4:53 PM, Deron Eriksson <[email protected]> > wrote: > > > Hi, > > > > I'm working on updating the online docs for the DML transform() function > > since a couple things didn't copy over in the conversion to markdown. > > However, I've run into an issue when I execute the transform() example. > In > > summary, is the "scale" transformation no longer allowed, and "bin" is > > allowed? > > > > I did the following: > > > > I created data.csv: > > > > > > > zipcode,district,sqft,numbedrooms,numbathrooms,floors,view,saleprice,askingprice > > 95141,south,3002,6,3,2,FALSE,929,934 > > NA,west,1373,,1,3,FALSE,695,698 > > 91312,south,NA,6,2,2,FALSE,902, > > 94555,NA,1835,3,,3,,888,892 > > 95141,west,2770,5,2.5,,TRUE,812,816 > > 95141,east,2833,6,2.5,2,TRUE,927, > > 96334,NA,1339,6,3,1,FALSE,672,675 > > 96334,south,2742,6,2.5,2,FALSE,872,876 > > 96334,north,2195,5,2.5,2,FALSE,799,803 > > > > I created data.csv.mtd: > > > > { > > "data_type": "frame", > > "format": "csv", > > "sep": ",", > > "header": true, > > "na.strings": [ "NA", "" ] > > } > > > > I created data.spec.json: > > > > { > > "omit": [ "zipcode" ] > > ,"impute": > > [ { "name": "district" , "method": "constant", "value": "south" } > > ,{ "name": "numbedrooms" , "method": "constant", "value": 2 } > > ,{ "name": "numbathrooms", "method": "constant", "value": 1 } > > ,{ "name": "floors" , "method": "constant", "value": 1 } > > ,{ "name": "view" , "method": "global_mode" } > > ,{ "name": "askingprice" , "method": "global_mean" } > > ] > > > > ,"recode": > > [ "zipcode", "district", "numbedrooms", "numbathrooms", "floors", > > "view" ] > > > > ,"bin": > > [ { "name": "saleprice" , "method": "equi-width", "numbins": 3 } > > ,{ "name": "sqft" , "method": "equi-width", "numbins": 4 } > > ] > > > > ,"dummycode": > > [ "district", "numbathrooms", "floors", "view", "saleprice", "sqft" ] > > > > ,"scale": > > [ { "name": "sqft", "method": "mean-subtraction" } > > ,{ "name": "saleprice", "method": "z-score" } > > ,{ "name": "askingprice", "method": "z-score" } > > ] > > } > > > > I executed the following DML: > > > > D = read("data.csv"); > > tfD = transform(target=D, > > transformSpec="data.spec.json", > > transformPath="example-transform"); > > s = sum(tfD); > > print("Sum = " + s); > > > > This generated the following error: > > > > java.lang.IllegalArgumentException: Invalid transformations on column ID > 3. > > A column can not be binned and scaled. > > > > So, I removed the "scale" from data.spec.json: > > > > { > > "omit": [ "zipcode" ] > > ,"impute": > > [ { "name": "district" , "method": "constant", "value": "south" } > > ,{ "name": "numbedrooms" , "method": "constant", "value": 2 } > > ,{ "name": "numbathrooms", "method": "constant", "value": 1 } > > ,{ "name": "floors" , "method": "constant", "value": 1 } > > ,{ "name": "view" , "method": "global_mode" } > > ,{ "name": "askingprice" , "method": "global_mean" } > > ] > > > > ,"recode": > > [ "zipcode", "district", "numbedrooms", "numbathrooms", "floors", > > "view" ] > > > > ,"bin": > > [ { "name": "saleprice" , "method": "equi-width", "numbins": 3 } > > ,{ "name": "sqft" , "method": "equi-width", "numbins": 4 } > > ] > > > > ,"dummycode": > > [ "district", "numbathrooms", "floors", "view", "saleprice", "sqft" ] > > > > } > > > > This generated: > > > > java.lang.RuntimeException: Encountered "NA" in column ID "3", when > > expecting a numeric value. Consider adding "NA" to na.strings, along with > > an appropriate imputation method. > > > > So, I set "sqft" to be "global_mean" in the "impute" section of the spec. > > > > { > > "omit": [ "zipcode" ] > > ,"impute": > > [ { "name": "district" , "method": "constant", "value": "south" } > > ,{ "name": "numbedrooms" , "method": "constant", "value": 2 } > > ,{ "name": "numbathrooms", "method": "constant", "value": 1 } > > ,{ "name": "floors" , "method": "constant", "value": 1 } > > ,{ "name": "view" , "method": "global_mode" } > > ,{ "name": "askingprice" , "method": "global_mean" } > > ,{ "name": "sqft" , "method": "global_mean" } > > ] > > > > ,"recode": > > [ "zipcode", "district", "numbedrooms", "numbathrooms", "floors", > > "view" ] > > > > ,"bin": > > [ { "name": "saleprice" , "method": "equi-width", "numbins": 3 } > > ,{ "name": "sqft" , "method": "equi-width", "numbins": 4 } > > ] > > > > ,"dummycode": > > [ "district", "numbathrooms", "floors", "view", "saleprice", "sqft" ] > > > > } > > > > This allowed the DML to execute successfully. > > > > So, is "scale" not allowed anymore? And "bin" is allowed (despite the > > message saying it isn't allowed)? > > > > Thank you, > > Deron > > >
