Hi,

I'm working on updating the online docs for the DML transform() function
since a couple things didn't copy over in the conversion to markdown.
However, I've run into an issue when I execute the transform() example. In
summary, is the "scale" transformation no longer allowed, and "bin" is
allowed?

I did the following:

I created data.csv:

zipcode,district,sqft,numbedrooms,numbathrooms,floors,view,saleprice,askingprice
95141,south,3002,6,3,2,FALSE,929,934
NA,west,1373,,1,3,FALSE,695,698
91312,south,NA,6,2,2,FALSE,902,
94555,NA,1835,3,,3,,888,892
95141,west,2770,5,2.5,,TRUE,812,816
95141,east,2833,6,2.5,2,TRUE,927,
96334,NA,1339,6,3,1,FALSE,672,675
96334,south,2742,6,2.5,2,FALSE,872,876
96334,north,2195,5,2.5,2,FALSE,799,803

I created data.csv.mtd:

{
    "data_type": "frame",
    "format": "csv",
    "sep": ",",
    "header": true,
    "na.strings": [ "NA", "" ]
}

I created data.spec.json:

{
    "omit": [ "zipcode" ]
   ,"impute":
    [ { "name": "district"    , "method": "constant", "value": "south" }
     ,{ "name": "numbedrooms" , "method": "constant", "value": 2 }
     ,{ "name": "numbathrooms", "method": "constant", "value": 1 }
     ,{ "name": "floors"      , "method": "constant", "value": 1 }
     ,{ "name": "view"        , "method": "global_mode" }
     ,{ "name": "askingprice" , "method": "global_mean" }
    ]

    ,"recode":
    [ "zipcode", "district", "numbedrooms", "numbathrooms", "floors",
"view" ]

    ,"bin":
    [ { "name": "saleprice"  , "method": "equi-width", "numbins": 3 }
     ,{ "name": "sqft"       , "method": "equi-width", "numbins": 4 }
    ]

    ,"dummycode":
    [ "district", "numbathrooms", "floors", "view", "saleprice", "sqft" ]

    ,"scale":
    [ { "name": "sqft", "method": "mean-subtraction" }
     ,{ "name": "saleprice", "method": "z-score" }
     ,{ "name": "askingprice", "method": "z-score" }
    ]
}

I executed the following DML:

D = read("data.csv");
tfD = transform(target=D,
                transformSpec="data.spec.json",
                transformPath="example-transform");
s = sum(tfD);
print("Sum = " + s);

This generated the following error:

java.lang.IllegalArgumentException: Invalid transformations on column ID 3.
A column can not be binned and scaled.

So, I removed the "scale" from data.spec.json:

{
    "omit": [ "zipcode" ]
   ,"impute":
    [ { "name": "district"    , "method": "constant", "value": "south" }
     ,{ "name": "numbedrooms" , "method": "constant", "value": 2 }
     ,{ "name": "numbathrooms", "method": "constant", "value": 1 }
     ,{ "name": "floors"      , "method": "constant", "value": 1 }
     ,{ "name": "view"        , "method": "global_mode" }
     ,{ "name": "askingprice" , "method": "global_mean" }
    ]

    ,"recode":
    [ "zipcode", "district", "numbedrooms", "numbathrooms", "floors",
"view" ]

    ,"bin":
    [ { "name": "saleprice"  , "method": "equi-width", "numbins": 3 }
     ,{ "name": "sqft"       , "method": "equi-width", "numbins": 4 }
    ]

    ,"dummycode":
    [ "district", "numbathrooms", "floors", "view", "saleprice", "sqft" ]

}

This generated:

java.lang.RuntimeException: Encountered "NA" in column ID "3", when
expecting a numeric value. Consider adding "NA" to na.strings, along with
an appropriate imputation method.

So, I set "sqft" to be "global_mean" in the "impute" section of the spec.

{
    "omit": [ "zipcode" ]
   ,"impute":
    [ { "name": "district"    , "method": "constant", "value": "south" }
     ,{ "name": "numbedrooms" , "method": "constant", "value": 2 }
     ,{ "name": "numbathrooms", "method": "constant", "value": 1 }
     ,{ "name": "floors"      , "method": "constant", "value": 1 }
     ,{ "name": "view"        , "method": "global_mode" }
     ,{ "name": "askingprice" , "method": "global_mean" }
     ,{ "name": "sqft"        , "method": "global_mean" }
    ]

    ,"recode":
    [ "zipcode", "district", "numbedrooms", "numbathrooms", "floors",
"view" ]

    ,"bin":
    [ { "name": "saleprice"  , "method": "equi-width", "numbins": 3 }
     ,{ "name": "sqft"       , "method": "equi-width", "numbins": 4 }
    ]

    ,"dummycode":
    [ "district", "numbathrooms", "floors", "view", "saleprice", "sqft" ]

}

This allowed the DML to execute successfully.

So, is "scale" not allowed anymore? And "bin" is allowed (despite the
message saying it isn't allowed)?

Thank you,
Deron

Reply via email to