Also agree that double-quoted column names are not ideal. In addition to the net-new features described in this thread, it'd be nice to see non-double-quoted output as default behavior in the existing create_indicator_variables() function.
Thanks, Woo On Fri, Oct 28, 2016 at 1:05 PM, Woo Jae Jung <[email protected]> wrote: > I like the one-hot encoded feature. Another variant of this idea would be > an "all other" variable (distinct from the reference class) that contains > occurrences of the less frequent category types. In both of these > scenarios, the threshold for 'less frequent' could be user-supplied. > > Thanks, > Woo > > On Fri, Oct 28, 2016 at 11:29 AM, Rahul Iyer <[email protected]> wrote: > >> An alternative to dropping is to assign the less frequent values to the >> reference i.e. all one-hot encoded features will be 0. >> Also important to note: total runtime will increase with this option since >> we'll have to compute the exact frequency distribution. >> >> Another suggested change is to call this function 'one_hot_encoding' since >> that is the output here (similar to sklearn's OneHotEncoder >> <http://scikit-learn.org/stable/modules/generated/sklearn. >> preprocessing.OneHotEncoder.html>). >> We can keep the current name as a deprecated alias till 2.0 is released. >> >> On Fri, Oct 28, 2016 at 11:17 AM, Frank McQuillan <[email protected]> >> wrote: >> >> > Jarrod, >> > >> > Just trying to write up detailed requirements. How would you see this >> one >> > working? >> > >> > "2) Option to dummy code only the top n most frequently occurring >> values in >> > any column" >> > >> > With 1 column I can picture it, you would drop the rows with the less >> > frequently occurring values and end up with a smaller table. But what >> if >> > you are encoding multiple rows? Would you want a per row >> specification >> > of n? i.e., top 3 values for column x, top 10 values for column y? If >> you >> > did this then your result set might include low frequency values for >> column >> > x (not in top 3) because they are in the top 10 for column y - this >> might >> > be confusing. >> > >> > Frank >> > >> > On Wed, Oct 19, 2016 at 2:44 PM, Frank McQuillan <[email protected] >> > >> > wrote: >> > >> >> great, thanks for the additional information >> >> >> >> Frank >> >> >> >> On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey <[email protected]> >> >> wrote: >> >> >> >>> IMO >> >>> >> >>> 1) Option to define resulting column names. Please see pdltools >> >>> implementation - the ability to pass in a function is especially >> useful ( >> >>> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html) >> >>> 2) Option to dummy code only the top n most frequently occurring >> values >> >>> in >> >>> any column >> >>> 3) Option to create numeric column names (E.g. pivotcol_val1, >> >>> pivotcol_val2 >> >>> ...) instead of values in column names + secondary mapping table >> >>> 4) Option to exclude original column from results table >> >>> >> >>> (1) & (2) are much higher priority than (3) & (4). >> >>> >> >>> Agreed that these could also be applied to Pivoting (especially 1). >> >>> >> >>> >> >>> >> >>> Jarrod Vawdrey >> >>> Sr. Data Scientist >> >>> Data Science & Engineering | Pivotal >> >>> (650) 315-8905 >> >>> https://pivotal.io/ >> >>> >> >>> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan < >> [email protected]> >> >>> wrote: >> >>> >> >>> > Thanks for those suggestions, Jarrod. They all sound pretty useful >> - >> >>> > would you mind taking a crack at numbering them 1,2,3... etc, in the >> >>> order >> >>> > of priority as you see it? >> >>> > >> >>> > Also it seems like some of these could be applied to the Pivot >> >>> function as >> >>> > well, e.g., UDF for column naming. >> >>> > >> >>> > Frank >> >>> > >> >>> > >> >>> > >> >>> > On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey < >> [email protected]> >> >>> > wrote: >> >>> > >> >>> >> Hey Frank, >> >>> >> >> >>> >> How are special character values handled today? It is often not >> ideal >> >>> to >> >>> >> end up with column names that require double quotes to call due to >> >>> >> downstream scripts. >> >>> >> >> >>> >> A couple of features that would be useful >> >>> >> >> >>> >> * Option to define resulting column names. Please see pdltools >> >>> >> implementation - the ability to pass in a function is especially >> >>> useful ( >> >>> >> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html >> ) >> >>> >> * Option to dummy code only the top n most frequently occurring >> >>> values in >> >>> >> any column >> >>> >> * Option to exclude original column from results table >> >>> >> * Option to create numeric column names (E.g. pivotcol_val1, >> >>> >> pivotcol_val2 ...) instead of values in column names + secondary >> >>> mapping >> >>> >> table >> >>> >> >> >>> >> Thank you >> >>> >> >> >>> >> Jarrod Vawdrey >> >>> >> Sr. Data Scientist >> >>> >> Data Science & Engineering | Pivotal >> >>> >> (650) 315-8905 >> >>> >> https://pivotal.io/ >> >>> >> >> >>> >> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan < >> >>> [email protected]> >> >>> >> wrote: >> >>> >> >> >>> >>> For the module encoding categorical variables >> >>> >>> http://madlib.incubator.apache.org/docs/latest/group__grp__d >> >>> >>> ata__prep.html >> >>> >>> does anyone have any suggestions on improvements that we could >> make? >> >>> >>> >> >>> >>> Here is a video on how encoding categorical variables works for >> >>> those not >> >>> >>> familiar with it >> >>> >>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6 >> >>> >>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ >> >>> >>> >> >>> >> >> >>> >> >> >>> > >> >>> >> >> >> >> >> > >> > >
