An alternative to dropping is to assign the less frequent values to the reference i.e. all one-hot encoded features will be 0. Also important to note: total runtime will increase with this option since we'll have to compute the exact frequency distribution.
Another suggested change is to call this function 'one_hot_encoding' since that is the output here (similar to sklearn's OneHotEncoder <http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html>). We can keep the current name as a deprecated alias till 2.0 is released. On Fri, Oct 28, 2016 at 11:17 AM, Frank McQuillan <fmcquil...@pivotal.io> wrote: > Jarrod, > > Just trying to write up detailed requirements. How would you see this one > working? > > "2) Option to dummy code only the top n most frequently occurring values in > any column" > > With 1 column I can picture it, you would drop the rows with the less > frequently occurring values and end up with a smaller table. But what if > you are encoding multiple rows? Would you want a per row specification > of n? i.e., top 3 values for column x, top 10 values for column y? If you > did this then your result set might include low frequency values for column > x (not in top 3) because they are in the top 10 for column y - this might > be confusing. > > Frank > > On Wed, Oct 19, 2016 at 2:44 PM, Frank McQuillan <fmcquil...@pivotal.io> > wrote: > >> great, thanks for the additional information >> >> Frank >> >> On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey <jvawd...@pivotal.io> >> wrote: >> >>> IMO >>> >>> 1) Option to define resulting column names. Please see pdltools >>> implementation - the ability to pass in a function is especially useful ( >>> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html) >>> 2) Option to dummy code only the top n most frequently occurring values >>> in >>> any column >>> 3) Option to create numeric column names (E.g. pivotcol_val1, >>> pivotcol_val2 >>> ...) instead of values in column names + secondary mapping table >>> 4) Option to exclude original column from results table >>> >>> (1) & (2) are much higher priority than (3) & (4). >>> >>> Agreed that these could also be applied to Pivoting (especially 1). >>> >>> >>> >>> Jarrod Vawdrey >>> Sr. Data Scientist >>> Data Science & Engineering | Pivotal >>> (650) 315-8905 >>> https://pivotal.io/ >>> >>> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan <fmcquil...@pivotal.io> >>> wrote: >>> >>> > Thanks for those suggestions, Jarrod. They all sound pretty useful - >>> > would you mind taking a crack at numbering them 1,2,3... etc, in the >>> order >>> > of priority as you see it? >>> > >>> > Also it seems like some of these could be applied to the Pivot >>> function as >>> > well, e.g., UDF for column naming. >>> > >>> > Frank >>> > >>> > >>> > >>> > On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey <jvawd...@pivotal.io> >>> > wrote: >>> > >>> >> Hey Frank, >>> >> >>> >> How are special character values handled today? It is often not ideal >>> to >>> >> end up with column names that require double quotes to call due to >>> >> downstream scripts. >>> >> >>> >> A couple of features that would be useful >>> >> >>> >> * Option to define resulting column names. Please see pdltools >>> >> implementation - the ability to pass in a function is especially >>> useful ( >>> >> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot01.html) >>> >> * Option to dummy code only the top n most frequently occurring >>> values in >>> >> any column >>> >> * Option to exclude original column from results table >>> >> * Option to create numeric column names (E.g. pivotcol_val1, >>> >> pivotcol_val2 ...) instead of values in column names + secondary >>> mapping >>> >> table >>> >> >>> >> Thank you >>> >> >>> >> Jarrod Vawdrey >>> >> Sr. Data Scientist >>> >> Data Science & Engineering | Pivotal >>> >> (650) 315-8905 >>> >> https://pivotal.io/ >>> >> >>> >> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan < >>> fmcquil...@pivotal.io> >>> >> wrote: >>> >> >>> >>> For the module encoding categorical variables >>> >>> http://madlib.incubator.apache.org/docs/latest/group__grp__d >>> >>> ata__prep.html >>> >>> does anyone have any suggestions on improvements that we could make? >>> >>> >>> >>> Here is a video on how encoding categorical variables works for >>> those not >>> >>> familiar with it >>> >>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6 >>> >>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ >>> >>> >>> >> >>> >> >>> > >>> >> >> >