Yes thanks Vatsan we have been looking at that. On Fri, Oct 28, 2016 at 2:39 PM, Srivatsan R <vatsan...@gmail.com> wrote:
> You guys may have already seen this, but linking just in case: > http://pandas.pydata.org/pandas-docs/stable/generated/ > pandas.get_dummies.html > > On Fri, Oct 28, 2016 at 1:32 PM, Woo Jae Jung <wj...@pivotal.io> wrote: > > > +Vatsan for his thoughts as well! > > > > On Fri, Oct 28, 2016 at 1:29 PM, Woo Jae Jung <wj...@pivotal.io> wrote: > > > >> Also agree that double-quoted column names are not ideal. In addition > to > >> the net-new features described in this thread, it'd be nice to see > >> non-double-quoted output as default behavior in the > >> existing create_indicator_variables() function. > >> > >> Thanks, > >> Woo > >> > >> On Fri, Oct 28, 2016 at 1:05 PM, Woo Jae Jung <wj...@pivotal.io> wrote: > >> > >>> I like the one-hot encoded feature. Another variant of this idea would > >>> be an "all other" variable (distinct from the reference class) that > >>> contains occurrences of the less frequent category types. In both of > these > >>> scenarios, the threshold for 'less frequent' could be user-supplied. > >>> > >>> Thanks, > >>> Woo > >>> > >>> On Fri, Oct 28, 2016 at 11:29 AM, Rahul Iyer <rahulri...@gmail.com> > >>> wrote: > >>> > >>>> An alternative to dropping is to assign the less frequent values to > the > >>>> reference i.e. all one-hot encoded features will be 0. > >>>> Also important to note: total runtime will increase with this option > >>>> since > >>>> we'll have to compute the exact frequency distribution. > >>>> > >>>> Another suggested change is to call this function 'one_hot_encoding' > >>>> since > >>>> that is the output here (similar to sklearn's OneHotEncoder > >>>> <http://scikit-learn.org/stable/modules/generated/sklearn.pr > >>>> eprocessing.OneHotEncoder.html>). > >>>> We can keep the current name as a deprecated alias till 2.0 is > released. > >>>> > >>>> On Fri, Oct 28, 2016 at 11:17 AM, Frank McQuillan < > >>>> fmcquil...@pivotal.io> > >>>> wrote: > >>>> > >>>> > Jarrod, > >>>> > > >>>> > Just trying to write up detailed requirements. How would you see > >>>> this one > >>>> > working? > >>>> > > >>>> > "2) Option to dummy code only the top n most frequently occurring > >>>> values in > >>>> > any column" > >>>> > > >>>> > With 1 column I can picture it, you would drop the rows with the > less > >>>> > frequently occurring values and end up with a smaller table. But > >>>> what if > >>>> > you are encoding multiple rows? Would you want a per row > >>>> specification > >>>> > of n? i.e., top 3 values for column x, top 10 values for column y? > >>>> If you > >>>> > did this then your result set might include low frequency values for > >>>> column > >>>> > x (not in top 3) because they are in the top 10 for column y - this > >>>> might > >>>> > be confusing. > >>>> > > >>>> > Frank > >>>> > > >>>> > On Wed, Oct 19, 2016 at 2:44 PM, Frank McQuillan < > >>>> fmcquil...@pivotal.io> > >>>> > wrote: > >>>> > > >>>> >> great, thanks for the additional information > >>>> >> > >>>> >> Frank > >>>> >> > >>>> >> On Wed, Oct 19, 2016 at 1:57 PM, Jarrod Vawdrey < > jvawd...@pivotal.io > >>>> > > >>>> >> wrote: > >>>> >> > >>>> >>> IMO > >>>> >>> > >>>> >>> 1) Option to define resulting column names. Please see pdltools > >>>> >>> implementation - the ability to pass in a function is especially > >>>> useful ( > >>>> >>> http://pivotalsoftware.github.io/PDLTools/group__grp__ > pivot01.html) > >>>> >>> 2) Option to dummy code only the top n most frequently occurring > >>>> values > >>>> >>> in > >>>> >>> any column > >>>> >>> 3) Option to create numeric column names (E.g. pivotcol_val1, > >>>> >>> pivotcol_val2 > >>>> >>> ...) instead of values in column names + secondary mapping table > >>>> >>> 4) Option to exclude original column from results table > >>>> >>> > >>>> >>> (1) & (2) are much higher priority than (3) & (4). > >>>> >>> > >>>> >>> Agreed that these could also be applied to Pivoting (especially > 1). > >>>> >>> > >>>> >>> > >>>> >>> > >>>> >>> Jarrod Vawdrey > >>>> >>> Sr. Data Scientist > >>>> >>> Data Science & Engineering | Pivotal > >>>> >>> (650) 315-8905 > >>>> >>> https://pivotal.io/ > >>>> >>> > >>>> >>> On Wed, Oct 19, 2016 at 4:47 PM, Frank McQuillan < > >>>> fmcquil...@pivotal.io> > >>>> >>> wrote: > >>>> >>> > >>>> >>> > Thanks for those suggestions, Jarrod. They all sound pretty > >>>> useful - > >>>> >>> > would you mind taking a crack at numbering them 1,2,3... etc, in > >>>> the > >>>> >>> order > >>>> >>> > of priority as you see it? > >>>> >>> > > >>>> >>> > Also it seems like some of these could be applied to the Pivot > >>>> >>> function as > >>>> >>> > well, e.g., UDF for column naming. > >>>> >>> > > >>>> >>> > Frank > >>>> >>> > > >>>> >>> > > >>>> >>> > > >>>> >>> > On Fri, Oct 14, 2016 at 1:02 PM, Jarrod Vawdrey < > >>>> jvawd...@pivotal.io> > >>>> >>> > wrote: > >>>> >>> > > >>>> >>> >> Hey Frank, > >>>> >>> >> > >>>> >>> >> How are special character values handled today? It is often not > >>>> ideal > >>>> >>> to > >>>> >>> >> end up with column names that require double quotes to call due > >>>> to > >>>> >>> >> downstream scripts. > >>>> >>> >> > >>>> >>> >> A couple of features that would be useful > >>>> >>> >> > >>>> >>> >> * Option to define resulting column names. Please see pdltools > >>>> >>> >> implementation - the ability to pass in a function is > especially > >>>> >>> useful ( > >>>> >>> >> http://pivotalsoftware.github.io/PDLTools/group__grp__pivot0 > >>>> 1.html) > >>>> >>> >> * Option to dummy code only the top n most frequently occurring > >>>> >>> values in > >>>> >>> >> any column > >>>> >>> >> * Option to exclude original column from results table > >>>> >>> >> * Option to create numeric column names (E.g. pivotcol_val1, > >>>> >>> >> pivotcol_val2 ...) instead of values in column names + > secondary > >>>> >>> mapping > >>>> >>> >> table > >>>> >>> >> > >>>> >>> >> Thank you > >>>> >>> >> > >>>> >>> >> Jarrod Vawdrey > >>>> >>> >> Sr. Data Scientist > >>>> >>> >> Data Science & Engineering | Pivotal > >>>> >>> >> (650) 315-8905 > >>>> >>> >> https://pivotal.io/ > >>>> >>> >> > >>>> >>> >> On Fri, Oct 14, 2016 at 3:35 PM, Frank McQuillan < > >>>> >>> fmcquil...@pivotal.io> > >>>> >>> >> wrote: > >>>> >>> >> > >>>> >>> >>> For the module encoding categorical variables > >>>> >>> >>> http://madlib.incubator.apache.org/docs/latest/group__grp__d > >>>> >>> >>> ata__prep.html > >>>> >>> >>> does anyone have any suggestions on improvements that we could > >>>> make? > >>>> >>> >>> > >>>> >>> >>> Here is a video on how encoding categorical variables works > for > >>>> >>> those not > >>>> >>> >>> familiar with it > >>>> >>> >>> https://www.youtube.com/watch?v=zxGgGMGJZRo&index=7&list=PL6 > >>>> >>> >>> 2pIycqXx-Qf6EXu5FDxUgXW23BHOtcQ > >>>> >>> >>> > >>>> >>> >> > >>>> >>> >> > >>>> >>> > > >>>> >>> > >>>> >> > >>>> >> > >>>> > > >>>> > >>> > >>> > >> > > >