[ https://issues.apache.org/jira/browse/MADLIB-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16535237#comment-16535237 ]
Frank McQuillan commented on MADLIB-1239: ----------------------------------------- added {code} A summary table named <out_table>_summary is also created at the same time, which has the following columns: source_table TEXT. Source table name. feature_names TEXT[]. Array of names of features. total_rows_processed INTEGER. Total numbers of rows processed. total_rows_skipped INTEGER. Total numbers of rows skipped due to failures. {code} since we are writing out a summary table, may as well add more info in it. > Columns to Vector > ----------------- > > Key: MADLIB-1239 > URL: https://issues.apache.org/jira/browse/MADLIB-1239 > Project: Apache MADlib > Issue Type: New Feature > Components: Module: Utilities > Reporter: Frank McQuillan > Assignee: Himanshu Pandey > Priority: Major > Fix For: v1.15 > > > related to https://issues.apache.org/jira/browse/MADLIB-1240 > Columns to Vector > Converts features from multiple columns of an input table into a feature > array in a single column. > This process can be reversed using the function vec2cols. > {code} > cols2vec( > source_table, > out_table, > list_of_features, > list_of_features_to_exclude, > cols_to_output > ) > source_table > TEXT. Name of the table containing the source data. > out_table > TEXT. Name of the generated table containing the output. If a table with the > same name already exists, an error will be returned. > list_of_features > TEXT. Comma-separated string of column names or expressions to put into > feature array. Can also be a '*' implying all columns are to be put into > feature array (except for the ones included in the next argument that lists > exclusions). Array columns in the source table are not supported in the > 'list_of_features'. > PostgreSQL arrays only allow elements of the same type. If multiple numeric > types are present in the 'list_of_features', they will be cast to the largest > type. For example, if there are INTEGER and DOUBLE PRECISION columns in the > feature list, the feature array will be of type DOUBLE PRECISION[]. Invalid > combinations like TEXT and INTEGER will result in an error. > list_of_features_to_exclude (optional) > TEXT, default NULL. Comma-separated string of column names to exclude from > the feature array. Use only when 'list_of_features' is '*'. > cols_to_output (optional) > TEXT, default NULL. Comma-separated string of column names from the source > table to keep in the output table, in addition to the feature array. To keep > all columns from the source table, use '*'. > Output > The output table produced by the cols2vec function contains the following > columns: > <...> > Columns from source table, depending on which ones are kept (if any). > feature_vector > Array of features. Array type will depend on feature type in the source > table. > A summary table named <out_table>_summary is also created at the same time, > which has the following columns: > source_table TEXT. Source table name. > feature_names TEXT[]. Array of names of features. > total_rows_processed INTEGER. Total numbers of rows processed. > total_rows_skipped INTEGER. Total numbers of rows skipped due to > failures. > {code} > Notes > (1) > The function > http://pivotalsoftware.github.io/PDLTools/group__grp__array__utilities.html#cols2vec_example > is similar but the proposed MADlib one has more options. To do the > equivalent of the PDL Tools one in MADlib, you would do: > {code} > cols2vec( > table_name, > output_table, > '*', > exclude_columns, > '*', > ) > {code} > (2) > Please put the feature vector on the right side of the output table, i.e., it > will be the last column on the right. -- This message was sent by Atlassian JIRA (v7.6.3#76005)