[ https://issues.apache.org/jira/browse/MADLIB-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16575589#comment-16575589 ]
Frank McQuillan commented on MADLIB-1270: ----------------------------------------- Here is one possible approach: {code} IF feature_names is specified THEN make the number of columns equal to the size of the feature_names array ELSE IF feature_names is not specified THEN make the number of columns equal to the max size of the array in vector_col {code} In all cases above, some of the generated rows may have NULLs if the array for a particular row is smaller. > Unexepcted behavior in vec2cols function > ---------------------------------------- > > Key: MADLIB-1270 > URL: https://issues.apache.org/jira/browse/MADLIB-1270 > Project: Apache MADlib > Issue Type: Bug > Components: Module: Utilities > Reporter: Rashmi Raghu > Priority: Minor > Fix For: v1.15.1 > > > There is some unexpected behavior when vector column to be split contains > different numbers of elements in the vectors. E.g. > Input table: > select * from test order by id; > id | t > ----+--------- > 1 | \{a,b} > 2 | \{c,d} > 3 | \{e,f} > 4 | \{g,h,i} > 5 | \{j} > (5 rows) > > select madlib.vec2cols('test','test_out_5','t',array['c1','c2','c3'],'id'); > ERROR: plpy.Error: vec2cols: Mismatch between size of vector_col and number > of cols in feature_names. > CONTEXT: Traceback (most recent call last): > PL/Python function "vec2cols", line 23, in <module> > return vec2cols_obj.vec2cols(**globals()) > PL/Python function "vec2cols", line 149, in vec2cols > PL/Python function "vec2cols", line 112, in get_names_for_split_output_cols > PL/Python function "vec2cols", line 77, in _assert > PL/Python function "vec2cols" > > select madlib.vec2cols('test','test_out_5','t',array['c1','c2'],'id'); > vec2cols > ---------- > (1 row) > select * from test_out_5 order by id; > id | c1 | c2 > ----++-------- > 1 | a | b > 2 | c | d > 3 | e | f > 4 | g | h > 5 | j | > (5 rows) > > > select madlib.vec2cols('test','test_out_6','t',array['c1'],'id'); > ERROR: plpy.Error: vec2cols: Mismatch between size of vector_col and number > of cols in feature_names. > CONTEXT: Traceback (most recent call last): > PL/Python function "vec2cols", line 23, in <module> > return vec2cols_obj.vec2cols(**globals()) > PL/Python function "vec2cols", line 149, in vec2cols > PL/Python function "vec2cols", line 112, in get_names_for_split_output_cols > PL/Python function "vec2cols", line 77, in _assert > PL/Python function "vec2cols" > > --- Update----- > There are a couple of decisions to be made regarding supporting arrays of > different lengths: > -If we choose the array with maximal length in the vector_col, what do we do > if the user's passed-in feature_names does not have the same number of > elements? > -What are the performance issues with looking through our vector_col for the > array with maximal length? > -How will we handle default feature names: will we create a feature name for > every element of the longest array entry? -- This message was sent by Atlassian JIRA (v7.6.3#76005)