[ https://issues.apache.org/jira/browse/SPARK-17498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15483514#comment-15483514 ]
Vincent edited comment on SPARK-17498 at 9/12/16 8:55 AM: ---------------------------------------------------------- Here is how we cc [~qhuang] look at this issue and correct me if any misunderstanding [~miro.balaz] val df= sc.parallelize(Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")), 2) val indexer = new StringIndexer().fit(df) when transform is call on a new dataframe with unseen label, say, val dfNew = sc.parallelize(Seq((0, "a"), (1, "b"), (2, "e"), (3, "d")), 2) indexer.transform(dfNew) should return 3, 4 for label "d", "e" instead of skipping/deleting the new incoming labels, and IndexToString should return NaN for these added indexes 3, 4 [~yanboliang] [~srowen] [~josephkb] what do you think of this issue? Currently it can either skip the unseen label or throw an error in such case, do you think we should add such 'new' way of handler as proposed for StringIndexer? was (Author: vincexie): Here is what we cc [~qhuang] see about this issue and correct me if any misunderstanding [~miro.balaz] val df= sc.parallelize(Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")), 2) val indexer = new StringIndexer().fit(df) when transform is call on a new dataframe with unseen label, say, val dfNew = sc.parallelize(Seq((0, "a"), (1, "b"), (2, "e"), (3, "d")), 2) indexer.transform(dfNew) should return 3, 4 for label "d", "e" instead of skipping/deleting the new incoming labels, and IndexToString should return NaN for these added indexes 3, 4 [~yanboliang] [~srowen] [~josephkb] what do you think of this issue? Currently it can either skip the unseen label or throw an error in such case, do you think we should add such 'new' way of handler as proposed for StringIndexer? > StringIndexer.setHandleInvalid sohuld have another option 'new' > --------------------------------------------------------------- > > Key: SPARK-17498 > URL: https://issues.apache.org/jira/browse/SPARK-17498 > Project: Spark > Issue Type: Improvement > Components: ML > Reporter: Miroslav Balaz > > That will map unseen label to maximum known label +1, IndexToString would map > that back to "<undef>" or NA if there is something like that in spark, -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org