[ https://issues.apache.org/jira/browse/SPARK-21199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063416#comment-16063416 ]
Franklyn Dsouza edited comment on SPARK-21199 at 6/26/17 5:16 PM: ------------------------------------------------------------------ For this particular scenario I have a table with two columns one is a string `document_type` and the other is a an array of tokens for the document. I want to do a TF-IDF on these tokens. the IDF needs to be done per `document_type` so i pivot on `document_type` and then do the IDF on the TF vetors. This pivoting introduces nulls for missing columns that need to be imputed. I can't impute array type either and fixing it at the token generation step would involve a lot of left joins to align various data sources. was (Author: franklyndsouza): For this particular scenario I have a table with two columns one is a string `document_type` and the other is a an array of tokens for the document. I want to do a TF-IDF on these tokens. the IDF needs to be done per `document_type` so i do pivot on `document_type` and then do the IDF on the TF vetors. This pivoting introduces nulls for missing columns that need to be imputed. I can't impute array type either and fixing it at the token generation step would involve a lot of left joins to align various data sources. > Its not possible to impute Vector types > --------------------------------------- > > Key: SPARK-21199 > URL: https://issues.apache.org/jira/browse/SPARK-21199 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.0.0, 2.1.1 > Reporter: Franklyn Dsouza > > There are cases where nulls end up in vector columns in dataframes. Currently > there is no way to fill in these nulls because its not possible to create a > literal vector column expression using lit(). > Also the entire pyspark ml api will fail when they encounter nulls so this > makes it hard to work with the data. > I think that either vector support should be added to the imputer or vectors > should be supported in column expressions so they can be used in a coalesce. > [~mlnick] -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org