Etienne Soulard-Geoffrion created SPARK-48685: -------------------------------------------------
Summary: PySpark MinHashLSH when used with CountVectorizer doesn't meet requirements Key: SPARK-48685 URL: https://issues.apache.org/jira/browse/SPARK-48685 Project: Spark Issue Type: Bug Components: ML Affects Versions: 3.5.1 Reporter: Etienne Soulard-Geoffrion Fix For: 3.5.1 I'm facing an issue when trying to use the MinHashLSH model, where the model is complaining about having only zero values in some rows although I do apply a filter before using the model. Here is a sample code to demonstrate using pyspark: ```python @F.udf(returnType=types.BooleanType()) def is_non_zero_vector(vector: SparseVector) -> bool: """ Returns True if the vector has at least one non zero element """ return vector.numNonzeros() > 0 df_text = df.select("id", "text") tokenizer=Tokenizer(inputCol="text", outputCol="text_tokens") df_text=tokenizer.transform(df_text).select("id", "text_tokens") ngram=NGram(inputCol="text_tokens", outputCol="text_ngrams", n=self.min_hash_lsh_ngram_size) df_text=ngram.transform(df_text).select("id", "text_ngrams") count_vectorizer=CountVectorizer(inputCol="text_ngrams", outputCol="text_count_vector").fit(df_text) df_text=count_vectorizer.transform(df_text).select("id", "text_count_vector") # Keep only the non zero vectors df_text=df_text.filter(is_non_zero_vector(F.col("text_count_vector"))) min_hash_lsh=MinHashLSH( inputCol="text_count_vector", outputCol="text_hashes", seed=self.min_hash_lsh_num_hash_tables, numHashTables=self.min_hash_lsh_num_hash_tables, ).fit(df_text) df_text=min_hash_lsh.transform(df_text) # Calculate the distance between all pairs of vectors and keep only the pairs with a distance > 0 (that are duplicates) pairs_df=min_hash_lsh.approxSimilarityJoin(df_text, df_text, 0.6, distCol="vector_distance") pairs_df=pairs_df.filter("vector_distance != 0") ``` I've also analyzed the dataframe and there is in fact no rows without at least 1 non-zero index. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org