[ 
https://issues.apache.org/jira/browse/SPARK-48685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-48685:
----------------------------------
    Fix Version/s:     (was: 3.5.1)

> PySpark MinHashLSH when used with CountVectorizer doesn't meet requirements
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-48685
>                 URL: https://issues.apache.org/jira/browse/SPARK-48685
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 3.5.1
>            Reporter: Etienne Soulard-Geoffrion
>            Priority: Major
>
> I'm facing an issue when trying to use the MinHashLSH model, where the model 
> is complaining about having only zero values in some rows although I do apply 
> a filter before using the model. Here is a sample code to demonstrate using 
> pyspark:
> ```python
> @F.udf(returnType=types.BooleanType())
> def is_non_zero_vector(vector: SparseVector) -> bool:
> """
> Returns True if the vector has at least one non zero element
> """
> return vector.numNonzeros() > 0
>  
> df_text = df.select("id", "text")
> tokenizer=Tokenizer(inputCol="text", outputCol="text_tokens")
> df_text=tokenizer.transform(df_text).select("id", "text_tokens")
> ngram=NGram(inputCol="text_tokens", outputCol="text_ngrams", 
> n=self.min_hash_lsh_ngram_size)
> df_text=ngram.transform(df_text).select("id", "text_ngrams")
> count_vectorizer=CountVectorizer(inputCol="text_ngrams", 
> outputCol="text_count_vector").fit(df_text)
> df_text=count_vectorizer.transform(df_text).select("id", "text_count_vector")
> # Keep only the non zero vectors
> df_text=df_text.filter(is_non_zero_vector(F.col("text_count_vector")))
> min_hash_lsh=MinHashLSH(
> inputCol="text_count_vector",
> outputCol="text_hashes",
> seed=self.min_hash_lsh_num_hash_tables,
> numHashTables=self.min_hash_lsh_num_hash_tables,
> ).fit(df_text)
> df_text=min_hash_lsh.transform(df_text)
> # Calculate the distance between all pairs of vectors and keep only the pairs 
> with a distance > 0 (that are duplicates)
> pairs_df=min_hash_lsh.approxSimilarityJoin(df_text, df_text, 0.6, 
> distCol="vector_distance")
> pairs_df=pairs_df.filter("vector_distance != 0")
> ```
> I've also analyzed the dataframe and there is in fact no rows without at 
> least 1 non-zero index.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to