[jira] [Created] (SPARK-48685) PySpark MinHashLSH when used with CountVectorizer doesn't meet requirements

Etienne Soulard-Geoffrion (Jira) Fri, 21 Jun 2024 14:41:04 -0700

Etienne Soulard-Geoffrion created SPARK-48685:
-------------------------------------------------


             Summary: PySpark MinHashLSH when used with CountVectorizer doesn't 
meet requirements
                 Key: SPARK-48685
                 URL: https://issues.apache.org/jira/browse/SPARK-48685
             Project: Spark
          Issue Type: Bug
          Components: ML
    Affects Versions: 3.5.1
            Reporter: Etienne Soulard-Geoffrion
             Fix For: 3.5.1


I'm facing an issue when trying to use the MinHashLSH model, where the model is 
complaining about having only zero values in some rows although I do apply a 
filter before using the model. Here is a sample code to demonstrate using 
pyspark:


```python
@F.udf(returnType=types.BooleanType())
def is_non_zero_vector(vector: SparseVector) -> bool:
"""
Returns True if the vector has at least one non zero element
"""
return vector.numNonzeros() > 0
 
df_text = df.select("id", "text")

tokenizer=Tokenizer(inputCol="text", outputCol="text_tokens")
df_text=tokenizer.transform(df_text).select("id", "text_tokens")

ngram=NGram(inputCol="text_tokens", outputCol="text_ngrams", 
n=self.min_hash_lsh_ngram_size)
df_text=ngram.transform(df_text).select("id", "text_ngrams")

count_vectorizer=CountVectorizer(inputCol="text_ngrams", 
outputCol="text_count_vector").fit(df_text)
df_text=count_vectorizer.transform(df_text).select("id", "text_count_vector")

# Keep only the non zero vectors
df_text=df_text.filter(is_non_zero_vector(F.col("text_count_vector")))

min_hash_lsh=MinHashLSH(
inputCol="text_count_vector",
outputCol="text_hashes",
seed=self.min_hash_lsh_num_hash_tables,
numHashTables=self.min_hash_lsh_num_hash_tables,
).fit(df_text)
df_text=min_hash_lsh.transform(df_text)

# Calculate the distance between all pairs of vectors and keep only the pairs 
with a distance > 0 (that are duplicates)
pairs_df=min_hash_lsh.approxSimilarityJoin(df_text, df_text, 0.6, 
distCol="vector_distance")
pairs_df=pairs_df.filter("vector_distance != 0")

```

I've also analyzed the dataframe and there is in fact no rows without at least 
1 non-zero index.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48685) PySpark MinHashLSH when used with CountVectorizer doesn't meet requirements

Reply via email to