[ 
https://issues.apache.org/jira/browse/SPARK-19771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15889180#comment-15889180
 ] 

Mingjie Tang commented on SPARK-19771:
--------------------------------------

If we follow the AND-OR framework, one optimization is here: 

At first, suppose we use BucketedRandomProjectionLSH, and  setNumHashTables(3) 
and setNumHashFunctions(5).

For one input vector with sparse format e.g., [(6,[0,1,2],[1.0,1.0,1.0])], 
we can compute its mapped hash vectors is
WrappedArray([-1.0,-1.0,-1.0,0.0,0.0], [-1.0,0.0,0.0,-1.0,-1.0], 
[-1.0,-1.0,-1.0,-1.0,0.0])]

For the similarity-join, we map this computed vector into
+--------------------+-----+--------------------+
|            datasetA|entry|           hashValue|
+--------------------+-----+--------------------+
|[0,(6,[0,1,2],[1....|    0|[-1.0,-1.0,-1.0,0...|
|[0,(6,[0,1,2],[1....|    1|[-1.0,0.0,0.0,-1....|
|[0,(6,[0,1,2],[1....|    2|[-1.0,-1.0,-1.0,-...|

Then, based on AND-OR principle, we using the entry and hashValue for 
equip-join with other table . When we look at the the sql, it need to iterate 
through the hashValue vector for equal join. Thus, this computation and memory 
usage cost is very high. For example, if we apply the nest-loop for comparing 
two vectors with length of NumHashFunctions, the computation cost is 
(NumHashFunctions* NumHashFunctions) and memory overhead is (N* 
NumHashFunctions). 

Therefore, we can apply one optimization technique here. that is, we do not 
need to store the hash value for each hash table like [-1.0,-1.0,-1.0,0.0,0.0], 
instead, we use a simple hash function to improve this. 

Suppose h_l= {h_0, h_1., ... h_k}, where k is the number of 
setNumHashFunctions. 
then, the mapped hash function is 
     H(h_l)=sum_{i =0...k} (h_i*r_i) mod prime.
where r_i is a integer  and the prime can be set as 2^32-5 for less hash 
collision.
Then, we only need to store the hash value H(h_l) for AND operation.  

The similar approach also can be applied for the approxNeasetNeighbors 
searching. 

If the multi-probe approach does not need to read the hash value for one hash 
function (e.g., h_l mentioned above), we can applied this H(h_l) to the 
preprocessed data to save memory. I will double check the multi-probe approach 
and see whether it is work.  Then, I can submit a patch to improve the current 
way. Note, this hash function to reduce the vector-to-vector comparing is 
widely used in different places for example, the LSH implementation by Andoni . 
 
[~diefunction] [~mlnick] 


> Support OR-AND amplification in Locality Sensitive Hashing (LSH)
> ----------------------------------------------------------------
>
>                 Key: SPARK-19771
>                 URL: https://issues.apache.org/jira/browse/SPARK-19771
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 2.1.0
>            Reporter: Yun Ni
>
> The current LSH implementation only supports AND-OR amplification. We need to 
> discuss the following questions before we goes to implementations:
> (1) Whether we should support OR-AND amplification
> (2) What API changes we need for OR-AND amplification
> (3) How we fix the approxNearestNeighbor and approxSimilarityJoin internally.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to