Hi,  

I don’t really see how this could be mapped into a graph-related problem. I 
would first try optimizing the original Python-based code to see if there is an 
obvious bottleneck somewhere.

If your S1 and S2 datasets are fully available before you run your algorithm 
and only a small fraction of the items are expected to have a non-empty 
intersection, one possible trick that you can use to speed things up is to 
create an “index map” for S1 and S2. For S1 (or S2), the index will be a 
dictionary that maps each element X that appears anywhere within S1 (or S2) to 
the indices of the rows that contain the element X in S1. Such indices can 
easily be constructed using a defaultdict(list) and a single pass through S1 or 
S2. Once you have the index maps for S1 and S2, you can realize that now it is 
enough to do an iteration as follows:

initialize an empty score dictionary
for every item X in the index of S1:
    if X is not a key in the index of S2:
        continue
    get the indices of the rows containing X in S1 from the index of S1
    get the indices of the rows containing X in S2 from the index of S2

    for each pair of the rows found in the previous two lines:
        if the row pair is already in the score dictionary, continue
        otherwise calculate the score of the row pair and store it in the score 
dictionary
         
The above idea basically ensures that disjoint row pairs are never intersected 
explicitly. The expense of this idea is that you have to create the index first 
(which costs time and space) and you also have to keep track of the intersected 
row pairs in the score cache (which also costs time and space) to ensure that 
you do not intersect rows twice or more (if they have more than one common 
item).

--  
T.


On Sunday, 10 November 2013 at 13:06, Message wrote:

> Hi All,  
>  
> I've been trying to write some logic in Python to correlate two sets of 
> simple data. For example:  
>  
> 1) set 1 (S1)  
> apple, 5, 150
> apple, 3, 150
> orange, 8, 200
> orange, 5, 150
>  
> 2) set 2 (S2)  
> apple, 8, 200
> apple, 5, 150
> orange, 8, 100
> orange, 3, 150
>  
>  
> In the above the following should match up :  
> S1- row 1 = S2 row 2  (100% match)
> S1- row 2 = S2 row 4  (66% match)
> S1- row 3 = S2 row 1  (66% match)
> S1- row 4 = S2 row 3  (33% match)
>  
>  
> I've written logic to do this, but it scales badly on larger sets of data due 
> to the number of comparisons I need to make.
> I confess I'm not familiar with Graph Theory, but I get the general feeling 
> it may be possible to emulate this type of correlation functionality.
>  
> I'd be very grateful if somebody could give me some initial feedback / 
> overview if this would be possible - and could igraph help?
>  
> Thanks for any feedback. All the best,  
> Marc  
>  
> _______________________________________________
> igraph-help mailing list
> [email protected] (mailto:[email protected])
> https://lists.nongnu.org/mailman/listinfo/igraph-help
>  
>  


_______________________________________________
igraph-help mailing list
[email protected]
https://lists.nongnu.org/mailman/listinfo/igraph-help

Reply via email to