* Arnaud Delobelle, on 21.09.2010 11:13:
On Sep 21, 7:19 am, "Alf P. Steinbach /Usenet"<alf.p.steinbach
+use...@gmail.com>  wrote:
* Alf P. Steinbach /Usenet, on 21.09.2010 01:09:





* Astley Le Jasper, on 20.09.2010 23:42:
I have a list of tuples that indicate a relationship, ie a is related
to b, b is related to c etc etc. What I want to do is cluster these
relationships into groups. An item will only be associated with a
single cluster.

Before I started, I wondered if there was any particular tool within
Python I should be looking at. I don't expect anyone to code this for
me, just say ... "you need to look at using x". I was going to use
populate a dictionary and

Sorry for being so vague.

Example Data:
[(a,b)
(a,c)
(a,d)
(b,c)
(b,d)
(c,d)
(e,f)
(e,g)
(f,g)
(h,i)]

Output (grouping id, item ref)
(1,a),
(1,b),
(1,c),
(1,d),
(2,e),
(2,f),
(2,g),
(3,h),
(3,i)

It seems to be the same problem as "equivalence sets".

This problem was solved early on because e.g. Fortran compilers had to construct
such sets (equivalence partitions of a set).

I though I'd just say "Google it!", because I know there's a standard algorithm
but I can't recall the name. However, googling it I found no mention of that
algorithm. Not even in the Wikipedia articles on equivalence sets.

A number of approaches spring to mind:

Approach 1:
Multi-pass. Originally you assign a distinct set number to each symbol.
In each pass through the symbols you replace one number with another as
per one of the equivalence specs. Stop when no replacements are done.

Approach 2:
Merging. For each new equivalence A=B you check whether A has been assigned
to a set, if not you create a new one, call that S1, and ditto for B, S2.
Merge S1 and S2, e.g. move all elements of S2 to S1.

Approach 3:
In a first pass convert the data to more explicit form, linking each symbol
to the symbols it's directly equivalent to. Then in second pass simply drag
out each equivalence group (recurse on each symbol's list of equivalences).

Approaches 1 and 2 seem to be pretty inefficient for a large number of symbols,
but I think approach 1 may be a practical option for a small number of symbols.

Uhm, thinking about it (it must have been my unconscious mind doing the work, it
just popped into my head), if you first sort each individual equivalence
relation so that you never have e.g. C=A but only A=C and so on, and then sort
the list of equivalences, then it should reduce to walking the list and just
starting a new set whenever a symbol is encountered that isn't yet in a set.

<code language="Py3">
class Attributes: pass

sym = Attributes()
for name in ("a", "b", "c", "d", "e", "f", "g", "h", "i"):
      setattr( sym, name, name )

eq_specs = [
      (sym.a, sym.d),
      (sym.b, sym.a),
      (sym.b, sym.c),
      (sym.b, sym.d),
      (sym.c, sym.d),
      (sym.c, sym.a),
      (sym.e, sym.f),
      (sym.e, sym.g),
      (sym.f, sym.g),
      (sym.h, sym.i),
      ]

equalities = []
for eq in eq_specs:
      sorted_eq = eq if eq[0]<= eq[1] else (eq[1], eq[0])
      equalities.append( sorted_eq )
equalities.sort()

eq_sets = {}
eq_set_ids = {}
current_eq_set_id = 0
for eq in equalities:

This would make the body of the loop easier to read:

   for x, y in equalities:

      if eq_set_ids.get( eq[0] ) is None:

Why not simply:

        if eq[0] in eq_set_ids:

          current_eq_set_id += 1
          eq_set_ids[eq[0]] = current_eq_set_id
          eq_sets[current_eq_set_id] = set( eq[0] )
      eq_set_id = eq_set_ids[eq[0]]
      eq_set_ids[eq[1]] = eq_set_id
      eq_sets[eq_set_id].add( eq[1] )

for eq_set_id in eq_sets.keys():
      for sym_name in eq_sets[eq_set_id]:
          print( "{}, {}".format( eq_set_id, sym_name ) )
</code>

<output>
1, a
1, c
1, b
1, d
2, e
2, g
2, f
3, i
3, h
</output>

Disclaimer: for me it's pretty late in day/night.

Cheers&  hth.,

- Alf

--
blog at<url:http://alfps.wordpress.com>

I think this won't work with the following graph:

eq_specs = [('a', 'c'), ('b', 'd'), ('c', 'd')]

Note that it is already sorted according to your sorting method.  I
don't have a Python machine to check this but I believe it will
output:

1, a
1, c
1, d
2, b

The flaw is that you fail to consider that the two vertices in the
current pair may already be part of a (partial) connected component.

Yeah, thanks.

I think the three general methods I listed will nicely work, though.

Moral: don't post insights from dream-world... ;-)


Cheers,

- Alf

--
blog at <url: http://alfps.wordpress.com>
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to