Re: Pimp my algorithm: finding repeating subsequences

Tyler Perkins Sat, 19 Feb 2011 11:05:31 -0800

Interesting. This problem is like compression, but the first thing I
thought of was pattern recognition. If your tokens are just characters
in a string, you can let regular expressions do all the heavy lifting:


(defn rep-groups [s] (re-seq #"(.+)\1+|." s))

Then (rep-groups ".,ababcababc.,cc,.abab.") for example, gives

(["." nil] ["," nil] ["ababcababc" "ababc"] ["." nil] ["," nil] ["cc"
"c"] ["," nil] ["." nil] ["abab" "ab"] ["." nil])

Now we just need to convert each of those match vectors into the form
you want, like this:

(map (fn [[mtch subst]] (if subst
                            [(/ (count mtch) (count subst)) subst]
                            mtch))
     *1)    ;; Where *1 is the match results, above.

Gives

("." "," [2 "ababc"] "." "," [2 "c"] "," "." [2 "ab"] ".")

Cool! Pretty close to what you need. But you wanted to use arbitrary
tokens, like :a, :b, etc. So let's just represent each token by a
unique character. (There are LOTS of characters available.) Here's a
function that creates two 1-to-1 maps -- one that converts tokens to
characters, and the other its inverse, converting characters to their
tokens:

(defn token-to-char-map-and-inverse [token-collection]
  (let [tokens (seq (set token-collection))
        chars  (map char (range (count tokens)))]
    [(zipmap tokens chars) (zipmap chars tokens)]))

We'll need a presentation function like the (fn ...) above. Since it
will also convert the characters back to tokens, it will also depend
upon the char-to-token map. So the following function takes such a map
and returns such a function:

(defn count-subst-fn [c-to-t]
  (fn [[mtch subst]]
    (if subst
      [(/ (count mtch) (count subst)) (map c-to-t subst)]
      (apply c-to-t mtch))))

Finally, put it all together:

(defn parse-reps [token-seq]
  (let
      [[t-to-c c-to-t] (token-to-char-map-and-inverse token-seq)
       s               (apply str (map t-to-c token-seq))]
    (map (count-subst-fn c-to-t) (rep-groups s))))

For example, (parse-reps [:_ :- :a :b :c :a :b :c :a :a :_]) gives

(:_ :- [2 (:a :b :c)] [2 (:a)] :_)

Notice that the input itself is used to define the set of tokens we're
using. We could even us a string as input. (parse-reps
".,ababcababc.,cc,.abab.") gives

(\. \, [2 (\a \b \a \b \c)] \. \, [2 (\c)] \, \. [2 (\a \b)] \.)

These four (defn ...) provide a simple solution to your problem. Using
regular expressions also yields a nice separation of concerns, I
think, and it allows for experimentation with the pattern. As defined
above, (parse-reps [:a :b :a :b :c :a :b :a :b :c]) gives

([2 (:a :b :a :b :c)])

But let's try redefining the regular expression (note the "?"):

(defn rep-groups [s] (re-seq #"(.+?)\1+|." s))

Then the same (parse-reps [:a :b :a :b :c :a :b :a :b :c]) gives the
less greedy result,

([2 (:a :b)] :c [2 (:a :b)] :c)

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: Pimp my algorithm: finding repeating subsequences

Reply via email to