On 20/01/12 18:45, Gwern Branwen wrote:
Recently I wanted to sort through a large folder of varied files and
figure out what is a 'natural' folder to split out, where natural
means something like>4 files with the same prefix.


My idea for an algorithm would be: build a trie for the input strings, and then look for the deepest subtries with more than one child.

For example, a trie containing the strings
  chorus-kiminoshiranaimonogatari.ogg
  chorus-mrmusic.ogg
  choucho-lastnightgoodnight.ogg

looks like:
 <root>  (3 items)
  c       (3 items)
   h       (3 items)
    o       (3 items)
     r       (2 items)
      u       (2 items)
       s       (2 items)
        -       (2 items)
         k       (1 item)
          i       (1 item)
           minoshiranaimonogatari.ogg
         m       (1 item)
          r       (1 item)
           music.ogg
     u       (1 item)
      c       (1 item)
       ho-lastnightgoodnight.ogg
Where actually the lines with more than one character are also subtrees of subtrees of subtrees.


Here is some example code (untested):


import qualified Data.Map as Map

-- A trie datatype
data Trie a = Trie { numLeafs, numDescendant :: !Int
                   , children :: Map.Map a (Trie a) }

-- The empty trie
empty :: Trie a
empty = Trie 0 0 Map.empty

-- A trie that contains a single string
singleton :: Ord a => [a] -> Trie a
singleton [] = Trie 1 1 Map.empty
singleton (x:xs) = Trie 0 1 (Map.singleton x (singleton xs)

-- Merge two tries
merge :: Ord a => Trie a -> Trie a -> Trie a
merge (Trie l d c) (Trie l' d' c')
    = Trie (l+l') (d+d') (Map.unionWith merge c c')

fromList :: Ord a => [[a]] -> Trie a
fromList = foldr merge empty . map singleton

toList :: Ord a => Trie a -> [[a]]
toList (Trie l _ c)
    = replicate l []
    ++ [ x:xs | (x,t) <- Map.toList c, xs <- toList t ]

data CommonPrefix a = Prefix { prefix :: [a], names :: Trie a }

atLeastThisManyDescendants :: Int -> Trie a -> [CommonPrefix a]
atLeastThisManyDescendants minD trie@(Trie l d t)
    | d < minD = []
    | null forChildren = [Prefix [] trie]
    | otherwise = forChildren
  where
    forChildren = [ Prefix (x:pfx) names
                  | (x,t) <- Map.toList c
                  , Prefix pfx names <- atLeastThisManyDescendants n t ]



Twan

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Reply via email to