Re: [Haskell-cafe] Re: Optimizing spelling correction program
Kamil Dworakowski wrote: On Jun 22, 10:03 am, Eugene Kirpichov wrote: Hey, you're using String I/O! nWORDS <- fmap (train . map B.pack . words) (readFile "big.txt") This should be WORDS <- fmap (train . B.words) (B.readFile "big.txt") By the way, which exact file do you use as a misspellings file? The corpus linked to at Norvig's page has many. And do you have a driver program that I could run and obtain your timings? Yep, Don pointed that out and I have changed the program accordingly. It didn't make any difference though. The time spent on building the dictionary is a small portion of the overall run time. Please see the repository contents for the current version of the program: http://patch-tag.com/r/spellcorrect/snapshot/current/content/pretty The eval-bytestring.hs there is the program I used for timing. Inside of it you will find the name of the misspellings file needed. Thanks all for the suggestions. I'll try them when I get home tonight. Another suggestion, is that you should try to make sure that the lists constructed by getCommonSpellingMistakesWithCorrections get optimized away. As written I'm not sure there will be sufficient inlining to ensure that. If you want to be explicit about removing them, something like the following should help: > module Main where > import Prelude hiding (words) > import SpellingCorrection > import qualified Data.ByteString.Char8 as B > import Data.Char > import Data.IORef > import Control.Monad (forM_) > > main = do > corrector <- getCorrector > misspell_corpus <- B.readFile "FAWTHROP1DAT.643" > n <- newIORef (0::Int) > wrong <- newIORef (0::Int) > forM_ (B.lines misspell_corpus) $ \line -> do > modifyIORef' n (1+) > let [ms,c] = map (B.map toLower) . B.words $ line > if corrector ms /= c > then modifyIORef' wrong (1+) > else return () > accuracy <- do > n' <- readIORef n > w' <- readIORef wrong > return $! 100 * fromIntegral w' / fromIntegral n' > putStrLn $ "accuracy" ++ show (100 - accuracy) ++ "%" > > modifyIORef':: IORef a -> (a -> a) -> IO () > modifyIORef' r f = readIORef r >>= (writeIORef r $!) . f > {-# INLINE modifyIORef' #-} -- Live well, ~wren ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: Optimizing spelling correction program
Hello Kamil, Tuesday, June 23, 2009, 11:17:43 AM, you wrote: >> One easy way to fix the GC time is to increase the default heap size. >> >> ./a.out +RTS -A200M > It does make the GC only 1.4% of run time but it increases it overall > by 14s. not surprising - you lose L2 cache locality. try to use -A size that is equal to your L2 cache size -- Best regards, Bulatmailto:bulat.zigans...@gmail.com ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: Optimizing spelling correction program
kamil: > On Jun 22, 9:10 am, Ketil Malde wrote: > > Kamil Dworakowski writes: > > > Right... Python uses hashtables while here I have a tree with log n > > > access time. I did not want to use the Data.HashTable, it would > > > pervade my program with IO. The alternative is an ideal hashmap that never > > > gets changed. This program creates a dictionary at start which then is > > > only > > > being used to read from: an ideal application for the Data.PerfectHash > > > by Mark Wotton available on Hackage [3]. > > > > If you are considering alternative data structures, you might want to > > look at tries or Bloom filters, both have O(n) lookup, both have > > Haskell implementations. The latter is probably faster but > > probabilistic (i.e. it will occasionally fail to detect a > > misspelling - which you can of course check against a "real" > > dictionary). > > Using Bryan O'Sullivan's fantastic BloomFilter I got it down below > Python's run time! Now it is 35.56s, 28% of the time is spent on GC, > which I think means there is still some room for improvement. One easy way to fix the GC time is to increase the default heap size. ./a.out +RTS -A200M for example. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: Optimizing spelling correction program
Am Montag 22 Juni 2009 22:54:49 schrieb Kamil Dworakowski: > Wait! Have you typed that definition into the msg off the top of your > head? :) No, took a bit of looking. > > I went back to using Strings instead of ByteStrings and with that > hashtable the program finishes in 31.5s! w00t! Nice :D ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: Optimizing spelling correction program
Hello Kamil, Tuesday, June 23, 2009, 12:54:49 AM, you wrote: > I went back to using Strings instead of ByteStrings and with that > hashtable the program finishes in 31.5s! w00t! and GC times are? also, try ByteString+HT, it should be pretty easy to write hashByteString -- Best regards, Bulatmailto:bulat.zigans...@gmail.com ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: Optimizing spelling correction program
Am Montag 22 Juni 2009 21:31:50 schrieb Kamil Dworakowski: > On Jun 22, 6:46 am, Bulat Ziganshin wrote: > > Hello Kamil, > > > > Monday, June 22, 2009, 12:01:40 AM, you wrote: > > > Right... Python uses hashtables while here I have a tree with log n > > > > you can try this pure hashtable approach: > > > > import Prelude hiding (lookup) > > import qualified Data.HashTable > > import Data.Array > > import qualified Data.List as List > > > > data HT a b = HT (a->Int) (Array Int [(a,b)]) > > > > -- size is the size of array (we implent closed hash) > > -- hash is the hash function (a->Int) > > -- list is assoclist of items to put in hash > > create size hash list = HT hashfunc > > (accumArray (flip (:)) > > [] > > (0, arrsize-1) > > (map (\(a,b) -> (hashfunc a,b)) Typo: should be map (\(a,b) -> (hashfunc a, (a,b)) > > list) ) > > > > where arrsize = head$ filter (>size)$ iterate (\x->3*x+1) 1 > > hashfunc a = hash a `mod` arrsize > > > > lookup a (HT hash arr) = List.lookup a (arr!hash a) > > > > main = do let assoclist = [("one", 1), ("two", 2), ("three", 3)] > > hash = create 10 (fromEnum . Data.HashTable.hashString) > > assoclist print (lookup "one" hash) > > print (lookup "zero" hash) > > It does not compile: > > No instance for (Num (String, b)) > arising from the literal `3' at foo.hs:23:61 > Possible fix: add an instance declaration for (Num (String, b)) > In the expression: 3 > In the expression: ("three", 3) > In the expression: [("one", 1), ("two", 2), ("three", 3)] > ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe