Re: [Haskell-cafe] Re: Optimizing spelling correction program

2009-06-24 Thread wren ng thornton

Kamil Dworakowski wrote:

On Jun 22, 10:03 am, Eugene Kirpichov  wrote:

Hey, you're using String I/O!

nWORDS <- fmap (train . map B.pack . words) (readFile "big.txt")

This should be

WORDS <- fmap (train . B.words) (B.readFile "big.txt")

By the way, which exact file do you use as a misspellings file? The
corpus linked to at Norvig's page has many.
And do you have a driver program that I could run and obtain your timings?


Yep, Don pointed that out and I have changed the program accordingly.
It didn't make any difference though. The time spent on building the
dictionary is a small portion of the overall run time.

Please see the repository contents for the current version of the
program:
http://patch-tag.com/r/spellcorrect/snapshot/current/content/pretty

The eval-bytestring.hs there is the program I used for timing. Inside
of it you will find the name of the misspellings file needed.

Thanks all for the suggestions. I'll try them when I get home tonight.



Another suggestion, is that you should try to make sure that the lists 
constructed by getCommonSpellingMistakesWithCorrections get optimized 
away. As written I'm not sure there will be sufficient inlining to 
ensure that. If you want to be explicit about removing them, something 
like the following should help:


> module Main where
> import Prelude hiding (words)
> import SpellingCorrection
> import qualified Data.ByteString.Char8 as B
> import Data.Char
> import Data.IORef
> import Control.Monad (forM_)
>
> main = do
> corrector   <- getCorrector
> misspell_corpus <- B.readFile "FAWTHROP1DAT.643"
> n   <- newIORef (0::Int)
> wrong   <- newIORef (0::Int)
> forM_ (B.lines misspell_corpus) $ \line -> do
> modifyIORef' n (1+)
> let [ms,c] = map (B.map toLower) . B.words $ line
> if corrector ms /= c
> then modifyIORef' wrong (1+)
> else return ()
> accuracy <- do
> n' <- readIORef n
> w' <- readIORef wrong
> return $! 100 * fromIntegral w' / fromIntegral n'
> putStrLn $ "accuracy" ++ show (100 - accuracy) ++ "%"
>
> modifyIORef':: IORef a -> (a -> a) -> IO ()
> modifyIORef' r f = readIORef r >>= (writeIORef r $!) . f
> {-# INLINE modifyIORef' #-}

--
Live well,
~wren
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Optimizing spelling correction program

2009-06-23 Thread Bulat Ziganshin
Hello Kamil,

Tuesday, June 23, 2009, 11:17:43 AM, you wrote:
>> One easy way to fix the GC time is to increase the default heap size.
>>
>>  ./a.out +RTS -A200M

> It does make the GC only 1.4% of run time but it  increases it overall
> by 14s.

not surprising - you lose L2 cache locality. try to use -A size that
is equal to your L2 cache size


-- 
Best regards,
 Bulatmailto:bulat.zigans...@gmail.com

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Optimizing spelling correction program

2009-06-22 Thread Don Stewart
kamil:
> On Jun 22, 9:10 am, Ketil Malde  wrote:
> > Kamil Dworakowski  writes:
> > > Right... Python uses hashtables while here I have a tree with log n
> > > access time. I did not want to use the Data.HashTable, it would
> > > pervade my program with IO. The alternative is an ideal hashmap that never
> > > gets changed. This program creates a dictionary at start which then is 
> > > only
> > > being used to read from: an ideal application for the Data.PerfectHash
> > > by Mark Wotton available on Hackage [3].
> >
> > If you are considering alternative data structures, you might want to
> > look at tries or Bloom filters, both have O(n) lookup, both have
> > Haskell implementations.  The latter is probably faster but
> > probabilistic (i.e. it will occasionally fail to detect a
> > misspelling - which you can of course check against a "real"
> > dictionary).
> 
> Using Bryan O'Sullivan's fantastic BloomFilter I got it down below
> Python's run time! Now it is 35.56s, 28% of the time is spent on GC,
> which I think means there is still some room for improvement.

One easy way to fix the GC time is to increase the default heap size.

 ./a.out +RTS -A200M 

for example.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Optimizing spelling correction program

2009-06-22 Thread Daniel Fischer
Am Montag 22 Juni 2009 22:54:49 schrieb Kamil Dworakowski:
> Wait! Have you typed that definition into the msg off the top of your
> head? :)

No, took a bit of looking.

>
> I went back to using Strings instead of ByteStrings and with that
> hashtable the program finishes in 31.5s! w00t!

Nice :D

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Optimizing spelling correction program

2009-06-22 Thread Bulat Ziganshin
Hello Kamil,

Tuesday, June 23, 2009, 12:54:49 AM, you wrote:

> I went back to using Strings instead of ByteStrings and with that
> hashtable the program finishes in 31.5s! w00t!

and GC times are? also, try ByteString+HT, it should be pretty easy to
write hashByteString


-- 
Best regards,
 Bulatmailto:bulat.zigans...@gmail.com

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Optimizing spelling correction program

2009-06-22 Thread Daniel Fischer
Am Montag 22 Juni 2009 21:31:50 schrieb Kamil Dworakowski:
> On Jun 22, 6:46 am, Bulat Ziganshin  wrote:
> > Hello Kamil,
> >
> > Monday, June 22, 2009, 12:01:40 AM, you wrote:
> > > Right... Python uses hashtables while here I have a tree with log n
> >
> > you can try this pure hashtable approach:
> >
> > import Prelude hiding (lookup)
> > import qualified Data.HashTable
> > import Data.Array
> > import qualified Data.List as List
> >
> > data HT a b = HT (a->Int) (Array Int [(a,b)])
> >
> > -- size is the size of array (we implent closed hash)
> > -- hash is the hash function (a->Int)
> > -- list is assoclist of items to put in hash
> > create size hash list = HT hashfunc
> >                            (accumArray (flip (:))
> >                                        []
> >                                        (0, arrsize-1)
> >                                        (map (\(a,b) -> (hashfunc a,b))

Typo: should be

map (\(a,b) -> (hashfunc a, (a,b))


> > list) )
> >
> >   where arrsize     =  head$ filter (>size)$ iterate (\x->3*x+1) 1
> >         hashfunc a  =  hash a `mod` arrsize
> >
> > lookup a (HT hash arr) = List.lookup a (arr!hash a)
> >
> > main = do let assoclist = [("one", 1), ("two", 2), ("three", 3)]
> >               hash = create 10 (fromEnum . Data.HashTable.hashString)
> > assoclist print (lookup "one" hash)
> >           print (lookup "zero" hash)
>
> It does not compile:
>
> No instance for (Num (String, b))
>   arising from the literal `3' at foo.hs:23:61
> Possible fix: add an instance declaration for (Num (String, b))
> In the expression: 3
> In the expression: ("three", 3)
> In the expression: [("one", 1), ("two", 2), ("three", 3)]
>


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe