[Haskell-cafe] Bloom Filter

2007-04-30 Thread tom

Hi all,

I'm pretty new to Haskell, I've been working on a Bloom filter[1]
implementation as a learning exercise.

I'd really appreciate it if someone more experienced would comment on
the code. I'm sure there's plenty of places where I'm doing things in
silly or overly complex ways.

I've packaged it all with Cabal because I wanted to see how that all
worked, I don't think it's ready for release yet! I would like to at
some point get it polished up and release it if anyone thinks it might
be useful.

You can download it at:

http://www.almostobsolete.net/bloom-0.0.tar.gz

I've also put the Haddock docs for it online at:

http://www.almostobsolete.net/doc/html/Data-yBloom.html

All comments will be very much appreciated :p

Thanks

Tom

[1] There's a nice description of what a Bloom filter is here:
http://www.cs.wisc.edu/~cao/papers/summary-cache/node8.html
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Bloom Filter

2007-05-01 Thread Dom
> 
> Reminds me of this code from Data.Binary:
> 
> unroll :: Integer -> [Word8]
> unroll = unfoldr step
>   where
> step 0 = Nothing
> step i = Just (fromIntegral i, i `shiftR` 8)
> 
> roll :: [Word8] -> Integer
> roll   = foldr unstep 0
>   where
> unstep b a = a `shiftL` 8 .|. fromIntegral b
> 
> Which is a bit stream-fusion inspired, I must admit.
> 

But better than what is in Codec.Utils:

> toBase x =
>map fromIntegral .
>reverse .
>map (flip mod x) .
>takeWhile (/=0) .
>iterate (flip div x)
> 
> -- | Take a number a convert it to base n as a list of octets.
> 
> toOctets :: (Integral a, Integral b) => a -> b -> [Octet]
> toOctets n x = (toBase n . fromIntegral) x

> powersOf n = 1 : (map (*n) (powersOf n))

> -- | Take a list of octets (a number expressed in base n) and convert it
> --   to a number.
> 
> fromOctets :: (Integral a, Integral b) => a -> [Octet] -> b
> fromOctets n x =
>fromIntegral $
>sum $
>zipWith (*) (powersOf n) (reverse (map fromIntegral x))

It seems a shame that everyone has to roll their own.

Dominic.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Bloom Filter

2007-04-30 Thread ajb
G'day.

Quoting tom <[EMAIL PROTECTED]>:

> I'm pretty new to Haskell, I've been working on a Bloom filter[1]
> implementation as a learning exercise.

Excellent!  Sounds like a fun test.

> I'd really appreciate it if someone more experienced would comment on
> the code. I'm sure there's plenty of places where I'm doing things in
> silly or overly complex ways.

Sure.

All in all, very well done.  It works, and it looks pretty efficient.
My quibbles are mostly stylistic or syntactic in nature.  Please
understand that the relative triviality of my quibbles is a sign that
there are really no major problems.

This is not a criticism, but more an advertisement: What are you using
for source control here?  Darcs is nice, and as a bonus, it's trivially
browsable from a web browser, which saves downloading and unpacking.

General comments:

You overuse parentheses.  A lot.  Definitions like this:

ary = (listArray (0, wordc-1) (repeat 0))

don't need parentheses around them, and just add to the general noise
level.

And (.&. ((size b)-1)) is much more cleanly expressed as (.&. (size b - 1)).

Rather than carrying around a hash function, it might be better to use
a type class:

class BloomHash k where
bloomHash :: k -> [Word8]

In wordsize:

You don't need to hard-code this.  You can use:

wordsize = bitSize (undefined::Word32)  -- Or Int, of course!

bitSize is defined in Data.Bits.

In splitup:

I got a bit confused by the local binding names.  It's usual, especially
in generic code, to use "xs", "ys" etc for a list of "x" and "y".
Something like this might be more idiomatic:

splitup n xs = let (xs1, xs2) = splitAt n xs
   in xs1 : splitup n xs2

In indexes:

(fromIntegral $ x `div` wordsize, fromIntegral $ x .&. (wordsize-1))

Seems intuitively wasteful.  Either use divMod or bit operations.

Similarly, (hashfunc b) key is the same as hashfunc b key.  But even
better is:

split bytecount . hashfunc b $ key

That makes it obvious that it's a pipeline of functions applied to the key.

This looks cool:

bytes2int = foldr ((. (256 *)) . (+)) 0 . (map toInteger)

but I'm not smart enough to parse it.  This is both more readable and
shorter:

bytes2int = foldr (\x r -> r*256 + fromInteger x) 0

Integer log2's are probably better done using integers only, or at least
abstracted out into a separate function.

In bloom:

Function guards are your friends!  This:

bloom hf sz hc = if condition
 then b
 else error "Badness"

is almost always better expressed as:

bloom hf sz hc
  | condition = b
  | otherwise = error "Badness"

You can now inline b.  (I can see why you put it in a where clause; now
you don't have to.)

wordc, again, only needs integral arithmetic:

wordc = ceiling ((fromIntegral a) / (fromIntegral b :: Double))

is more or less:

wordc = (a+b-1) `div` b

And drop the parentheses around the definition of ary.

In add:

Try to use function names that are close to names in existing libraries,
like Data.Set.  "insert" sounds better here.

Also, rather than this:

add :: Bloom a -> a -> Bloom a

a better argument order is this:

insert :: a -> Bloom a -> Bloom a

That way, you can use it with foldr.

In test:

Again, probably misnamed.  Data.Set calls this "member".  And again,
arguably the wrong argument ordering.

Once again, well done.

Cheers,
Andrew Bromage
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Bloom Filter

2007-04-30 Thread Donald Bruce Stewart
ajb:
> Quoting tom <[EMAIL PROTECTED]>:
> This looks cool:
> 
> bytes2int = foldr ((. (256 *)) . (+)) 0 . (map toInteger)
> 
> but I'm not smart enough to parse it.  This is both more readable and
> shorter:
> 
> bytes2int = foldr (\x r -> r*256 + fromInteger x) 0
> 
> Integer log2's are probably better done using integers only, or at least
> abstracted out into a separate function.

Reminds me of this code from Data.Binary:

unroll :: Integer -> [Word8]
unroll = unfoldr step
  where
step 0 = Nothing
step i = Just (fromIntegral i, i `shiftR` 8)

roll :: [Word8] -> Integer
roll   = foldr unstep 0
  where
unstep b a = a `shiftL` 8 .|. fromIntegral b

Which is a bit stream-fusion inspired, I must admit.

-- Don
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Bloom Filter

2007-05-01 Thread Josef Svenningsson

Hi,

Just a small comment on one of the comments.

On 5/1/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:

Also, rather than this:

add :: Bloom a -> a -> Bloom a

a better argument order is this:

insert :: a -> Bloom a -> Bloom a

That way, you can use it with foldr.


Hmmm. If you want to create a Bloom using a fold wouldn't it make more
sense to use foldl'? I think the argument order is fine.

Cheers,

Josef
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Bloom Filter

2007-05-01 Thread ajb
G'day all.

Quoting Dom <[EMAIL PROTECTED]>:

> But better than what is in Codec.Utils:
[deletia]
> It seems a shame that everyone has to roll their own.

That and integer log base 2.

Cheers,
Andrew Bromage
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Bloom Filter

2007-05-01 Thread ajb
G'day all.

I wrote:

> > insert :: a -> Bloom a -> Bloom a
> >
> > That way, you can use it with foldr.

Quoting Josef Svenningsson <[EMAIL PROTECTED]>:

> Hmmm. If you want to create a Bloom using a fold wouldn't it make more
> sense to use foldl'? I think the argument order is fine.

You're right that foldl' makes more sense than foldr in this case.
Nevertheless, the usual Haskell convention is that insert-like functions
have the same argument order as the cons operator (:).

Haskell libraries have a real problem with the proliferation of
conventions for various things.  This order is the usual convention,
so follow it.  If you don't like it, there's always "flip".

Cheers,
Andrew Bromage
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Bloom Filter

2007-05-02 Thread Henning Thielemann

On Mon, 30 Apr 2007 [EMAIL PROTECTED] wrote:

> In bloom:
>
> Function guards are your friends!  This:
>
> bloom hf sz hc = if condition
>  then b
>  else error "Badness"
>
> is almost always better expressed as:
>
> bloom hf sz hc
>   | condition = b
>   | otherwise = error "Badness"

Why replacing the almost-function 'if' by a special syntactic construct?
Why replacing the two-branch decision 'if' by a multi-branch construct
(similar to "switch") with two-branches? Express simple things the simple
way!
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Bloom Filter

2007-05-02 Thread tom

Hi Andrew,

Thanks for the comments, it really helps to have someone else's
opinion on my code.  I'll be applying what you've said as soon as I
get a chance and I'm sure I'll have some more questions then. I'll
certainly look more closely at the Set interface and try and duplicate
all the parts which make sense.

I've been using Darcs for a while with non-haskell projects as well as
this project, however it seems that cabal strips out the darcs
meta-data when making up a distribution tar file. Is there an option
to have it include the darcs stuff? it seems like it could be quite
useful and I can't really see a downside. If you're interested the
Darcs repository is at:

http://www.almostobsolete.net/bloom/

Tom

On 5/1/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:

G'day.

Quoting tom <[EMAIL PROTECTED]>:

> I'm pretty new to Haskell, I've been working on a Bloom filter[1]
> implementation as a learning exercise.

Excellent!  Sounds like a fun test.

> I'd really appreciate it if someone more experienced would comment on
> the code. I'm sure there's plenty of places where I'm doing things in
> silly or overly complex ways.

Sure.

All in all, very well done.  It works, and it looks pretty efficient.
My quibbles are mostly stylistic or syntactic in nature.  Please
understand that the relative triviality of my quibbles is a sign that
there are really no major problems.

This is not a criticism, but more an advertisement: What are you using
for source control here?  Darcs is nice, and as a bonus, it's trivially
browsable from a web browser, which saves downloading and unpacking.

General comments:

You overuse parentheses.  A lot.  Definitions like this:

ary = (listArray (0, wordc-1) (repeat 0))

don't need parentheses around them, and just add to the general noise
level.

And (.&. ((size b)-1)) is much more cleanly expressed as (.&. (size b - 1)).

Rather than carrying around a hash function, it might be better to use
a type class:

class BloomHash k where
bloomHash :: k -> [Word8]

In wordsize:

You don't need to hard-code this.  You can use:

wordsize = bitSize (undefined::Word32)  -- Or Int, of course!

bitSize is defined in Data.Bits.

In splitup:

I got a bit confused by the local binding names.  It's usual, especially
in generic code, to use "xs", "ys" etc for a list of "x" and "y".
Something like this might be more idiomatic:

splitup n xs = let (xs1, xs2) = splitAt n xs
   in xs1 : splitup n xs2

In indexes:

(fromIntegral $ x `div` wordsize, fromIntegral $ x .&. (wordsize-1))

Seems intuitively wasteful.  Either use divMod or bit operations.

Similarly, (hashfunc b) key is the same as hashfunc b key.  But even
better is:

split bytecount . hashfunc b $ key

That makes it obvious that it's a pipeline of functions applied to the key.

This looks cool:

bytes2int = foldr ((. (256 *)) . (+)) 0 . (map toInteger)

but I'm not smart enough to parse it.  This is both more readable and
shorter:

bytes2int = foldr (\x r -> r*256 + fromInteger x) 0

Integer log2's are probably better done using integers only, or at least
abstracted out into a separate function.

In bloom:

Function guards are your friends!  This:

bloom hf sz hc = if condition
 then b
 else error "Badness"

is almost always better expressed as:

bloom hf sz hc
  | condition = b
  | otherwise = error "Badness"

You can now inline b.  (I can see why you put it in a where clause; now
you don't have to.)

wordc, again, only needs integral arithmetic:

wordc = ceiling ((fromIntegral a) / (fromIntegral b :: Double))

is more or less:

wordc = (a+b-1) `div` b

And drop the parentheses around the definition of ary.

In add:

Try to use function names that are close to names in existing libraries,
like Data.Set.  "insert" sounds better here.

Also, rather than this:

add :: Bloom a -> a -> Bloom a

a better argument order is this:

insert :: a -> Bloom a -> Bloom a

That way, you can use it with foldr.

In test:

Again, probably misnamed.  Data.Set calls this "member".  And again,
arguably the wrong argument ordering.

Once again, well done.

Cheers,
Andrew Bromage
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Bloom Filter

2007-05-02 Thread ajb
G'day all.

Quoting Henning Thielemann <[EMAIL PROTECTED]>:

> Why replacing the almost-function 'if' by a special syntactic construct?

In a sense, this is entirely stylistic.  One works just as well as
the other.

But I think it's superior in this case (obviously not all cases) because
this is at the top-level (i.e. on the RHS of a top-level function
definition), and one "arm" of the if-then-else is a call to "error", that
is, a precondition/domain/sanity check.

These pre-checks are almost always different from the "real" business
that the function performs.  Using guards makes the separation that much
more obvious.

Cheers,
Andrew Bromage
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe