Re: multiview using bloom filters
I have added the formatting changes and contacted the author of scalable bloom filters, it seems that bitarray (and the hipe version) came from a discussion on the erlang mailing lists http://groups.google.com/group/erlang-programming/browse_thread/thread/7c0191b1d709a5fe/ea5cf52b46d67d76?lnk=gst&q=bitarray#ea5cf52b46d67d76 but the author of the bloom.erl should be able to confirm. Any other comments, anyone had a chance to test it out?!! thanks, Norman On Sat, Sep 25, 2010 at 8:45 AM, Paul Davis wrote: > Although, I don't believe theirs is growable. But if it is, that might > be interesting to test for speed. Or we could add the growable parts. > > On Sat, Sep 25, 2010 at 5:44 AM, Robert Dionne > wrote: >> Norman, >> >> Basho also has a bloom filter implementation packaged as a separate >> project[1], that you might find useful. It's used in Bitcask. >> >> Cheers, >> >> Bob >> >> >> >> [1] http://github.com/basho/ebloom >> >> >> >> >> On Sep 24, 2010, at 11:21 PM, Norman Barker wrote: >> >>> Paul, >>> >>> yes, performance is actually much better (for some of our harder >>> queries, so all docs over time with field X (two views), 10x faster), >>> I am testing with docs that in total emit ~100K of keys (following the >>> raindrop megaview). >>> >>> Some of the scalable bloom filter project contained EPL headers, >>> others didn't, googling for the source code I had seen other projects >>> add the EPL headers to bit array so I did the same. I will contact the >>> author as he seems active on the erlang mailing lists and if not I >>> will write a bloom filter from scratch, the theory is well documented, >>> though I like his code! >>> >>> thanks for your help, let me know any suggestions you may have. >>> >>> thanks, >>> >>> Norman >>> >>> >>> >>> On Fri, Sep 24, 2010 at 7:16 PM, Paul Davis >>> wrote: Norman, Just glanced through. Looks better. Any feeling for a performance differences? Also, I glanced at the original files that you linked to. The bit array files didn't have a license, but what you've got there does have EPL headers. We need to make sure we have permission to do so. I would assume as much, but we have to be careful about such things in the ASF. You only need to get an email from the original author saying its ok. I'm a bit caught up with some other code at the moment, I'll give a more thorough combing over tomorrow. Paul On Fri, Sep 24, 2010 at 7:54 PM, Norman Barker wrote: > Hi, > > thanks to Paul's excellent suggestion I have rewritten the multiview > to use bloom filters, I had a concern that a bloom filter per view > would use too much memory but thanks in the main to excellent > implementation of bloom filters in erlang > (http://sites.google.com/site/scalablebloomfilters/) they seem to be > very space efficient. > > New code is here > > http://github.com/normanb/couchdb/ > > The code is simple, all one process, once we have agreed the approach > we can decide if there is any benefit in making the bloom filter > generation occur a separate process (using a genserver). > > Comments as always appreciated, I will continue adding to the test suite. > > thanks for the help, > > Norman > >> >> >
Re: multiview using bloom filters
Although, I don't believe theirs is growable. But if it is, that might be interesting to test for speed. Or we could add the growable parts. On Sat, Sep 25, 2010 at 5:44 AM, Robert Dionne wrote: > Norman, > > Basho also has a bloom filter implementation packaged as a separate > project[1], that you might find useful. It's used in Bitcask. > > Cheers, > > Bob > > > > [1] http://github.com/basho/ebloom > > > > > On Sep 24, 2010, at 11:21 PM, Norman Barker wrote: > >> Paul, >> >> yes, performance is actually much better (for some of our harder >> queries, so all docs over time with field X (two views), 10x faster), >> I am testing with docs that in total emit ~100K of keys (following the >> raindrop megaview). >> >> Some of the scalable bloom filter project contained EPL headers, >> others didn't, googling for the source code I had seen other projects >> add the EPL headers to bit array so I did the same. I will contact the >> author as he seems active on the erlang mailing lists and if not I >> will write a bloom filter from scratch, the theory is well documented, >> though I like his code! >> >> thanks for your help, let me know any suggestions you may have. >> >> thanks, >> >> Norman >> >> >> >> On Fri, Sep 24, 2010 at 7:16 PM, Paul Davis >> wrote: >>> Norman, >>> >>> Just glanced through. Looks better. Any feeling for a performance >>> differences? >>> >>> Also, I glanced at the original files that you linked to. The bit >>> array files didn't have a license, but what you've got there does have >>> EPL headers. We need to make sure we have permission to do so. I would >>> assume as much, but we have to be careful about such things in the >>> ASF. You only need to get an email from the original author saying its >>> ok. >>> >>> I'm a bit caught up with some other code at the moment, I'll give a >>> more thorough combing over tomorrow. >>> >>> Paul >>> >>> On Fri, Sep 24, 2010 at 7:54 PM, Norman Barker >>> wrote: Hi, thanks to Paul's excellent suggestion I have rewritten the multiview to use bloom filters, I had a concern that a bloom filter per view would use too much memory but thanks in the main to excellent implementation of bloom filters in erlang (http://sites.google.com/site/scalablebloomfilters/) they seem to be very space efficient. New code is here http://github.com/normanb/couchdb/ The code is simple, all one process, once we have agreed the approach we can decide if there is any benefit in making the bloom filter generation occur a separate process (using a genserver). Comments as always appreciated, I will continue adding to the test suite. thanks for the help, Norman >>> > >
Re: multiview using bloom filters
Norman, Basho also has a bloom filter implementation packaged as a separate project[1], that you might find useful. It's used in Bitcask. Cheers, Bob [1] http://github.com/basho/ebloom On Sep 24, 2010, at 11:21 PM, Norman Barker wrote: > Paul, > > yes, performance is actually much better (for some of our harder > queries, so all docs over time with field X (two views), 10x faster), > I am testing with docs that in total emit ~100K of keys (following the > raindrop megaview). > > Some of the scalable bloom filter project contained EPL headers, > others didn't, googling for the source code I had seen other projects > add the EPL headers to bit array so I did the same. I will contact the > author as he seems active on the erlang mailing lists and if not I > will write a bloom filter from scratch, the theory is well documented, > though I like his code! > > thanks for your help, let me know any suggestions you may have. > > thanks, > > Norman > > > > On Fri, Sep 24, 2010 at 7:16 PM, Paul Davis > wrote: >> Norman, >> >> Just glanced through. Looks better. Any feeling for a performance >> differences? >> >> Also, I glanced at the original files that you linked to. The bit >> array files didn't have a license, but what you've got there does have >> EPL headers. We need to make sure we have permission to do so. I would >> assume as much, but we have to be careful about such things in the >> ASF. You only need to get an email from the original author saying its >> ok. >> >> I'm a bit caught up with some other code at the moment, I'll give a >> more thorough combing over tomorrow. >> >> Paul >> >> On Fri, Sep 24, 2010 at 7:54 PM, Norman Barker >> wrote: >>> Hi, >>> >>> thanks to Paul's excellent suggestion I have rewritten the multiview >>> to use bloom filters, I had a concern that a bloom filter per view >>> would use too much memory but thanks in the main to excellent >>> implementation of bloom filters in erlang >>> (http://sites.google.com/site/scalablebloomfilters/) they seem to be >>> very space efficient. >>> >>> New code is here >>> >>> http://github.com/normanb/couchdb/ >>> >>> The code is simple, all one process, once we have agreed the approach >>> we can decide if there is any benefit in making the bloom filter >>> generation occur a separate process (using a genserver). >>> >>> Comments as always appreciated, I will continue adding to the test suite. >>> >>> thanks for the help, >>> >>> Norman >>> >>
Re: multiview using bloom filters
Hi, good work. I haven't had the time to look at it in detail, only glanced at it. My comments: - use indentation of 4 spaces - I see some places using indentation of 2 spaces, others using more, etc; - most of CouchDB's code uses a different style of case expression indentation - each case branch as the same indentation as the case line, e.g.: case lists:member(foobar, List) of true -> whatever; false -> else; end This avoids the code to grow too much horizontally (and more readable in my opinion, but this is very subjective). - keep lines to 80 characters at most - instead of using list_to_binary and binary_to_list, you can use the ?l2b and ?b2l macros defined in couch_db.hrl, they help reducing line length cheers On Sat, Sep 25, 2010 at 12:54 AM, Norman Barker wrote: > Hi, > > thanks to Paul's excellent suggestion I have rewritten the multiview > to use bloom filters, I had a concern that a bloom filter per view > would use too much memory but thanks in the main to excellent > implementation of bloom filters in erlang > (http://sites.google.com/site/scalablebloomfilters/) they seem to be > very space efficient. > > New code is here > > http://github.com/normanb/couchdb/ > > The code is simple, all one process, once we have agreed the approach > we can decide if there is any benefit in making the bloom filter > generation occur a separate process (using a genserver). > > Comments as always appreciated, I will continue adding to the test suite. > > thanks for the help, > > Norman > -- Filipe David Manana, fdman...@gmail.com, fdman...@apache.org "Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men."
Re: multiview using bloom filters
I would say if you find it somewhere with an EPL header, then that's a good sign. But best to check with the original author that it was his intent. As an interesting aside, I'm not even sure if its possible that someone that's not an employee of Ericson can release something under the EPL. I'm not being a pain on purpose here, but the ASF is particular about provenance in code and licensing. Alternatively, I haven't studied the code, so in the worst case I'll read the white paper and try and apply it to Basho's ebloom project. Paul Davis On Fri, Sep 24, 2010 at 11:21 PM, Norman Barker wrote: > Paul, > > yes, performance is actually much better (for some of our harder > queries, so all docs over time with field X (two views), 10x faster), > I am testing with docs that in total emit ~100K of keys (following the > raindrop megaview). > > Some of the scalable bloom filter project contained EPL headers, > others didn't, googling for the source code I had seen other projects > add the EPL headers to bit array so I did the same. I will contact the > author as he seems active on the erlang mailing lists and if not I > will write a bloom filter from scratch, the theory is well documented, > though I like his code! > > thanks for your help, let me know any suggestions you may have. > > thanks, > > Norman > > > > On Fri, Sep 24, 2010 at 7:16 PM, Paul Davis > wrote: >> Norman, >> >> Just glanced through. Looks better. Any feeling for a performance >> differences? >> >> Also, I glanced at the original files that you linked to. The bit >> array files didn't have a license, but what you've got there does have >> EPL headers. We need to make sure we have permission to do so. I would >> assume as much, but we have to be careful about such things in the >> ASF. You only need to get an email from the original author saying its >> ok. >> >> I'm a bit caught up with some other code at the moment, I'll give a >> more thorough combing over tomorrow. >> >> Paul >> >> On Fri, Sep 24, 2010 at 7:54 PM, Norman Barker >> wrote: >>> Hi, >>> >>> thanks to Paul's excellent suggestion I have rewritten the multiview >>> to use bloom filters, I had a concern that a bloom filter per view >>> would use too much memory but thanks in the main to excellent >>> implementation of bloom filters in erlang >>> (http://sites.google.com/site/scalablebloomfilters/) they seem to be >>> very space efficient. >>> >>> New code is here >>> >>> http://github.com/normanb/couchdb/ >>> >>> The code is simple, all one process, once we have agreed the approach >>> we can decide if there is any benefit in making the bloom filter >>> generation occur a separate process (using a genserver). >>> >>> Comments as always appreciated, I will continue adding to the test suite. >>> >>> thanks for the help, >>> >>> Norman >>> >> >
Re: multiview using bloom filters
Paul, yes, performance is actually much better (for some of our harder queries, so all docs over time with field X (two views), 10x faster), I am testing with docs that in total emit ~100K of keys (following the raindrop megaview). Some of the scalable bloom filter project contained EPL headers, others didn't, googling for the source code I had seen other projects add the EPL headers to bit array so I did the same. I will contact the author as he seems active on the erlang mailing lists and if not I will write a bloom filter from scratch, the theory is well documented, though I like his code! thanks for your help, let me know any suggestions you may have. thanks, Norman On Fri, Sep 24, 2010 at 7:16 PM, Paul Davis wrote: > Norman, > > Just glanced through. Looks better. Any feeling for a performance differences? > > Also, I glanced at the original files that you linked to. The bit > array files didn't have a license, but what you've got there does have > EPL headers. We need to make sure we have permission to do so. I would > assume as much, but we have to be careful about such things in the > ASF. You only need to get an email from the original author saying its > ok. > > I'm a bit caught up with some other code at the moment, I'll give a > more thorough combing over tomorrow. > > Paul > > On Fri, Sep 24, 2010 at 7:54 PM, Norman Barker > wrote: >> Hi, >> >> thanks to Paul's excellent suggestion I have rewritten the multiview >> to use bloom filters, I had a concern that a bloom filter per view >> would use too much memory but thanks in the main to excellent >> implementation of bloom filters in erlang >> (http://sites.google.com/site/scalablebloomfilters/) they seem to be >> very space efficient. >> >> New code is here >> >> http://github.com/normanb/couchdb/ >> >> The code is simple, all one process, once we have agreed the approach >> we can decide if there is any benefit in making the bloom filter >> generation occur a separate process (using a genserver). >> >> Comments as always appreciated, I will continue adding to the test suite. >> >> thanks for the help, >> >> Norman >> >
Re: multiview using bloom filters
Norman, Just glanced through. Looks better. Any feeling for a performance differences? Also, I glanced at the original files that you linked to. The bit array files didn't have a license, but what you've got there does have EPL headers. We need to make sure we have permission to do so. I would assume as much, but we have to be careful about such things in the ASF. You only need to get an email from the original author saying its ok. I'm a bit caught up with some other code at the moment, I'll give a more thorough combing over tomorrow. Paul On Fri, Sep 24, 2010 at 7:54 PM, Norman Barker wrote: > Hi, > > thanks to Paul's excellent suggestion I have rewritten the multiview > to use bloom filters, I had a concern that a bloom filter per view > would use too much memory but thanks in the main to excellent > implementation of bloom filters in erlang > (http://sites.google.com/site/scalablebloomfilters/) they seem to be > very space efficient. > > New code is here > > http://github.com/normanb/couchdb/ > > The code is simple, all one process, once we have agreed the approach > we can decide if there is any benefit in making the bloom filter > generation occur a separate process (using a genserver). > > Comments as always appreciated, I will continue adding to the test suite. > > thanks for the help, > > Norman >
multiview using bloom filters
Hi, thanks to Paul's excellent suggestion I have rewritten the multiview to use bloom filters, I had a concern that a bloom filter per view would use too much memory but thanks in the main to excellent implementation of bloom filters in erlang (http://sites.google.com/site/scalablebloomfilters/) they seem to be very space efficient. New code is here http://github.com/normanb/couchdb/ The code is simple, all one process, once we have agreed the approach we can decide if there is any benefit in making the bloom filter generation occur a separate process (using a genserver). Comments as always appreciated, I will continue adding to the test suite. thanks for the help, Norman