On Mon, Aug 20, 2012 at 9:31 PM, Rahul <[email protected]> wrote: > Josh, > > If you look at the current piece of code then it can be. But in general I > want it to work on a PCollection. This was just a sample testbed where I > was playing with it. > If it works an a PCollection then it can be more useful, I am thinking of > a Aggregation function which can do this. > > Also what you said about building filters for a bunch of files/folder > looks an interesting use case to me. I can add something on the lines of > piggybank and make it there. J >
I look forward to the patch. J > > regards > Rahul > > > On 20-08-2012 20:29, Josh Wills wrote: > >> Hey Rahul, >> >> Very cool use case. A thought: isn't the name of the file that >> contains the bloom filter a better key than the boolean? That way, I >> could point the input at an entire directory of files and have it >> build bloom filters for all of them for me. >> >> It seems useful to me in general, but I'm not quite sure where to put >> it-- it's more useful than an example, but not such a common use case >> that we would put it in core. We need something like the equivalent of >> Pig's piggybank. >> >> J >> >> On Mon, Aug 20, 2012 at 12:58 AM, Rahul <[email protected]> wrote: >> >>> Hi, >>> >>> Today I tried to create BloomFilters using Crunch, attached is the >>> testcase >>> for the same. I do not know if there is a better way of accomplishing >>> the >>> same. >>> I think APIs to create/load BloomFilters could be a good add-on to >>> Crunch's >>> existing set. If people feel like it could be added then I can make a >>> patch >>> for the same. >>> >>> regards, >>> Rahul >>> >>> > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
