Re: BloomFilters in Crunch

Josh Wills Tue, 21 Aug 2012 21:11:09 -0700

On Mon, Aug 20, 2012 at 9:31 PM, Rahul <[email protected]> wrote:

> Josh,
>
> If you look at the current piece of code then it can be. But in general I
> want it to work on a PCollection. This was just a sample testbed where I
> was playing with it.
> If it works an a PCollection then it can be more useful, I am thinking of
> a Aggregation function which can do this.
>
> Also what you said about building filters for a bunch of files/folder
> looks an interesting use case to me. I can add something on the lines of
> piggybank and make it there. J
>


I look forward to the patch.

J


>
> regards
> Rahul
>
>
> On 20-08-2012 20:29, Josh Wills wrote:
>
>> Hey Rahul,
>>
>> Very cool use case. A thought: isn't the name of the file that
>> contains the bloom filter a better key than the boolean? That way, I
>> could point the input at an entire directory of files and have it
>> build bloom filters for all of them for me.
>>
>> It seems useful to me in general, but I'm not quite sure where to put
>> it-- it's more useful than an example, but not such a common use case
>> that we would put it in core. We need something like the equivalent of
>> Pig's piggybank.
>>
>> J
>>
>> On Mon, Aug 20, 2012 at 12:58 AM, Rahul <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> Today I tried to create BloomFilters using Crunch,  attached is the
>>> testcase
>>> for the same. I do not know if there is  a better way of accomplishing
>>> the
>>> same.
>>> I think APIs to create/load BloomFilters could be a good add-on to
>>> Crunch's
>>> existing set. If people feel like it could be added then I can make a
>>> patch
>>> for the same.
>>>
>>> regards,
>>> Rahul
>>>
>>>
>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: BloomFilters in Crunch

Reply via email to