Re: FP Growth Understanding

Neal Richter Mon, 15 Feb 2010 08:23:40 -0800

I have no problem with the repetition!

I'll have to poke at this a bit more, but I like the switches ideas.
I often use Christian Borgelt's itemset implementations for playing
with data.  He's implemented a nice set of switches, see below.
Setting a minimum support threshold and mimimum itemset size are both
convenient and tend to make the algorithm run a bit faster.


http://www.borgelt.net/software.html

ne...@nrichter-laptop:~$ fpgrowth_fim
usage: fpgrowth_fim [options] infile outfile
find frequent item sets with the fpgrowth algorithm
version 1.13 (2008.05.02)        (c) 2004-2008   Christian Borgelt
-m#      minimal number of items per item set (default: 1)
-n#      maximal number of items per item set (default: no limit)
-s#      minimal support of an item set (default: 10%)
         (positive: percentage, negative: absolute number)
-d#      minimal binary logarithm of support quotient (default: none)
-p#      output format for the item set support (default: "%.1f")
-a       print absolute support (number of transactions)
-g       write output in scanable form (quote certain characters)
-q#      sort items w.r.t. their frequency (default: -2)
         (1: ascending, -1: descending, 0: do not sort,
          2: ascending, -2: descending w.r.t. transaction size sum)
-u       use alternative tree projection method
-z       do not prune tree projections to bonsai
-j       use quicksort to sort the transactions (default: heapsort)
-i#      ignore records starting with a character in the given string
-b/f/r#  blank characters, field and record separators
         (default: " \t\r", " \t", "\n")
infile   file to read transactions from
outfile  file to write frequent item se

On Mon, Feb 15, 2010 at 9:14 AM, Robin Anil <[email protected]> wrote:
> Hi Neal,
>             I know there is repetition. I tried sticking true to the
> original algorithm that is finding closed patterns and using the longest
> one.
>
> Say if 68 and 12 occurs 1000 times
> and 68 12 17 also occurs 1000 times, there so information that former
> pattern gives you. So, you can remove it. Therefore you say that 68 12 17 is
> a closed pattern and all the patterns it is enclosing are removed.
>
> had 68 alone occurred 2000 times. It no longer becomes a closed pattern..
>
> Things could be made configurable by having a flag to remove closed patterns
> within a percentage of the support Or mine only patterns > 3 items in
> length. These are tricky but could be done.
>
> Robin
>
>
> On Mon, Feb 15, 2010 at 9:34 PM, Neal Richter <[email protected]> wrote:
>
>> Grant:  Chapter 5 of Han and Kamber (Data Mining: Concepts and
>> Techniques) detail itemset mining and the fpgrowth alg.  Han is a
>> co-inventor of it.
>>
>> There is a bit of repetition in the output compared to other itemset
>> mining packages, though this structure is convenient for relational
>> indexing by key.
>>
>> - Neal
>>
>> On Mon, Feb 15, 2010 at 6:49 AM, Robin Anil <[email protected]> wrote:
>> > Ok.. A bit more background..
>> >
>> > An Itemset is a subset I1, I2, I3... In
>> >
>> > so [I2, I4, I7] is an itemset and the support(no of times its visible in
>> the
>> > dataset) is say Y
>> >
>> > A Pattern is Pair<Itemset, support>
>> >
>> > Take a look at in this format
>> >
>> > 68:
>> >     ([68],90692),
>> >     ([17, 68],90683),
>> >     ([12, 68],90490),
>> >     ([17, 12, 68],90481),
>> >     ([18, 68],90291)
>> >
>> > these are top patterns containing 68 and their support in descending
>> order
>> > 68 occurs with 12,  90490 times
>> >
>> > Robin
>> >
>> >
>> > On Mon, Feb 15, 2010 at 6:27 PM, Grant Ingersoll <[email protected]
>> >wrote:
>> >
>> >>
>> >> On Feb 14, 2010, at 11:37 PM, Robin Anil wrote:
>> >>
>> >> > Each key is a feature and each attribute is the topK frequent patterns
>> >> where
>> >> > the feature exist
>> >>
>> >> Still a bit confused.
>> >> Given:
>> >> Key: 68: Value: ([68],90692), ([17, 68],90683), ([12, 68],90490), ([17,
>> 12,
>> >> 68],90481), ([18, 68],90291), ([17, 18, 68],90282), ([12, 18,
>> 68],90229),
>> >> ([17, 12, 18, 68],90220), ([31, 68],89071), ([17, 31, 68],89062), ([12,
>> 31,
>> >> 68],88874), ([17, 12, 31, 68],88865), ([18, 31, 68],88681), ([17, 18,
>> 31,
>> >> 68],88672), ([12, 18, 31, 68],88619), ([17, 12, 18, 31, 68],88610),
>> ([16,
>> >> 68],87933),
>> >>
>> >> So, 68 is the feature in question.  That makes sense.  Then, what is the
>> >> significance of the [] areas, as in [68],90692 or [17,12,68], 90481.
>>  Why
>> >> all the repetition?
>> >>
>> >> -Grant
>> >
>>
>

Re: FP Growth Understanding

Reply via email to