If you have a clean test case demonstrating the desired behavior, and
a change which makes it work that way, yes make a JIRA and PR.

On Fri, Jun 17, 2016 at 1:35 AM, Luyi Wang <wangluyi1...@gmail.com> wrote:
> Hey there:
>
> The frequent item in dataframe stat package seems not accurate. In the
> documentation,it did mention that it has false positive but still seems
> incorrect.
>
> Wondering if this is all known problem or not?
>
>
> Here is a quick example showing the problem.
>
> val sqlContext = new SQLContext(sc)
> import sqlContext.implicits._
>
> val rows = Seq((0,"a"),(1, "c"),(2, "a"),(3, "a"),(4, "b"),(5,
> "d")).toDF("id", "category")
> val history = rows.toDF("id", "category")
>
> history.stat.freqItems(Array("category"),0.5).show
> history.stat.freqItems(Array("category"),0.3).show
> history.stat.freqItems(Array("category"),0.51).show
> history.stat.freqItems(Array("category")).show
>
>
> Here is the output
>
> +------------------+
> |category_freqItems|
> +------------------+
> |            [d, a]|
> +------------------+
>
> +------------------+
> |category_freqItems|
> +------------------+
> |               [a]|
> +------------------+
>
> +------------------+
> |category_freqItems|
> +------------------+
> |                []|
> +------------------+
>
> +------------------+
> |category_freqItems|
> +------------------+
> |      [b, d, a, c]|
> +------------------+
>
>
>
> The problem results from the freqItemCounter class's add function which is
> used in the function singlePassFreqItems aggregation stage.
>
> Regarding on the paper, the return size of the frequent set can't be larger
> than 1/minimum_support,which we indicated as k hereby, so that  in
> singlePassFreqItems the counterMap is created with this size.
>
> The logic of the add function is following:
>
> To add up the counter of a item, when it already exists in the map,  the
> counter is added up.If it doesn't exist and also map size less than k, it
> inserts.  if it doesn't exist and also current size just equal to size k,
> then it will compare the inserted count with the minimum value. if the
> counter of the to be inserted item is larger than or equals to the current
> minimum, item is inserted and all items with counter value larger than
> current minimum would and smaller and equals to will be removed.  If counter
> of the to be inserted item is smaller than the current minimum, item won't
> be inserted and counters of all items in the map will be deduct the inserted
> counter value.
>
> Problem:
>
> Since it would retain the items larger than the current minimum,  if the
> current minimum is just happened to be the count of second most frequent
> item. it would be removed if the to be inserted item has the same count. In
> this case, possibly a smaller one would be inserted in the map afterward and
> returned later.
>
> Given one example here. "a" appears 3 times, "b" and "c" both appears 2
> times, "d" appears only once, total 8 times, For minimum support 0.5, the
> map is initiated with size 2.   The correct answer should return items
> appears more than 4 times, which is empty. However it returns "a" and "d".
> The reason it returned two items is because of map size. The reason "d" is
> returned is because that "b" and "c" appear the same amount and more than
> "d", but they are cleaned when either one of them already inserted and the
> map reach the size limitation. and when "d" is to be inserted, size is
> smaller and it is inserted.
>
>
> val rows = Seq((0,"a"),(1, "b"),(2, "a"),(3, "a"),(4, "b"),(5,
> "c"),(6,"c"),(7,"d")).toDF("id", "category")
> val history = rows.toDF("id", "category")
>
> history.stat.freqItems(Array("category"),0.5).show
> history.stat.freqItems(Array("category"),0.3).show
> history.stat.freqItems(Array("category"),0.51).show
> history.stat.freqItems(Array("category")).show
>
>
> +------------------+
> |category_freqItems|
> +------------------+
> |            [d, a]|
> +------------------+
>
> +------------------+
> |category_freqItems|
> +------------------+
> |         [b, a, c]|
> +------------------+
>
> +------------------+
> |category_freqItems|
> +------------------+
> |                []|
> +------------------+
>
> +------------------+
> |category_freqItems|
> +------------------+
> |      [b, d, a, c]|
> +------------------+
>
>
> Hope this explains the problem.
>
> Thanks.
>
> -Luyi.
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to