If you have a clean test case demonstrating the desired behavior, and a change which makes it work that way, yes make a JIRA and PR.
On Fri, Jun 17, 2016 at 1:35 AM, Luyi Wang <wangluyi1...@gmail.com> wrote: > Hey there: > > The frequent item in dataframe stat package seems not accurate. In the > documentation,it did mention that it has false positive but still seems > incorrect. > > Wondering if this is all known problem or not? > > > Here is a quick example showing the problem. > > val sqlContext = new SQLContext(sc) > import sqlContext.implicits._ > > val rows = Seq((0,"a"),(1, "c"),(2, "a"),(3, "a"),(4, "b"),(5, > "d")).toDF("id", "category") > val history = rows.toDF("id", "category") > > history.stat.freqItems(Array("category"),0.5).show > history.stat.freqItems(Array("category"),0.3).show > history.stat.freqItems(Array("category"),0.51).show > history.stat.freqItems(Array("category")).show > > > Here is the output > > +------------------+ > |category_freqItems| > +------------------+ > | [d, a]| > +------------------+ > > +------------------+ > |category_freqItems| > +------------------+ > | [a]| > +------------------+ > > +------------------+ > |category_freqItems| > +------------------+ > | []| > +------------------+ > > +------------------+ > |category_freqItems| > +------------------+ > | [b, d, a, c]| > +------------------+ > > > > The problem results from the freqItemCounter class's add function which is > used in the function singlePassFreqItems aggregation stage. > > Regarding on the paper, the return size of the frequent set can't be larger > than 1/minimum_support,which we indicated as k hereby, so that in > singlePassFreqItems the counterMap is created with this size. > > The logic of the add function is following: > > To add up the counter of a item, when it already exists in the map, the > counter is added up.If it doesn't exist and also map size less than k, it > inserts. if it doesn't exist and also current size just equal to size k, > then it will compare the inserted count with the minimum value. if the > counter of the to be inserted item is larger than or equals to the current > minimum, item is inserted and all items with counter value larger than > current minimum would and smaller and equals to will be removed. If counter > of the to be inserted item is smaller than the current minimum, item won't > be inserted and counters of all items in the map will be deduct the inserted > counter value. > > Problem: > > Since it would retain the items larger than the current minimum, if the > current minimum is just happened to be the count of second most frequent > item. it would be removed if the to be inserted item has the same count. In > this case, possibly a smaller one would be inserted in the map afterward and > returned later. > > Given one example here. "a" appears 3 times, "b" and "c" both appears 2 > times, "d" appears only once, total 8 times, For minimum support 0.5, the > map is initiated with size 2. The correct answer should return items > appears more than 4 times, which is empty. However it returns "a" and "d". > The reason it returned two items is because of map size. The reason "d" is > returned is because that "b" and "c" appear the same amount and more than > "d", but they are cleaned when either one of them already inserted and the > map reach the size limitation. and when "d" is to be inserted, size is > smaller and it is inserted. > > > val rows = Seq((0,"a"),(1, "b"),(2, "a"),(3, "a"),(4, "b"),(5, > "c"),(6,"c"),(7,"d")).toDF("id", "category") > val history = rows.toDF("id", "category") > > history.stat.freqItems(Array("category"),0.5).show > history.stat.freqItems(Array("category"),0.3).show > history.stat.freqItems(Array("category"),0.51).show > history.stat.freqItems(Array("category")).show > > > +------------------+ > |category_freqItems| > +------------------+ > | [d, a]| > +------------------+ > > +------------------+ > |category_freqItems| > +------------------+ > | [b, a, c]| > +------------------+ > > +------------------+ > |category_freqItems| > +------------------+ > | []| > +------------------+ > > +------------------+ > |category_freqItems| > +------------------+ > | [b, d, a, c]| > +------------------+ > > > Hope this explains the problem. > > Thanks. > > -Luyi. > > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org