[
https://issues.apache.org/jira/browse/MADLIB-1168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299215#comment-16299215
]
ssoni commented on MADLIB-1168:
-------------------------------
[~iyerr3][[~okislal]]
Following are a couple of potential solutions for limiting the class sizes and
speeding random sampling query:
* *Solution 1:* Generate random row numbers and select the desired
fraction/number of observations. This requires just 1 seq scan and may use disk
merge sort if the desired number of observations is large. Otherwise, it uses
quicksort.
* *Solution 2:* Use 'order by random()' with the keyword limit. This requires
as many seq scans as the desired fraction/number specified in class_sizes. It
is more likely to use top-n heapsort (smaller limit e.g. 1000) instead of disk
merge sort (larger limit e.g. 100000).
According to our experiment with 10 million rows and 2 columns (1 column with 4
groups) both the solutions are comparable. But for smaller datasets, solution
2 is slightly faster than solution 1.
{code}
drop table test2 cascade;
create table test2 as
select generate_series(1,10000000) as id,
((random()*3)+1)::int as gr;
{code}
*Solution 1:*
Case with larger subset selection: 19-20 seconds
{code}
explain analyze
select
id, gr
from
(
SELECT
*,
row_number() OVER(PARTITION BY gr)
AS __row_no
FROM
test2
) as foo
where
(gr= '1' and __row_no<=100000 )
OR
(gr= '2' and __row_no<=200000 )
OR
(gr= '3' and __row_no<=300000 )
or
(gr not in ('1','2','3'))
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Subquery Scan on foo (cost=1580360.76..2042859.70 rows=3783218 width=8)
(actual time=9186.275..19784.115 rows=2267855 loops=1)
Filter: (((foo.gr = 1) AND (foo.__row_no <= 100000)) OR ((foo.gr = 2) AND
(foo.__row_no <= 200000)) OR ((foo.gr = 3) AND (foo.__row_no <= 300000)) OR
(foo.gr <> ALL ('{1,2,3}'::integer[])))
Rows Removed by Filter: 7732145
-> WindowAgg (cost=1580360.76..1755360.36 rows=9999977 width=16) (actual
time=9186.273..16999.496 rows=10000000 loops=1)
-> Sort (cost=1580360.76..1605360.71 rows=9999977 width=8) (actual
time=9186.265..11228.217 rows=10000000 loops=1)
Sort Key: test2.gr
Sort Method: external merge Disk: 176176kB
-> Seq Scan on test2 (cost=0.00..144247.77 rows=9999977
width=8) (actual time=0.017..2289.247 rows=10000000 loops=1)
Planning time: 0.210 ms
Execution time: 20010.551 ms
(10 rows)
Time: 20011.532 ms (00:20.012)
{code}
Case with smaller subset selection: 19-20 seconds
{code}
explain analyze
select
id, gr
from
(
SELECT
*,
row_number() OVER(PARTITION BY gr)
AS __row_no
FROM
test2
) as foo
where
(gr= '1' and __row_no<=1000 )
OR
(gr= '2' and __row_no<=2000 )
OR
(gr= '3' and __row_no<=3000 )
or
(gr not in ('1','2','3'))
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Subquery Scan on foo (cost=1580360.76..2042859.70 rows=3783218 width=8)
(actual time=8699.221..19181.321 rows=1673855 loops=1)
Filter: (((foo.gr = 1) AND (foo.__row_no <= 1000)) OR ((foo.gr = 2) AND
(foo.__row_no <= 2000)) OR ((foo.gr = 3) AND (foo.__row_no <= 3000)) OR (foo.gr
<> ALL ('{1,2,3}'::integer[])))
Rows Removed by Filter: 8326145
-> WindowAgg (cost=1580360.76..1755360.36 rows=9999977 width=16) (actual
time=8699.218..16369.833 rows=10000000 loops=1)
-> Sort (cost=1580360.76..1605360.71 rows=9999977 width=8) (actual
time=8699.210..10701.751 rows=10000000 loops=1)
Sort Key: test2.gr
Sort Method: external merge Disk: 176176kB
-> Seq Scan on test2 (cost=0.00..144247.77 rows=9999977
width=8) (actual time=0.017..2113.890 rows=10000000 loops=1)
Planning time: 0.181 ms
Execution time: 19471.310 ms
(10 rows)
{code}
*Solution 2*.
Case with larger subset selection: 19-20 seconds
{code}
explain analyze
select * from
(select * from test2 where gr = '1' order by random() limit 100000) as foo
UnION
(select * from test2 where gr= '2' order by random() limit 200000)
UNion
(select * from test2 where gr= '3' order by random() limit 300000)
UNiON
select * from test2 where gr not in ('1','2','3');
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=2133228.64..2150256.12 rows=2270330 width=8) (actual
time=19347.622..20610.326 rows=2267855 loops=1)
-> Sort (cost=2133228.64..2138904.47 rows=2270330 width=8) (actual
time=19347.620..19854.822 rows=2267855 loops=1)
Sort Key: foo.id, foo.gr
Sort Method: external merge Disk: 39976kB
-> Append (cost=320764.65..1831461.56 rows=2270330 width=8) (actual
time=3527.894..18066.441 rows=2267855 loops=1)
-> Subquery Scan on foo (cost=320764.65..322014.65 rows=100000
width=8) (actual time=3527.894..3593.640 rows=100000 loops=1)
-> Limit (cost=320764.65..321014.65 rows=100000
width=16) (actual time=3527.891..3573.712 rows=100000 loops=1)
-> Sort (cost=320764.65..324947.97 rows=1673329
width=16) (actual time=3527.890..3561.467 rows=100000 loops=1)
Sort Key: (random())
Sort Method: external merge Disk: 42464kB
-> Seq Scan on test2 (cost=0.00..173431.04
rows=1673329 width=16) (actual time=0.031..1786.680 rows=1667258 loops=1)
Filter: (gr = 1)
Rows Removed by Filter: 8332742
-> Subquery Scan on "*SELECT* 2" (cost=652508.36..655008.36
rows=200000 width=8) (actual time=5818.155..5947.836 rows=200000 loops=1)
-> Limit (cost=652508.36..653008.36 rows=200000
width=16) (actual time=5818.152..5907.384 rows=200000 loops=1)
-> Sort (cost=652508.36..660839.18 rows=3332326
width=16) (actual time=5818.150..5882.304 rows=200000 loops=1)
Sort Key: (random())
Sort Method: external merge Disk: 84848kB
-> Seq Scan on test2 test2_1
(cost=0.00..177578.53 rows=3332326 width=16) (actual time=0.025..2010.535
rows=3332151 loops=1)
Filter: (gr = 2)
Rows Removed by Filter: 6667849
-> Subquery Scan on "*SELECT* 3" (cost=651237.57..654987.57
rows=300000 width=8) (actual time=5798.621..5985.493 rows=300000 loops=1)
-> Limit (cost=651237.57..651987.57 rows=300000
width=16) (actual time=5798.618..5927.191 rows=300000 loops=1)
-> Sort (cost=651237.57..659547.55 rows=3323992
width=16) (actual time=5798.617..5890.713 rows=300000 loops=1)
Sort Key: (random())
Sort Method: external merge Disk: 84856kB
-> Seq Scan on test2 test2_2
(cost=0.00..177557.69 rows=3323992 width=16) (actual time=0.021..1990.768
rows=3332736 loops=1)
Filter: (gr = 3)
Rows Removed by Filter: 6667264
-> Seq Scan on test2 test2_3 (cost=0.00..181747.68
rows=1670330 width=8) (actual time=0.028..2267.051 rows=1667855 loops=1)
Filter: (gr <> ALL ('{1,2,3}'::integer[]))
Rows Removed by Filter: 8332145
Planning time: 0.344 ms
Execution time: 20759.252 ms
(34 rows)
Time: 20760.563 ms (00:20.761)
{code}
Case with smaller subset selection: 13-14 seconds
{code}
explain analyze
select * from
(select * from test2 where gr = '1' order by random() limit 1000) as foo
UnION
(select * from test2 where gr= '2' order by random() limit 2000)
UNion
(select * from test2 where gr= '3' order by random() limit 3000)
UNiON
select * from test2 where gr not in ('1','2','3');
QUERY
PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=1423040.27..1435612.75 rows=1676330 width=8) (actual
time=12549.205..13420.110 rows=1673855 loops=1)
-> Sort (cost=1423040.27..1427231.10 rows=1676330 width=8) (actual
time=12549.194..12844.827 rows=1673855 loops=1)
Sort Key: foo.id, foo.gr
Sort Method: external merge Disk: 29496kB
-> Append (cost=265177.86..1226812.44 rows=1676330 width=8) (actual
time=2373.069..10965.185 rows=1673855 loops=1)
-> Subquery Scan on foo (cost=265177.86..265190.36 rows=1000
width=8) (actual time=2373.068..2373.567 rows=1000 loops=1)
-> Limit (cost=265177.86..265180.36 rows=1000 width=16)
(actual time=2373.066..2373.368 rows=1000 loops=1)
-> Sort (cost=265177.86..269361.18 rows=1673329
width=16) (actual time=2373.065..2373.223 rows=1000 loops=1)
Sort Key: (random())
Sort Method: top-N heapsort Memory: 141kB
-> Seq Scan on test2 (cost=0.00..173431.04
rows=1673329 width=16) (actual time=0.022..1884.731 rows=1667258 loops=1)
Filter: (gr = 1)
Rows Removed by Filter: 8332742
-> Subquery Scan on "*SELECT* 2" (cost=376948.00..376973.00
rows=2000 width=8) (actual time=2978.519..2979.438 rows=2000 loops=1)
-> Limit (cost=376948.00..376953.00 rows=2000 width=16)
(actual time=2978.517..2979.056 rows=2000 loops=1)
-> Sort (cost=376948.00..385278.81 rows=3332326
width=16) (actual time=2978.516..2978.830 rows=2000 loops=1)
Sort Key: (random())
Sort Method: top-N heapsort Memory: 280kB
-> Seq Scan on test2 test2_1
(cost=0.00..177578.53 rows=3332326 width=16) (actual time=0.024..2034.872
rows=3332151 loops=1)
Filter: (gr = 2)
Rows Removed by Filter: 6667849
-> Subquery Scan on "*SELECT* 3" (cost=386150.60..386188.10
rows=3000 width=8) (actual time=3098.020..3099.406 rows=3000 loops=1)
-> Limit (cost=386150.60..386158.10 rows=3000 width=16)
(actual time=3098.018..3098.834 rows=3000 loops=1)
-> Sort (cost=386150.60..394460.58 rows=3323992
width=16) (actual time=3098.017..3098.491 rows=3000 loops=1)
Sort Key: (random())
Sort Method: top-N heapsort Memory: 468kB
-> Seq Scan on test2 test2_2
(cost=0.00..177557.69 rows=3323992 width=16) (actual time=0.023..2117.887
rows=3332736 loops=1)
Filter: (gr = 3)
Rows Removed by Filter: 6667264
-> Seq Scan on test2 test2_3 (cost=0.00..181747.68
rows=1670330 width=8) (actual time=0.021..2314.743 rows=1667855 loops=1)
Filter: (gr <> ALL ('{1,2,3}'::integer[]))
Rows Removed by Filter: 8332145
Planning time: 0.295 ms
Execution time: 13525.515 ms
(34 rows)
{code}
> Balance datasets
> ----------------
>
> Key: MADLIB-1168
> URL: https://issues.apache.org/jira/browse/MADLIB-1168
> Project: Apache MADlib
> Issue Type: New Feature
> Components: Module: Sampling
> Reporter: Frank McQuillan
> Assignee: ssoni
> Fix For: v1.14
>
> Attachments: MADlib Balance Datasets Requirements.pdf,
> MADlib_Balance_Datasets_Requirements_v2.pdf
>
>
> From [1] here is the motivation behind balancing datasets:
> “Most classification algorithms will only perform optimally when the number
> of samples of each class is roughly the same. Highly skewed datasets, where
> the minority is heavily outnumbered by one or more classes, have proven to be
> a challenge while at the same time becoming more and more common.
> One way of addressing this issue is by re-sampling the dataset as to offset
> this imbalance with the hope of arriving at a more robust and fair decision
> boundary than you would otherwise.
> Re-sampling techniques can be divided in these categories:
> * Under-sampling the majority class(es).
> * Over-sampling the minority class.
> * Combining over- and under-sampling.
> * Create ensemble balanced sets.”
> There is an extensive literature on balancing datasets. The plan for MADlib
> in the initial phase is to offer basic functionality that can be extended in
> later phases based on feedback from users.
> Please see attached document for proposed scope of this story.
> References
> [1] imbalance-learn Python project
> http://contrib.scikit-learn.org/imbalanced-learn/stable/index.html
> https://github.com/scikit-learn-contrib/imbalanced-learn
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)