[
https://issues.apache.org/jira/browse/MADLIB-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jingyi Mei updated MADLIB-1229:
-------------------------------
Description:
In madlib 1.13, if I run the follow query
{code:java}
DROP TABLE IF EXISTS vertex, "EDGE";
CREATE TABLE vertex(
id INTEGER
);
CREATE TABLE "EDGE"(
src INTEGER,
dest INTEGER,
user_id INTEGER
);
INSERT INTO vertex VALUES
(0),
(1),
(2);
INSERT INTO "EDGE" VALUES
(0, 1, 1),
(0, 2, 1),
(1, 2, 1),
(2, 1, 1),
(0, 1, 2);
DROP TABLE IF EXISTS pagerank_ppr_grp_out;
DROP TABLE IF EXISTS pagerank_ppr_grp_out_summary;
SELECT pagerank(
'vertex', -- Vertex table
'id', -- Vertix id column
'"EDGE"', -- "EDGE" table
'src=src, dest=dest', -- "EDGE" args
'pagerank_ppr_grp_out', -- Output table of PageRank
NULL, -- Default damping factor (0.85)
NULL, -- Default max iters (100)
NULL, -- Default Threshold
'user_id');{code}
I will get result
{code:java}
madlib=# select * from pagerank_ppr_grp_out order by user_id, id; user_id | id
| pagerank
---------+----+-------------------
1 | 0 | 0.05
1 | 0 | 0.05
1 | 1 | 0.614906399170753
1 | 2 | 0.614906399170753
2 | 0 | 0.075
2 | 1 | 0.13875
(6 rows){code}
where user_id=1, id=1, pagerank=0.05 appears twice.
We should correct it to only show distinct result.
Besides, for user_id=1, all pagerank scores should sum up to 1. The score for
user_id=1, id=1 should be 0.475, and the score for user_id=1, id=2 should be
0.475. We should correct this calculation too.
was:
In madlib 1.13, if I run the follow query
{code:java}
DROP TABLE IF EXISTS vertex, "EDGE";
CREATE TABLE vertex(
id INTEGER
);
CREATE TABLE "EDGE"(
src INTEGER,
dest INTEGER,
user_id INTEGER
);
INSERT INTO vertex VALUES
(0),
(1),
(2);
INSERT INTO "EDGE" VALUES
(0, 1, 1),
(0, 2, 1),
(1, 2, 1),
(2, 1, 1),
(0, 1, 2);
DROP TABLE IF EXISTS pagerank_ppr_grp_out;
DROP TABLE IF EXISTS pagerank_ppr_grp_out_summary;
SELECT pagerank(
'vertex', -- Vertex table
'id', -- Vertix id column
'"EDGE"', -- "EDGE" table
'src=src, dest=dest', -- "EDGE" args
'pagerank_ppr_grp_out', -- Output table of PageRank
NULL, -- Default damping factor (0.85)
NULL, -- Default max iters (100)
NULL, -- Default Threshold
'user_id');{code}
I will get result
{code:java}
madlib=# select * from pagerank_ppr_grp_out order by user_id, id; user_id | id
| pagerank
---------+----+-------------------
1 | 0 | 0.05
1 | 0 | 0.05
1 | 1 | 0.614906399170753
1 | 2 | 0.614906399170753
2 | 0 | 0.075
2 | 1 | 0.13875
(6 rows){code}
where user_id=1, id=1, pagerank=0.05 appears twice.
We should correct it to only show distinct result.
> Duplicated result in PageRank output table with grouping
> --------------------------------------------------------
>
> Key: MADLIB-1229
> URL: https://issues.apache.org/jira/browse/MADLIB-1229
> Project: Apache MADlib
> Issue Type: Bug
> Components: Module: Graph
> Reporter: Jingyi Mei
> Assignee: Himanshu Pandey
> Priority: Minor
> Fix For: v1.15
>
>
> In madlib 1.13, if I run the follow query
> {code:java}
> DROP TABLE IF EXISTS vertex, "EDGE";
> CREATE TABLE vertex(
> id INTEGER
> );
> CREATE TABLE "EDGE"(
> src INTEGER,
> dest INTEGER,
> user_id INTEGER
> );
> INSERT INTO vertex VALUES
> (0),
> (1),
> (2);
> INSERT INTO "EDGE" VALUES
> (0, 1, 1),
> (0, 2, 1),
> (1, 2, 1),
> (2, 1, 1),
> (0, 1, 2);
> DROP TABLE IF EXISTS pagerank_ppr_grp_out;
> DROP TABLE IF EXISTS pagerank_ppr_grp_out_summary;
> SELECT pagerank(
> 'vertex', -- Vertex table
> 'id', -- Vertix id column
> '"EDGE"', -- "EDGE" table
> 'src=src, dest=dest', -- "EDGE" args
> 'pagerank_ppr_grp_out', -- Output table of PageRank
> NULL, -- Default damping factor (0.85)
> NULL, -- Default max iters (100)
> NULL, -- Default Threshold
> 'user_id');{code}
> I will get result
> {code:java}
> madlib=# select * from pagerank_ppr_grp_out order by user_id, id; user_id |
> id | pagerank
> ---------+----+-------------------
> 1 | 0 | 0.05
> 1 | 0 | 0.05
> 1 | 1 | 0.614906399170753
> 1 | 2 | 0.614906399170753
> 2 | 0 | 0.075
> 2 | 1 | 0.13875
> (6 rows){code}
> where user_id=1, id=1, pagerank=0.05 appears twice.
> We should correct it to only show distinct result.
>
> Besides, for user_id=1, all pagerank scores should sum up to 1. The score for
> user_id=1, id=1 should be 0.475, and the score for user_id=1, id=2 should be
> 0.475. We should correct this calculation too.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)