[jira] [Assigned] (MADLIB-986) Stratified sampling

2017-06-15 Thread Frank McQuillan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MADLIB-986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan reassigned MADLIB-986:
--

Assignee: Orhan Kislal

> Stratified sampling
> ---
>
> Key: MADLIB-986
> URL: https://issues.apache.org/jira/browse/MADLIB-986
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Sampling
>Reporter: Frank McQuillan
>Assignee: Orhan Kislal
>  Labels: starter
> Fix For: v1.12
>
>
> Story
> As a data scientist, I want to sample a data table in proportion to the 
> number of rows in each group, so that I can do model building on the sampled 
> data sets.
> The MVP for this story is:
> * sample proportion is global, i.e., single fractional value between 0 and 1
> * allow option to sample without replacement (default) and sample with 
> replacement
> * allow option to output a subset of columns to the output table
> Proposed Interface
> {code}
> stratified_sample ( 
>source_table,
>output_table,
>proportion,
>grouping_col -- optional
>with_replacement, -- optional
>target_cols -- optional
> )
> source_table
> TEXT. The name of the table containing the input data.
> output_table
> TEXT. Name of output table that contains the sampled data. 
> The output table contains all the columns present in the source table 
> unless otherwise specified in the 'target_cols' parameter below.
> proportion
> FLOAT8 in the range (0,1).  The size of the sample in each stratum will 
> be taken in proportion to the size of the stratum. 
> grouping_col (optional)
> TEXT, default: NULL. A single column or a list of comma-separated columns
>  that defines how to stratify.  When this parameter is NULL, 
> no grouping is used so the sampling is non-stratified.
> with_replacement (optional) 
> BOOLEAN, default FALSE.  Determines whether to sample with replacement 
> or without replacement (default).
> target_cols (optional)
> TEXT, default NULL. A comma-separated list of columns to appear in the 
> 'output_table'. 
> If NULL, all columns from the 'source_table'  will appear in the 
> 'output_table'.
> {code}
> Other notes
> PDL tools is one example implementation of stratified sampling to review [2]. 
>  
> Please review existing MADlib sample functions [3] to see if these can be 
> used as a basis, or built on, for this stratified sample story. 
> References
> [2] PDL tools sampling modules incl stratified sampling
> http://pivotalsoftware.github.io/PDLTools/group__grp__sampling.html
> [3] Existing MADlib sample function
> http://madlib.incubator.apache.org/docs/latest/group__grp__sample.html
> [4] Pandas/Selecting Random Samples
> http://pandas.pydata.org/pandas-docs/stable/indexing.html#selecting-random-samples
> [5] General
> https://en.wikipedia.org/wiki/Stratified_sampling



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MADLIB-1124) Graph - HITS algorithm

2017-06-15 Thread Frank McQuillan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MADLIB-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan updated MADLIB-1124:

Description: Dependency on https://issues.apache.org/jira/browse/MADLIB-1122

> Graph - HITS algorithm
> --
>
> Key: MADLIB-1124
> URL: https://issues.apache.org/jira/browse/MADLIB-1124
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Graph
>Reporter: Frank McQuillan
> Fix For: v2.0
>
>
> Dependency on https://issues.apache.org/jira/browse/MADLIB-1122



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MADLIB-1123) Graph - Eigenvector centrality

2017-06-15 Thread Frank McQuillan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MADLIB-1123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan updated MADLIB-1123:

Description: Dependency on https://issues.apache.org/jira/browse/MADLIB-1122

> Graph - Eigenvector centrality
> --
>
> Key: MADLIB-1123
> URL: https://issues.apache.org/jira/browse/MADLIB-1123
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Graph
>Reporter: Frank McQuillan
> Fix For: v2.0
>
>
> Dependency on https://issues.apache.org/jira/browse/MADLIB-1122



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MADLIB-1122) Graph - refactor pagerank code

2017-06-15 Thread Frank McQuillan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MADLIB-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan updated MADLIB-1122:

Description: Refactor pagerank code such that we can re-use some functions 
for Eigenvector centrality and HITS  (was: Refactor pagerank code such that we 
can re-use some functions for Eigenvector and HITS)

> Graph - refactor pagerank code
> --
>
> Key: MADLIB-1122
> URL: https://issues.apache.org/jira/browse/MADLIB-1122
> Project: Apache MADlib
>  Issue Type: Improvement
>  Components: Module: Graph
>Reporter: Frank McQuillan
> Fix For: v2.0
>
>
> Refactor pagerank code such that we can re-use some functions for Eigenvector 
> centrality and HITS



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MADLIB-1073) Graph - Phase 1 measures

2017-06-15 Thread Frank McQuillan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MADLIB-1073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan updated MADLIB-1073:

Description: 
Follow on from  https://issues.apache.org/jira/browse/MADLIB-1072. Given that 
this story is complete, what measures can we compute from APSP?

Story

As a MADlib developer, I want to implement the following measures:

* Closeness (uses APSP)
* Graph diameter  (uses APSP)
* Average path length (uses APSP)
* In/out degrees

Acceptance

1) Interface defined
2) Design document updated
3) Documentation and on-line help
4) IC and functional tests
5) Scale tests


  was:
Follow on from  https://issues.apache.org/jira/browse/MADLIB-1072. Given that 
this story is complete, what measures can we compute from APSP?

Story

As a MADlib developer, I want to implement the following measures:

* Closeness (uses APSP)
* Graph diameter  (uses APSP)
* Average path length   
* In/out degrees

Acceptance

1) Interface defined
2) Design document updated
3) Documentation and on-line help
4) IC and functional tests
5) Scale tests



> Graph - Phase 1 measures
> 
>
> Key: MADLIB-1073
> URL: https://issues.apache.org/jira/browse/MADLIB-1073
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Graph
>Reporter: Frank McQuillan
> Fix For: v1.12
>
>
> Follow on from  https://issues.apache.org/jira/browse/MADLIB-1072. Given that 
> this story is complete, what measures can we compute from APSP?
> Story
> As a MADlib developer, I want to implement the following measures:
> * Closeness (uses APSP)
> * Graph diameter  (uses APSP)
> * Average path length (uses APSP)
> * In/out degrees
> Acceptance
> 1) Interface defined
> 2) Design document updated
> 3) Documentation and on-line help
> 4) IC and functional tests
> 5) Scale tests



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MADLIB-1126) Graph - clustering coefficient

2017-06-15 Thread Frank McQuillan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MADLIB-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan updated MADLIB-1126:

Description: 
Uses triangle counting
https://issues.apache.org/jira/browse/MADLIB-1126

  was:Uses traiangl counting


> Graph - clustering coefficient
> --
>
> Key: MADLIB-1126
> URL: https://issues.apache.org/jira/browse/MADLIB-1126
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Graph
>Reporter: Frank McQuillan
> Fix For: v2.0
>
>
> Uses triangle counting
> https://issues.apache.org/jira/browse/MADLIB-1126



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MADLIB-1122) Graph - refactor pagerank code

2017-06-15 Thread Frank McQuillan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MADLIB-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan updated MADLIB-1122:

Fix Version/s: (was: v1.12)
   v2.0

> Graph - refactor pagerank code
> --
>
> Key: MADLIB-1122
> URL: https://issues.apache.org/jira/browse/MADLIB-1122
> Project: Apache MADlib
>  Issue Type: Improvement
>  Components: Module: Graph
>Reporter: Frank McQuillan
> Fix For: v2.0
>
>
> Refactor pagerank code such that we can re-use some functions for Eigenvector 
> and HITS



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1121) Graph - Betweenness centrality

2017-06-15 Thread Frank McQuillan (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16051210#comment-16051210
 ] 

Frank McQuillan commented on MADLIB-1121:
-

Comment from [~okislal]


This (betweenness) might require a minor addition to the APSP code, length (the 
number of vertices) of the shortest path, that can be done in the same story.
Here is my idea for betweenness. Let's say APSP shows that the shortest path 
from 1 to 5 is 1,2,3,4,5. In this case the parent will be 4. This means we 
increment the betweenness count for 4 by 1 and set a multiplier value for the 
1->4 shortest path since every vertex in that path has to be counted one more 
time than usual. When we look at the path 1->4 and see the parent is 3, we 
increase its betweenness score by 2 and increase the multiplier for 1->3 path 
to 2 since those vertices should be counted for both of the previous iterations 
(1->4 and 1->5).

The algorithm should start from the path with the highest length and collect 
these values in max(|P|) iterations where |P| is the length of a given path. I 
am pretty sure that the reverse order is doable as well.

This implementation would be problematic on the HAWQ environment since it 
relied heavily on updates, so a completely independent algorithm might prove 
useful.

> Graph - Betweenness centrality
> --
>
> Key: MADLIB-1121
> URL: https://issues.apache.org/jira/browse/MADLIB-1121
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Graph
>Reporter: Frank McQuillan
> Fix For: v1.12
>
>
> Follow on from  https://issues.apache.org/jira/browse/MADLIB-1072. Given that 
> this story is complete, what measures can we compute from APSP?
> Betweenness may use the APSP result set, or implement as a separate 
> algorithm, e.g., [1].
> References
> [1] A Faster Algorithm for Betweenness Centrality
> http://www.algo.uni-konstanz.de/publications/b-fabc-01.pdf
> (used in Gephi)
> [2] Network properties
> https://en.wikipedia.org/wiki/Network_science#Network_properties



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MADLIB-1121) Graph - Betweenness centrality

2017-06-15 Thread Frank McQuillan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MADLIB-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan updated MADLIB-1121:

Description: 
Follow on from  https://issues.apache.org/jira/browse/MADLIB-1072. Given that 
this story is complete, what measures can we compute from APSP?

Betweenness may use the APSP result set, or implement as a separate algorithm, 
e.g., [1].

References

[1] A Faster Algorithm for Betweenness Centrality
http://www.algo.uni-konstanz.de/publications/b-fabc-01.pdf
(used in Gephi)

[2] Network properties
https://en.wikipedia.org/wiki/Network_science#Network_properties


  was:
Follow on from  https://issues.apache.org/jira/browse/MADLIB-1072. Given that 
this story is complete, what measures can we compute from APSP?

Betweenness may use APSP result set, or implement as a separate algorithm, 
e.g., [1].

References

[1] A Faster Algorithm for Betweenness Centrality
http://www.algo.uni-konstanz.de/publications/b-fabc-01.pdf
(used in Gephi)

[2] Network properties
https://en.wikipedia.org/wiki/Network_science#Network_properties

Other comments

This (betweenness) might require a minor addition to the APSP code, length (the 
number of vertices) of the shortest path, that can be done in the same story.
Here is my idea for betweenness. Let's say APSP shows that the shortest path 
from 1 to 5 is 1,2,3,4,5. In this case the parent will be 4. This means we 
increment the betweenness count for 4 by 1 and set a multiplier value for the 
1->4 shortest path since every vertex in that path has to be counted one more 
time than usual. When we look at the path 1->4 and see the parent is 3, we 
increase its betweenness score by 2 and increase the multiplier for 1->3 path 
to 2 since those vertices should be counted for both of the previous iterations 
(1->4 and 1->5).
The algorithm should start from the path with the highest length and collect 
these values in max(|P|) iterations where |P| is the length of a given path. I 
am pretty sure that the reverse order is doable as well.
This implementation would be problematic on the HAWQ environment since it 
relied heavily on updates, so a completely independent algorithm might prove 
useful.



> Graph - Betweenness centrality
> --
>
> Key: MADLIB-1121
> URL: https://issues.apache.org/jira/browse/MADLIB-1121
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Graph
>Reporter: Frank McQuillan
> Fix For: v1.12
>
>
> Follow on from  https://issues.apache.org/jira/browse/MADLIB-1072. Given that 
> this story is complete, what measures can we compute from APSP?
> Betweenness may use the APSP result set, or implement as a separate 
> algorithm, e.g., [1].
> References
> [1] A Faster Algorithm for Betweenness Centrality
> http://www.algo.uni-konstanz.de/publications/b-fabc-01.pdf
> (used in Gephi)
> [2] Network properties
> https://en.wikipedia.org/wiki/Network_science#Network_properties



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MADLIB-1121) Graph - Betweenness centrality

2017-06-15 Thread Frank McQuillan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MADLIB-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan updated MADLIB-1121:

Description: 
Follow on from  https://issues.apache.org/jira/browse/MADLIB-1072. Given that 
this story is complete, what measures can we compute from APSP?

Betweenness may use APSP result set, or implement as a separate algorithm, 
e.g., [1].

References

[1] A Faster Algorithm for Betweenness Centrality
http://www.algo.uni-konstanz.de/publications/b-fabc-01.pdf
(used in Gephi)

[2] Network properties
https://en.wikipedia.org/wiki/Network_science#Network_properties

Other comments

This (betweenness) might require a minor addition to the APSP code, length (the 
number of vertices) of the shortest path, that can be done in the same story.
Here is my idea for betweenness. Let's say APSP shows that the shortest path 
from 1 to 5 is 1,2,3,4,5. In this case the parent will be 4. This means we 
increment the betweenness count for 4 by 1 and set a multiplier value for the 
1->4 shortest path since every vertex in that path has to be counted one more 
time than usual. When we look at the path 1->4 and see the parent is 3, we 
increase its betweenness score by 2 and increase the multiplier for 1->3 path 
to 2 since those vertices should be counted for both of the previous iterations 
(1->4 and 1->5).
The algorithm should start from the path with the highest length and collect 
these values in max(|P|) iterations where |P| is the length of a given path. I 
am pretty sure that the reverse order is doable as well.
This implementation would be problematic on the HAWQ environment since it 
relied heavily on updates, so a completely independent algorithm might prove 
useful.


  was:
Follow on from  https://issues.apache.org/jira/browse/MADLIB-1072. Given that 
this story is complete, what measures can we compute from APSP?

Betweenness may use APSP result set, or implement as a separate algorithm, 
e.g., [1].

References

[1] A Faster Algorithm for Betweenness Centrality
http://www.algo.uni-konstanz.de/publications/b-fabc-01.pdf
(used in Gephi)

[2] Network properties
https://en.wikipedia.org/wiki/Network_science#Network_properties



> Graph - Betweenness centrality
> --
>
> Key: MADLIB-1121
> URL: https://issues.apache.org/jira/browse/MADLIB-1121
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Graph
>Reporter: Frank McQuillan
> Fix For: v1.12
>
>
> Follow on from  https://issues.apache.org/jira/browse/MADLIB-1072. Given that 
> this story is complete, what measures can we compute from APSP?
> Betweenness may use APSP result set, or implement as a separate algorithm, 
> e.g., [1].
> References
> [1] A Faster Algorithm for Betweenness Centrality
> http://www.algo.uni-konstanz.de/publications/b-fabc-01.pdf
> (used in Gephi)
> [2] Network properties
> https://en.wikipedia.org/wiki/Network_science#Network_properties
> Other comments
> This (betweenness) might require a minor addition to the APSP code, length 
> (the number of vertices) of the shortest path, that can be done in the same 
> story.
> Here is my idea for betweenness. Let's say APSP shows that the shortest path 
> from 1 to 5 is 1,2,3,4,5. In this case the parent will be 4. This means we 
> increment the betweenness count for 4 by 1 and set a multiplier value for the 
> 1->4 shortest path since every vertex in that path has to be counted one more 
> time than usual. When we look at the path 1->4 and see the parent is 3, we 
> increase its betweenness score by 2 and increase the multiplier for 1->3 path 
> to 2 since those vertices should be counted for both of the previous 
> iterations (1->4 and 1->5).
> The algorithm should start from the path with the highest length and collect 
> these values in max(|P|) iterations where |P| is the length of a given path. 
> I am pretty sure that the reverse order is doable as well.
> This implementation would be problematic on the HAWQ environment since it 
> relied heavily on updates, so a completely independent algorithm might prove 
> useful.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MADLIB-1126) Graph - clustering coefficient

2017-06-15 Thread Frank McQuillan (JIRA)
Frank McQuillan created MADLIB-1126:
---

 Summary: Graph - clustering coefficient
 Key: MADLIB-1126
 URL: https://issues.apache.org/jira/browse/MADLIB-1126
 Project: Apache MADlib
  Issue Type: New Feature
  Components: Module: Graph
Reporter: Frank McQuillan


Uses traiangl counting



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MADLIB-1125) Graph - triangle counting

2017-06-15 Thread Frank McQuillan (JIRA)
Frank McQuillan created MADLIB-1125:
---

 Summary: Graph - triangle counting
 Key: MADLIB-1125
 URL: https://issues.apache.org/jira/browse/MADLIB-1125
 Project: Apache MADlib
  Issue Type: New Feature
  Components: Module: Graph
Reporter: Frank McQuillan
 Fix For: v2.0






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MADLIB-1124) Graph - HITS algorithm

2017-06-15 Thread Frank McQuillan (JIRA)
Frank McQuillan created MADLIB-1124:
---

 Summary: Graph - HITS algorithm
 Key: MADLIB-1124
 URL: https://issues.apache.org/jira/browse/MADLIB-1124
 Project: Apache MADlib
  Issue Type: New Feature
  Components: Module: Graph
Reporter: Frank McQuillan
 Fix For: v1.12






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MADLIB-1121) Graph - Betweenness centrality

2017-06-15 Thread Frank McQuillan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MADLIB-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan updated MADLIB-1121:

Description: 
Follow on from  https://issues.apache.org/jira/browse/MADLIB-1072. Given that 
this story is complete, what measures can we compute from APSP?

Betweenness may use APSP result set, or implement as a separate algorithm, 
e.g., [1].

References

[1] A Faster Algorithm for Betweenness Centrality
http://www.algo.uni-konstanz.de/publications/b-fabc-01.pdf
(used in Gephi)

[2] Network properties
https://en.wikipedia.org/wiki/Network_science#Network_properties


  was:
Follow on from  https://issues.apache.org/jira/browse/MADLIB-1072. Given that 
this story is complete, what measures can we compute from APSP?

Story

As a MADlib developer, I want to implement the following measures:

1 Betweenness (possibly use APSP)
2 Closeness (uses APSP)
3 Eigenvector centrality
4 Graph diameter  (uses APSP)
5 HITS  
6 Average path length   
7 In/out degrees
8 Clustering coefficient  (uses triangle/triplet counting)

Acceptance

1) Interface defined
2) Design document updated
3) Documentation and on-line help
4) IC and Tinc tests
5) Scale tests

References

[1] A Faster Algorithm for Betweenness Centrality
http://www.algo.uni-konstanz.de/publications/b-fabc-01.pdf
(used in Gephi)

[2] Network properties
https://en.wikipedia.org/wiki/Network_science#Network_properties



> Graph - Betweenness centrality
> --
>
> Key: MADLIB-1121
> URL: https://issues.apache.org/jira/browse/MADLIB-1121
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Graph
>Reporter: Frank McQuillan
> Fix For: v1.12
>
>
> Follow on from  https://issues.apache.org/jira/browse/MADLIB-1072. Given that 
> this story is complete, what measures can we compute from APSP?
> Betweenness may use APSP result set, or implement as a separate algorithm, 
> e.g., [1].
> References
> [1] A Faster Algorithm for Betweenness Centrality
> http://www.algo.uni-konstanz.de/publications/b-fabc-01.pdf
> (used in Gephi)
> [2] Network properties
> https://en.wikipedia.org/wiki/Network_science#Network_properties



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MADLIB-1121) Graph - Betweenness centrality

2017-06-15 Thread Frank McQuillan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MADLIB-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan updated MADLIB-1121:

Summary: Graph - Betweenness centrality  (was: CLONE - Graph - Phase 1 
measures)

> Graph - Betweenness centrality
> --
>
> Key: MADLIB-1121
> URL: https://issues.apache.org/jira/browse/MADLIB-1121
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Graph
>Reporter: Frank McQuillan
> Fix For: v1.12
>
>
> Follow on from  https://issues.apache.org/jira/browse/MADLIB-1072. Given that 
> this story is complete, what measures can we compute from APSP?
> Story
> As a MADlib developer, I want to implement the following measures:
> 1 Betweenness (possibly use APSP)
> 2 Closeness (uses APSP)
> 3 Eigenvector centrality
> 4 Graph diameter  (uses APSP)
> 5 HITS
> 6 Average path length 
> 7 In/out degrees
> 8 Clustering coefficient  (uses triangle/triplet counting)
> Acceptance
> 1) Interface defined
> 2) Design document updated
> 3) Documentation and on-line help
> 4) IC and Tinc tests
> 5) Scale tests
> References
> [1] A Faster Algorithm for Betweenness Centrality
> http://www.algo.uni-konstanz.de/publications/b-fabc-01.pdf
> (used in Gephi)
> [2] Network properties
> https://en.wikipedia.org/wiki/Network_science#Network_properties



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MADLIB-1073) Graph - Phase 1 measures

2017-06-15 Thread Frank McQuillan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MADLIB-1073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan updated MADLIB-1073:

Description: 
Follow on from  https://issues.apache.org/jira/browse/MADLIB-1072. Given that 
this story is complete, what measures can we compute from APSP?

Story

As a MADlib developer, I want to implement the following measures:

* Closeness (uses APSP)
* Graph diameter  (uses APSP)
* Average path length   
* In/out degrees

Acceptance

1) Interface defined
2) Design document updated
3) Documentation and on-line help
4) IC and functional tests
5) Scale tests


  was:
Follow on from  https://issues.apache.org/jira/browse/MADLIB-1072. Given that 
this story is complete, what measures can we compute from APSP?

Story

As a MADlib developer, I want to implement the following measures:

1 Betweenness (possibly use APSP)
2 Closeness (uses APSP)
3 Eigenvector centrality
4 Graph diameter  (uses APSP)
5 HITS  
6 Average path length   
7 In/out degrees
8 Clustering coefficient  (uses triangle/triplet counting)

Acceptance

1) Interface defined
2) Design document updated
3) Documentation and on-line help
4) IC and Tinc tests
5) Scale tests

References

[1] A Faster Algorithm for Betweenness Centrality
http://www.algo.uni-konstanz.de/publications/b-fabc-01.pdf
(used in Gephi)

[2] Network properties
https://en.wikipedia.org/wiki/Network_science#Network_properties



> Graph - Phase 1 measures
> 
>
> Key: MADLIB-1073
> URL: https://issues.apache.org/jira/browse/MADLIB-1073
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Graph
>Reporter: Frank McQuillan
> Fix For: v1.12
>
>
> Follow on from  https://issues.apache.org/jira/browse/MADLIB-1072. Given that 
> this story is complete, what measures can we compute from APSP?
> Story
> As a MADlib developer, I want to implement the following measures:
> * Closeness (uses APSP)
> * Graph diameter  (uses APSP)
> * Average path length 
> * In/out degrees
> Acceptance
> 1) Interface defined
> 2) Design document updated
> 3) Documentation and on-line help
> 4) IC and functional tests
> 5) Scale tests



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MADLIB-1121) CLONE - Graph - Phase 1 measures

2017-06-15 Thread Frank McQuillan (JIRA)
Frank McQuillan created MADLIB-1121:
---

 Summary: CLONE - Graph - Phase 1 measures
 Key: MADLIB-1121
 URL: https://issues.apache.org/jira/browse/MADLIB-1121
 Project: Apache MADlib
  Issue Type: New Feature
  Components: Module: Graph
Reporter: Frank McQuillan
 Fix For: v1.12


Follow on from  https://issues.apache.org/jira/browse/MADLIB-1072. Given that 
this story is complete, what measures can we compute from APSP?

Story

As a MADlib developer, I want to implement the following measures:

1 Betweenness (possibly use APSP)
2 Closeness (uses APSP)
3 Eigenvector centrality
4 Graph diameter  (uses APSP)
5 HITS  
6 Average path length   
7 In/out degrees
8 Clustering coefficient  (uses triangle/triplet counting)

Acceptance

1) Interface defined
2) Design document updated
3) Documentation and on-line help
4) IC and Tinc tests
5) Scale tests

References

[1] A Faster Algorithm for Betweenness Centrality
http://www.algo.uni-konstanz.de/publications/b-fabc-01.pdf
(used in Gephi)

[2] Network properties
https://en.wikipedia.org/wiki/Network_science#Network_properties




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MADLIB-1073) Graph - Phase 1 measures

2017-06-15 Thread Frank McQuillan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MADLIB-1073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan updated MADLIB-1073:

Description: 
Follow on from  https://issues.apache.org/jira/browse/MADLIB-1072. Given that 
this story is complete, what measures can we compute from APSP?

Story

As a MADlib developer, I want to implement the following measures:

1 Betweenness (possibly use APSP)
2 Closeness (uses APSP)
3 Eigenvector centrality
4 Graph diameter  (uses APSP)
5 HITS  
6 Average path length   
7 In/out degrees
8 Clustering coefficient  (uses triangle/triplet counting)

Acceptance

1) Interface defined
2) Design document updated
3) Documentation and on-line help
4) IC and Tinc tests
5) Scale tests

References

[1] A Faster Algorithm for Betweenness Centrality
http://www.algo.uni-konstanz.de/publications/b-fabc-01.pdf
(used in Gephi)

[2] Network properties
https://en.wikipedia.org/wiki/Network_science#Network_properties


  was:
Follow on from  https://issues.apache.org/jira/browse/MADLIB-1072. Given that 
this story is complete, what measures can we compute from APSP?

Story

As a MADlib developer, I want to implement the following measures:

1 Betweenness  (uses APSP)
2 Closeness (uses APSP)
3 Eigenvector centrality  (uses APSP?)
4 Graph diameter  (uses APSP)
5 HITS  
6 Average path length   
7 In/out degrees
8 clustering  (uses triangle counting?)

Acceptance

1) Interface defined
2) Design document updated
3) Documentation and on-line help
4) IC and Tinc tests
5) Scale tests

References

[1] A Faster Algorithm for Betweenness Centrality
http://www.algo.uni-konstanz.de/publications/b-fabc-01.pdf
(used in Gephi)

[2] Network properties
https://en.wikipedia.org/wiki/Network_science#Network_properties



> Graph - Phase 1 measures
> 
>
> Key: MADLIB-1073
> URL: https://issues.apache.org/jira/browse/MADLIB-1073
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Graph
>Reporter: Frank McQuillan
> Fix For: v1.12
>
>
> Follow on from  https://issues.apache.org/jira/browse/MADLIB-1072. Given that 
> this story is complete, what measures can we compute from APSP?
> Story
> As a MADlib developer, I want to implement the following measures:
> 1 Betweenness (possibly use APSP)
> 2 Closeness (uses APSP)
> 3 Eigenvector centrality
> 4 Graph diameter  (uses APSP)
> 5 HITS
> 6 Average path length 
> 7 In/out degrees
> 8 Clustering coefficient  (uses triangle/triplet counting)
> Acceptance
> 1) Interface defined
> 2) Design document updated
> 3) Documentation and on-line help
> 4) IC and Tinc tests
> 5) Scale tests
> References
> [1] A Faster Algorithm for Betweenness Centrality
> http://www.algo.uni-konstanz.de/publications/b-fabc-01.pdf
> (used in Gephi)
> [2] Network properties
> https://en.wikipedia.org/wiki/Network_science#Network_properties



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (MADLIB-1099) Graph - all pairs shortest path grouping

2017-06-15 Thread Frank McQuillan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MADLIB-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan closed MADLIB-1099.
---
Resolution: Fixed

> Graph - all pairs shortest path grouping
> 
>
> Key: MADLIB-1099
> URL: https://issues.apache.org/jira/browse/MADLIB-1099
> Project: Apache MADlib
>  Issue Type: Improvement
>  Components: Module: Graph
>Reporter: Frank McQuillan
>Assignee: Frank McQuillan
> Fix For: v1.12
>
> Attachments: screenshot-1.png
>
>
> Story
> As a MADlib developer, I want to implement grouping for all pairs shortest 
> path in an efficient and scaleable manner, so that I can compute other 
> measures that use APSP (eg., centrality measures).
> Acceptance
> 1) Grouping param in interface
> 2) Documentation and on-line help
> 3) IC and Tinc tests
> 4) Scale testing done



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1099) Graph - all pairs shortest path grouping

2017-06-15 Thread Frank McQuillan (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16051110#comment-16051110
 ] 

Frank McQuillan commented on MADLIB-1099:
-

I attached a screen shot of some perf testing; this is an expensive algo as we 
know.

This is what I wrote in the user docs for APSP:

"APSP is an expensive algorithm for run-time because it finds the shortest path 
between all nodes in the graph. The worst case run-time for this implementation 
is O(V^2 * E) where V is the number of vertices and E is the number of edges. 
In practice, run-time will be generally be much less than this, depending on 
the graph."



> Graph - all pairs shortest path grouping
> 
>
> Key: MADLIB-1099
> URL: https://issues.apache.org/jira/browse/MADLIB-1099
> Project: Apache MADlib
>  Issue Type: Improvement
>  Components: Module: Graph
>Reporter: Frank McQuillan
>Assignee: Frank McQuillan
> Fix For: v1.12
>
> Attachments: screenshot-1.png
>
>
> Story
> As a MADlib developer, I want to implement grouping for all pairs shortest 
> path in an efficient and scaleable manner, so that I can compute other 
> measures that use APSP (eg., centrality measures).
> Acceptance
> 1) Grouping param in interface
> 2) Documentation and on-line help
> 3) IC and Tinc tests
> 4) Scale testing done



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Reopened] (MADLIB-1099) Graph - all pairs shortest path grouping

2017-06-15 Thread Frank McQuillan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MADLIB-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan reopened MADLIB-1099:
-
  Assignee: Frank McQuillan  (was: Orhan Kislal)

> Graph - all pairs shortest path grouping
> 
>
> Key: MADLIB-1099
> URL: https://issues.apache.org/jira/browse/MADLIB-1099
> Project: Apache MADlib
>  Issue Type: Improvement
>  Components: Module: Graph
>Reporter: Frank McQuillan
>Assignee: Frank McQuillan
> Fix For: v1.12
>
>
> Story
> As a MADlib developer, I want to implement grouping for all pairs shortest 
> path in an efficient and scaleable manner, so that I can compute other 
> measures that use APSP (eg., centrality measures).
> Acceptance
> 1) Grouping param in interface
> 2) Documentation and on-line help
> 3) IC and Tinc tests
> 4) Scale testing done



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (MADLIB-1099) Graph - all pairs shortest path grouping

2017-06-15 Thread Frank McQuillan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MADLIB-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan closed MADLIB-1099.
---

> Graph - all pairs shortest path grouping
> 
>
> Key: MADLIB-1099
> URL: https://issues.apache.org/jira/browse/MADLIB-1099
> Project: Apache MADlib
>  Issue Type: Improvement
>  Components: Module: Graph
>Reporter: Frank McQuillan
>Assignee: Orhan Kislal
> Fix For: v1.12
>
>
> Story
> As a MADlib developer, I want to implement grouping for all pairs shortest 
> path in an efficient and scaleable manner, so that I can compute other 
> measures that use APSP (eg., centrality measures).
> Acceptance
> 1) Grouping param in interface
> 2) Documentation and on-line help
> 3) IC and Tinc tests
> 4) Scale testing done



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (MADLIB-1106) Graph - Get path for APSP

2017-06-15 Thread Frank McQuillan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MADLIB-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan closed MADLIB-1106.
---

> Graph - Get path for APSP
> -
>
> Key: MADLIB-1106
> URL: https://issues.apache.org/jira/browse/MADLIB-1106
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Graph
>Reporter: Frank McQuillan
>Assignee: Orhan Kislal
>Priority: Minor
> Fix For: v1.12
>
>
> Follow on to APSP stories
> https://issues.apache.org/jira/browse/MADLIB-1072
> https://issues.apache.org/jira/browse/MADLIB-1099
> Get path function similar in nature to 
> http://madlib.incubator.apache.org/docs/latest/group__grp__sssp.html
> but you need to add a source vertex:
> graph_apsp_get_path( apsp_table,
>source_vertex,
> dest_vertex,
> path_table
>)
> sssp_table
> TEXT. Name of the table that contains the APSP output.
> source_vertex
> INTEGER. The vertex that will be the source of the desired path.
> dest_vertex
> INTEGER. The vertex that will be the destination of the desired path.
> path_table
> TEXT. Name of the output table that contains the path. It contains a row for 
> every group and has the following columns:
> * grouping_cols : The grouping columns given in the creation of the APSP 
> table. If there are no grouping columns, these columns will not exist and the 
> table will have a single row.
> * path (ARRAY) : The shortest path from the source vertex to the destination 
> vertex.
> Open questions:
> 1) Are there any other helper functions related to APSP that we should build?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (MADLIB-1072) Graph - all pairs shortest path

2017-06-15 Thread Frank McQuillan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MADLIB-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan closed MADLIB-1072.
---

> Graph - all pairs shortest path
> ---
>
> Key: MADLIB-1072
> URL: https://issues.apache.org/jira/browse/MADLIB-1072
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Graph
>Reporter: Frank McQuillan
>Assignee: Orhan Kislal
> Fix For: v1.12
>
>
> Story
> As a MADlib developer, I want to implement all pairs shortest path in an 
> efficient and scaleable manner.
> Acceptance
> 1) Interface defined
> 2) Design document updated
> 3) Documentation and on-line help
> 4) IC and functional tests
> 5) Scale tests
> Refs
> [1] Floyd-Warshall is one possible implemention
> https://en.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm
> [2] All-Pairs Shortest Paths with Real Weights in O(n3/ logn) Time
> https://pdfs.semanticscholar.org/82fc/17e993f4bca9efd468b5087d9d02c8ca5f6d.pdf



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1072) Graph - all pairs shortest path

2017-06-15 Thread Frank McQuillan (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16051071#comment-16051071
 ] 

Frank McQuillan commented on MADLIB-1072:
-

pls refer to
https://github.com/apache/incubator-madlib/pull/136




> Graph - all pairs shortest path
> ---
>
> Key: MADLIB-1072
> URL: https://issues.apache.org/jira/browse/MADLIB-1072
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Graph
>Reporter: Frank McQuillan
>Assignee: Orhan Kislal
> Fix For: v1.12
>
>
> Story
> As a MADlib developer, I want to implement all pairs shortest path in an 
> efficient and scaleable manner.
> Acceptance
> 1) Interface defined
> 2) Design document updated
> 3) Documentation and on-line help
> 4) IC and functional tests
> 5) Scale tests
> Refs
> [1] Floyd-Warshall is one possible implemention
> https://en.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm
> [2] All-Pairs Shortest Paths with Real Weights in O(n3/ logn) Time
> https://pdfs.semanticscholar.org/82fc/17e993f4bca9efd468b5087d9d02c8ca5f6d.pdf



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1102) Graph - Breadth First Search / Traversal

2017-06-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16051020#comment-16051020
 ] 

ASF GitHub Bot commented on MADLIB-1102:


GitHub user rashmi815 opened a pull request:

https://github.com/apache/incubator-madlib/pull/141

Graph: Add Breadth-first Search algorithm with grouping support

JIRA: MADLIB-1102

Graph: Add Breadth-first Search algorithm with grouping support.
This algorithm searches or traverses connected nodes in a graph in 
breadth-first order starting at a user-specified origin node.

Documentation and install-check to follow in subsequent commits

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rashmi815/incubator-madlib feature/bfs_v02

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/141.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #141


commit 5069dd95d25d1f62190913b13a207e86d62cef13
Author: Rashmi Raghu 
Date:   2017-06-15T20:25:31Z

JIRA: MADLIB-1102

Add Breadth-first Search algorithm with grouping support.
This algorithm searches or traverses connected nodes in a graph in 
breadth-first order starting at a user-specified origin node.

Documentation and install-check to follow in subsequent commits




> Graph - Breadth First Search / Traversal
> 
>
> Key: MADLIB-1102
> URL: https://issues.apache.org/jira/browse/MADLIB-1102
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Graph
>Reporter: Rashmi Raghu
>Assignee: Rashmi Raghu
> Fix For: v1.12
>
>
> Story
> As a MADlib user and developer, I want to implement Breadth First Search / 
> Traversal for a graph. BFS is also a core part of the connected components 
> graph algorithm.
> Accpetance:
> 1) Interface defined
> 2) Design doc updated
> 3) Documentation and on-line help
> 4) IC and functional tests
> 5) Scale tests
> References:
> [0] [https://en.wikipedia.org/wiki/Breadth-first_search] 
> "Breadth-first search (BFS) is an algorithm for traversing or searching tree 
> or graph data structures. It starts at the tree root (or some arbitrary node 
> of a graph, sometimes referred to as a 'search key'[1]) and explores the 
> neighbor nodes first, before moving to the next level neighbors."
> [1] [http://www.geeksforgeeks.org/breadth-first-traversal-for-a-graph/]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MADLIB-1071) Graph - weakly connect components

2017-06-15 Thread Nandish Jayaram (JIRA)

 [ 
https://issues.apache.org/jira/browse/MADLIB-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandish Jayaram reassigned MADLIB-1071:
---

Assignee: Nandish Jayaram  (was: Rashmi Raghu)

> Graph - weakly connect components
> -
>
> Key: MADLIB-1071
> URL: https://issues.apache.org/jira/browse/MADLIB-1071
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Graph
>Reporter: Frank McQuillan
>Assignee: Nandish Jayaram
> Fix For: v1.12
>
>
> Story
> As a MADlib developer, I want to implement  weakly connected components (ref 
> [0]) in an efficient and scaleable way.
> Acceptance
> 1) Interface defined
> 2) Design document updated
> 3) Documentation and on-line help
> 4) IC and functional tests
> 5) Scale tests
> References
> [0] https://en.wikipedia.org/wiki/Connectivity_(graph_theory)
> "A directed graph is called weakly connected if replacing all of its directed 
> edges with undirected edges produces a connected (undirected) graph."
> [1] Grails paper
> http://pages.cs.wisc.edu/~jignesh/publ/Grail.pdf
> [2] Grails deck
> http://pages.cs.wisc.edu/~jignesh/publ/Grail-slides.pdf
> [3] Grails repo with page rank example
> https://github.com/UWQuickstep/Grail
> https://github.com/UWQuickstep/Grail/blob/master/analytics/wcc.sql
> (weakly connected components implementation)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (MADLIB-1106) Graph - Get path for APSP

2017-06-15 Thread Frank McQuillan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MADLIB-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan resolved MADLIB-1106.
-
Resolution: Fixed

> Graph - Get path for APSP
> -
>
> Key: MADLIB-1106
> URL: https://issues.apache.org/jira/browse/MADLIB-1106
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Graph
>Reporter: Frank McQuillan
>Assignee: Orhan Kislal
>Priority: Minor
> Fix For: v1.12
>
>
> Follow on to APSP stories
> https://issues.apache.org/jira/browse/MADLIB-1072
> https://issues.apache.org/jira/browse/MADLIB-1099
> Get path function similar in nature to 
> http://madlib.incubator.apache.org/docs/latest/group__grp__sssp.html
> but you need to add a source vertex:
> graph_apsp_get_path( apsp_table,
>source_vertex,
> dest_vertex,
> path_table
>)
> sssp_table
> TEXT. Name of the table that contains the APSP output.
> source_vertex
> INTEGER. The vertex that will be the source of the desired path.
> dest_vertex
> INTEGER. The vertex that will be the destination of the desired path.
> path_table
> TEXT. Name of the output table that contains the path. It contains a row for 
> every group and has the following columns:
> * grouping_cols : The grouping columns given in the creation of the APSP 
> table. If there are no grouping columns, these columns will not exist and the 
> table will have a single row.
> * path (ARRAY) : The shortest path from the source vertex to the destination 
> vertex.
> Open questions:
> 1) Are there any other helper functions related to APSP that we should build?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (MADLIB-1099) Graph - all pairs shortest path grouping

2017-06-15 Thread Frank McQuillan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MADLIB-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan resolved MADLIB-1099.
-
Resolution: Fixed

> Graph - all pairs shortest path grouping
> 
>
> Key: MADLIB-1099
> URL: https://issues.apache.org/jira/browse/MADLIB-1099
> Project: Apache MADlib
>  Issue Type: Improvement
>  Components: Module: Graph
>Reporter: Frank McQuillan
>Assignee: Orhan Kislal
> Fix For: v1.12
>
>
> Story
> As a MADlib developer, I want to implement grouping for all pairs shortest 
> path in an efficient and scaleable manner, so that I can compute other 
> measures that use APSP (eg., centrality measures).
> Acceptance
> 1) Grouping param in interface
> 2) Documentation and on-line help
> 3) IC and Tinc tests
> 4) Scale testing done



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)