[jira] [Commented] (MADLIB-1146) Elastic Net fails when used without normalization with grouping

2017-08-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16133934#comment-16133934
 ] 

ASF GitHub Bot commented on MADLIB-1146:


Github user asfgit closed the pull request at:

https://github.com/apache/incubator-madlib/pull/172


> Elastic Net fails when used without normalization with grouping
> ---
>
> Key: MADLIB-1146
> URL: https://issues.apache.org/jira/browse/MADLIB-1146
> Project: Apache MADlib
>  Issue Type: Bug
>Reporter: Cooper Sloan
>Assignee: Nandish Jayaram
>Priority: Minor
>
> ```
> DROP TABLE IF EXISTS house_en,house_en_summary;
> SELECT madlib.elastic_net_train(
> 'lin_housing_wi',
> 'house_en',
> 'y',
> 'x',
> 'gaussian',
> 0.5,
> 0.5,
> False,
> 'grp_by_col',
> 'fista',
> '',
> NULL,
> 1,
> 1e-6
> );
> psql:/Users/csloan/elastic_net.sql:1: NOTICE:  table "house_en" does not 
> exist, skipping
> psql:/Users/csloan/elastic_net.sql:1: NOTICE:  table "house_en_summary" does 
> not exist, skipping
> DROP TABLE
> psql:/Users/csloan/elastic_net.sql:17: ERROR:  KeyError: 'select_grp'
> CONTEXT:  Traceback (most recent call last):
>   PL/Python function "elastic_net_train", line 27, in 
> excluded, max_iter, tolerance)
>   PL/Python function "elastic_net_train", line 467, in elastic_net_train
>   PL/Python function "elastic_net_train", line 502, in 
> _internal_elastic_net_train
>   PL/Python function "elastic_net_train", line 24, in 
> _elastic_net_gaussian_fista_train
>   PL/Python function "elastic_net_train", line 171, in 
> _elastic_net_fista_train
>   PL/Python function "elastic_net_train", line 297, in 
> _elastic_net_fista_train_compute
>   PL/Python function "elastic_net_train", line 83, in 
> _elastic_net_generate_result
> PL/Python function "elastic_net_train"
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1119) Train-test split

2017-08-18 Thread Orhan Kislal (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16133806#comment-16133806
 ] 

Orhan Kislal commented on MADLIB-1119:
--

Created a PR to address both of these issues: 
https://github.com/apache/incubator-madlib/pull/174

> Train-test split
> 
>
> Key: MADLIB-1119
> URL: https://issues.apache.org/jira/browse/MADLIB-1119
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Sampling
>Reporter: Frank McQuillan
>Assignee: Orhan Kislal
> Fix For: v1.12
>
>
> Context
> See related story on stratified sampling 
> https://issues.apache.org/jira/browse/MADLIB-986
> Story
> As a data scientist, I want to split a data table into training and test sets 
> including grouping support, so that I use the result sets for model 
> development in the usual way.
> The MVP for this story is:
> * support split by group
> * allow option to sample without replacement (default) and sample with 
> replacement
> * allow option to output a subset of columns to the output table
> * output one table with a new test/train column, or optionally two separate 
> tables
> Proposed Interface
> {code}
> train_test_split ( 
>source_table,
>output_table,
>train_proportion,
>test_proportion, -- optional
>grouping_col -- optional
>with_replacement, -- optional
>target_cols -- optional
>separate_output_tables -- optional
> )
> source_table
> TEXT. The name of the table containing the input data.
> output_table
> TEXT. Name of output table.   A new INTEGER column on the right 
> called 'split' will identify 1 for train set and 0 for test set,
> unless the 'separate_output_tables' parameter below is TRUE, 
> in which case two output tables will be created using 
> the 'output_table' name with the suffixes '_train' and '_test'.
> The output table contains all the  columns present in the source 
> table unless otherwise specified  in the 'target_cols' parameter below. 
> train_proportion
> FLOAT8 in the range (0,1).  Proportion of the dataset to include 
> in the train split.  If the 'grouping_col' parameter is specified below, 
> each group will be sampled independently using the 
> train proportion, i.e., in a stratified fashion.
> test_proportion (optional)
> FLOAT8 in the range (0,1).  Proportion of the dataset to include 
> in the test split.  Default is the complement to the train
> proportion (1-'train_proportion').  If the 'grouping_col' 
> parameter is specified below,  each group will be sampled 
> independently using the  train proportion, 
> i.e., in a stratified fashion.
> grouping_col (optional)
> TEXT, default: NULL. A single column or a list of comma-separated columns
>  that defines how to stratify.  When this parameter is NULL, 
> the train-test split is not stratified.
> with_replacement (optional) 
> BOOLEAN, default FALSE.  Determines whether to sample with replacement 
> or without replacement (default).
> target_cols (optional)
> TEXT, default NULL. A comma-separated list of columns to appear in the 
> 'output_table'. 
> If NULL, all columns from the 'source_table'  will appear in the 
> 'output_table'.
> separate_output_tables (optional)
> BOOLEAN, default FALSE.  If TRUE, two output tables will be created using 
> the 'output_table' name with the suffixes '_train' and '_test'.
> {code}
> Other notes
> 1) PDL tools is one example implementation of train/test split to review [2]. 
>  
> 2) From Rahul Iyer: "The goal of having both train and test is to provide 
> subsample and train/test split in one function. 
> For eg. if train_size = 0.4 and test_size = 0.1, then only half the inputed 
> data will be output. This is tremendously useful in situations where a user 
> wants to prototype/evaluate a couple of models on smaller iid data before 
> running it on whole dataset. 
> Under no circumstances would the train_size + test_size be allowed to be more 
> than 1. The implementation will also ensure that there are no "leaks" (leak = 
> same data occurring in both train and test) as that defeats the whole purpose 
> of building an independent dataset for model evaluation. 
> Of course, the interface does get a little complex and could confuse users. 
> Explanatory documentation with examples is the only solution to that problem. 
> The alternative to having both sizes in one function is to run a subsample 
> function (using various sampling methods) and then perform the train_test 
> split. The downside to this approach is it requires writing an intermediate 
> table 

[jira] [Commented] (MADLIB-1073) Graph - Phase 1 measures

2017-08-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16133772#comment-16133772
 ] 

ASF GitHub Bot commented on MADLIB-1073:


GitHub user iyerr3 opened a pull request:

https://github.com/apache/incubator-madlib/pull/173

Measures: Use outer join for in-out degrees computation

JIRA: MADLIB-1073

Commit 06788cc added the graph measure functions described in the JIRA.
This commit fixes a bug from that commit in the graph_vertex_degrees
function. The bug led to results not containing vertices that
either had 0 in-degree or out-degree.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/iyerr3/incubator-madlib bugfix/in_out_degrees

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/173.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #173


commit f3697fdaaebeb851dfa23a0503c2c143c54f7f69
Author: Rahul Iyer 
Date:   2017-08-18T23:19:39Z

Measures: Use outer join for in-out degrees computation

JIRA: MADLIB-1073

Commit 06788cc added the graph measure functions described in the JIRA.
This commit fixes a bug from that commit in the graph_vertex_degrees
function. The bug led to results not containing vertices that
either had 0 in-degree or out-degree.




> Graph - Phase 1 measures
> 
>
> Key: MADLIB-1073
> URL: https://issues.apache.org/jira/browse/MADLIB-1073
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Graph
>Reporter: Frank McQuillan
>Assignee: Rahul Iyer
> Fix For: v1.12
>
> Attachments: Graph Measures Interfaces - JIRA.pdf
>
>
> Follow on from  https://issues.apache.org/jira/browse/MADLIB-1072. Given that 
> this story is complete, what measures can we compute from APSP?
> Story
> As a MADlib developer, I want to implement the following measures:
> * Closeness (uses APSP)
> * Graph diameter  (uses APSP)
> * Average path length (uses APSP)
> * In/out degrees
> Acceptance
> 1) Interface defined
> 2) Design document updated
> 3) Documentation and on-line help
> 4) IC and functional tests
> 5) Scale tests



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (MADLIB-1073) Graph - Phase 1 measures

2017-08-18 Thread Frank McQuillan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MADLIB-1073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan resolved MADLIB-1073.
-
Resolution: Fixed

> Graph - Phase 1 measures
> 
>
> Key: MADLIB-1073
> URL: https://issues.apache.org/jira/browse/MADLIB-1073
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Graph
>Reporter: Frank McQuillan
>Assignee: Rahul Iyer
> Fix For: v1.12
>
> Attachments: Graph Measures Interfaces - JIRA.pdf
>
>
> Follow on from  https://issues.apache.org/jira/browse/MADLIB-1072. Given that 
> this story is complete, what measures can we compute from APSP?
> Story
> As a MADlib developer, I want to implement the following measures:
> * Closeness (uses APSP)
> * Graph diameter  (uses APSP)
> * Average path length (uses APSP)
> * In/out degrees
> Acceptance
> 1) Interface defined
> 2) Design document updated
> 3) Documentation and on-line help
> 4) IC and functional tests
> 5) Scale tests



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Reopened] (MADLIB-1119) Train-test split

2017-08-18 Thread Frank McQuillan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MADLIB-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan reopened MADLIB-1119:
-

> Train-test split
> 
>
> Key: MADLIB-1119
> URL: https://issues.apache.org/jira/browse/MADLIB-1119
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Sampling
>Reporter: Frank McQuillan
>Assignee: Orhan Kislal
> Fix For: v1.12
>
>
> Context
> See related story on stratified sampling 
> https://issues.apache.org/jira/browse/MADLIB-986
> Story
> As a data scientist, I want to split a data table into training and test sets 
> including grouping support, so that I use the result sets for model 
> development in the usual way.
> The MVP for this story is:
> * support split by group
> * allow option to sample without replacement (default) and sample with 
> replacement
> * allow option to output a subset of columns to the output table
> * output one table with a new test/train column, or optionally two separate 
> tables
> Proposed Interface
> {code}
> train_test_split ( 
>source_table,
>output_table,
>train_proportion,
>test_proportion, -- optional
>grouping_col -- optional
>with_replacement, -- optional
>target_cols -- optional
>separate_output_tables -- optional
> )
> source_table
> TEXT. The name of the table containing the input data.
> output_table
> TEXT. Name of output table.   A new INTEGER column on the right 
> called 'split' will identify 1 for train set and 0 for test set,
> unless the 'separate_output_tables' parameter below is TRUE, 
> in which case two output tables will be created using 
> the 'output_table' name with the suffixes '_train' and '_test'.
> The output table contains all the  columns present in the source 
> table unless otherwise specified  in the 'target_cols' parameter below. 
> train_proportion
> FLOAT8 in the range (0,1).  Proportion of the dataset to include 
> in the train split.  If the 'grouping_col' parameter is specified below, 
> each group will be sampled independently using the 
> train proportion, i.e., in a stratified fashion.
> test_proportion (optional)
> FLOAT8 in the range (0,1).  Proportion of the dataset to include 
> in the test split.  Default is the complement to the train
> proportion (1-'train_proportion').  If the 'grouping_col' 
> parameter is specified below,  each group will be sampled 
> independently using the  train proportion, 
> i.e., in a stratified fashion.
> grouping_col (optional)
> TEXT, default: NULL. A single column or a list of comma-separated columns
>  that defines how to stratify.  When this parameter is NULL, 
> the train-test split is not stratified.
> with_replacement (optional) 
> BOOLEAN, default FALSE.  Determines whether to sample with replacement 
> or without replacement (default).
> target_cols (optional)
> TEXT, default NULL. A comma-separated list of columns to appear in the 
> 'output_table'. 
> If NULL, all columns from the 'source_table'  will appear in the 
> 'output_table'.
> separate_output_tables (optional)
> BOOLEAN, default FALSE.  If TRUE, two output tables will be created using 
> the 'output_table' name with the suffixes '_train' and '_test'.
> {code}
> Other notes
> 1) PDL tools is one example implementation of train/test split to review [2]. 
>  
> 2) From Rahul Iyer: "The goal of having both train and test is to provide 
> subsample and train/test split in one function. 
> For eg. if train_size = 0.4 and test_size = 0.1, then only half the inputed 
> data will be output. This is tremendously useful in situations where a user 
> wants to prototype/evaluate a couple of models on smaller iid data before 
> running it on whole dataset. 
> Under no circumstances would the train_size + test_size be allowed to be more 
> than 1. The implementation will also ensure that there are no "leaks" (leak = 
> same data occurring in both train and test) as that defeats the whole purpose 
> of building an independent dataset for model evaluation. 
> Of course, the interface does get a little complex and could confuse users. 
> Explanatory documentation with examples is the only solution to that problem. 
> The alternative to having both sizes in one function is to run a subsample 
> function (using various sampling methods) and then perform the train_test 
> split. The downside to this approach is it requires writing an intermediate 
> table to disk (inefficient). "
> Acceptance
> 1) Code, user docs, on-line docs, IC, Tinc tests complete.
> 2) Radar green for all supported 

[jira] [Commented] (MADLIB-1119) Train-test split

2017-08-18 Thread Frank McQuillan (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16133739#comment-16133739
 ] 

Frank McQuillan commented on MADLIB-1119:
-

Re-opening

1) Please rename this module train-test split, not test-train split, and update 
user docs.  This is what scikit-learn does and it is a more common term in the 
industry.

2) for test_proportion
* should be optional param (it is mandatory currently)
* Default should the complement to the train proportion (1-'train_proportion'); 
this does not seen to be implemented

Thanks

> Train-test split
> 
>
> Key: MADLIB-1119
> URL: https://issues.apache.org/jira/browse/MADLIB-1119
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Sampling
>Reporter: Frank McQuillan
>Assignee: Orhan Kislal
> Fix For: v1.12
>
>
> Context
> See related story on stratified sampling 
> https://issues.apache.org/jira/browse/MADLIB-986
> Story
> As a data scientist, I want to split a data table into training and test sets 
> including grouping support, so that I use the result sets for model 
> development in the usual way.
> The MVP for this story is:
> * support split by group
> * allow option to sample without replacement (default) and sample with 
> replacement
> * allow option to output a subset of columns to the output table
> * output one table with a new test/train column, or optionally two separate 
> tables
> Proposed Interface
> {code}
> train_test_split ( 
>source_table,
>output_table,
>train_proportion,
>test_proportion, -- optional
>grouping_col -- optional
>with_replacement, -- optional
>target_cols -- optional
>separate_output_tables -- optional
> )
> source_table
> TEXT. The name of the table containing the input data.
> output_table
> TEXT. Name of output table.   A new INTEGER column on the right 
> called 'split' will identify 1 for train set and 0 for test set,
> unless the 'separate_output_tables' parameter below is TRUE, 
> in which case two output tables will be created using 
> the 'output_table' name with the suffixes '_train' and '_test'.
> The output table contains all the  columns present in the source 
> table unless otherwise specified  in the 'target_cols' parameter below. 
> train_proportion
> FLOAT8 in the range (0,1).  Proportion of the dataset to include 
> in the train split.  If the 'grouping_col' parameter is specified below, 
> each group will be sampled independently using the 
> train proportion, i.e., in a stratified fashion.
> test_proportion (optional)
> FLOAT8 in the range (0,1).  Proportion of the dataset to include 
> in the test split.  Default is the complement to the train
> proportion (1-'train_proportion').  If the 'grouping_col' 
> parameter is specified below,  each group will be sampled 
> independently using the  train proportion, 
> i.e., in a stratified fashion.
> grouping_col (optional)
> TEXT, default: NULL. A single column or a list of comma-separated columns
>  that defines how to stratify.  When this parameter is NULL, 
> the train-test split is not stratified.
> with_replacement (optional) 
> BOOLEAN, default FALSE.  Determines whether to sample with replacement 
> or without replacement (default).
> target_cols (optional)
> TEXT, default NULL. A comma-separated list of columns to appear in the 
> 'output_table'. 
> If NULL, all columns from the 'source_table'  will appear in the 
> 'output_table'.
> separate_output_tables (optional)
> BOOLEAN, default FALSE.  If TRUE, two output tables will be created using 
> the 'output_table' name with the suffixes '_train' and '_test'.
> {code}
> Other notes
> 1) PDL tools is one example implementation of train/test split to review [2]. 
>  
> 2) From Rahul Iyer: "The goal of having both train and test is to provide 
> subsample and train/test split in one function. 
> For eg. if train_size = 0.4 and test_size = 0.1, then only half the inputed 
> data will be output. This is tremendously useful in situations where a user 
> wants to prototype/evaluate a couple of models on smaller iid data before 
> running it on whole dataset. 
> Under no circumstances would the train_size + test_size be allowed to be more 
> than 1. The implementation will also ensure that there are no "leaks" (leak = 
> same data occurring in both train and test) as that defeats the whole purpose 
> of building an independent dataset for model evaluation. 
> Of course, the interface does get a little complex and could confuse users. 
> Explanatory documentation 

[jira] [Commented] (MADLIB-413) Neural Networks - MLP - Phase 1

2017-08-18 Thread Frank McQuillan (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16133665#comment-16133665
 ] 

Frank McQuillan commented on MADLIB-413:


(2)
{code}
SELECT madlib.mlp_classification(
'iris_data',  -- Source table
'mlp_model',  -- Destination table
'attributes', -- Input features
'class_text' -- Label
);
{code}
works now

On-line help looks good though I did not examine all of the details

> Neural Networks - MLP - Phase 1
> ---
>
> Key: MADLIB-413
> URL: https://issues.apache.org/jira/browse/MADLIB-413
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Neural Networks
>Reporter: Caleb Welton
>Assignee: Cooper Sloan
> Fix For: v1.12
>
> Attachments: mlp.sql_in, screenshot-1.png
>
>
> Multilayer perceptron with backpropagation
> Modules:
> * mlp_classification
> * mlp_regression
> Interface
> {code}
> source_table VARCHAR
> output_table VARCHAR
> independent_varname VARCHAR -- Column name for input features, should be a 
> Real Valued array
> dependent_varname VARCHAR, -- Column name for target values, should be Real 
> Valued array of size 1 or greater
> hidden_layer_sizes INTEGER[], -- Number of units per hidden layer (can be 
> empty or null, in which case, no hidden layers)
> optimizer_params VARCHAR, -- Specified below
> weights VARCHAR, -- Column name for weights. Weights the loss for each input 
> vector. Column should contain positive real value
> activation_function VARCHAR, -- One of 'sigmoid' (default), 'tanh', 'relu', 
> or any prefix (eg. 't', 's')
> grouping_cols
> )
> {code}
> where
> {code}
> optimizer_params: -- eg "step_size=0.5, n_tries=5"
> {
> step_size DOUBLE PRECISION, -- Learning rate
> n_iterations INTEGER, -- Number of iterations per try
> n_tries INTEGER, -- Total number of training cycles, with random 
> initializations to avoid local minima.
> tolerance DOUBLE PRECISION, -- Maximum distance between weights before 
> training stops (or until it reaches n_iterations)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (MADLIB-1134) Neural Networks - MLP - Phase 2

2017-08-18 Thread Frank McQuillan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MADLIB-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan closed MADLIB-1134.
---

> Neural Networks - MLP - Phase 2
> ---
>
> Key: MADLIB-1134
> URL: https://issues.apache.org/jira/browse/MADLIB-1134
> Project: Apache MADlib
>  Issue Type: Improvement
>  Components: Module: Neural Networks
>Reporter: Frank McQuillan
>Assignee: Cooper Sloan
> Fix For: v1.12
>
>
> Follow on from https://issues.apache.org/jira/browse/MADLIB-413
> Story
> As a MADlib developer, I want to get 2nd phase implementation of NN going 
> with training and prediction functions, so that I can use this to build to an 
> MVP version for GA.
> Features to add:
> * weights for inputs
> * logic for n_tries
> * normalize inputs
> * L2 regularization
> * learning rate policy
> * warm start



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Reopened] (MADLIB-1073) Graph - Phase 1 measures

2017-08-18 Thread Frank McQuillan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MADLIB-1073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan reopened MADLIB-1073:
-

> Graph - Phase 1 measures
> 
>
> Key: MADLIB-1073
> URL: https://issues.apache.org/jira/browse/MADLIB-1073
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Graph
>Reporter: Frank McQuillan
>Assignee: Rahul Iyer
> Fix For: v1.12
>
> Attachments: Graph Measures Interfaces - JIRA.pdf
>
>
> Follow on from  https://issues.apache.org/jira/browse/MADLIB-1072. Given that 
> this story is complete, what measures can we compute from APSP?
> Story
> As a MADlib developer, I want to implement the following measures:
> * Closeness (uses APSP)
> * Graph diameter  (uses APSP)
> * Average path length (uses APSP)
> * In/out degrees
> Acceptance
> 1) Interface defined
> 2) Design document updated
> 3) Documentation and on-line help
> 4) IC and functional tests
> 5) Scale tests



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1073) Graph - Phase 1 measures

2017-08-18 Thread Frank McQuillan (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16133549#comment-16133549
 ] 

Frank McQuillan commented on MADLIB-1073:
-

Regarding
https://github.com/apache/incubator-madlib/pull/152

Closeness, diameter and average path length look good.

In-out degree is dropping the last vertex.  E.g., from the user docs:

{code}
DROP TABLE IF EXISTS degrees;
SELECT madlib.graph_vertex_degrees(
'vertex',  -- Vertex table
'id',  -- Vertix id column (NULL means use default naming)
'edge',-- Edge table
'src=src_id, dest=dest_id, weight=edge_weight',
'degrees');-- Output table of shortest paths
SELECT * FROM degrees ORDER BY id;
{code}

produces
{code}
 id | indegree | outdegree 
+--+---
  0 |2 | 3
  1 |1 | 2
  2 |2 | 3
  3 |2 | 1
  4 |1 | 1
  5 |1 | 1
  6 |2 | 1
(7 rows)
{code}

Vertex 7 has indegree=1 and outdegree=0.

Same issue with and without grouping.






> Graph - Phase 1 measures
> 
>
> Key: MADLIB-1073
> URL: https://issues.apache.org/jira/browse/MADLIB-1073
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Graph
>Reporter: Frank McQuillan
>Assignee: Rahul Iyer
> Fix For: v1.12
>
> Attachments: Graph Measures Interfaces - JIRA.pdf
>
>
> Follow on from  https://issues.apache.org/jira/browse/MADLIB-1072. Given that 
> this story is complete, what measures can we compute from APSP?
> Story
> As a MADlib developer, I want to implement the following measures:
> * Closeness (uses APSP)
> * Graph diameter  (uses APSP)
> * Average path length (uses APSP)
> * In/out degrees
> Acceptance
> 1) Interface defined
> 2) Design document updated
> 3) Documentation and on-line help
> 4) IC and functional tests
> 5) Scale tests



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MADLIB-1146) Elastic Net fails when used without normalization with grouping

2017-08-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MADLIB-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16133294#comment-16133294
 ] 

ASF GitHub Bot commented on MADLIB-1146:


GitHub user njayaram2 opened a pull request:

https://github.com/apache/incubator-madlib/pull/172

Elastic_net: Fix grouping without normalization bug

JIRA: MADLIB-1146

Selecting grouping columns into the output table was not working
when data was NOT scaled, but grouping was used. This commit
fixes it.

Closes #172

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/njayaram2/incubator-madlib bugfix/MADlib_1146

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-madlib/pull/172.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #172


commit 54c09a64b98a070cd4f18c85bf47bf77a2546264
Author: Nandish Jayaram 
Date:   2017-08-18T17:10:58Z

Elastic_net: Fix grouping without normalization bug

JIRA: MADLIB-1146

Selecting grouping columns into the output table was not working
when data was NOT scaled, but grouping was used. This commit
fixes it.

Closes #172




> Elastic Net fails when used without normalization with grouping
> ---
>
> Key: MADLIB-1146
> URL: https://issues.apache.org/jira/browse/MADLIB-1146
> Project: Apache MADlib
>  Issue Type: Bug
>Reporter: Cooper Sloan
>Assignee: Nandish Jayaram
>Priority: Minor
>
> ```
> DROP TABLE IF EXISTS house_en,house_en_summary;
> SELECT madlib.elastic_net_train(
> 'lin_housing_wi',
> 'house_en',
> 'y',
> 'x',
> 'gaussian',
> 0.5,
> 0.5,
> False,
> 'grp_by_col',
> 'fista',
> '',
> NULL,
> 1,
> 1e-6
> );
> psql:/Users/csloan/elastic_net.sql:1: NOTICE:  table "house_en" does not 
> exist, skipping
> psql:/Users/csloan/elastic_net.sql:1: NOTICE:  table "house_en_summary" does 
> not exist, skipping
> DROP TABLE
> psql:/Users/csloan/elastic_net.sql:17: ERROR:  KeyError: 'select_grp'
> CONTEXT:  Traceback (most recent call last):
>   PL/Python function "elastic_net_train", line 27, in 
> excluded, max_iter, tolerance)
>   PL/Python function "elastic_net_train", line 467, in elastic_net_train
>   PL/Python function "elastic_net_train", line 502, in 
> _internal_elastic_net_train
>   PL/Python function "elastic_net_train", line 24, in 
> _elastic_net_gaussian_fista_train
>   PL/Python function "elastic_net_train", line 171, in 
> _elastic_net_fista_train
>   PL/Python function "elastic_net_train", line 297, in 
> _elastic_net_fista_train_compute
>   PL/Python function "elastic_net_train", line 83, in 
> _elastic_net_generate_result
> PL/Python function "elastic_net_train"
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (MADLIB-1119) Train-test split

2017-08-18 Thread Frank McQuillan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MADLIB-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan resolved MADLIB-1119.
-
Resolution: Fixed

> Train-test split
> 
>
> Key: MADLIB-1119
> URL: https://issues.apache.org/jira/browse/MADLIB-1119
> Project: Apache MADlib
>  Issue Type: New Feature
>  Components: Module: Sampling
>Reporter: Frank McQuillan
>Assignee: Orhan Kislal
> Fix For: v1.12
>
>
> Context
> See related story on stratified sampling 
> https://issues.apache.org/jira/browse/MADLIB-986
> Story
> As a data scientist, I want to split a data table into training and test sets 
> including grouping support, so that I use the result sets for model 
> development in the usual way.
> The MVP for this story is:
> * support split by group
> * allow option to sample without replacement (default) and sample with 
> replacement
> * allow option to output a subset of columns to the output table
> * output one table with a new test/train column, or optionally two separate 
> tables
> Proposed Interface
> {code}
> train_test_split ( 
>source_table,
>output_table,
>train_proportion,
>test_proportion, -- optional
>grouping_col -- optional
>with_replacement, -- optional
>target_cols -- optional
>separate_output_tables -- optional
> )
> source_table
> TEXT. The name of the table containing the input data.
> output_table
> TEXT. Name of output table.   A new INTEGER column on the right 
> called 'split' will identify 1 for train set and 0 for test set,
> unless the 'separate_output_tables' parameter below is TRUE, 
> in which case two output tables will be created using 
> the 'output_table' name with the suffixes '_train' and '_test'.
> The output table contains all the  columns present in the source 
> table unless otherwise specified  in the 'target_cols' parameter below. 
> train_proportion
> FLOAT8 in the range (0,1).  Proportion of the dataset to include 
> in the train split.  If the 'grouping_col' parameter is specified below, 
> each group will be sampled independently using the 
> train proportion, i.e., in a stratified fashion.
> test_proportion (optional)
> FLOAT8 in the range (0,1).  Proportion of the dataset to include 
> in the test split.  Default is the complement to the train
> proportion (1-'train_proportion').  If the 'grouping_col' 
> parameter is specified below,  each group will be sampled 
> independently using the  train proportion, 
> i.e., in a stratified fashion.
> grouping_col (optional)
> TEXT, default: NULL. A single column or a list of comma-separated columns
>  that defines how to stratify.  When this parameter is NULL, 
> the train-test split is not stratified.
> with_replacement (optional) 
> BOOLEAN, default FALSE.  Determines whether to sample with replacement 
> or without replacement (default).
> target_cols (optional)
> TEXT, default NULL. A comma-separated list of columns to appear in the 
> 'output_table'. 
> If NULL, all columns from the 'source_table'  will appear in the 
> 'output_table'.
> separate_output_tables (optional)
> BOOLEAN, default FALSE.  If TRUE, two output tables will be created using 
> the 'output_table' name with the suffixes '_train' and '_test'.
> {code}
> Other notes
> 1) PDL tools is one example implementation of train/test split to review [2]. 
>  
> 2) From Rahul Iyer: "The goal of having both train and test is to provide 
> subsample and train/test split in one function. 
> For eg. if train_size = 0.4 and test_size = 0.1, then only half the inputed 
> data will be output. This is tremendously useful in situations where a user 
> wants to prototype/evaluate a couple of models on smaller iid data before 
> running it on whole dataset. 
> Under no circumstances would the train_size + test_size be allowed to be more 
> than 1. The implementation will also ensure that there are no "leaks" (leak = 
> same data occurring in both train and test) as that defeats the whole purpose 
> of building an independent dataset for model evaluation. 
> Of course, the interface does get a little complex and could confuse users. 
> Explanatory documentation with examples is the only solution to that problem. 
> The alternative to having both sizes in one function is to run a subsample 
> function (using various sampling methods) and then perform the train_test 
> split. The downside to this approach is it requires writing an intermediate 
> table to disk (inefficient). "
> Acceptance
> 1) Code, user docs, on-line docs, IC, Tinc tests complete.
> 2) Radar