[madlib-site] branch asf-site updated (d27cc96 -> 222c2eb)
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a change to branch asf-site in repository https://gitbox.apache.org/repos/asf/madlib-site.git. from d27cc96 Add docs for 1.18.0 RC1 new 1b38647 updated jupyter notebooks for 1dot18dot0 release new e6b2ff6 2nd update jupyter notebooks for 1dot18dot0 release new 8bb79b8 trivial update new 36413e3 minor edits to multiple workbooks new 222c2eb Merge pull request #21 from fmcquillan99/1dot18dot0-jupyter-notebooks The 113 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: .../Encoding-categorical-variables-v2.ipynb| 57 +- .../Path-demo-4.ipynb | 856 ++- .../Load-model-selection-table-v1.ipynb| 955 --- .../Deep-learning/MADlib-Keras-MLP-v2.ipynb| 4057 .../MADlib-Keras-cifar10-inference-v1.ipynb| 601 -- .../MADlib-Keras-model-selection-MLP-v1.ipynb | 5709 .../Define-custom-functions-v1.ipynb | 531 ++ .../Define-model-architecture-v2.ipynb}| 270 +- ...rocessor-for-images-distribution-rules-v1.ipynb | 10 +- .../Preprocessor-for-images-v2.ipynb | 739 +-- .../Train-multiple-models/AutoML-MLP-v1.ipynb | 6937 .../Define-model-configurations-v2.ipynb | 2025 ++ ...Dlib-Keras-model-selection-CNN-cifar10-v1.ipynb |0 .../MADlib-Keras-model-selection-MLP-v1.ipynb | 6279 ++ .../Train-single-model/MADlib-Keras-MLP-v2.ipynb | 5025 ++ .../MADlib-Keras-cifar10-cnn-v3.ipynb | 74 +- .../MADlib-Keras-cifar10-inference-v1.ipynb| 829 +++ .../MADlib-Keras-imagenet-inference-v1.ipynb |2 +- .../MADlib-Keras-transfer-learning-v3.ipynb| 778 ++- .../{ => Utilities}/Load-images-v1.ipynb | 382 +- .../{ => Utilities}/madlib_image_loader.py |0 .../automl/hyperband-diag-cifar10-v1.ipynb | 5288 --- .../MADlib-e2e-ds-workflow-abalone.ipynb | 2181 +++--- community-artifacts/Graph/PageRank-v2.ipynb| 93 +- .../Supervised-learning/Decision-trees-v2.ipynb| 346 +- .../Supervised-learning/Linear-regression-v1.ipynb | 100 +- .../Supervised-learning/MLP-mnist-v3.ipynb | 386 +- .../SVM-binary-classification-v1.ipynb | 201 +- .../SVM-novelty-detection-v2.ipynb | 484 +- .../Kmeans-auto-k-selection-v1.ipynb | 241 +- 30 files changed, 25559 insertions(+), 19877 deletions(-) delete mode 100644 community-artifacts/Deep-learning/Load-model-selection-table-v1.ipynb delete mode 100644 community-artifacts/Deep-learning/MADlib-Keras-MLP-v2.ipynb delete mode 100644 community-artifacts/Deep-learning/MADlib-Keras-cifar10-inference-v1.ipynb delete mode 100644 community-artifacts/Deep-learning/MADlib-Keras-model-selection-MLP-v1.ipynb create mode 100755 community-artifacts/Deep-learning/Model-preparation/Define-custom-functions-v1.ipynb rename community-artifacts/Deep-learning/{Load-model-architecture-v2.ipynb => Model-preparation/Define-model-architecture-v2.ipynb} (68%) rename community-artifacts/Deep-learning/{ => Model-preparation}/Preprocessor-for-images-distribution-rules-v1.ipynb (98%) rename community-artifacts/Deep-learning/{ => Model-preparation}/Preprocessor-for-images-v2.ipynb (61%) create mode 100755 community-artifacts/Deep-learning/Train-multiple-models/AutoML-MLP-v1.ipynb create mode 100755 community-artifacts/Deep-learning/Train-multiple-models/Define-model-configurations-v2.ipynb rename community-artifacts/Deep-learning/{ => Train-multiple-models}/MADlib-Keras-model-selection-CNN-cifar10-v1.ipynb (100%) create mode 100644 community-artifacts/Deep-learning/Train-multiple-models/MADlib-Keras-model-selection-MLP-v1.ipynb create mode 100644 community-artifacts/Deep-learning/Train-single-model/MADlib-Keras-MLP-v2.ipynb rename community-artifacts/Deep-learning/{ => Train-single-model}/MADlib-Keras-cifar10-cnn-v3.ipynb (99%) create mode 100644 community-artifacts/Deep-learning/Train-single-model/MADlib-Keras-cifar10-inference-v1.ipynb rename community-artifacts/Deep-learning/{ => Train-single-model}/MADlib-Keras-imagenet-inference-v1.ipynb (99%) mode change 100644 => 100755 rename community-artifacts/Deep-learning/{ => Train-single-model}/MADlib-Keras-transfer-learning-v3.ipynb (68%) rename community-artifacts/Deep-learning/{ => Utilities}/Load-images-v1.ipynb (83%) rename community-artifacts/Deep-learning/{ => Utilities}/madlib_image_loader.py (100%) delete mode 100644 community-artifacts/Deep-learning/automl/hyperband-diag-cifar10-v1.ipynb
[madlib] branch master updated: release notes for 1dot18dot0
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/madlib.git The following commit(s) were added to refs/heads/master by this push: new c6a5883 release notes for 1dot18dot0 c6a5883 is described below commit c6a5883e193a8f89d1b29dd0317f7976e7a969fa Author: Frank McQuillan AuthorDate: Tue Mar 9 11:19:26 2021 -0800 release notes for 1dot18dot0 --- RELEASE_NOTES | 52 1 file changed, 52 insertions(+) diff --git a/RELEASE_NOTES b/RELEASE_NOTES index 030d28c..918cdf4 100644 --- a/RELEASE_NOTES +++ b/RELEASE_NOTES @@ -10,6 +10,58 @@ commit history located at https://github.com/apache/madlib/commits/master. Current list of bugs and issues can be found at https://issues.apache.org/jira/browse/MADLIB. —- +MADlib v1.18.0: + +Release Date: 2021-Mar-16 + +New features +- DL: setup methods for grid search and random search (MADLIB-1439) +- DL: Add support for custom loss functions (MADLIB-1441) +- DL: Hyperband phase 1 - print run schedule (MADLIB-1445) +- DL: Hyperband phase 2 - generate MST table (MADLIB-1446) +- DL: Hyperband phase 3 - logic for diagonal runs (MADLIB-1447) +- DL: Hyperband phase 4 - implement full logic with default params (MADLIB-1448) +- DL: Hyperband phase 5 - implement full logic with optional params (MADLIB-1449) +- AutoML: add Hyperopt for deep learning (MADLIB-1453) +- DL: Add Multiple input/output support to load, fit, and evaluate (MADLIB-1457) +- DL: Add multiple input/output support on advanced features (MADLIB-1458) +- DL: add caching param to autoML interface (MADLIB-1461) +- DL: Add support for TensorBoard (MADLIB-1474) +- DBSCAN clustering algo - phase 1 (MADLIB-1017) + +Improvements: +- DL: cache data to speed training (MADLIB-1427) +- DL: reduce GPU idle time between hops (MADLIB-1428) +- DL: utility to load and delete custom Python functions (MADLIB-1429) +- DL: support custom loss functions (MADLIB-1432) +- DL: support custom metrics (MADLIB-1433) +- DL: Fit multiple does not print timing for validation evaluate (MADLIB-1462) +- DL: Fix gpu_memory_fraction for distribution_policy != 'all_segments' (MADLIB-1463) +- DL: add object table info in load MST table utility function (MADLIB-1430) +- DL: improve speed of evaluate for multiple model training (MADLIB-1431) +- DL: improve existing grid search method (MADLIB-1440) +- DL: Remove dependency on keras (MADLIB-1450) +- DL: Improve output of predict (MADLIB-1451) +- DL: Add top n to evalute() (MADLIB-1452) +- DL - Write best so far to console for autoML methods (MADLIB-1454) +- Do not try to drop output tables (MADLIB-1442) +- Prevent an "integer out of range" exception in linear regression train (MADLIB-1460) + +Bug fixes: +- DL: Fix fit_multiple when output_table or mst_table is passed as NULL (MADLIB-1464) +- DL: Iris predict accuracy has regressed (MADLIB-1465) +- DL: madlib_keras_fit_multiple_model goes down with an IndexError: tuple index out of range (MADLIB-1467) +- DL: Crash in fit_multiple when any model reaches loss=nan (MADLIB-1443) +- DL: BYOM fails at get_num_classes (MADLIB-1472) +- DL: Hyperband cumulative output time is not correct (MADLIB-1456) +- check bigint support for all graph methods (MADLIB-1444) +- MLP: weights param not working (MADLIB-1471) + +Other: +- Create build trigger jobs on cloudbees (MADLIB-1466) + + +—- MADlib v1.17.0: Release Date: 2020-Mar-31
[madlib] branch master updated: clarify input row weights vs network weights in user docs for MLP
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/madlib.git The following commit(s) were added to refs/heads/master by this push: new fe1c1f5 clarify input row weights vs network weights in user docs for MLP fe1c1f5 is described below commit fe1c1f5915cc7c5c0dfa7422e3b6a7713402524f Author: Frank McQuillan AuthorDate: Mon Mar 8 15:35:08 2021 -0800 clarify input row weights vs network weights in user docs for MLP --- src/ports/postgres/modules/convex/mlp.sql_in | 26 ++ 1 file changed, 14 insertions(+), 12 deletions(-) diff --git a/src/ports/postgres/modules/convex/mlp.sql_in b/src/ports/postgres/modules/convex/mlp.sql_in index d6ce7ce..d98f8c4 100644 --- a/src/ports/postgres/modules/convex/mlp.sql_in +++ b/src/ports/postgres/modules/convex/mlp.sql_in @@ -152,19 +152,20 @@ mlp_classification( weights (optional) TEXT, default: 1. -Weights for input rows. Column name which specifies the weight for each input row. -This weight will be incorporated into the update during stochastic gradient -descent (SGD), but will not be used for loss calculations. If not specified, - weight for each row will default to 1 (equal weights). Column should be a - numeric type. +Column name for giving different weights to different rows during training. +E.g., a weight of two for a specific row is equivalent to dupicating that row. +This weight is incorporated into the update during stochastic gradient +descent (SGD), but is not be used for loss calculations. If not specified, +weight for each row will default to 1 (equal weights). Column should be a +numeric type. @note -The 'weights' parameter is not currently for mini-batching. +The 'weights' parameter cannot be used if you use mini-batching of the source dataset. warm_start (optional) BOOLEAN, default: FALSE. -Initalize weights with the coefficients from the last call of the training -function. If set to true, weights will be initialized from the output_table +Initalize neural network weights with the coefficients from the last call of the training +function. If set to true, neural network weights will be initialized from the output_table generated by the previous run. Note that all parameters other than optimizer_params and verbose must remain constant between calls when warm_start is used. @@ -173,7 +174,7 @@ mlp_classification( The warm start feature works based on the name of the output_table. When using warm start, do not drop the output table or the output table summary before calling the training function, since these are needed to obtain the -weights from the previous run. +neural network weights from the previous run. If you are not using warm start, the output table and the output table summary must be dropped in the usual way before calling the training function. @@ -294,7 +295,8 @@ A summary table named \_summary is also created, which has the fo weights -The weight column used during training. +The weight column used during training for giving different +weights to different rows. grouping_col @@ -421,7 +423,7 @@ a factor of gamma. Valid for learning rate policy = 'step'. n_tries Default: 1. Number of times to retrain the network with randomly initialized -weights. +neural network weights. lambda @@ -954,7 +956,7 @@ num_iterations | 450 Notice that the loss is lower compared to the previous example, despite having the same values for every other parameter. This is because the algorithm -learnt three different models starting with a different set of initial weights +learned three different models starting with a different set of initial weights for the coefficients, and chose the best model among them as the initial weights for the coefficients when run with warm start.
[madlib] branch master updated: update example in multi-fit to use new model config generator
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/madlib.git The following commit(s) were added to refs/heads/master by this push: new 33ad16c update example in multi-fit to use new model config generator 33ad16c is described below commit 33ad16c29af1e99a02a8a153671a9a16608e74c6 Author: Frank McQuillan AuthorDate: Fri Mar 5 16:54:34 2021 -0800 update example in multi-fit to use new model config generator --- .../madlib_keras_fit_multiple_model.sql_in | 69 -- 1 file changed, 37 insertions(+), 32 deletions(-) diff --git a/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.sql_in b/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.sql_in index 67ee2c7..e8c4d51 100644 --- a/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.sql_in +++ b/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.sql_in @@ -999,41 +999,44 @@ $$ 'MLP with 2 hidden layers' -- Descr ); --# Define model selection tuples and load. Select the model(s) from the model architecture -table that you want to run, along with the compile and fit parameters. Combinations will be -created for the set of model selection parameters will be loaded: +-# Generate model configurations using grid search. The output table for grid +search contains the unique combinations of model architectures, compile and +fit parameters. DROP TABLE IF EXISTS mst_table, mst_table_summary; -SELECT madlib.load_model_selection_table('model_arch_library', -- model architecture table - 'mst_table', -- model selection table output - ARRAY[1,2], -- model ids from model architecture table - ARRAY[ -- compile params - $$loss='categorical_crossentropy',optimizer='Adam(lr=0.1)',metrics=['accuracy']$$, - $$loss='categorical_crossentropy', optimizer='Adam(lr=0.01)',metrics=['accuracy']$$, - $$loss='categorical_crossentropy',optimizer='Adam(lr=0.001)',metrics=['accuracy']$$ - ], - ARRAY[-- fit params - $$batch_size=4,epochs=1$$, - $$batch_size=8,epochs=1$$ - ] +SELECT madlib.generate_model_configs( +'model_arch_library', -- model architecture table +'mst_table', -- model selection table output + ARRAY[1,2], -- model ids from model architecture table + $$ +{'loss': ['categorical_crossentropy'], + 'optimizer_params_list': [ {'optimizer': ['Adam'], 'lr': [0.001, 0.01, 0.1]} ], + 'metrics': ['accuracy']} + $$, -- compile_param_grid + $$ + { 'batch_size': [4, 8], + 'epochs': [1] + } + $$, -- fit_param_grid + 'grid' -- search_type ); SELECT * FROM mst_table ORDER BY mst_key; - mst_key | model_id | compile_params | fit_params + mst_key | model_id | compile_params | fit_params -+--+-+--- - 1 |1 | loss='categorical_crossentropy',optimizer='Adam(lr=0.1)',metrics=['accuracy'] | batch_size=4,epochs=1 - 2 |1 | loss='categorical_crossentropy',optimizer='Adam(lr=0.1)',metrics=['accuracy'] | batch_size=8,epochs=1 - 3 |1 | loss='categorical_crossentropy', optimizer='A
[madlib] branch master updated: clarify example in user docs for loading model arch
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/madlib.git The following commit(s) were added to refs/heads/master by this push: new 7eeb29c clarify example in user docs for loading model arch 7eeb29c is described below commit 7eeb29c6827ff6968e9533536d6f32f8bc6de3c8 Author: Frank McQuillan AuthorDate: Thu Mar 4 15:41:33 2021 -0800 clarify example in user docs for loading model arch --- .../deep_learning/keras_model_arch_table.sql_in| 46 +- 1 file changed, 28 insertions(+), 18 deletions(-) diff --git a/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in b/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in index ee30f94..0c099e0 100644 --- a/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in +++ b/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in @@ -237,8 +237,15 @@ output table 'iris_model' from a previous run of 'madlib_keras_fit()' : UPDATE model_arch_library SET model_weights = model_weights FROM iris_model WHERE model_id = 2; +SELECT model_id, name, description, (model_weights IS NOT NULL) AS has_model_weights FROM model_arch_library ORDER BY model_id; -To load weights from Keras using a PL/Python function, + + model_id | name | description | has_model_weights +--++-+--- +1 | Sophie | A simple model | f +2 | Maria | Also a simple model | t + +-# To load weights from Keras using a PL/Python function, we need to flatten then serialize the weights to store as a PostgreSQL binary data type. Byte format is more efficient on space and memory compared to a numeric array. @@ -273,15 +280,16 @@ plpy.execute(load_query, [model.to_json(), weights_bytea]) $$ language plpythonu; -- Call load function SELECT load_weights(); --- Check weights loaded OK -SELECT COUNT(*) FROM model_arch_library WHERE model_weights IS NOT NULL; +SELECT model_id, name, description, (model_weights IS NOT NULL) AS has_model_weights FROM model_arch_library ORDER BY model_id; - count + - 1 + model_id | name | description | has_model_weights +--++-+--- +1 | Sophie | A simple model | f +2 | Maria | Also a simple model | t +3 | Ella | Model x | t -Load weights from Keras using psycopg2. (Psycopg is a PostgreSQL database adapter for the +-# Load weights from Keras using psycopg2. (Psycopg is a PostgreSQL database adapter for the Python programming language.) As above we need to flatten then serialize the weights to store as a PostgreSQL binary data type. Note that the psycopg2.Binary function used below will increase the size of the Python object for the weights, so if your model is large it might be better to use a PL/Python function as above. @@ -310,27 +318,29 @@ weights_bytea = psycopg2.Binary(weights1d.tostring()) query = "SELECT madlib.load_keras_model('model_arch_library', %s,%s)" cur.execute(query,[model.to_json(),weights_bytea]) conn.commit() - -From SQL check if weights loaded OK: - -SELECT COUNT(*) FROM model_arch_library WHERE model_weights IS NOT NULL; +SELECT model_id, name, description, (model_weights IS NOT NULL) AS has_model_weights FROM model_arch_library ORDER BY model_id; - count + - 2 + model_id | name | description | has_model_weights +--++-+--- +1 | Sophie | A simple model | f +2 | Maria | Also a simple model | t +3 | Ella | Model x | t +4 | Grace | Model y | t -# Delete one of the models: SELECT madlib.delete_keras_model('model_arch_library', -- Output table 1 -- Model id ); -SELECT COUNT(*) FROM model_arch_library; +SELECT model_id, name, description, (model_weights IS NOT NULL) AS has_model_weights FROM model_arch_library ORDER BY model_id; - count + - 2 + model_id | name | description | has_model_weights +--+---+-+--- +2 | Maria | Also a simple model | t +3 | Ella | Model x | t +4 | Grace | Model y | t @anchor related
[madlib] branch master updated: move notes to bottom of page for consistency in user docs
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/madlib.git The following commit(s) were added to refs/heads/master by this push: new f29674b move notes to bottom of page for consistency in user docs f29674b is described below commit f29674b8bf3d500b3dcca38e81356a4a39591bec Author: Frank McQuillan AuthorDate: Mon Feb 8 12:58:26 2021 -0800 move notes to bottom of page for consistency in user docs --- src/ports/postgres/modules/graph/apsp.sql_in | 26 +++--- src/ports/postgres/modules/graph/bfs.sql_in | 28 src/ports/postgres/modules/graph/hits.sql_in | 22 +-- src/ports/postgres/modules/graph/pagerank.sql_in | 10 ++--- src/ports/postgres/modules/graph/sssp.sql_in | 28 src/ports/postgres/modules/graph/wcc.sql_in | 7 ++ 6 files changed, 66 insertions(+), 55 deletions(-) diff --git a/src/ports/postgres/modules/graph/apsp.sql_in b/src/ports/postgres/modules/graph/apsp.sql_in index 893cd79..bab6d83 100644 --- a/src/ports/postgres/modules/graph/apsp.sql_in +++ b/src/ports/postgres/modules/graph/apsp.sql_in @@ -34,8 +34,8 @@ m4_include(`SQLCommon.m4') Contents APSP -Notes Examples +Notes Literature @@ -159,18 +159,6 @@ It contains a row for every group and has the following columns: -@anchor notes -@par Notes - -Graphs with negative edges are supported but graphs with negative cycles are not. - -The implementation is analogous to a matrix multiplication procedure. -Please refer to the MADlib design document and references [1] and [2] -for more details. - -Also see the Grail project [3] for more background on graph analytics processing -in relational databases. - @anchor examples @examp @@ -369,6 +357,18 @@ SELECT * FROM out_gr_path ORDER BY grp; 1 | {0,4,5} +@anchor notes +@par Notes + +1. Graphs with negative edges are supported but graphs with negative cycles are not. + +2. The implementation for APSP is analogous to a matrix multiplication operation. +Please refer to the MADlib design document and references [1] and [2] +for more details. + +3. Also see the Grail project [3] for more background on graph analytics processing +in relational databases. + @anchor literature @par Literature diff --git a/src/ports/postgres/modules/graph/bfs.sql_in b/src/ports/postgres/modules/graph/bfs.sql_in index f9507d9..d2474f0 100644 --- a/src/ports/postgres/modules/graph/bfs.sql_in +++ b/src/ports/postgres/modules/graph/bfs.sql_in @@ -33,8 +33,8 @@ m4_include(`SQLCommon.m4') Contents Breadth-First Search -Notes Examples +Notes Literature @@ -130,19 +130,6 @@ and a single BFS result is generated. -@note On a Greenplum cluster, the edge table should be distributed -by the source vertex id column for better performance. - -@anchor notes -@par Notes - -The graph_bfs function is a SQL implementation of the well-known breadth-first -search algorithm [1] modified appropriately for a relational database. It will -find any node in the graph reachable from the source_vertex only once. If a node -is reachable by many different paths from the source_vertex (i.e. has more than -one parent), then only one of those parents is present in the output table. -The BFS result will, in general, be different for different choices of source_vertex. - @anchor examples @examp @@ -388,6 +375,19 @@ SELECT * FROM out_gr ORDER BY g1,g2,dist,id; (7 rows) +@anchor notes +@par Notes + +1. On a Greenplum cluster, the edge table should be distributed +by the source vertex id column for better performance. + +2. The graph_bfs function is a SQL implementation of the well-known breadth-first +search algorithm [1] modified appropriately for a relational database. It will +find any node in the graph reachable from the 'source_vertex' only once. If a node +is reachable by many different paths from the 'source_vertex' (i.e. has more than +one parent), then only one of those parents is present in the output table. +The BFS result will, in general, be different for different choices of 'source_vertex'. + @anchor literature @par Literature diff --git a/src/ports/postgres/modules/graph/hits.sql_in b/src/ports/postgres/modules/graph/hits.sql_in index d2d6cfc..6f140c8 100644 --- a/src/ports/postgres/modules/graph/hits.sql_in +++ b/src/ports/postgres/modules/graph/hits.sql_in @@ -34,13 +34,13 @@ m4_include(`SQLCommon.m4') Contents HITS -Notes Examples +Notes Literature -@brief Find the HITS scores(authority and hub) of all vertices in a directed +@brief Find the HITS scores (authority and hub) of all vertices in a directed graph. Given a graph, the HITS (Hyperlink-Induced Topic Search) algorithm outputs the @@ -127,15 +127,6 @@ parameter. -@note On a Greenplum cluster, the edge table should b
[madlib] branch master updated: clarify grouping not part of arima currently in user docs
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/madlib.git The following commit(s) were added to refs/heads/master by this push: new 2e75913 clarify grouping not part of arima currently in user docs 2e75913 is described below commit 2e75913b32d6ee6282da5fd7e77c2fee80befd6a Author: Frank McQuillan AuthorDate: Tue Feb 2 12:45:15 2021 -0800 clarify grouping not part of arima currently in user docs --- src/ports/postgres/modules/tsa/arima.sql_in | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/src/ports/postgres/modules/tsa/arima.sql_in b/src/ports/postgres/modules/tsa/arima.sql_in index 48f0abd..12930a6 100644 --- a/src/ports/postgres/modules/tsa/arima.sql_in +++ b/src/ports/postgres/modules/tsa/arima.sql_in @@ -158,13 +158,17 @@ arima_train( input_table, TEXT. The name of the column containing the time series data. This data is currently restricted to DOUBLE PRECISION. -grouping_columns (optional) -TEXT, default: NULL. Not currently implemented. Any non-NULL value is ignored. +grouping_columns (not currently implemented) +TEXT, default: NULL. A comma-separated list of column names used to group the input dataset into discrete groups, training one ARIMA model per group. It is similar to the SQL GROUP BY clause. When this value is null, no grouping is -used and a single result model is generated. +used and a single result model is generated. + +@note Grouping is not currently implemented for ARIMA, but +will be added in the future. Any non-NULL value for this parameter +is ignored. include_mean (optional) BOOLEAN, default: FALSE. Mean value of the data series is added in the ARIMA model
[madlib] branch master updated: fix error in marginal effects example
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/madlib.git The following commit(s) were added to refs/heads/master by this push: new a70a877 fix error in marginal effects example a70a877 is described below commit a70a8776fea111afef353f91f0bad93ffa13b6ab Author: Frank McQuillan AuthorDate: Tue Feb 2 11:48:55 2021 -0800 fix error in marginal effects example --- src/ports/postgres/modules/regress/marginal.sql_in | 12 +--- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/src/ports/postgres/modules/regress/marginal.sql_in b/src/ports/postgres/modules/regress/marginal.sql_in index a19424e..3cb3f8a 100644 --- a/src/ports/postgres/modules/regress/marginal.sql_in +++ b/src/ports/postgres/modules/regress/marginal.sql_in @@ -38,9 +38,7 @@ computed is the average of the marginal effect at every data point present in th source table. MADlib provides marginal effects regression functions for linear, logistic and -multinomial logistic regressions. - -@warning The margins_logregr() and margins_mlogregr() functions have been deprecated in favor of the margins() function. +multinomial logistic regressions. The implementation is similar to reference [1]. @anchor margins @par Marginal Effects with Interaction Terms @@ -321,11 +319,11 @@ DROP TABLE IF EXISTS margins_table; SELECT madlib.logregr_train( 'patients', 'model_table', 'second_attack', - 'ARRAY[1, treatment, trait_anxiety, treatment^2, treatment * trait_anxiety]' + 'ARRAY[1, treatment, trait_anxiety, treatment * trait_anxiety]' ); SELECT madlib.margins( 'model_table', 'margins_table', - 'intercept, treatment, trait_anxiety, treatment^2, treatment*trait_anxiety', + 'intercept, treatment, trait_anxiety, treatment*trait_anxiety', NULL, NULL ); @@ -347,7 +345,7 @@ and view the results (using different names in 'x_design'). DROP TABLE IF EXISTS result_table; SELECT madlib.margins( 'model_table', 'result_table', - 'i, tre, tra, tre^2, tre*tra', + 'i, tre, tra, tre*tra', NULL, 'tre' ); @@ -475,7 +473,7 @@ We use the delta method for calculating standard errors on the marginal effects. @literature -[1] mfx function in STATA: http://www.stata.com/help.cgi?mfx_option +[1] Marginal effects in Stata: https://www.stata.com/ @anchor related @par Related Topics
[madlib-site] branch asf-site updated: add ipython notebook for window functions
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/madlib-site.git The following commit(s) were added to refs/heads/asf-site by this push: new 3a7f9ed add ipython notebook for window functions 3a7f9ed is described below commit 3a7f9ed2e8dbaa6d0b0a406593f00b0598e1bbf0 Author: Frank McQuillan AuthorDate: Fri Apr 24 12:45:54 2020 -0700 add ipython notebook for window functions --- .../Time-series/Window-functions-v1.ipynb | 1910 1 file changed, 1910 insertions(+) diff --git a/community-artifacts/Time-series/Window-functions-v1.ipynb b/community-artifacts/Time-series/Window-functions-v1.ipynb new file mode 100644 index 000..9c30a40 --- /dev/null +++ b/community-artifacts/Time-series/Window-functions-v1.ipynb @@ -0,0 +1,1910 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# Time series example - window functions\n", +"\n", +"Some example queries on time series data using aggregates and window functions. Thanks to Divya Bhargov from VMware for this example notebook.\n", +"\n", +"Data from https://data.cityofchicago.org/Transportation/Potholes-Patched/wqdh-9gek/data which is loaded from CSV format.\n", +"\n", +"## Table of contents \n", +"\n", +"1. Connect to database\n", +"\n", +"2. Load data\n", +"\n", +"3. Window functions\n", +"\n", +"4. Mapping for gap filling" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"\n", +"## 1. Connect to database" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "The sql extension is already loaded. To reload it, use:\n", + " %reload_ext sql\n", + "1 rows affected.\n" + ] +}, +{ + "data": { + "text/html": [ + "\n", + "\n", + "version\n", + "\n", + "\n", + "PostgreSQL 8.3.23 (Greenplum Database 5.18.0 build commit:6aec9959d367d46c6b4391eb9ffc82c735d20102) on x86_64-pc-linux-gnu, compiled by GCC gcc (GCC) 6.2.0, 64-bit compiled on Apr 3 2019 14:45:51\n", + "\n", + "" + ], + "text/plain": [ + "[(u'PostgreSQL 8.3.23 (Greenplum Database 5.18.0 build commit:6aec9959d367d46c6b4391eb9ffc82c735d20102) on x86_64-pc-linux-gnu, compiled by GCC gcc (GCC) 6.2.0, 64-bit compiled on Apr 3 2019 14:45:51',)]" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"%load_ext sql\n", +"\n", +"# Greenplum Database 5.x on GCP (PM demo machine) - via tunnel\n", +"%sql postgresql://gpadmin@localhost:8000/madlib\n", +"\n", +"# PostgreSQL local\n", +"#%sql postgresql://fmcquillan@localhost:5432/madlib\n", +"\n", +"%sql SELECT version();" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"\n", +"## 2. Load data\n", +"Load from CSV. You will need to change the path to the location of the CSV file." + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "Done.\n", + "Done.\n", + "65544 rows affected.\n" + ] +}, +{ + "data": { + "text/plain": [ + "[]" + ] + }, + "execution_count": 47, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"%%sql\n", +"DROP TABLE IF EXISTS chicago_potholes_patched;\n", +"CREATE TABLE chicago_potholes_patched (\n", +"id serial NOT NULL,\n", +"address TEXT,\n", +"request_date TIMESTAMP,
[madlib-site] branch asf-site updated: fix download links for archived 1.16 release artifacts
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/madlib-site.git The following commit(s) were added to refs/heads/asf-site by this push: new e2df99d fix download links for archived 1.16 release artifacts e2df99d is described below commit e2df99d059fd2224e5bdfc1119d4afc71c93efee Author: Frank McQuillan AuthorDate: Fri Apr 17 11:32:29 2020 -0700 fix download links for archived 1.16 release artifacts --- download.html | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/download.html b/download.html index 0d89739..66f5052 100644 --- a/download.html +++ b/download.html @@ -109,15 +109,15 @@ Latest stable release: - https://dist.apache.org/repos/dist/release/madlib/1.16/apache-madlib-1.16-src.tar.gz";>Source code tar.gz (https://www.apache.org/dist/madlib/1.16/apache-madlib-1.16-src.tar.gz.asc";>pgp, https://www.apache.org/dist/madlib/1.16/apache-madlib-1.16-src.tar.gz.sha512";>sha512) + https://archive.apache.org/dist/madlib/1.16/apache-madlib-1.16-src.tar.gz";>Source code tar.gz (https://archive.apache.org/dist/madlib/1.16/apache-madlib-1.16-src.tar.gz.asc";>pgp, https://archive.apache.org/dist/madlib/1.16/apache-madlib-1.16-src.tar.gz.sha512";>sha512) - https://dist.apache.org/repos/dist/release/madlib/1.16/apache-madlib-1.16-bin-Linux-GPDB43.rpm";>Linux (https://www.apache.org/dist/madlib/1.16/apache-madlib-1.16-bin-Linux-GPDB43.rpm.asc";>pgp, https://www.apache.org/dist/madlib/1.16/apache-madlib-1.16-bin-Linux-GPDB43.rpm.sha512";>sha512) — CentOS / Red Hat 5 and higher (64 bit). [...] + https://archive.apache.org/dist/madlib/1.16/apache-madlib-1.16-bin-Linux-GPDB43.rpm";>Linux (https://archive.apache.org/dist/madlib/1.16/apache-madlib-1.16-bin-Linux-GPDB43.rpm.asc";>pgp, https://archive.apache.org/dist/madlib/1.16/apache-madlib-1.16-bin-Linux-GPDB43.rpm.sha512";>sha512) — CentOS / Red Hat 5 and higher (64 bit). GP [...] - https://dist.apache.org/repos/dist/release/madlib/1.16/apache-madlib-1.16-bin-Linux.rpm";>Linux (https://www.apache.org/dist/madlib/1.16/apache-madlib-1.16-bin-Linux.rpm.asc";>pgp, https://www.apache.org/dist/madlib/1.16/apache-madlib-1.16-bin-Linux.rpm.sha512";>sha512) — CentOS / Red Hat 6 and higher (64 bit). GPDB 5.x, PostgreSQL 10.x a [...] + https://archive.apache.org/dist/madlib/1.16/apache-madlib-1.16-bin-Linux.rpm";>Linux (https://archive.apache.org/dist/madlib/1.16/apache-madlib-1.16-bin-Linux.rpm.asc";>pgp, https://archive.apache.org/dist/madlib/1.16/apache-madlib-1.16-bin-Linux.rpm.sha512";>sha512) — CentOS / Red Hat 6 and higher (64 bit). GPDB 5.x, PostgreSQL 10.x and [...] - https://dist.apache.org/repos/dist/release/madlib/1.16/apache-madlib-1.16-bin-Linux.deb";>Linux (https://www.apache.org/dist/madlib/1.16/apache-madlib-1.16-bin-Linux.deb.asc";>pgp, https://www.apache.org/dist/madlib/1.16/apache-madlib-1.16-bin-Linux.deb.sha512";>sha512) — Ubuntu 16.04. GPDB 5.x, PostgreSQL 10.x and 11.x. + https://archive.apache.org/dist/madlib/1.16/apache-madlib-1.16-bin-Linux.deb";>Linux (https://archive.apache.org/dist/madlib/1.16/apache-madlib-1.16-bin-Linux.deb.asc";>pgp, https://archive.apache.org/dist/madlib/1.16/apache-madlib-1.16-bin-Linux.deb.sha512";>sha512) — Ubuntu 16.04. GPDB 5.x, PostgreSQL 10.x and 11.x. - https://dist.apache.org/repos/dist/release/madlib/1.16/apache-madlib-1.16-bin-Darwin.dmg";>Mac OS X (https://www.apache.org/dist/madlib/1.16/apache-madlib-1.16-bin-Darwin.dmg.asc";>pgp, https://www.apache.org/dist/madlib/1.16/apache-madlib-1.16-bin-Darwin.dmg.sha512";>sha512) — OS 10.6 and higher. PostgreSQL 10.x and 11.x. + https://archive.apache.org/dist/madlib/1.16/apache-madlib-1.16-bin-Darwin.dmg";>Mac OS X (https://archive.apache.org/dist/madlib/1.16/apache-madlib-1.16-bin-Darwin.dmg.asc";>pgp, https://archive.apache.org/dist/madlib/1.16/apache-madlib-1.16-bin-Darwin.dmg.sha512";>sha512) — OS 10.6 and higher. PostgreSQL 10.x and 11.x. v1.15.1
[madlib-site] branch asf-site updated: update download page to say ubuntu 18 for 1.17.0
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/madlib-site.git The following commit(s) were added to refs/heads/asf-site by this push: new 002cf96 update download page to say ubuntu 18 for 1.17.0 002cf96 is described below commit 002cf96cd02ffc461c7adf3a0b99128ebda3371c Author: Frank McQuillan AuthorDate: Tue Apr 14 11:43:13 2020 -0700 update download page to say ubuntu 18 for 1.17.0 --- download.html | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/download.html b/download.html index 76af104..0d89739 100644 --- a/download.html +++ b/download.html @@ -72,7 +72,7 @@ https://dist.apache.org/repos/dist/release/madlib/1.17.0/apache-madlib-1.17.0-bin-Linux.rpm";>Linux (https://www.apache.org/dist/madlib/1.17.0/apache-madlib-1.17.0-bin-Linux.rpm.asc";>pgp, https://www.apache.org/dist/madlib/1.17.0/apache-madlib-1.17.0-bin-Linux.rpm.sha512";>sha512) — CentOS / Red Hat 6 and higher (64 bit). GPDB 5.x, GPD [...] - https://dist.apache.org/repos/dist/release/madlib/1.17.0/apache-madlib-1.17.0-bin-Linux.deb";>Linux (https://www.apache.org/dist/madlib/1.17.0/apache-madlib-1.17.0-bin-Linux.deb.asc";>pgp, https://www.apache.org/dist/madlib/1.17.0/apache-madlib-1.17.0-bin-Linux.deb.sha512";>sha512) — Ubuntu 16.04. GPDB 5.x, GPDB 6.x, PostgreSQL 11.x and [...] + https://dist.apache.org/repos/dist/release/madlib/1.17.0/apache-madlib-1.17.0-bin-Linux.deb";>Linux (https://www.apache.org/dist/madlib/1.17.0/apache-madlib-1.17.0-bin-Linux.deb.asc";>pgp, https://www.apache.org/dist/madlib/1.17.0/apache-madlib-1.17.0-bin-Linux.deb.sha512";>sha512) — Ubuntu 18.04. GPDB 5.x, GPDB 6.x, PostgreSQL 11.x and [...] https://dist.apache.org/repos/dist/release/madlib/1.17.0/apache-madlib-1.17.0-bin-Darwin.dmg";>Mac OS X (https://www.apache.org/dist/madlib/1.17.0/apache-madlib-1.17.0-bin-Darwin.dmg.asc";>pgp, https://www.apache.org/dist/madlib/1.17.0/apache-madlib-1.17.0-bin-Darwin.dmg.sha512";>sha512) — OS 10.6 and higher. PostgreSQL 11.x and 12.x.
[madlib-site] branch asf-site updated: update website for 1.17.0 release
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/madlib-site.git The following commit(s) were added to refs/heads/asf-site by this push: new 1d4f9be update website for 1.17.0 release new 81dc279 Merge branch 'asf-site' of github.com:apache/madlib-site into asf-site 1d4f9be is described below commit 1d4f9bec751b1b2323f05d8e512a60cb8240aa92 Author: Frank McQuillan AuthorDate: Fri Apr 10 11:08:02 2020 -0700 update website for 1.17.0 release --- _media/logos/vmw.png | Bin 0 -> 5231 bytes community.html | 4 +-- documentation.html | 1 + download.html| 31 index.html | 81 --- 5 files changed, 73 insertions(+), 44 deletions(-) diff --git a/_media/logos/vmw.png b/_media/logos/vmw.png new file mode 100644 index 000..c216caa Binary files /dev/null and b/_media/logos/vmw.png differ diff --git a/community.html b/community.html index 4cec998..62ef336 100644 --- a/community.html +++ b/community.html @@ -58,7 +58,7 @@ http://pivotal.io/"; class="center"> - + Providing core development and scalability testing Learn More @@ -210,7 +210,7 @@ http://postgresql.org";>PostgreSQL http://greenplum.org/";>Greenplum Database -http://hawq.incubator.apache.org";>Apache HAWQ +http://hawq.apache.org";>Apache HAWQ http://cran.r-project.org/web/packages/PivotalR/";>PivotalR diff --git a/documentation.html b/documentation.html index b5d93bf..7d92d11 100644 --- a/documentation.html +++ b/documentation.html @@ -55,6 +55,7 @@ jQuery(document).ready(function() { The primary documentation reference material providing detailed information on the functions and algorithms within MADlib as well as background theory and references into the literature. Older Documentation +MADlib v1.16 MADlib v1.15.1 MADlib v1.15 MADlib v1.14 diff --git a/download.html b/download.html index b68b47f..76af104 100644 --- a/download.html +++ b/download.html @@ -58,7 +58,7 @@ Current Release - v1.16 + v1.17.0 Source Code and Convenience Binaries MADlib® source code and convenience binaries are available from the Apache distribution site. @@ -66,15 +66,15 @@ Latest stable release: - https://dist.apache.org/repos/dist/release/madlib/1.16/apache-madlib-1.16-src.tar.gz";>Source code tar.gz (https://www.apache.org/dist/madlib/1.16/apache-madlib-1.16-src.tar.gz.asc";>pgp, https://www.apache.org/dist/madlib/1.16/apache-madlib-1.16-src.tar.gz.sha512";>sha512) + https://dist.apache.org/repos/dist/release/madlib/1.17.0/apache-madlib-1.17.0-src.tar.gz";>Source code tar.gz (https://www.apache.org/dist/madlib/1.17.0/apache-madlib-1.17.0-src.tar.gz.asc";>pgp, https://www.apache.org/dist/madlib/1.17.0/apache-madlib-1.17.0-src.tar.gz.sha512";>sha512) - https://dist.apache.org/repos/dist/release/madlib/1.16/apache-madlib-1.16-bin-Linux-GPDB43.rpm";>Linux (https://www.apache.org/dist/madlib/1.16/apache-madlib-1.16-bin-Linux-GPDB43.rpm.asc";>pgp, https://www.apache.org/dist/madlib/1.16/apache-madlib-1.16-bin-Linux-GPDB43.rpm.sha512";>sha512) — CentOS / Red Hat 5 and higher (64 bit). [...] + https://dist.apache.org/repos/dist/release/madlib/1.17.0/apache-madlib-1.17.0-bin-Linux-GPDB43.rpm";>Linux (https://www.apache.org/dist/madlib/1.17.0/apache-madlib-1.17.0-bin-Linux-GPDB43.rpm.asc";>pgp, https://www.apache.org/dist/madlib/1.17.0/apache-madlib-1.17.0-bin-Linux-GPDB43.rpm.sha512";>sha512) — CentOS / Red Hat 5 and hi [...] - https://dist.apache.org/repos/dist/release/madlib/1.16/apache-madlib-1.16-bin-Linux.rpm";>Linux (https://www.apache.org/dist/madlib/1.16/apache-madlib-1.16-bin-Linux.rpm.asc";>pgp, https://www.apache.org/dist/madlib/1.16/apache-madlib-1.16-bin-Linux.rpm.sha512";>sha512) — CentOS /
[madlib-site] branch asf-site updated: fix broken links for datasets on community page
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/madlib-site.git The following commit(s) were added to refs/heads/asf-site by this push: new bda2c9b fix broken links for datasets on community page bda2c9b is described below commit bda2c9b2335a80914332ebf16fdc16009988f71f Author: Frank McQuillan AuthorDate: Mon Mar 30 15:34:03 2020 -0700 fix broken links for datasets on community page --- community.html | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/community.html b/community.html index 6f892a8..4cec998 100644 --- a/community.html +++ b/community.html @@ -197,10 +197,12 @@ Datasets +There is a growing set of publically available datasets. Here are some examples: -http://archive.ics.uci.edu/ml/datasets.html"; title="UCI Machine Learning Repository: Data Sets">http://archive.ics.uci.edu/ml/datasets.html -http://mlcomp.org/datasets"; title="MLcomp - Viewing All Datasets">http://mlcomp.org/datasets -http://mldata.org/"; title="mldata :: Welcome">http://mldata.org/ +https://archive.ics.uci.edu/ml/index.php";>UCI Machine Learning Repository +https://datasetsearch.research.google.com/";>Google Dataset Search +https://www.kaggle.com/datasets";>Kaggle Datasets +https://www.kdnuggets.com/datasets/index.html";>KDnuggets List of Datasets
[madlib] branch master updated: correct disk space comment for gp5 and 6 in keras multi fit user docs
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/madlib.git The following commit(s) were added to refs/heads/master by this push: new 2f0bb2e correct disk space comment for gp5 and 6 in keras multi fit user docs 2f0bb2e is described below commit 2f0bb2e0b01e060150b443c43f00c5e1d664a5c6 Author: Frank McQuillan AuthorDate: Thu Mar 26 15:25:49 2020 -0700 correct disk space comment for gp5 and 6 in keras multi fit user docs --- .../modules/deep_learning/madlib_keras_fit_multiple_model.sql_in | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.sql_in b/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.sql_in index 4d1eb09..9238652 100644 --- a/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.sql_in +++ b/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.sql_in @@ -88,10 +88,11 @@ You can set up the models and hyperparameters to try with the Model Selection utility to define the unique combinations of model architectures, compile and fit parameters. -@note If 'madlib_keras_fit_multiple_model()' is running on GPDB 5, the database will +@note If 'madlib_keras_fit_multiple_model()' is running on GPDB 5 and some versions +of GPDB 6, the database will keep adding to the disk space (in proportion to model size) and will only release the disk space once the fit multiple query has completed execution. -This is not the case for GPDB 6+ where disk space is released during the +This is not the case for GPDB 6.5.0+ where disk space is released during the fit multiple query. @note CUDA GPU memory cannot be released until the process holding it is terminated.
[madlib] branch master updated: add clarification in DL user docs re GPU memory release
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/madlib.git The following commit(s) were added to refs/heads/master by this push: new 2896c24 add clarification in DL user docs re GPU memory release 2896c24 is described below commit 2896c24acba9f25cc30d1a412ee2d84cc4cf5187 Author: Frank McQuillan AuthorDate: Thu Mar 26 13:14:41 2020 -0700 add clarification in DL user docs re GPU memory release --- .../postgres/modules/deep_learning/madlib_keras.sql_in| 15 ++- .../deep_learning/madlib_keras_fit_multiple_model.sql_in | 15 ++- 2 files changed, 28 insertions(+), 2 deletions(-) diff --git a/src/ports/postgres/modules/deep_learning/madlib_keras.sql_in b/src/ports/postgres/modules/deep_learning/madlib_keras.sql_in index e4794a3..75fa56a 100644 --- a/src/ports/postgres/modules/deep_learning/madlib_keras.sql_in +++ b/src/ports/postgres/modules/deep_learning/madlib_keras.sql_in @@ -84,9 +84,20 @@ but rather imported from an external source. This is in the section called "Predict BYOM" below, where "BYOM" stands for "Bring Your Own Model." Note that the following MADlib functions are targeting a specific Keras -version (2.2.4) with a specific Tensorflow kernel version (1.14). +version (2.2.4) with a specific TensorFlow kernel version (1.14). Using a newer or older version may or may not work as intended. +@note CUDA GPU memory cannot be released until the process holding it is terminated. +When a MADlib deep learning function is called with GPUs, Greenplum internally +creates a process (called a slice) which calls TensorFlow to do the computation. +This process holds the GPU memory until one of the following two things happen: +query finishes and user logs out of the Postgres client/session; or, +query finishes and user waits for the timeout set by `gp_vmem_idle_resource_timeout`. +The default value for this timeout is 18 sec [8]. So the recommendation is: +log out/reconnect to the session after every GPU query; or +wait for `gp_vmem_idle_resource_timeout` before you run another GPU query (you can +also set it to a lower value). + @anchor keras_fit @par Fit The fit (training) function has the following format: @@ -1620,6 +1631,8 @@ http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf Yuhao Zhang, and Arun Kumar, Technical Report, Computer Science and Engineering, University of California, San Diego https://adalabucsd.github.io/papers/TR_2019_Cerebro.pdf. +[8] Greenplum Database server configuration parameters https://gpdb.docs.pivotal.io/latest/ref_guide/config_params/guc-list.html + @anchor related @par Related Topics diff --git a/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.sql_in b/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.sql_in index cd58d93..b929724 100644 --- a/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.sql_in +++ b/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.sql_in @@ -94,6 +94,17 @@ release the disk space once the fit multiple query has completed execution. This is not the case for GPDB 6+ where disk space is released during the fit multiple query. +@note CUDA GPU memory cannot be released until the process holding it is terminated. +When a MADlib deep learning function is called with GPUs, Greenplum internally +creates a process (called a slice) which calls TensorFlow to do the computation. +This process holds the GPU memory until one of the following two things happen: +query finishes and user logs out of the Postgres client/session; or, +query finishes and user waits for the timeout set by `gp_vmem_idle_resource_timeout`. +The default value for this timeout is 18 sec [8]. So the recommendation is: +log out/reconnect to the session after every GPU query; or +wait for `gp_vmem_idle_resource_timeout` before you run another GPU query (you can +also set it to a lower value). + @anchor keras_fit @par Fit The fit (training) function has the following format: @@ -1381,10 +1392,12 @@ https://adalabucsd.github.io/papers/TR_2019_Cerebro.pdf Geoffrey Hinton with Nitish Srivastava and Kevin Swersky, http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf -[6] Deep learning section of Apache MADlib wiki, https://cwiki.apache.org/confluence/display/MADLIB/Deep+Learning +[6] Deep learning section of Apache MADlib wiki https://cwiki.apache.org/confluence/display/MADLIB/Deep+Learning [7] Deep Learning, Ian Goodfellow, Yoshua Bengio and Aaron Courville, MIT Press, 2016. +[8] Greenplum Database server configuration parameters https://gpdb.docs.pivotal.io/latest/ref_guide/config_params/guc-list.html + @anchor related @par Related Topics
[madlib] branch master updated: indicate optional param in elastic net
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/madlib.git The following commit(s) were added to refs/heads/master by this push: new 62e2a46 indicate optional param in elastic net 62e2a46 is described below commit 62e2a46173761e1d6ef4db8304e15506f724a708 Author: Frank McQuillan AuthorDate: Wed Feb 12 17:30:19 2020 -0800 indicate optional param in elastic net --- src/ports/postgres/modules/elastic_net/elastic_net.sql_in | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/ports/postgres/modules/elastic_net/elastic_net.sql_in b/src/ports/postgres/modules/elastic_net/elastic_net.sql_in index 157851d..c1aaebf 100644 --- a/src/ports/postgres/modules/elastic_net/elastic_net.sql_in +++ b/src/ports/postgres/modules/elastic_net/elastic_net.sql_in @@ -163,7 +163,7 @@ empty string, no columns are excluded. max_iter (optional) INTEGER, default: 1000. The maximum number of iterations allowed. -tolerance +tolerance (optional) FLOAT8, default: 1e-6. This is the criterion to stop iterating. Both the 'fista' and 'igd' optimizers compute the difference between the log likelihood of two consecutive iterations, and when the difference is smaller
[madlib] branch master updated: indicate optional params in DR and RF
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/madlib.git The following commit(s) were added to refs/heads/master by this push: new ac30a3c indicate optional params in DR and RF ac30a3c is described below commit ac30a3c508509a6996f872b0a7505b215c94fd85 Author: Frank McQuillan AuthorDate: Wed Feb 12 17:06:19 2020 -0800 indicate optional params in DR and RF --- .../postgres/modules/recursive_partitioning/decision_tree.sql_in| 6 +++--- .../postgres/modules/recursive_partitioning/random_forest.sql_in| 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in b/src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in index 2408770..04f7b82 100644 --- a/src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in +++ b/src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in @@ -537,7 +537,7 @@ tree_predict(tree_model, 'estimated_prob_dep_value', where dep_value represents each value of the response variable. - type + type (optional) TEXT, optional, default: 'response'. For regression trees, the output is always the predicted value of the dependent variable. For classification trees, the type variable can be 'response', giving the @@ -580,10 +580,10 @@ split for a tuple. tree_model TEXT. Name of the table containing the decision tree model. -dot_format +dot_format (optional) BOOLEAN, default = TRUE. Output can either be in a dot format or a text format. If TRUE, the result is in the dot format, else output is in text format. -verbosity +verbosity (optional) BOOLEAN, default = FALSE. If set to TRUE, the dot format output will contain additional information (impurity, sample size, number of weighted rows for each response variable, classification or prediction if the tree diff --git a/src/ports/postgres/modules/recursive_partitioning/random_forest.sql_in b/src/ports/postgres/modules/recursive_partitioning/random_forest.sql_in index 251dfbc..888388c 100644 --- a/src/ports/postgres/modules/recursive_partitioning/random_forest.sql_in +++ b/src/ports/postgres/modules/recursive_partitioning/random_forest.sql_in @@ -545,7 +545,7 @@ forest_predict(random_forest_model, 'estimated_prob_dep_value', where dep_value represents each value of the response variable. - type + type (optional) TEXT, optional, default: 'response'. For regression trees, the output is always the predicted value of the dependent variable. For classification trees, the type variable can be 'response', giving the
[madlib] branch master updated: misc user doc clarifications
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/madlib.git The following commit(s) were added to refs/heads/master by this push: new 515dc25 misc user doc clarifications 515dc25 is described below commit 515dc2574f2800c0459ec2f0b10d17071f456186 Author: Frank McQuillan AuthorDate: Wed Jan 15 16:41:10 2020 -0800 misc user doc clarifications --- .../madlib_keras_fit_multiple_model.sql_in | 97 -- .../madlib_keras_model_selection.sql_in| 10 +++ 2 files changed, 64 insertions(+), 43 deletions(-) diff --git a/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.sql_in b/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.sql_in index 0468942..33699a4 100644 --- a/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.sql_in +++ b/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.sql_in @@ -130,6 +130,13 @@ madlib_keras_fit_multiple_model( num_iterations INTEGER. Number of iterations to train. + +@note +This parameter is different than the number of passes over the dataset, +which is commonly referred to as the number of epochs. Since MADlib operates +in a distributed system, the number of +epochs is actually equal to this parameter 'num_iterations' X 'epochs' as +specified in the Keras fit parameter. use_gpus (optional) @@ -1016,18 +1023,18 @@ SELECT * FROM iris_multi_model_info ORDER BY training_metrics_final DESC, traini mst_key | model_id | compile_params | fit_params | model_type | model_size | metrics_elapsed_time | metrics_type | training_metrics_final | training_loss_final | training_metrics |training_loss| validation_metrics_final | validation_loss_final | validation_metrics | validation_loss -+--+-+---+--+--+--+--++-+-+-+--+---++- - 9 |2 | loss='categorical_crossentropy', optimizer='Adam(lr=0.01)',metrics=['accuracy'] | batch_size=4,epochs=1 | madlib_keras | 1.2197265625 | {0.189763069152832} | {accuracy} | 0.98349228 | 0.102392569184 | {0.98349227905} | {0.102392569184303} | | | | - 4 |1 | loss='categorical_crossentropy', optimizer='Adam(lr=0.01)',metrics=['accuracy'] | batch_size=8,epochs=1 | madlib_keras | 0.7900390625 | {0.170287847518921} | {accuracy} | 0.97523842 | 0.159002527595 | {0.97523841858} | {0.159002527594566} | | | | - 3 |1 | loss='categorical_crossentropy', optimizer='Adam(lr=0.01)',metrics=['accuracy'] | batch_size=4,epochs=1 | madlib_keras | 0.7900390625 | {0.165465116500854} | {accuracy} | 0.96638851 | 0.10245500505 | {0.96638851166} | {0.102455005049706} | | | | - 10 |2 | loss='categorical_crossentropy', optimizer='Adam(lr=0.01)',metrics=['accuracy'] | batch_size=8,epochs=1 | madlib_keras | 1.2197265625 | {0.199872970581055} | {accuracy} | 0.94162693 | 0.12242924422 | {0.94162693024} | {0.122429244220257} | | | | - 5 |1 | loss='categorical_crossentropy',optimizer='Adam(lr=0.001)',metrics=['accuracy'] | batch_size=4,epochs=1 | madlib_keras | 0.7900390625 | {0.16815185546875} | {accuracy} | 0.88325386 | 0.437314987183 | {0.88325386047} | {0.437314987182617} | | || - 11 |2 | loss='categorical_crossentropy',optimizer='Adam(lr=0.001)',metrics=['accuracy'] | batch_size=4,epochs=1 | madlib_keras | 1.2197265625 | {0.430488109588623} | {accuracy} | 0.85849228 | 0.400548309088 | {0.85849227905} | {0.400548309087753} | | || - 6 |1 | loss='categorical_crossentropy',optimizer='Adam(lr=0.001)',metrics=['accuracy'] | batch_size=8,epochs=1 | madlib_keras | 0.7900390625 | {0
[madlib] 02/02: Update Apache Copyright date
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/madlib.git commit 273301e3c2e150b9648a886761607695a04ce236 Author: Domino Valdano AuthorDate: Wed Jan 15 10:42:52 2020 -0800 Update Apache Copyright date --- NOTICE | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/NOTICE b/NOTICE index 7cbfa51..10b9387 100644 --- a/NOTICE +++ b/NOTICE @@ -1,5 +1,5 @@ Apache MADlib -Copyright 2016-2019 The Apache Software Foundation. +Copyright 2016-2020 The Apache Software Foundation. This product includes software developed at The Apache Software Foundation (http://www.apache.org/).
[madlib] 01/02: Decrease the learning rate for transfer learning test
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/madlib.git commit 96a44244f6110c34e2ca3742b0b6f72593e3ada8 Author: Domino Valdano AuthorDate: Tue Jan 14 17:52:02 2020 -0800 Decrease the learning rate for transfer learning test This helps smooth out the learning curve, making the test be more predictable... much less likely to fail due to a random fluctuation. Co-authored-by: Orhan Kislal --- .../modules/deep_learning/test/madlib_keras_transfer_learning.sql_in | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/ports/postgres/modules/deep_learning/test/madlib_keras_transfer_learning.sql_in b/src/ports/postgres/modules/deep_learning/test/madlib_keras_transfer_learning.sql_in index d17ea20..92f2277 100644 --- a/src/ports/postgres/modules/deep_learning/test/madlib_keras_transfer_learning.sql_in +++ b/src/ports/postgres/modules/deep_learning/test/madlib_keras_transfer_learning.sql_in @@ -290,8 +290,8 @@ SELECT load_model_selection_table( 'mst_table', ARRAY[1,3], ARRAY[ - $$loss='categorical_crossentropy',optimizer='Adam(lr=0.01)',metrics=['accuracy']$$, -$$loss='categorical_crossentropy', optimizer='Adam(lr=0.001)',metrics=['accuracy']$$ + $$loss='categorical_crossentropy',optimizer='Adam(lr=0.1)',metrics=['accuracy']$$, +$$loss='categorical_crossentropy', optimizer='Adam(lr=0.2)',metrics=['accuracy']$$ ], ARRAY[ $$batch_size=5,epochs=1$$
[madlib] branch master updated (7625ae0 -> 273301e)
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/madlib.git. from 7625ae0 DL: Fix failure on GPDB6 for preprocessor new 96a4424 Decrease the learning rate for transfer learning test new 273301e Update Apache Copyright date The 2 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: NOTICE| 2 +- .../modules/deep_learning/test/madlib_keras_transfer_learning.sql_in | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-)
[madlib] branch master updated: clarify warm start with model selection in user docs
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/madlib.git The following commit(s) were added to refs/heads/master by this push: new 1fa020b clarify warm start with model selection in user docs 1fa020b is described below commit 1fa020b08c2a4a8971d1957674794894d6c71783 Author: Frank McQuillan AuthorDate: Tue Jan 7 17:33:42 2020 -0800 clarify warm start with model selection in user docs --- .../modules/deep_learning/madlib_keras_fit_multiple_model.sql_in | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.sql_in b/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.sql_in index fbf3497..0468942 100644 --- a/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.sql_in +++ b/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.sql_in @@ -84,7 +84,7 @@ for the training data. For example, you may only want to train models on segments that reside on hosts that are GPU enabled. You can set up the models and hyperparameters to try with the -Setup +Setup Model Selection utility to define the unique combinations of model architectures, compile and fit parameters. @@ -1320,6 +1320,8 @@ set the 'warm_start' parameter to TRUE in the fit function. Transfer learning uses initial model state (weights) stored in the 'model_arch_table' - in this case set the 'warm_start' parameter to FALSE in the fit function. +4. Here are some more details on how warm start works. These details are mostly applicable when implementing autoML algorithms on top of MADlib's model selection. In short, the 'model_selection_table' dictates which models get trained and output to the 'model_output_table' and associated summary and info tables. When 'warm_start' is TRUE, models are built for each 'mst_key' in the 'model_selection_table'. If there are prior runs for an 'mst_key' then the weights from that run will be [...] + @anchor background @par Technical Background
[madlib] branch master updated: clarify warm start vs transfer learning in user docs
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/madlib.git The following commit(s) were added to refs/heads/master by this push: new 87758d7 clarify warm start vs transfer learning in user docs 87758d7 is described below commit 87758d701df0cba6f924fd1ff27f3ab931a4b8f8 Author: Frank McQuillan AuthorDate: Fri Dec 20 16:53:26 2019 -0800 clarify warm start vs transfer learning in user docs --- src/ports/postgres/modules/deep_learning/madlib_keras.sql_in | 10 -- .../deep_learning/madlib_keras_fit_multiple_model.sql_in | 10 -- 2 files changed, 16 insertions(+), 4 deletions(-) diff --git a/src/ports/postgres/modules/deep_learning/madlib_keras.sql_in b/src/ports/postgres/modules/deep_learning/madlib_keras.sql_in index 0a395e8..1d2d0ba 100644 --- a/src/ports/postgres/modules/deep_learning/madlib_keras.sql_in +++ b/src/ports/postgres/modules/deep_learning/madlib_keras.sql_in @@ -1566,11 +1566,17 @@ metrics_iters | {10} @anchor notes @par Notes -1. Refer to the deep learning section of the Apache MADlib +1. Refer to the deep learning section of the Apache MADlib wiki [5] for important information including supported libraries and versions. -2. Classification is currently supported, not regression. +2. Classification is currently supported, not regression. + +3. Reminder about the distinction between warm start and transfer learning. Warm start uses model +state (weights) from the model output table from a previous training run - +set the 'warm_start' parameter to TRUE in the fit function. +Transfer learning uses initial model state (weights) stored in the 'model_arch_table' - in this case set the +'warm_start' parameter to FALSE in the fit function. @anchor background @par Technical Background diff --git a/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.sql_in b/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.sql_in index 669c5db..fbf3497 100644 --- a/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.sql_in +++ b/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.sql_in @@ -1308,11 +1308,17 @@ Note that the loss and accuracy values pick up from where the previous run left @anchor notes @par Notes -1. Refer to the deep learning section of the Apache MADlib +1. Refer to the deep learning section of the Apache MADlib wiki [6] for important information including supported libraries and versions. -2. Classification is currently supported, not regression. +2. Classification is currently supported, not regression. + +3. Reminder about the distinction between warm start and transfer learning. Warm start uses model +state (weights) from the model output table from a previous training run - +set the 'warm_start' parameter to TRUE in the fit function. +Transfer learning uses initial model state (weights) stored in the 'model_arch_table' - in this case set the +'warm_start' parameter to FALSE in the fit function. @anchor background @par Technical Background
[madlib] branch master updated: correct fit function call in user docs for multi fit
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/madlib.git The following commit(s) were added to refs/heads/master by this push: new fc9cd64 correct fit function call in user docs for multi fit fc9cd64 is described below commit fc9cd64ea53433353d7db205113f0e499d920f14 Author: Frank McQuillan AuthorDate: Thu Dec 19 16:11:58 2019 -0800 correct fit function call in user docs for multi fit --- .../modules/deep_learning/madlib_keras_fit_multiple_model.sql_in| 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.sql_in b/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.sql_in index c0a68b3..669c5db 100644 --- a/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.sql_in +++ b/src/ports/postgres/modules/deep_learning/madlib_keras_fit_multiple_model.sql_in @@ -93,7 +93,7 @@ of model architectures, compile and fit parameters. The fit (training) function has the following format: -madlib_keras_fit( +madlib_keras_fit_multiple_model( source_table, model_output_table, model_selection_table,
[madlib] branch master updated: misc user doc updates for 1dot17
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/madlib.git The following commit(s) were added to refs/heads/master by this push: new ec5614f misc user doc updates for 1dot17 ec5614f is described below commit ec5614fe34fc4e410ac226a60985051fc166ee8e Author: Frank McQuillan AuthorDate: Tue Dec 17 12:38:01 2019 -0800 misc user doc updates for 1dot17 --- doc/mainpage.dox.in| 6 +-- .../deep_learning/input_data_preprocessor.sql_in | 4 +- .../deep_learning/keras_model_arch_table.sql_in| 9 ++-- .../modules/deep_learning/madlib_keras.sql_in | 57 +++--- .../madlib_keras_fit_multiple_model.sql_in | 28 ++- src/ports/postgres/modules/knn/knn.sql_in | 4 ++ 6 files changed, 69 insertions(+), 39 deletions(-) diff --git a/doc/mainpage.dox.in b/doc/mainpage.dox.in index 0e7b426..82be4a5 100644 --- a/doc/mainpage.dox.in +++ b/doc/mainpage.dox.in @@ -292,9 +292,9 @@ Interface and implementation are subject to change. @defgroup grp_gpu_configuration GPU Configuration @defgroup grp_keras Keras @defgroup grp_keras_model_arch Load Models -@defgroup grp_model_selection Model Selection -@brief Train multiple deep learning models at the same time. -@details Train multiple deep learning models at the same time. +@defgroup grp_model_selection Model Selection for DL +@brief Train multiple deep learning models at the same time for model architecture search and hyperparameter selection. +@details Train multiple deep learning models at the same time for model architecture search and hyperparameter selection. @{ @defgroup grp_automl AutoML @defgroup grp_keras_run_model_selection Run Model Selection diff --git a/src/ports/postgres/modules/deep_learning/input_data_preprocessor.sql_in b/src/ports/postgres/modules/deep_learning/input_data_preprocessor.sql_in index ddc356f..f243417 100644 --- a/src/ports/postgres/modules/deep_learning/input_data_preprocessor.sql_in +++ b/src/ports/postgres/modules/deep_learning/input_data_preprocessor.sql_in @@ -853,7 +853,9 @@ Geoffrey Hinton with Nitish Srivastava and Kevin Swersky, http://www.cs.toronto. @anchor related @par Related Topics -minibatch_preprocessing.sql_in +training_preprocessor_dl() + +validation_preprocessor_dl() gpu_configuration() diff --git a/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in b/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in index b1bf150..cc915bb 100644 --- a/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in +++ b/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in @@ -275,11 +275,10 @@ SELECT COUNT(*) FROM model_arch_library WHERE model_weights IS NOT NULL; ---+ 1 -Load weights from Keras using psycopg2. -(Psycopg is a PostgreSQL database adapter for the -Python programming language.) As above we need to -flatten then serialize the weights to store as a -PostgreSQL binary data type. +Load weights from Keras using psycopg2. (Psycopg is a PostgreSQL database adapter for the +Python programming language.) As above we need to flatten then serialize the weights to store as a +PostgreSQL binary data type. Note that the psycopg2.Binary function used below will increase the size of the +Python object for the weights, so if your model is large it might be better to use a PL/Python function as above. import psycopg2 import psycopg2 as p2 diff --git a/src/ports/postgres/modules/deep_learning/madlib_keras.sql_in b/src/ports/postgres/modules/deep_learning/madlib_keras.sql_in index 6127031..0a395e8 100644 --- a/src/ports/postgres/modules/deep_learning/madlib_keras.sql_in +++ b/src/ports/postgres/modules/deep_learning/madlib_keras.sql_in @@ -737,7 +737,12 @@ madlib_keras_predict_byom( class_values (optional) TEXT[], default: NULL. List of class labels that were used while training the model. See the 'output_table' -column for more details. +column above for more details. + +@note +If you specify the class values parameter, +it must reflect how the dependent variable was 1-hot encoded for training. If you accidently +pick another order that does not match the 1-hot encoding, the predictions would be wrong. normalizing_const (optional) @@ -1166,7 +1171,7 @@ WHERE iris_predict.estimated_class_text != iris_test.class_text; 6 (1 row) -Percent missclassifications: +Accuracy: SELECT round(count(*)*100/(150*0.2),2) as test_accuracy_percent from (select iris_test.class_text as actual, iris_predict.estimated_class_text as estimated @@ -1188,10 +1193,18 @@ syntax. See load_keras_model for details on how to load the model architecture and weights. In this example w
[madlib-site] branch automl updated: hyperband diagonal E2E update
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch automl in repository https://gitbox.apache.org/repos/asf/madlib-site.git The following commit(s) were added to refs/heads/automl by this push: new 94a7f7e hyperband diagonal E2E update 94a7f7e is described below commit 94a7f7e81077ccd67710648850b696e2344e39d9 Author: Frank McQuillan AuthorDate: Fri Nov 22 16:29:51 2019 -0800 hyperband diagonal E2E update --- .../hyperband_diag_v2_mnist-checkpoint.ipynb | 157 + .../automl/hyperband_diag_v2_mnist.ipynb | 130 - 2 files changed, 135 insertions(+), 152 deletions(-) diff --git a/community-artifacts/Deep-learning/automl/.ipynb_checkpoints/hyperband_diag_v2_mnist-checkpoint.ipynb b/community-artifacts/Deep-learning/automl/.ipynb_checkpoints/hyperband_diag_v2_mnist-checkpoint.ipynb index 091e6fd..b62f8d5 100644 --- a/community-artifacts/Deep-learning/automl/.ipynb_checkpoints/hyperband_diag_v2_mnist-checkpoint.ipynb +++ b/community-artifacts/Deep-learning/automl/.ipynb_checkpoints/hyperband_diag_v2_mnist-checkpoint.ipynb @@ -30,19 +30,17 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 16, "metadata": { "scrolled": true }, "outputs": [ { - "name": "stderr", + "name": "stdout", "output_type": "stream", "text": [ - "/Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/config.py:13: ShimWarning: The `IPython.config` package has been deprecated since IPython 4.0. You should import from traitlets.config instead.\n", - " \"You should import from traitlets.config instead.\", ShimWarning)\n", - "/Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/utils/traitlets.py:5: UserWarning: IPython.utils.traitlets has moved to a top-level traitlets package.\n", - " warn(\"IPython.utils.traitlets has moved to a top-level traitlets package.\")\n" + "The sql extension is already loaded. To reload it, use:\n", + " %reload_ext sql\n" ] } ], @@ -52,7 +50,7 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 17, "metadata": {}, "outputs": [], "source": [ @@ -74,7 +72,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 19, "metadata": {}, "outputs": [ { @@ -100,7 +98,7 @@ "[(u'MADlib version: 1.17-dev, git revision: rel/v1.16-47-g5a1717e, cmake configuration time: Tue Nov 19 01:02:39 UTC 2019, build type: release, build system: Linux-3.10.0-957.27.2.el7.x86_64, C compiler: gcc 4.8.5, C++ compiler: g++ 4.8.5',)]" ] }, - "execution_count": 3, + "execution_count": 19, "metadata": {}, "output_type": "execute_result" } @@ -121,24 +119,9 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 20, "metadata": {}, - "outputs": [ -{ - "name": "stderr", - "output_type": "stream", - "text": [ - "Using TensorFlow backend.\n" - ] -}, -{ - "name": "stdout", - "output_type": "stream", - "text": [ - "Couldn't import dot_parser, loading of dot files will not be possible.\n" - ] -} - ], + "outputs": [], "source": [ "from __future__ import print_function\n", "\n", @@ -180,7 +163,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 21, "metadata": {}, "outputs": [], "source": [ @@ -794,7 +777,7 @@ }, { "cell_type": "code", - "execution_count": 17, + "execution_count": 22, "metadata": {}, "outputs": [ { @@ -821,7 +804,7 @@ "[]" ] }, - "execution_count": 17, + "execution_count": 22, "metadata": {}, "output_type": "execute_result" } @@ -896,7 +879,7 @@ }, { "cell_type": "code", - "execution_count": 18, + "execution_count": 23, "metadata": {}, "outputs": [], "source": [ @@ -924,7 +907,7 @@
[madlib-site] branch automl updated: hyperband diagonal E2E still in work...
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch automl in repository https://gitbox.apache.org/repos/asf/madlib-site.git The following commit(s) were added to refs/heads/automl by this push: new c606abc hyperband diagonal E2E still in work... c606abc is described below commit c606abcf87684808eaa68fc47b700ae247a7f20c Author: Frank McQuillan AuthorDate: Thu Nov 21 17:20:43 2019 -0800 hyperband diagonal E2E still in work... --- .../hyperband_diag_v2_mnist-checkpoint.ipynb | 924 ++--- .../automl/hyperband_diag_v2_mnist.ipynb | 924 ++--- 2 files changed, 866 insertions(+), 982 deletions(-) diff --git a/community-artifacts/Deep-learning/automl/.ipynb_checkpoints/hyperband_diag_v2_mnist-checkpoint.ipynb b/community-artifacts/Deep-learning/automl/.ipynb_checkpoints/hyperband_diag_v2_mnist-checkpoint.ipynb index 09598ea..091e6fd 100644 --- a/community-artifacts/Deep-learning/automl/.ipynb_checkpoints/hyperband_diag_v2_mnist-checkpoint.ipynb +++ b/community-artifacts/Deep-learning/automl/.ipynb_checkpoints/hyperband_diag_v2_mnist-checkpoint.ipynb @@ -23,7 +23,9 @@ "\n", "5. Hyperband diagonal\n", "\n", -"6. Plot results" +"6. Plot results\n", +"\n", +"7. Print run schedules" ] }, { @@ -792,7 +794,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 17, "metadata": {}, "outputs": [ { @@ -819,7 +821,7 @@ "[]" ] }, - "execution_count": 6, + "execution_count": 17, "metadata": {}, "output_type": "execute_result" } @@ -894,7 +896,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 18, "metadata": {}, "outputs": [], "source": [ @@ -917,344 +919,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ -"Pretty print reg Hyperband run schedule" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [ -{ - "name": "stdout", - "output_type": "stream", - "text": [ - "max_iter = 3\n", - "eta = 3\n", - "B = 2*max_iter = 6\n", - " \n", - "s=1\n", - "n_i r_i\n", - "\n", - "3 1.0\n", - "1.0 3.0\n", - " \n", - "s=0\n", - "n_i r_i\n", - "\n", - "2 3\n", - " \n", - "sum of configurations at leaf nodes across all s = 3.0\n", - "(if have more workers than this, they may not be 100% busy)\n" - ] -} - ], - "source": [ -"import numpy as np\n", -"from math import log, ceil\n", -"\n", -"#input\n", -"max_iter = 3 # maximum iterations/epochs per configuration\n", -"eta = 3 # defines downsampling rate (default=3)\n", -"\n", -"logeta = lambda x: log(x)/log(eta)\n", -"s_max = int(logeta(max_iter)) # number of unique executions of Successive Halving (minus one)\n", -"B = (s_max+1)*max_iter # total number of iterations (without reuse) per execution of Succesive Halving (n,r)\n", -"\n", -"#echo output\n", -"print (\"max_iter = \" + str(max_iter))\n", -"print (\"eta = \" + str(eta))\n", -"print (\"B = \" + str(s_max+1) + \"*max_iter = \" + str(B))\n", -"\n", -"sum_leaf_n_i = 0 # count configurations at leaf nodes across all s\n", -"\n", -" Begin Finite Horizon Hyperband outlerloop. Repeat indefinitely.\n", -"for s in reversed(range(s_max+1)):\n", -"\n", -"print (\" \")\n", -"print (\"s=\" + str(s))\n", -"print (\"n_i r_i\")\n", -"print (\"\")\n", -"counter = 0\n", -"\n", -"n = int(ceil(int(B/max_iter/(s+1))*eta**s)) # initial number of configurations\n", -"r = max_iter*eta**(-s) # initial number of iterations to run configurations for\n", -"\n", -" Begin Finite Ho
[madlib-site] branch automl created (now 0c8e677)
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a change to branch automl in repository https://gitbox.apache.org/repos/asf/madlib-site.git. at 0c8e677 hyperband in work This branch includes the following new commits: new 0c8e677 hyperband in work The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference.
[madlib] 02/02: Address review comments
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/madlib.git commit 72dfd30df1b7e79525de825c56e81b423e0c3d1b Author: Orhan Kislal AuthorDate: Fri Oct 25 11:35:51 2019 -0400 Address review comments --- RELEASE_NOTES| 2 +- src/ports/postgres/modules/deep_learning/madlib_keras.sql_in | 5 +++-- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/RELEASE_NOTES b/RELEASE_NOTES index d4296ec..5b8e0ae 100644 --- a/RELEASE_NOTES +++ b/RELEASE_NOTES @@ -15,7 +15,7 @@ MADlib v1.17: Release Date: Other: -- DL: Supported keras version is fixed to 2.2.4 +- DL: Supported keras version is capped at 2.2.4, tensorflow version is capped at 1.14. —- MADlib v1.16: diff --git a/src/ports/postgres/modules/deep_learning/madlib_keras.sql_in b/src/ports/postgres/modules/deep_learning/madlib_keras.sql_in index 9c4f39a..4ffec57 100644 --- a/src/ports/postgres/modules/deep_learning/madlib_keras.sql_in +++ b/src/ports/postgres/modules/deep_learning/madlib_keras.sql_in @@ -77,8 +77,9 @@ typically resulting faster and smoother convergence [3]. You can also do inference on models that have not been trained with MADlib, but rather imported from an external source. -Note that the following MADlib functions are targetting a specific Keras -version (2.2.4). Using a newer or older version may or may not work as intended. +Note that the following MADlib functions are targeting a specific Keras +version (2.2.4) with a specific Tensorflow kernel version (1.14). +Using a newer or older version may or may not work as intended. @brief Solves image classification problems by calling the Keras API
[madlib] branch master updated (35e959d -> 72dfd30)
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/madlib.git. from 35e959d DL: Remove quote_ident to allow tables on schemas new 24c6e73 Add keras version to the docs and release notes new 72dfd30 Address review comments The 2 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: RELEASE_NOTES| 8 src/ports/postgres/modules/deep_learning/madlib_keras.sql_in | 4 2 files changed, 12 insertions(+)
[madlib] 01/02: Add keras version to the docs and release notes
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/madlib.git commit 24c6e730c5dd4faa2fc60fd054a88d85643cf63c Author: Orhan Kislal AuthorDate: Mon Oct 21 14:10:12 2019 -0400 Add keras version to the docs and release notes --- RELEASE_NOTES| 8 src/ports/postgres/modules/deep_learning/madlib_keras.sql_in | 3 +++ 2 files changed, 11 insertions(+) diff --git a/RELEASE_NOTES b/RELEASE_NOTES index 49a4cd6..d4296ec 100644 --- a/RELEASE_NOTES +++ b/RELEASE_NOTES @@ -10,6 +10,14 @@ commit history located at https://github.com/apache/madlib/commits/master. Current list of bugs and issues can be found at https://issues.apache.org/jira/browse/MADLIB. —- +MADlib v1.17: + +Release Date: + +Other: +- DL: Supported keras version is fixed to 2.2.4 + +—- MADlib v1.16: Release Date: 2019-Jul-02 diff --git a/src/ports/postgres/modules/deep_learning/madlib_keras.sql_in b/src/ports/postgres/modules/deep_learning/madlib_keras.sql_in index cf4f2d1..9c4f39a 100644 --- a/src/ports/postgres/modules/deep_learning/madlib_keras.sql_in +++ b/src/ports/postgres/modules/deep_learning/madlib_keras.sql_in @@ -77,6 +77,9 @@ typically resulting faster and smoother convergence [3]. You can also do inference on models that have not been trained with MADlib, but rather imported from an external source. +Note that the following MADlib functions are targetting a specific Keras +version (2.2.4). Using a newer or older version may or may not work as intended. + @brief Solves image classification problems by calling the Keras API
[madlib] branch master updated: SVM: Lower bound the default for n_components
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/madlib.git The following commit(s) were added to refs/heads/master by this push: new 1b5ba4a SVM: Lower bound the default for n_components 1b5ba4a is described below commit 1b5ba4afd58ca9b263ccac47769ef281b45e3466 Author: Orhan Kislal AuthorDate: Fri Oct 4 14:45:06 2019 -0400 SVM: Lower bound the default for n_components JIRA: MADLIB-1384 --- src/ports/postgres/modules/svm/svm.py_in | 2 +- src/ports/postgres/modules/svm/svm.sql_in | 30 ++ 2 files changed, 15 insertions(+), 17 deletions(-) diff --git a/src/ports/postgres/modules/svm/svm.py_in b/src/ports/postgres/modules/svm/svm.py_in index b4f4f45..1532cb2 100644 --- a/src/ports/postgres/modules/svm/svm.py_in +++ b/src/ports/postgres/modules/svm/svm.py_in @@ -1330,7 +1330,7 @@ def _process_epsilon(is_svc, args): def _extract_kernel_params(kernel_params='', n_features=10): params_default = { # common params -'n_components': 2 * n_features, +'n_components': max(100, 2 * n_features), 'fit_intercept': False, 'random_state': 1, diff --git a/src/ports/postgres/modules/svm/svm.sql_in b/src/ports/postgres/modules/svm/svm.sql_in index cb6b69e..ba05e86 100644 --- a/src/ports/postgres/modules/svm/svm.sql_in +++ b/src/ports/postgres/modules/svm/svm.sql_in @@ -319,23 +319,22 @@ to the end of the feature list - thus the last element of the coefficient list is the intercept. n_components -Default: 2*num_features. The dimensionality of the transformed feature space. +Default: max(100, 2*num_features). The dimensionality of the transformed feature space. A larger value lowers the variance of the estimate of the kernel but requires more memory and takes longer to train. @note -Setting the \e n_components kernel parameter properly is important -to generate an accurate decision boundary. This parameter -is the dimensionality of the transformed feature space that arises -from using the primal formulation. We use primal in MADlib -because we are implementing in a distributed system, -compared to an R or other single node implementations -that can use the dual formulation. The primal approach -implements an approximation of the kernel function using random -feature maps, so in the case of a gaussian kernel, the -dimensionality of the transformed feature space is not -infinite (as in dual), but rather of size \e n_components. -Try increasing \e n_components higher than the default if you are -not getting an accurate decision boundary. +Setting the \e n_components kernel parameter properly is important to +generate an accurate decision boundary and can make the difference between a +good model and a useless model. Try increasing the value of \e n_components + if you are not getting an accurate decision boundary. This parameter arises +from using the primal formulation, in which we map data into a relatively +low-dimensional randomized feature space [2, 3]. The parameter +\e n_components is the dimension of that feature space. We use the primal in +MADlib to support scaling to large data sets, compared to R or other single +node implementations that use the dual formulation and hence do not have this +type of mapping, since the the dimensionality of the transformed feature +space in the dual is effectively infinite. + random_state Default: 1. Seed used by a random number generator. @@ -641,8 +640,7 @@ WHERE houses_pred.prediction != (houses.price < 10); -# Train using Gaussian kernel. This time we specify the initial step size and maximum number of iterations to run. As part of the kernel parameter, we choose 10 as the dimension of the space where we train -SVM. A larger number will lead to a more powerful model but run the risk of -overfitting. As a result, the model will be a 10 dimensional vector, instead +SVM. As a result, the model will be a 10 dimensional vector, instead of 4 as in the case of linear model. DROP TABLE IF EXISTS houses_svm_gaussian, houses_svm_gaussian_summary, houses_svm_gaussian_random;
[madlib] branch master updated: updated DL preprocessor docs for bytea (#445)
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/madlib.git The following commit(s) were added to refs/heads/master by this push: new 63f40e7 updated DL preprocessor docs for bytea (#445) 63f40e7 is described below commit 63f40e70f8dbb6c9ed2b1b91c847fd3819b1a627 Author: Frank McQuillan AuthorDate: Tue Oct 1 13:52:40 2019 -0700 updated DL preprocessor docs for bytea (#445) * updated DL preprocessor docs for bytea * address review comments --- .../deep_learning/input_data_preprocessor.sql_in | 210 ++--- 1 file changed, 98 insertions(+), 112 deletions(-) diff --git a/src/ports/postgres/modules/deep_learning/input_data_preprocessor.sql_in b/src/ports/postgres/modules/deep_learning/input_data_preprocessor.sql_in index a3f4281..8d70431 100644 --- a/src/ports/postgres/modules/deep_learning/input_data_preprocessor.sql_in +++ b/src/ports/postgres/modules/deep_learning/input_data_preprocessor.sql_in @@ -18,7 +18,7 @@ * under the License. * * @file input_preprocessor_dl.sql_in - * @brief TODO + * @brief Utilities to prepare input image data for use by deep learning modules. * @date December 2018 * */ @@ -86,9 +86,10 @@ training_preprocessor_dl(source_table, TEXT. Name of the output table from the training preprocessor which will be used as input to algorithms that support mini-batching. Note that the arrays packed into the output table are shuffled - and normalized (by dividing each element in the independent variable array - by the optional 'normalizing_const' parameter), so they will not match - up in an obvious way with the rows in the source table. + and normalized, by dividing each element in the independent variable array + by the optional 'normalizing_const' parameter. For performance reasons, + packed arrays are converted to PostgreSQL bytea format, which is a + variable-length binary string. In the case a validation data set is used (see later on this page), this output table is also used @@ -158,11 +159,15 @@ validation_preprocessor_dl(source_table, output_table TEXT. Name of the output table from the validation - preprocessor which will be used as input to algorithms that support mini-batching. The arrays packed into the output table are + preprocessor which will be used as input to algorithms that support mini-batching. + The arrays packed into the output table are normalized using the same normalizing constant from the training preprocessor as specified in the 'training_preprocessor_table' parameter described below. Validation data is not shuffled. + For performance reasons, + packed arrays are converted to PostgreSQL bytea format, which is a + variable-length binary string. dependent_varname @@ -209,25 +214,43 @@ validation_preprocessor_dl(source_table, validation_preprocessor_dl() contain the following columns: -buffer_id -INTEGER. Unique id for each row in the packed table. +independent_var +BYTEA. Packed array of independent variables in PostgreSQL bytea format. +Arrays of independent variables packed into the output table are +normalized by dividing each element in the independent variable array by the +optional 'normalizing_const' parameter. Training data is shuffled, but +validation data is not. dependent_var -ANYARRAY[]. Packed array of dependent variables. +BYTEA. Packed array of dependent variables in PostgreSQL bytea format. The dependent variable is always one-hot encoded as an -INTEGER[] array. For now, we are assuming that +integer array. For now, we are assuming that input_preprocessor_dl() will be used only for classification problems using deep learning. So the dependent variable is one-hot encoded, unless it's already a numeric array in which case we assume it's already one-hot -encoded and just cast it to an INTEGER[] array. +encoded and just cast it to an integer array. -independent_var -REAL[]. Packed array of independent variables. +independent_var_shape +INTEGER[]. Shape of the independent variable array after preprocessing. +The first element is the number of images packed per row, and subsequent +elements will depend on how the image is described (e.g., channels first or last). + + + +dependent_var_shape +INTEGER[]. Shape of the dependent variable array after preprocessing. +The first element is the number of images packed per row, and the second +element is the number of class values. + + + +buffer_id +I
[madlib] branch load_mst_user_docs created (now 2059ddb)
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a change to branch load_mst_user_docs in repository https://gitbox.apache.org/repos/asf/madlib.git. at 2059ddb user docs for setting up model selection table This branch includes the following new commits: new 2059ddb user docs for setting up model selection table The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference.
[madlib] 01/01: user docs for setting up model selection table
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch load_mst_user_docs in repository https://gitbox.apache.org/repos/asf/madlib.git commit 2059ddbbe7c7fe9a2d2fd76f20796285a42da022 Author: Frank McQuillan AuthorDate: Wed Sep 4 17:03:03 2019 -0700 user docs for setting up model selection table --- doc/mainpage.dox.in| 2 + .../madlib_keras_model_selection.sql_in| 336 + 2 files changed, 338 insertions(+) diff --git a/doc/mainpage.dox.in b/doc/mainpage.dox.in index daedce5..b31dedb 100644 --- a/doc/mainpage.dox.in +++ b/doc/mainpage.dox.in @@ -13,6 +13,7 @@ Useful links: https://mail-archives.apache.org/mod_mbox/madlib-user/";>User mailing list https://mail-archives.apache.org/mod_mbox/madlib-dev/";>Dev mailing list User documentation for earlier releases: +v1.15, v1.15.1, v1.15, v1.14, @@ -292,6 +293,7 @@ Interface and implementation are subject to change. @defgroup grp_keras Keras @defgroup grp_keras_model_arch Load Model @defgroup grp_input_preprocessor_dl Preprocessor for Images +@defgroup grp_keras_model_selection Setup Model Selection @} @defgroup grp_bayes Naive Bayes Classification @defgroup grp_sample Random Sampling diff --git a/src/ports/postgres/modules/deep_learning/madlib_keras_model_selection.sql_in b/src/ports/postgres/modules/deep_learning/madlib_keras_model_selection.sql_in index f26a541..37914d4 100644 --- a/src/ports/postgres/modules/deep_learning/madlib_keras_model_selection.sql_in +++ b/src/ports/postgres/modules/deep_learning/madlib_keras_model_selection.sql_in @@ -27,7 +27,343 @@ *//* --- */ m4_include(`SQLCommon.m4') +/** +@addtogroup grp_keras_model_selection +@brief Utility function to set up a model selection table +for hyperparameter tuning and model architecture search. + +\warning This MADlib method is still in early stage development. +Interface and implementation are subject to change. + +Contents +Load Model Selection Table +Examples +Related Topics + + +This utility function sets up a model selection table +for use by the multiple model Keras fit feature of MADlib. +By model selection we mean both hyperparameter tuning and +model architecture search. The table defines the unique combinations +of model architectures, compile and fit parameters for the tests +to run on a massively parallel processing database cluster. + +@anchor load_mst_table +@par Load Model Selection Table + + +load_model_selection_table( +model_arch_table, +model_selection_table, +model_arch_id_list, +compile_params_list, +fit_params_list +) + + +\b Arguments + + model_arch_table + VARCHAR. Table containing model architectures and weights. + For more information on this table + refer to Load Model. + + + model_selection_table + VARCHAR. Model selection table created by this utility. Content of + this table is described below. + + + model_arch_id_list + INTEGER[]. Array of model IDs from the 'model_arch_table' to be included + in the run combinations. For hyperparameter search, this will typically be + one model ID. For model architecture search, this will be the different model IDs + that you want to test. + + + compile_params_list + VARCHAR[]. Array of compile parameters to be tested. Each element + of the array should consist of a string of compile parameters + exactly as it is to be passed to Keras. + + + fit_params_list + VARCHAR[]. Array of fit parameters to be tested. Each element + of the array should consist of a string of fit parameters + exactly as it is to be passed to Keras. + + + + +Output table + +The model selection output table contains the following columns: + + +mst_key +INTEGER. ID that defines a unique +model architecture-compile parameters-fit parameters tuple. + + + +model_arch_table +VARCHAR. Name of the table corresponding to the model architecture ID. + + + +model_arch_id +VARCHAR. Model architecture ID from the 'model_arch_table'. + + + +compile_params +VARCHAR. Keras compile parameters. + + + +fit_params +VARCHAR. Keras fit parameters. + + + + + +@anchor example +@par Examples +-# The model selection table works in conjunction with a model architecture table, +so we first create a model architecture table with two different models. Use Keras to define +a model architecture with 1 hidden layer: + +import keras +from keras.models import Sequential +from keras.layers import Dense +model1 = Sequential() +model1.add(Dense(10, activation='relu', input_shape=(4,))) +model1.add(Dense(10,
[madlib-site] branch asf-site updated: updated 3 notebooks with minor changes
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/madlib-site.git The following commit(s) were added to refs/heads/asf-site by this push: new 388c4b3 updated 3 notebooks with minor changes 388c4b3 is described below commit 388c4b34a08c96e059baca60f17350db327491ca Author: Frank McQuillan AuthorDate: Tue Aug 27 13:28:24 2019 -0700 updated 3 notebooks with minor changes --- .../Deep-learning/Load-images-v1.ipynb | 571 +++-- .../Deep-learning/Load-model-architecture-v1.ipynb | 228 ++-- .../MADlib-Keras-cifar10-cnn-v2.ipynb | 4 +- 3 files changed, 486 insertions(+), 317 deletions(-) diff --git a/community-artifacts/Deep-learning/Load-images-v1.ipynb b/community-artifacts/Deep-learning/Load-images-v1.ipynb index 3209aaf..1750cfc 100644 --- a/community-artifacts/Deep-learning/Load-images-v1.ipynb +++ b/community-artifacts/Deep-learning/Load-images-v1.ipynb @@ -134,35 +134,14 @@ "source": [ "import sys\n", "import os\n", -"from keras.datasets import cifar10" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [], - "source": [ +"from keras.datasets import cifar10\n", +"\n", "madlib_site_dir = '/Users/fmcquillan/Documents/Product/MADlib/Demos/data'\n", -"sys.path.append(madlib_site_dir)" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [], - "source": [ +"sys.path.append(madlib_site_dir)\n", +"\n", "# Import image loader module\n", -"from madlib_image_loader import ImageLoader, DbCredentials" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [], - "source": [ +"from madlib_image_loader import ImageLoader, DbCredentials\n", +"\n", "# Specify database credentials, for connecting to db\n", "#db_creds = DbCredentials(user='gpadmin',\n", "# host='35.239.240.26',\n", @@ -173,15 +152,8 @@ "db_creds = DbCredentials(user='fmcquillan',\n", " host='localhost',\n", " port='5432',\n", -" password='')" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [], - "source": [ +" password='')\n", +"\n", "# Initialize ImageLoader (increase num_workers to run faster)\n", "iloader = ImageLoader(num_workers=5, db_creds=db_creds)" ] @@ -201,15 +173,12 @@ "- data_y is a 1D np.array of the image categories (labels).\n", "\n", "\n", -"- If the user passes a table_name while creating ImageLoader object, it will be used for all further calls to load_dataset_from_np. It can be changed by passing it as a parameter during the actual call to load_dataset_from_np, and if so future calls will load to that table name instead. This avoids needing to pass the table_name again every time, but also allows it to be changed at any time.\n", -"\n", -" \n", -"- append=False attempts to create a new table, while append=True appends more images to an existing table." +"- If the user passes a table_name while creating ImageLoader object, it will be used for all further calls to load_dataset_from_np. It can be changed by passing it as a parameter during the actual call to load_dataset_from_np, and if so future calls will load to that table name instead. This avoids needing to pass the table_name again every time, but also allows it to be changed at any time." ] }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 5, "metadata": {}, "outputs": [], "source": [ @@ -219,7 +188,7 @@ }, { "cell_type": "code", - "execution_count": 13, + "execution_count": 6, "metadata": {}, "outputs": [ { @@ -232,180 +201,180 @@ "CREATE TABLE\n", "Created table cifar_10_train_data in madlib db\n", "Spawning 5 wor
[madlib-site] 02/02: Update demo notebook
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/madlib-site.git commit 3b665af751366e349420d72fcf1505744fb129cc Author: Domino Valdano AuthorDate: Mon Aug 26 13:04:12 2019 -0700 Update demo notebook --- .../Deep-learning/Load-images-v1.ipynb | 20 +++- 1 file changed, 7 insertions(+), 13 deletions(-) diff --git a/community-artifacts/Deep-learning/Load-images-v1.ipynb b/community-artifacts/Deep-learning/Load-images-v1.ipynb index 15aa948..3209aaf 100644 --- a/community-artifacts/Deep-learning/Load-images-v1.ipynb +++ b/community-artifacts/Deep-learning/Load-images-v1.ipynb @@ -193,7 +193,7 @@ "\n", "# 2. Fetch images then load NumPy array into table\n", "\n", -"iloader.load_dataset_from_np(data_x, data_y, table_name, append=False, no_temp_files=False)\n", +"iloader.load_dataset_from_np(data_x, data_y, table_name, append=False)\n", "\n", "- data_x contains image data in np.array format\n", "\n", @@ -204,13 +204,7 @@ "- If the user passes a table_name while creating ImageLoader object, it will be used for all further calls to load_dataset_from_np. It can be changed by passing it as a parameter during the actual call to load_dataset_from_np, and if so future calls will load to that table name instead. This avoids needing to pass the table_name again every time, but also allows it to be changed at any time.\n", "\n", " \n", -"- append=False attempts to create a new table, while append=True appends more images to an existing table.\n", -"\n", -"\n", -"- EXPERIMENTAL: If no_temp_files=True, the operation will happen without\n", -" writing out the tables to temporary files before loading them.\n", -" Instead, an in-memory filelike buffer (StringIO) will be used\n", -" to build the tables before loading." +"- append=False attempts to create a new table, while append=True appends more images to an existing table." ] }, { @@ -420,8 +414,8 @@ "%sql DROP TABLE IF EXISTS cifar_10_train_data, cifar_10_test_data;\n", "\n", "# Save images to temporary directories and load into database\n", -"iloader.load_dataset_from_np(x_train, y_train, 'cifar_10_train_data', append=False, no_temp_files=False)\n", -"iloader.load_dataset_from_np(x_test, y_test, 'cifar_10_test_data', append=False, no_temp_files=False)" +"iloader.load_dataset_from_np(x_train, y_train, 'cifar_10_train_data', append=False)\n", +"iloader.load_dataset_from_np(x_test, y_test, 'cifar_10_test_data', append=False)" ] }, { @@ -434,12 +428,12 @@ "Uses the Python Imaging Library so supports multiple formats\n", "http://www.pythonware.com/products/pil/\n";, "\n", -"load_dataset_from_disk(root_dir, table_name, num_labels='all', append=False, no_temp_files=False)\n", +"load_dataset_from_disk(root_dir, table_name, num_labels='all', append=False)\n", "\n", "- Calling this function will look in root_dir on the local disk of wherever this is being run. It will skip over any files in that directory, but will load images contained in each of its subdirectories. The images should be organized by category/class, where the name of each subdirectory is the label for the images contained within it.\n", "\n", "\n", -"- The table_name, append, and no_temp_files parameters are the same as above The parameter num_labels is an optional parameter which can be used to restrict the number of labels (image classes) loaded, even if more are found in root_dir. For example, for a large dataset you may have hundreds of labels, but only wish to use a subset of that containing a few dozen.\n", +"- The table_name and append parameters are the same as above The parameter num_labels is an optional parameter which can be used to restrict the number of labels (image classes) loaded, even if more are found in root_dir. For example, for a large dataset you may have hundreds of labels, but only wish to use a subset of that containing a few dozen.\n", "\n", "For example, if we put the CIFAR-10 training data is in 10 subdirectories under directory cifar10, with one subdirectory for each class:" ] @@ -596,7 +590,7 @@ "source": [ "%sql drop table if exists cifar_10_train_data_filesystem;\n", "# Load images from file system\n", -"iloader.load_dataset_from_disk('/Users/fmcquillan/tmp/cifar10', 'cifar_10_train_data_filesystem', num_labels='all', append=False, no_temp_files=False)" +"iloader.load_dataset_from_disk('/Users/fmcquillan/tmp/cifar10', 'cifar_10_train_data_filesystem', num_labels='all', append=False)" ] }, {
[madlib-site] 01/02: Disable --no-temp-files|-m option, since it doesn't work
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/madlib-site.git commit 6a530b1b23b609aefd2dc5cb3ca9098ea7849c81 Author: Domino Valdano AuthorDate: Mon Aug 26 11:58:42 2019 -0700 Disable --no-temp-files|-m option, since it doesn't work --- .../Deep-learning/madlib_image_loader.py | 42 +- 1 file changed, 16 insertions(+), 26 deletions(-) diff --git a/community-artifacts/Deep-learning/madlib_image_loader.py b/community-artifacts/Deep-learning/madlib_image_loader.py index 09a170d..1dc45b3 100755 --- a/community-artifacts/Deep-learning/madlib_image_loader.py +++ b/community-artifacts/Deep-learning/madlib_image_loader.py @@ -54,7 +54,7 @@ # 2a. Perform parallel image loading from numpy arrays: # # iloader.load_dataset_from_np(data_x, data_y, table_name, -#append=False, no_temp_files=False) +#append=False) # # data_x contains image data in np.array format, and data_y is a 1D np.array # of the image categories (labels). @@ -73,18 +73,12 @@ # name instead. This avoids needing to pass the table_name again every # time, but also allows it to be changed at any time. # -# EXPERIMENTAL: If no_temp_files=True, the operation will happen without -# writing out the tables to temporary files before loading them. -# Instead, an in-memory filelike buffer (StringIO) will be used -# to build the tables before loading. Currently not working, -# for unknown reason. -# # or, # # 2b. Perform parallel image loading from disk: # # load_dataset_from_disk(self, root_dir, table_name, num_labels='all', -# append=False, no_temp_files=False): +# append=False): # # Calling this function instead will look in root_dir on the local disk of # wherever this is being run. It will skip over any files in that @@ -93,7 +87,7 @@ # where the name of each subdirectory is the label for the images # contained within it. # -# The table_name, append, and no_temp_files parameters are the same as +# The table_name and append parameters are the same as described # above. num_labels is an optional parameter which can be used to # restrict the number of labels (image classes) loaded, even if more # are found in root_dir. For example, for a large dataset you may @@ -107,7 +101,7 @@ # # usage: madlib_image_loader.py [-h] [-r ROOT_DIR] [-n NUM_LABELS] [-d DB_NAME] # [-a] [-w NUM_WORKERS] [-p PORT] [-U USERNAME] -# [-t HOST] [-P PASSWORD] [-m] +# [-t HOST] [-P PASSWORD] # table_name # # positional arguments: @@ -247,7 +241,7 @@ class ImageLoader: self.table_name = table_name self.root_dir = None self.pool = None -self.no_temp_files = None +self.no_temp_files = False global iloader # Singleton per process iloader = self @@ -435,7 +429,7 @@ class ImageLoader: self.db_close() def load_dataset_from_np(self, data_x, data_y, table_name=None, - append=False, no_temp_files=False): + append=False): """ Loads a numpy array into db. For append=False, creates a new table and loads the data. For append=True, appends data to existing table. @@ -450,14 +444,12 @@ class ImageLoader: @table_name Name of table in db to load data into @append Whether to create a new table (False) or append to an existing one (True). If unspecified, default is False -@no_temp_files If specified, no temporary files are written--all -operations are performed in-memory. - """ start_time = time.time() self.mother = True self.from_disk = False self.append = append + if table_name: self.table_name = table_name @@ -477,7 +469,7 @@ class ImageLoader: initargs=(current_process().pid, self.table_name, self.append, - no_temp_files, + False, self.db_creds, False)) @@ -539,7 +531,7 @@ class ImageLoader: _call_np_worker(data) def load_dataset_from_disk(self, root_dir, table_name, num_labels='all', - append=False, no_temp_files=False): + app
[madlib-site] branch asf-site updated (a283758 -> 3b665af)
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a change to branch asf-site in repository https://gitbox.apache.org/repos/asf/madlib-site.git. from a283758 update Load-model-architecture-v1.ipynb with faster way to load model weights - minor fix new 6a530b1 Disable --no-temp-files|-m option, since it doesn't work new 3b665af Update demo notebook The 2 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: .../Deep-learning/Load-images-v1.ipynb | 20 --- .../Deep-learning/madlib_image_loader.py | 42 +- 2 files changed, 23 insertions(+), 39 deletions(-)
[madlib-site] branch asf-site updated: update Load-model-architecture-v1.ipynb with faster way to load model weights - minor fix
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/madlib-site.git The following commit(s) were added to refs/heads/asf-site by this push: new a283758 update Load-model-architecture-v1.ipynb with faster way to load model weights - minor fix a283758 is described below commit a283758e9439f36b4b1334fb27d55049d71d7457 Author: Frank McQuillan AuthorDate: Tue Aug 20 16:20:45 2019 -0700 update Load-model-architecture-v1.ipynb with faster way to load model weights - minor fix --- .../Deep-learning/Load-model-architecture-v1.ipynb | 108 ++--- 1 file changed, 74 insertions(+), 34 deletions(-) diff --git a/community-artifacts/Deep-learning/Load-model-architecture-v1.ipynb b/community-artifacts/Deep-learning/Load-model-architecture-v1.ipynb index 4a8a5d5..0cb9fa6 100644 --- a/community-artifacts/Deep-learning/Load-model-architecture-v1.ipynb +++ b/community-artifacts/Deep-learning/Load-model-architecture-v1.ipynb @@ -25,17 +25,15 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 2, "metadata": {}, "outputs": [ { - "name": "stderr", + "name": "stdout", "output_type": "stream", "text": [ - "/Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/config.py:13: ShimWarning: The `IPython.config` package has been deprecated since IPython 4.0. You should import from traitlets.config instead.\n", - " \"You should import from traitlets.config instead.\", ShimWarning)\n", - "/Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/utils/traitlets.py:5: UserWarning: IPython.utils.traitlets has moved to a top-level traitlets package.\n", - " warn(\"IPython.utils.traitlets has moved to a top-level traitlets package.\")\n" + "The sql extension is already loaded. To reload it, use:\n", + " %reload_ext sql\n" ] } ], @@ -45,7 +43,7 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 3, "metadata": {}, "outputs": [ { @@ -54,7 +52,7 @@ "u'Connected: gpadmin@madlib'" ] }, - "execution_count": 2, + "execution_count": 3, "metadata": {}, "output_type": "execute_result" } @@ -72,7 +70,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 4, "metadata": {}, "outputs": [ { @@ -98,7 +96,7 @@ "[(u'MADlib version: 1.17-dev, git revision: rel/v1.16-10-g205bdba, cmake configuration time: Thu Aug 15 17:53:15 UTC 2019, build type: release, build system: Linux-3.10.0-957.21.3.el7.x86_64, C compiler: gcc 4.8.5, C++ compiler: g++ 4.8.5',)]" ] }, - "execution_count": 3, + "execution_count": 4, "metadata": {}, "output_type": "execute_result" } @@ -120,7 +118,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 5, "metadata": {}, "outputs": [ { @@ -153,7 +151,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 6, "metadata": {}, "outputs": [ { @@ -187,7 +185,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 7, "metadata": {}, "outputs": [ { @@ -196,7 +194,7 @@ "'{\"class_name\": \"Sequential\", \"keras_version\": \"2.1.6\", \"config\": [{\"class_name\": \"Dense\", \"config\": {\"kernel_initializer\": {\"class_name\": \"VarianceScaling\", \"config\": {\"distribution\": \"uniform\", \"scale\": 1.0, \"seed\": null, \"mode\": \"fan_avg\"}}, \"name\": \"dense_1\", \"kernel_constraint\": null, \"bias_regularizer\": null, \"bias_constraint\": null, \"dtype\": \"float32\", \"activation\": \"relu\", \"trainable\": true, \"kernel_regularizer\": n [...] ] }, - "execution_count": 6, + "execution_count": 7, "metadata": {},
[madlib-site] branch asf-site updated: update Load-model-architecture-v1.ipynb with faster way to load model weights
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/madlib-site.git The following commit(s) were added to refs/heads/asf-site by this push: new 4ef7217 update Load-model-architecture-v1.ipynb with faster way to load model weights 4ef7217 is described below commit 4ef7217f0adf914a7f0e63115a199e60a334677f Author: Frank McQuillan AuthorDate: Tue Aug 20 16:14:41 2019 -0700 update Load-model-architecture-v1.ipynb with faster way to load model weights --- .../Deep-learning/Load-model-architecture-v1.ipynb | 164 ++--- 1 file changed, 78 insertions(+), 86 deletions(-) diff --git a/community-artifacts/Deep-learning/Load-model-architecture-v1.ipynb b/community-artifacts/Deep-learning/Load-model-architecture-v1.ipynb index e7cd2f1..4a8a5d5 100644 --- a/community-artifacts/Deep-learning/Load-model-architecture-v1.ipynb +++ b/community-artifacts/Deep-learning/Load-model-architecture-v1.ipynb @@ -90,12 +90,12 @@ "version\n", "\n", "\n", - "MADlib version: 1.16-dev, git revision: rel/v1.15.1-119-gea1e0ac, cmake configuration time: Sat Jun 8 00:55:28 UTC 2019, build type: release, build system: Linux-3.10.0-957.12.1.el7.x86_64, C compiler: gcc 4.8.5, C++ compiler: g++ 4.8.5\n", + "MADlib version: 1.17-dev, git revision: rel/v1.16-10-g205bdba, cmake configuration time: Thu Aug 15 17:53:15 UTC 2019, build type: release, build system: Linux-3.10.0-957.21.3.el7.x86_64, C compiler: gcc 4.8.5, C++ compiler: g++ 4.8.5\n", "\n", "" ], "text/plain": [ - "[(u'MADlib version: 1.16-dev, git revision: rel/v1.15.1-119-gea1e0ac, cmake configuration time: Sat Jun 8 00:55:28 UTC 2019, build type: release, build system: Linux-3.10.0-957.12.1.el7.x86_64, C compiler: gcc 4.8.5, C++ compiler: g++ 4.8.5',)]" + "[(u'MADlib version: 1.17-dev, git revision: rel/v1.16-10-g205bdba, cmake configuration time: Thu Aug 15 17:53:15 UTC 2019, build type: release, build system: Linux-3.10.0-957.21.3.el7.x86_64, C compiler: gcc 4.8.5, C++ compiler: g++ 4.8.5',)]" ] }, "execution_count": 3, @@ -120,9 +120,24 @@ }, { "cell_type": "code", - "execution_count": 28, + "execution_count": 4, "metadata": {}, - "outputs": [], + "outputs": [ +{ + "name": "stderr", + "output_type": "stream", + "text": [ + "Using TensorFlow backend.\n" + ] +}, +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "Couldn't import dot_parser, loading of dot files will not be possible.\n" + ] +} + ], "source": [ "import keras\n", "from keras.models import Sequential\n", @@ -138,7 +153,7 @@ }, { "cell_type": "code", - "execution_count": 30, + "execution_count": 5, "metadata": {}, "outputs": [ { @@ -148,11 +163,11 @@ "_\n", "Layer (type) Output Shape Param # \n", "=\n", - "dense_13 (Dense) (None, 10)50\n", + "dense_1 (Dense) (None, 10)50\n", "_\n", - "dense_14 (Dense) (None, 10)110 \n", + "dense_2 (Dense) (None, 10)110 \n", "_\n", - "dense_15 (Dense) (None, 3) 33\n", + "dense_3 (Dense) (None, 3) 33\n", "=\n", "Total params: 193\n", "Trainable params: 193\n", @@ -172,16 +187,16 @@ }, { "cell_type": "code", - "execution_count": 31, + "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "'{\"class_name\": \"Sequential\", \"keras_version\": \"2.1.6\", \"config\": [{\"cla
[madlib] 01/01: updated user docs for madlib-keras with BYOM inference
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch keras_byom in repository https://gitbox.apache.org/repos/asf/madlib.git commit 49cf31f1416326044974fa6053d03594ead4ca22 Author: Frank McQuillan AuthorDate: Thu Aug 15 10:31:29 2019 -0700 updated user docs for madlib-keras with BYOM inference --- .../modules/deep_learning/madlib_keras.sql_in | 182 - 1 file changed, 68 insertions(+), 114 deletions(-) diff --git a/src/ports/postgres/modules/deep_learning/madlib_keras.sql_in b/src/ports/postgres/modules/deep_learning/madlib_keras.sql_in index c69f158..84c1054 100644 --- a/src/ports/postgres/modules/deep_learning/madlib_keras.sql_in +++ b/src/ports/postgres/modules/deep_learning/madlib_keras.sql_in @@ -74,6 +74,9 @@ can perform better than stochastic gradient descent because it uses more than one training example at a time, typically resulting faster and smoother convergence [3]. +You can also do inference on models that have not been trained with MADlib, +but rather imported from an external source. + @brief Solves image classification problems by calling the Keras API @@ -620,8 +623,10 @@ madlib_keras_predict( @anchor keras_predict_byom -@par Predict BYOM (Bring your own model) -The predict byom function has the following format: +@par Predict BYOM (bring your own model) +The predict BYOM function allows you to do inference on models that +have not been trained on MADlib, but rather imported from elsewhere. +It has the following format: madlib_keras_predict_byom( model_arch_table, @@ -645,11 +650,11 @@ madlib_keras_predict_byom( TEXT. Name of the architecture table containing the model to use for prediction. The model weights and architecture can be loaded to this table by using the - load_keras_model function + load_keras_model function. model_arch_id - INTEGER. This is the id in 'model_arch_table'containing the model + INTEGER. This is the id in 'model_arch_table' containing the model architecture and model weights to use for prediction. @@ -657,9 +662,7 @@ madlib_keras_predict_byom( TEXT. Name of the table containing the dataset to predict on. Note that test data is not preprocessed (unlike fit and evaluate) so put one test image per row for prediction. - Also see the comment below for the 'independent_varname' parameter - regarding normalization. - + Set the 'normalizing_const' below for the independent variable if necessary. id_col @@ -668,9 +671,7 @@ madlib_keras_predict_byom( independent_varname TEXT. Column with independent variables in the test table. - If a 'normalizing_const' is specified when preprocessing the - training dataset, this same normalization will be applied to - the independent variables used in predict. + Set the 'normalizing_const' below if necessary. output_table @@ -679,14 +680,14 @@ madlib_keras_predict_byom( id -Gives the 'id' for each prediction, corresponding to each row from the test_table. +Gives the 'id' for each prediction, corresponding to each row from the 'test_table'. estimated_dependent_var -(For pred_type='response') The estimated class for classification. If -class_values is passed in as NULL, then we assume that the class -labels are [0,1,2...,n] where n in the num of classes in the model +(For pred_type='response') Estimated class for classification. If +the 'class_values' parameter is passed in as NULL, then we assume that the class +labels are [0,1,2...,n-1] where n-1 is the number of classes in the model architecture. @@ -694,11 +695,11 @@ madlib_keras_predict_byom( prob_CLASS (For pred_type='prob' for classification) - The probability of a given class. - If class_values is passed in as NULL, we create just one column called - 'prob' which is an array of probabilities of all the classes. - Otherwise if class_values is not NULL, then there will be one - column for each class in the training data. + Probability of a given class. + If 'class_values' is passed in as NULL, we create one column called + 'prob' which is an array of probabilities for each class. + If 'class_values' is not NULL, then there will be one + column for each class. @@ -725,8 +726,8 @@ madlib_keras_predict_byom( class_values (optional) TEXT[], default: NULL. -List of class labels that were used while training the model. See the -output_table column for more details. +List of class labels that were used while t
[madlib] branch keras_byom created (now 49cf31f)
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a change to branch keras_byom in repository https://gitbox.apache.org/repos/asf/madlib.git. at 49cf31f updated user docs for madlib-keras with BYOM inference This branch includes the following new commits: new 49cf31f updated user docs for madlib-keras with BYOM inference The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference.
[madlib-site] branch asf-site updated: update jupyter notebooks for new image loader script
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/madlib-site.git The following commit(s) were added to refs/heads/asf-site by this push: new b53d4bd update jupyter notebooks for new image loader script b53d4bd is described below commit b53d4bd6b95373c5b05db554d52c3ef5fd094f0c Author: Frank McQuillan AuthorDate: Tue Jul 23 10:41:42 2019 -0700 update jupyter notebooks for new image loader script --- .../Deep-learning/Load-images-v1.ipynb | 662 + ...-v1.ipynb => MADlib-Keras-cifar10-cnn-v2.ipynb} | 348 ++- ...ynb => MADlib-Keras-transfer-learning-v2.ipynb} | 516 .../Deep-learning/Madlib Image Loader Demo.ipynb | 488 --- 4 files changed, 1088 insertions(+), 926 deletions(-) diff --git a/community-artifacts/Deep-learning/Load-images-v1.ipynb b/community-artifacts/Deep-learning/Load-images-v1.ipynb new file mode 100644 index 000..15aa948 --- /dev/null +++ b/community-artifacts/Deep-learning/Load-images-v1.ipynb @@ -0,0 +1,662 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# Load images into table\n", +"\n", +"This demonstrates different ways to load images into a database table.\n", +"\n", +"We use the script called madlib_image_loader.py located at https://github.com/apache/madlib-site/tree/asf-site/community-artifacts/Deep-learning which uses the Python Imaging Library so supports multiple formats http://www.pythonware.com/products/pil/\n";, +"\n", +"## Table of contents\n", +"\n", +"1. Setup image loader\n", +"\n", +"2. Fetch images then load NumPy array into table\n", +"\n", +"3. Load from file system into table" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ +{ + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/config.py:13: ShimWarning: The `IPython.config` package has been deprecated since IPython 4.0. You should import from traitlets.config instead.\n", + " \"You should import from traitlets.config instead.\", ShimWarning)\n", + "/Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/utils/traitlets.py:5: UserWarning: IPython.utils.traitlets has moved to a top-level traitlets package.\n", + " warn(\"IPython.utils.traitlets has moved to a top-level traitlets package.\")\n" + ] +} + ], + "source": [ +"%load_ext sql" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ +{ + "data": { + "text/plain": [ + "u'Connected: fmcquillan@madlib'" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"# Greenplum Database 5.x on GCP for deep learning (PM demo machine)\n", +"#%sql postgresql://gpadmin@35.239.240.26:5432/madlib\n", +"\n", +"# PostgreSQL local\n", +"%sql postgresql://fmcquillan@localhost:5432/madlib" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "1 rows affected.\n" + ] +}, +{ + "data": { + "text/html": [ + "\n", + "\n", + "version\n", + "\n", + "\n", + "MADlib version: 1.16, git revision: rc/1.16-rc1, cmake configuration time: Mon Jul 1 17:45:09 UTC 2019, build type: Release, build system: Darwin-16.7.0, C compiler: Clang, C++ compiler: Clang\n", + "\n", + "" + ], + "text/plain": [ + "[(u'MADlib version: 1.16, git revision: rc/1.16-rc1, cmake configuration time: Mon Jul 1 17:45:09 UTC 2019, build type: Release, build system: Darwin-16.7.0, C compiler: Clang, C++ compiler: Clang',)]" + ] + }, + "execution_count": 3, + "metadata": {}, + &q
[madlib-site] branch asf-site updated: add links to deep learning notes and Jupyter notebooks
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/madlib-site.git The following commit(s) were added to refs/heads/asf-site by this push: new c30b4ca add links to deep learning notes and Jupyter notebooks c30b4ca is described below commit c30b4cab2616d8633ee3fdd0d32662033ef2973a Author: Frank McQuillan AuthorDate: Thu Jul 11 13:34:12 2019 -0700 add links to deep learning notes and Jupyter notebooks --- index.html | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/index.html b/index.html index 9060809..3a4c7a0 100644 --- a/index.html +++ b/index.html @@ -84,7 +84,10 @@ K-nearest neighbors - Improve performance with kd-tree approximate method. Association rules - Set default maximum itemset rules to 10 to reduce runtime. - You are invited to https://dist.apache.org/repos/dist/release/madlib/1.16/";>download the 1.16 release and https://github.com/apache/madlib/blob/master/RELEASE_NOTES";>review the release notes. + You are invited to https://dist.apache.org/repos/dist/release/madlib/1.16/";>download the 1.16 release and https://github.com/apache/madlib/blob/master/RELEASE_NOTES";>review the release notes. + For more details about the new deep learning feature, please refer to the + https://cwiki.apache.org/confluence/display/MADLIB/Deep+Learning";>Apache MADlib deep learning notes and + the https://github.com/apache/madlib-site/tree/asf-site/community-artifacts/Deep-learning";>Jupyter notebook examples.
[madlib-site] branch asf-site updated: update website for 1.16 release
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/madlib-site.git The following commit(s) were added to refs/heads/asf-site by this push: new f42ca4a update website for 1.16 release f42ca4a is described below commit f42ca4a590c78dade76362424c8f2b97cae82961 Author: Frank McQuillan AuthorDate: Mon Jul 8 12:14:27 2019 -0700 update website for 1.16 release --- _media/featured/neural-net.png | Bin 0 -> 9616 bytes _media/featured/topic-modelling.png | Bin 5673 -> 9614 bytes _media/featured/validation.png | Bin 2522 -> 3687 bytes _media/logos/ucsd.png | Bin 0 -> 7585 bytes community.html | 15 ++-- documentation.html | 1 + download.html | 31 + index.html | 66 +++- product.html| 22 ++-- 9 files changed, 85 insertions(+), 50 deletions(-) diff --git a/_media/featured/neural-net.png b/_media/featured/neural-net.png new file mode 100644 index 000..84c3b53 Binary files /dev/null and b/_media/featured/neural-net.png differ diff --git a/_media/featured/topic-modelling.png b/_media/featured/topic-modelling.png index 813d1d6..8460bb0 100644 Binary files a/_media/featured/topic-modelling.png and b/_media/featured/topic-modelling.png differ diff --git a/_media/featured/validation.png b/_media/featured/validation.png index 0be0d92..ed2f5c4 100644 Binary files a/_media/featured/validation.png and b/_media/featured/validation.png differ diff --git a/_media/logos/ucsd.png b/_media/logos/ucsd.png new file mode 100644 index 000..890d59a Binary files /dev/null and b/_media/logos/ucsd.png differ diff --git a/community.html b/community.html index 893f0a1..6f892a8 100644 --- a/community.html +++ b/community.html @@ -38,7 +38,7 @@ Community - + @@ -49,7 +49,7 @@ Apache MADlib is an open source project that endeavors to adhere in all respects to the principles of http://apache.org/foundation/governance/";>The Apache Way. MADlib grew out of discussions between database engine developers, data scientists, IT architects and academics interested in new approaches to scalable, sophisticated in-database analytics. These discussions were written up in a paper in VLDB 2009 that coined the term “MAD Skills” for data analysis. The MADlib software project began the following year as a collaboration between researchers at UC Berkeley and engineers and data scientists at EMC/Greenplum (late [...] In September 2015 MADlib was accepted into the Apache Software Foundation Incubator and graduated to a Top Level Project in July 2017. -Some of the original participants in the development of this project were: +Some of the past and present participants in this project are: @@ -91,6 +91,15 @@ Learn More + +http://cseweb.ucsd.edu/~arunkk/"; class="center"> + + + +Providing key research in artificial neural networks on distributed systems +Learn More + + If you are interested in joining our project please consider joining our User or Developer mailing lists. Everyone is welcome. @@ -199,7 +208,7 @@ http://postgresql.org";>PostgreSQL http://greenplum.org/";>Greenplum Database -http://hawq.incubator.apache.org";>Apache HAWQ (incubating) +http://hawq.incubator.apache.org";>Apache HAWQ http://cran.r-project.org/web/packages/PivotalR/";>PivotalR diff --git a/documentation.html b/documentation.html index 8727541..b5d93bf 100644 --- a/documentation.html +++ b/documentation.html @@ -55,6 +55,7 @@ jQuery(document).ready(function() { The primary documentation reference material providing detailed information on the functions and algorithms within MADlib as well as background theory and references into the literature. Older Documentation +MADlib v1.15.1 MADlib v1.15 MADlib v1.14 MADlib v1.13 diff --git a/download.html b/download.html index 1781c9d..b68b47f 100644 --- a/download.html +++ b/download.html @@ -58,7 +58,7 @@ Current Release - v1.15.1 + v1.16
[madlib] branch master updated: add comment to graph user docs to distribute edge table by source vertex id
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/madlib.git The following commit(s) were added to refs/heads/master by this push: new 874d189 add comment to graph user docs to distribute edge table by source vertex id 874d189 is described below commit 874d1892c5e35436c6e5bfc46ad9983a6587b159 Author: Frank McQuillan AuthorDate: Fri May 17 14:10:30 2019 -0700 add comment to graph user docs to distribute edge table by source vertex id --- src/ports/postgres/modules/graph/apsp.sql_in | 2 ++ src/ports/postgres/modules/graph/bfs.sql_in | 3 +++ src/ports/postgres/modules/graph/hits.sql_in | 3 +++ src/ports/postgres/modules/graph/pagerank.sql_in | 3 +++ src/ports/postgres/modules/graph/sssp.sql_in | 3 +++ src/ports/postgres/modules/graph/wcc.sql_in | 5 +++-- 6 files changed, 17 insertions(+), 2 deletions(-) diff --git a/src/ports/postgres/modules/graph/apsp.sql_in b/src/ports/postgres/modules/graph/apsp.sql_in index c7bf210..7cd77d3 100644 --- a/src/ports/postgres/modules/graph/apsp.sql_in +++ b/src/ports/postgres/modules/graph/apsp.sql_in @@ -55,6 +55,8 @@ for this implementation is O(V^2 * E) where V is the number of vertices and E is the number of edges. In practice, run-time will be generally be much less than this, but it depends on the graph. +On a Greenplum cluster, the edge table should be distributed +by the source vertex id column for better performance. @anchor apsp @par APSP diff --git a/src/ports/postgres/modules/graph/bfs.sql_in b/src/ports/postgres/modules/graph/bfs.sql_in index c1c27fe..ea991fa 100644 --- a/src/ports/postgres/modules/graph/bfs.sql_in +++ b/src/ports/postgres/modules/graph/bfs.sql_in @@ -130,6 +130,9 @@ and a single BFS result is generated. +@note On a Greenplum cluster, the edge table should be distributed +by the source vertex id column for better performance. + @anchor notes @par Notes diff --git a/src/ports/postgres/modules/graph/hits.sql_in b/src/ports/postgres/modules/graph/hits.sql_in index 96a507c..83f838d 100644 --- a/src/ports/postgres/modules/graph/hits.sql_in +++ b/src/ports/postgres/modules/graph/hits.sql_in @@ -127,6 +127,9 @@ parameter. +@note On a Greenplum cluster, the edge table should be distributed +by the source vertex id column for better performance. + @anchor notes @par Notes diff --git a/src/ports/postgres/modules/graph/pagerank.sql_in b/src/ports/postgres/modules/graph/pagerank.sql_in index b81b58e..cd239bd 100644 --- a/src/ports/postgres/modules/graph/pagerank.sql_in +++ b/src/ports/postgres/modules/graph/pagerank.sql_in @@ -132,6 +132,9 @@ for personalized PageRank. When this parameter is provided, personalized PageRan will run. In the absence of this parameter, regular PageRank will run. +@note On a Greenplum cluster, the edge table should be distributed +by the source vertex id column for better performance. + @anchor examples @examp diff --git a/src/ports/postgres/modules/graph/sssp.sql_in b/src/ports/postgres/modules/graph/sssp.sql_in index 372f1fb..8175624 100644 --- a/src/ports/postgres/modules/graph/sssp.sql_in +++ b/src/ports/postgres/modules/graph/sssp.sql_in @@ -104,6 +104,9 @@ A summary table named _summary is also created. This is an internal t TEXT, default = NULL. List of columns used to group the input into discrete subgraphs. These columns must exist in the edge table. When this value is null, no grouping is used and a single SSSP result is generated. +@note On a Greenplum cluster, the edge table should be distributed +by the source vertex id column for better performance. + @par Path Retrieval The path retrieval function returns the shortest path from the diff --git a/src/ports/postgres/modules/graph/wcc.sql_in b/src/ports/postgres/modules/graph/wcc.sql_in index 1c3808b..bc6ce7a 100644 --- a/src/ports/postgres/modules/graph/wcc.sql_in +++ b/src/ports/postgres/modules/graph/wcc.sql_in @@ -115,8 +115,9 @@ weakly connected components are generated for all data -@note On Greenplum cluster, the edge table should be distributed on the src -column for better performance. In addition, the user should note that this +@note On a Greenplum cluster, the edge table should be distributed +by the source vertex id column for better performance. +In addition, the user should note that this function creates a duplicate of the edge table (on Greenplum cluster) for better performance.
[madlib] branch master updated: user doc updates to multiple modules
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/madlib.git The following commit(s) were added to refs/heads/master by this push: new e015e0f user doc updates to multiple modules e015e0f is described below commit e015e0f5257cdf3cd37e601df51af155c23ca5a5 Author: Frank McQuillan AuthorDate: Fri May 17 13:38:44 2019 -0700 user doc updates to multiple modules --- doc/mainpage.dox.in| 2 +- src/ports/postgres/modules/bayes/bayes.sql_in | 5 ++--- .../conjugate_gradient/conjugate_gradient.sql_in | 5 ++--- src/ports/postgres/modules/convex/mlp.sql_in | 20 + .../deep_learning/input_data_preprocessor.sql_in | 3 +++ .../deep_learning/keras_model_arch_table.sql_in| 25 -- .../recursive_partitioning/decision_tree.sql_in| 3 ++- src/ports/postgres/modules/sample/sample.sql_in| 5 ++--- src/ports/postgres/modules/svm/svm.sql_in | 14 9 files changed, 60 insertions(+), 22 deletions(-) diff --git a/doc/mainpage.dox.in b/doc/mainpage.dox.in index d874e5f..b63ee5d 100644 --- a/doc/mainpage.dox.in +++ b/doc/mainpage.dox.in @@ -290,7 +290,7 @@ Interface and implementation are subject to change. @brief A collection of modules for deep learning. @details A collection of modules for deep learning. @{ -@defgroup grp_keras_model_arch Load Model Architecture +@defgroup grp_keras_model_arch Load Model @defgroup grp_input_preprocessor_dl Preprocessor for Images @} @defgroup grp_bayes Naive Bayes Classification diff --git a/src/ports/postgres/modules/bayes/bayes.sql_in b/src/ports/postgres/modules/bayes/bayes.sql_in index 40b71d2..9121cc1 100644 --- a/src/ports/postgres/modules/bayes/bayes.sql_in +++ b/src/ports/postgres/modules/bayes/bayes.sql_in @@ -32,9 +32,8 @@ m4_include(`SQLCommon.m4') independently contributes to the probability that a data point belongs to a category. -\warning This MADlib method is still in early stage development. There may be some -issues that will be addressed in a future version. Interface and implementation -is subject to change. +\warning This MADlib method is still in early stage development. +Interface and implementation are subject to change. Naive Bayes refers to a stochastic model where all independent variables \f$ a_1, \dots, a_n \f$ (often referred to as attributes in this context) diff --git a/src/ports/postgres/modules/conjugate_gradient/conjugate_gradient.sql_in b/src/ports/postgres/modules/conjugate_gradient/conjugate_gradient.sql_in index 2dfafc5..0636314 100644 --- a/src/ports/postgres/modules/conjugate_gradient/conjugate_gradient.sql_in +++ b/src/ports/postgres/modules/conjugate_gradient/conjugate_gradient.sql_in @@ -22,9 +22,8 @@ @brief Finds the solution to the function \f$ \boldsymbol Ax = \boldsymbol b \f$, where \f$A\f$ is a symmetric, positive-definite matrix and \f$x\f$ and \f$ \boldsymbol b \f$ are vectors. -\warning This MADlib method is still in early stage development. There may be some -issues that will be addressed in a future version. Interface and implementation -is subject to change. +\warning This MADlib method is still in early stage development. +Interface and implementation are subject to change. This function uses the iterative conjugate gradient method [1] to find a solution to the function: \f[ \boldsymbol Ax = \boldsymbol b \f] where \f$ \boldsymbol A \f$ is a symmetric, positive definite matrix and \f$x\f$ and \f$ \boldsymbol b \f$ are vectors. diff --git a/src/ports/postgres/modules/convex/mlp.sql_in b/src/ports/postgres/modules/convex/mlp.sql_in index 0d06c54..d6ce7ce 100644 --- a/src/ports/postgres/modules/convex/mlp.sql_in +++ b/src/ports/postgres/modules/convex/mlp.sql_in @@ -182,6 +182,18 @@ mlp_classification( verbose (optional) BOOLEAN, default: FALSE. Provides verbose output of the results of training, including the value of loss at each iteration. + @note +There are some subtleties on the reported per-iteration loss +values because we are working in a distributed system. +When mini-batching is used (i.e., batch gradient descent), +loss per iteration is an average of losses across all mini-batches +and epochs on a segment. Losses across all segments then get +averaged to give overall loss for the model for the iteration. +This will tend to be a pessimistic estimate of loss. +When mini-batching is not used (i.e., stochastic gradient descent), +we use the model state from the previous iteration to compute the loss +at the start of the current iteration on the whole data set. This +is an accurate computation of loss for the iteration. grouping_col (optional) TEXT, default: NULL. @@ -1376,6 +1388,14 @@ For an overview of multilayer perceptrons, see [1].
[madlib] branch master updated: add examples for generalize cross validation
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/madlib.git The following commit(s) were added to refs/heads/master by this push: new e4b53a7 add examples for generalize cross validation e4b53a7 is described below commit e4b53a75f62e9e6688d611fc3bc029af26961b0f Author: Frank McQuillan AuthorDate: Thu May 2 15:32:28 2019 -0700 add examples for generalize cross validation --- .../modules/validation/cross_validation.sql_in | 198 ++--- 1 file changed, 173 insertions(+), 25 deletions(-) diff --git a/src/ports/postgres/modules/validation/cross_validation.sql_in b/src/ports/postgres/modules/validation/cross_validation.sql_in index a5eeeff..77b2c2b 100644 --- a/src/ports/postgres/modules/validation/cross_validation.sql_in +++ b/src/ports/postgres/modules/validation/cross_validation.sql_in @@ -28,7 +28,8 @@ m4_include(`SQLCommon.m4') -Estimates the fit of a predictive model given a data set and specifications for the training, prediction, and error estimation functions. +Estimates the fit of a predictive model given a data set and specifications for +the training, prediction, and error estimation functions. Cross validation, sometimes called rotation estimation, is a technique for assessing how the results of a statistical analysis will generalize to an @@ -56,12 +57,12 @@ output table. The prediction function should take a unique ID column name in the data table as one of the inputs, so that the prediction result can be compared with the validation values. Note: Prediction function in some MADlib modules do not save results into an output -table. These prediction functions are not suitable for cross-validation. +table. These prediction functions are not suitable for this cross-validation module. - The error metric function compares the prediction results with the known values of the dependent variables in the data set that was fed into the prediction function. It computes the error metric using the specified error -metric function, storing the results in a table. +metric function, and stores the results in a table. Other inputs include the output table name, k value for the k-fold cross validation, and how many folds to try. For example, you can choose to run a @@ -94,40 +95,54 @@ cross_validation_general( modelling_func, modelling_func VARCHAR. The name of the function that trains the model. + modelling_params VARCHAR[]. An array of parameters to supply to the modelling function. + modelling_params_type VARCHAR[]. An array of data type names for each of the parameters supplied to the modelling function. + param_explored VARCHAR. The name of the parameter that will be checked to find the optimum value. The name must appear in the \e modelling_params array. + explore_values VARCHAR. The name of the parameter whose values are to be studied. + predict_func VARCHAR. The name of the prediction function. + predict_params VARCHAR[]. An array of parameters to supply to the prediction function. + predict_params_type VARCHAR[]. An array of data type names for each of the parameters supplied to the prediction function. + metric_func VARCHAR. The name of the function for measuring errors. + metric_params VARCHAR[]. An array of parameters to supply to the error metric function. + metric_params_type VARCHAR[]. An array of data type names for each of the parameters supplied to the metric function. + data_tbl VARCHAR. The name of the data table that will be split into training and validation parts. + data_id VARCHAR. The name of the column containing a unique ID associated with each row, or NULL if the table has no such column. -Ideally, the data set has a unique ID for each row, so that it is easier to +Ideally, the data set has a unique ID for each row so that it is easier to partition the data set into the training part and the validation part. Set the \e id_is_random argument to inform the cross-validation function whether the ID value is randomly assigned to each row. If it is not randomly assigned, the cross-validation function generates a random ID for each row. + id_is_random BOOLEAN. TRUE if the provided ID is randomly assigned to each row. + validation_result VARCHAR. The name of the table to store the output of the cross-validation function. The output table has the following columns: @@ -146,6 +161,7 @@ same name specified in the \e param_explored argument of the \e cross_validation + data_cols A comma-separated list of names of data columns to use in the calculation. When its value is NULL, the function will automatically figure out all the column names of the data table. @@ -183,42 +199,174 @@ The parameter arrays for the modelling, prediction and metric functions can incl @anchor examples @examp -This example uses cross validation with an elastic net regression to
[madlib] branch master updated: add sections to RF and DT user docs on run-time and memory usage
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/madlib.git The following commit(s) were added to refs/heads/master by this push: new 20c87fa add sections to RF and DT user docs on run-time and memory usage 20c87fa is described below commit 20c87faefd3a166c5456112fba1c8b6ab107ad18 Author: Frank McQuillan AuthorDate: Fri Apr 19 17:23:51 2019 -0700 add sections to RF and DT user docs on run-time and memory usage --- .../deep_learning/keras_model_arch_table.sql_in| 2 +- .../recursive_partitioning/decision_tree.sql_in| 34 + .../recursive_partitioning/random_forest.sql_in| 43 +- .../modules/regress/clustered_variance.sql_in | 6 +-- .../postgres/modules/sample/balance_sample.sql_in | 2 +- src/ports/postgres/modules/svm/svm.sql_in | 4 +- 6 files changed, 67 insertions(+), 24 deletions(-) diff --git a/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in b/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in index bb734ab..16037c2 100644 --- a/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in +++ b/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in @@ -129,7 +129,7 @@ model.add(Dense(3, name='dense_2')) model.to_json This is represented by the following JSON: - + '{"class_name": "Sequential", "keras_version": "2.1.6", "config": [{"class_name": "Dense", "config": {"kernel_initializer": {"class_name": "VarianceScaling", "config": {"distribution": "uniform", diff --git a/src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in b/src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in index 8ad7a9d..bf1c883 100644 --- a/src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in +++ b/src/ports/postgres/modules/recursive_partitioning/decision_tree.sql_in @@ -17,6 +17,7 @@ m4_include(`SQLCommon.m4') Contents Training Function +Run-time and Memory Usage Prediction Function Tree Display Importance Display @@ -109,7 +110,7 @@ tree_train( by their value. - list_of_features_to_exclude + list_of_features_to_exclude (optional) TEXT. Comma-separated string of column names to exclude from the predictors list. If the dependent_variable is an expression (including cast of a column name), then this list should include the columns present in the @@ -118,7 +119,7 @@ tree_train( The names in this parameter should be identical to the names used in the table and quoted appropriately. - split_criterion + split_criterion (optional) TEXT, default = 'gini' for classification, 'mse' for regression. Impurity function to compute the feature to use to split a node. Supported criteria are 'gini', 'entropy', 'misclassification' for @@ -148,7 +149,8 @@ tree_train( INTEGER, default: 7. Maximum depth of any node of the final tree, with the root node counted as depth 0. A deeper tree can lead to better prediction but will also result in - longer processing time and higher memory usage. + longer processing time and higher memory usage. + Current allowed maximum is 100. min_split (optional) INTEGER, default: 20. Minimum number of observations that must exist @@ -475,11 +477,27 @@ provided cp and explore all possible sub-trees (up to a single-node tre to compute the optimal sub-tree. The optimal sub-tree and the 'cp' corresponding to this optimal sub-tree is placed in the output_table, with the columns named as tree and pruning_cp respectively. -- The main parameters that affect memory usage are: depth of -tree (‘max_depth’), number of features, number of values per -categorical feature, and number of bins for continuous features (‘num_splits’). -If you are hitting memory limits, consider reducing one or -more of these parameters. + +@anchor runtime +@par Run-time and Memory Usage + +The number of features and the number of class values per categorical feature have a direct +impact on run-time and memory. In addition, here is a summary of the main parameters +in the training function that affect run-time and memory: + +| Parameter | Run-time | Memory | Notes | +| :-- | :-- | :-- | :-- | +| 'max_depth' | High | High | Deeper trees can take longer to run and use more memory. | +| 'min_split' | No or little effect, unless very small. | No or little effect, unless very small. | If too small, can impact run-time by building trees that are very thick. | +| 'min_bucket' | No or little effect, unless very small. | No or little effect, unless very small. |
[madlib] branch master updated: update user docs for loading model arch
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/madlib.git The following commit(s) were added to refs/heads/master by this push: new fd04db3 update user docs for loading model arch fd04db3 is described below commit fd04db3f96c07a865a345b3072f9d0b4c3cc5bda Author: Frank McQuillan AuthorDate: Thu Mar 28 17:53:32 2019 -0700 update user docs for loading model arch --- doc/mainpage.dox.in| 8 +- .../deep_learning/keras_model_arch_table.sql_in| 224 +++-- .../utilities/minibatch_preprocessing_dl.sql_in| 3 + 3 files changed, 211 insertions(+), 24 deletions(-) diff --git a/doc/mainpage.dox.in b/doc/mainpage.dox.in index 826e8d7..e221319 100644 --- a/doc/mainpage.dox.in +++ b/doc/mainpage.dox.in @@ -287,11 +287,11 @@ Interface and implementation are subject to change. @{ @defgroup grp_cg Conjugate Gradient @defgroup grp_dl Deep Learning -@brief A collection of deep learning interfaces. -@details A collection of deep learning interfaces. +@brief A collection of modules for deep learning. +@details A collection of modules for deep learning. @{ -@defgroup grp_minibatch_preprocessing_dl Mini-Batch Preprocessor for Image Data -@defgroup grp_keras_model_arch Helper Function to Load Model Architectures to Table +@defgroup grp_keras_model_arch Load Model Architecture +@defgroup grp_minibatch_preprocessing_dl Mini-Batch Preprocessor for Images @} @defgroup grp_bayes Naive Bayes Classification @defgroup grp_sample Random Sampling diff --git a/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in b/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in index 7626107..bb734ab 100644 --- a/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in +++ b/src/ports/postgres/modules/deep_learning/keras_model_arch_table.sql_in @@ -30,22 +30,28 @@ m4_include(`SQLCommon.m4') /** @addtogroup grp_keras_model_arch +@brief Utility function to load model architectures and weights into a table for +use by deep learning algorithms. + Contents -Helper Function to Load Model Architectures to Table -Helper Function to Delete Model Architectures from Table +Load Model Architecture +Delete Model Architecture Examples -The architecture of the model to be used in madlib_keras_train() -function must be stored in a table, the details of which must be -provided as parameters to the madlib_keras_train module. load_keras_model is -a helper function to help users insert JSON blobs of Keras model -architectures into a table. If the output table already exists, the model_arch -specified will be added as a new row into the table. The output table could thus -act as a repository of Keras model architectures. +This utility function loads model architectures and +weights into a table for use by deep learning algorithms. +Model architecture is in JSON form +and model weights are in the form of double precision arrays. +If the output table already exists, a new row is inserted +into the table so it can act as a repository for multiple model +architectures. + +There is also a utility function to delete a model architecture +from the model architecture table. -delete_keras_model can be used to delete the model architecture corresponding -to the provided model_id from the model architecture repository table (keras_model_arch_table). +@anchor load_keras_model +@par Load Model Architecture load_keras_model( @@ -56,17 +62,17 @@ load_keras_model( \b Arguments keras_model_arch_table - VARCHAR. Output table to load keras model arch. + VARCHAR. Output table to load keras model architecture. model_arch - JSON. JSON of the model architecture to insert. + JSON. JSON of the model architecture to load. Output table -The output table produced by load_keras_model contains the following columns: +The output table contains the following columns: model_id @@ -80,17 +86,19 @@ load_keras_model( model_weights -DOUBLE PRECISION[]. Weights of the model for warm start. +DOUBLE PRECISION[]. Weights of the model which may be use for warm start. __internal_madlib_id__ -TEXT. Unique id for model arch. +TEXT. Unique id for model arch. This is an id used internally be MADlib. +@anchor delete_keras_model +@par Delete Model Architecture delete_keras_model( @@ -101,18 +109,194 @@ delete_keras_model( \b Arguments keras_model_arch_table - VARCHAR. Table containing Keras model architectures. + VARCHAR. Table containing model architectures. model_id - INTEGER. The id of the model arch to be deleted. + INTEGER. The id of the model architecture
[madlib] 01/01: mini-batch preprocessor for image user doc improvements
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a commit to branch mini-batch-dl-v1 in repository https://gitbox.apache.org/repos/asf/madlib.git commit 4671a0a5db179fa3c1be896c43abd3c15817a280 Author: Frank McQuillan AuthorDate: Wed Feb 13 14:27:43 2019 -0800 mini-batch preprocessor for image user doc improvements --- .../utilities/minibatch_preprocessing_dl.sql_in| 355 + 1 file changed, 218 insertions(+), 137 deletions(-) diff --git a/src/ports/postgres/modules/utilities/minibatch_preprocessing_dl.sql_in b/src/ports/postgres/modules/utilities/minibatch_preprocessing_dl.sql_in index 0caca98..6cbe249 100644 --- a/src/ports/postgres/modules/utilities/minibatch_preprocessing_dl.sql_in +++ b/src/ports/postgres/modules/utilities/minibatch_preprocessing_dl.sql_in @@ -32,25 +32,29 @@ m4_include(`SQLCommon.m4') Contents Mini-Batch Preprocessor for Deep Learning Examples +Related Topics -For Deep Learning based techniques such as Convolutional Neural Nets, the input -data is mostly images. These images can be represented as an array of numbers -where all elements are between 0 and 255 in value. It is standard practice -to divide each of these numbers by 255.0 to normalize the image data. -minibatch_preprocessor() is for general use-cases, but for deep learning based -use-cases we provide minibatch_preprocessor_dl() that is light-weight and is -specific to image datasets. The normalizing constant is parameterized, and can -be specified based on the kind of image data used. +For deep learning techniques such as convolutional neural networks, the input +data is often images. These images can represented as an array of numbers +with elements between 0 and 255, representing grayscale or RGB channel values +for each pixel in the image. It is standard practice to divide by 255 to +normalize the image data. The normalizing constant is parameterized, and can +be set depending on the format of image data used. + +This mini-batch preprocessor is a lightweight version designed specifically +for image data. A separate more general minibatch_preprocessor() is also +available for other MADlib modules using non-image input data. + -minibatch_preprocessor_dl(source_table, -output_table, -dependent_varname, -independent_varname, -buffer_size, -normalizing_const, -dependent_offset -) +minibatch_preprocessor_dl( source_table, + output_table, + dependent_varname, + independent_varname, + buffer_size, + normalizing_const, + dependent_offset + ) \b Arguments @@ -62,9 +66,9 @@ minibatch_preprocessor_dl(source_table, output_table TEXT. Name of the output table from the preprocessor which will be used as input to algorithms that support mini-batching. - Note that the arrays packed into the output table are randomized + Note that the arrays packed into the output table are shuffled and normalized (by dividing each element in the independent variable array - by the normalizing_const), so they will not match up in an obvious way with + by the "normalizing_const"), so they will not match up in an obvious way with the rows in the source table. @@ -73,7 +77,7 @@ minibatch_preprocessor_dl(source_table, independent_varname - TEXT. Name of the independent variable column. The column must be of + TEXT. Name of the independent variable column. The column must be a numeric array type. @@ -82,19 +86,21 @@ minibatch_preprocessor_dl(source_table, number of rows from the source table that are packed into one row of the preprocessor output table. The default value is computed considering size of - the source table, number of independent variables, number of groups, - and number of segments in the database cluster. For larger data sets, - the computed buffer size will typically be a value in the millions. + the source table, number of independent variables, + and number of segments in the database cluster. normalizing_const (optional) - DOUBLE PRECISION, default: 255.0. The normalizing constant to divide + DOUBLE PRECISION, default: 255. The normalizing constant to divide each value in the independent_varname array by. dependent_offset (optional) INTEGER, default: NULL. If specified, shifts all dependent - variable values by this number (should only be used for numeric types). + variable values by this number (should only be used for numeric + types). For example, can be used to handle the case where you need to + convert between class values that start at 0 or 1, which saves + you from having to run a separate quer
[madlib] branch mini-batch-dl-v1 created (now 4671a0a)
This is an automated email from the ASF dual-hosted git repository. fmcquillan pushed a change to branch mini-batch-dl-v1 in repository https://gitbox.apache.org/repos/asf/madlib.git. at 4671a0a mini-batch preprocessor for image user doc improvements This branch includes the following new commits: new 4671a0a mini-batch preprocessor for image user doc improvements The 1 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference.
madlib git commit: update NOTICE file to 2019
Repository: madlib Updated Branches: refs/heads/master 70afde269 -> d00f09166 update NOTICE file to 2019 Project: http://git-wip-us.apache.org/repos/asf/madlib/repo Commit: http://git-wip-us.apache.org/repos/asf/madlib/commit/d00f0916 Tree: http://git-wip-us.apache.org/repos/asf/madlib/tree/d00f0916 Diff: http://git-wip-us.apache.org/repos/asf/madlib/diff/d00f0916 Branch: refs/heads/master Commit: d00f09166fbb06c8a6ac9a3eb6d75fc20cc6fef8 Parents: 70afde2 Author: Frank McQuillan Authored: Wed Jan 9 17:34:11 2019 -0800 Committer: Frank McQuillan Committed: Wed Jan 9 17:34:11 2019 -0800 -- NOTICE | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/madlib/blob/d00f0916/NOTICE -- diff --git a/NOTICE b/NOTICE index feb18f0..7cbfa51 100644 --- a/NOTICE +++ b/NOTICE @@ -1,5 +1,5 @@ Apache MADlib -Copyright 2016-2018 The Apache Software Foundation. +Copyright 2016-2019 The Apache Software Foundation. This product includes software developed at The Apache Software Foundation (http://www.apache.org/).
madlib-site git commit: update website for 1.15.1 release
Repository: madlib-site Updated Branches: refs/heads/asf-site 127c0b7e7 -> 6d7f908b5 update website for 1.15.1 release Project: http://git-wip-us.apache.org/repos/asf/madlib-site/repo Commit: http://git-wip-us.apache.org/repos/asf/madlib-site/commit/6d7f908b Tree: http://git-wip-us.apache.org/repos/asf/madlib-site/tree/6d7f908b Diff: http://git-wip-us.apache.org/repos/asf/madlib-site/diff/6d7f908b Branch: refs/heads/asf-site Commit: 6d7f908b550848b438cf94b3176ce963814bf367 Parents: 127c0b7 Author: Frank McQuillan Authored: Mon Oct 15 10:50:17 2018 -0700 Committer: Frank McQuillan Committed: Mon Oct 15 10:50:17 2018 -0700 -- documentation.html | 1 + download.html | 37 - index.html | 16 3 files changed, 45 insertions(+), 9 deletions(-) -- http://git-wip-us.apache.org/repos/asf/madlib-site/blob/6d7f908b/documentation.html -- diff --git a/documentation.html b/documentation.html index 0d01094..8727541 100644 --- a/documentation.html +++ b/documentation.html @@ -55,6 +55,7 @@ jQuery(document).ready(function() { The primary documentation reference material providing detailed information on the functions and algorithms within MADlib as well as background theory and references into the literature. Older Documentation +MADlib v1.15 MADlib v1.14 MADlib v1.13 MADlib v1.12 http://git-wip-us.apache.org/repos/asf/madlib-site/blob/6d7f908b/download.html -- diff --git a/download.html b/download.html index ce790ad..1781c9d 100644 --- a/download.html +++ b/download.html @@ -58,7 +58,7 @@ Current Release - v1.15 + v1.15.1 Source Code and Convenience Binaries MADlib® source code and convenience binaries are available from the Apache distribution site. @@ -66,13 +66,15 @@ Latest stable release: - http://apache.org/dyn/closer.cgi?filename=madlib/1.15/apache-madlib-1.15-src.tar.gz&action=download";>Source code tar.gz (https://www.apache.org/dist/madlib/1.15/apache-madlib-1.15-src.tar.gz.asc";>pgp, https://www.apache.org/dist/madlib/1.15/apache-madlib-1.15-src.tar.gz.sha512";>sha512) + https://dist.apache.org/repos/dist/release/madlib/1.15.1/apache-madlib-1.15.1-src.tar.gz";>Source code tar.gz (https://www.apache.org/dist/madlib/1.15.1/apache-madlib-1.15.1-src.tar.gz.asc";>pgp, https://www.apache.org/dist/madlib/1.15.1/apache-madlib-1.15.1-src.tar.gz.sha512";>sha512) + + https://dist.apache.org/repos/dist/release/madlib/1.15.1/apache-madlib-1.15.1-bin-Linux-GPDB43.rpm";>Linux (https://www.apache.org/dist/madlib/1.15.1/apache-madlib-1.15.1-bin-Linux-GPDB43.rpm.asc";>pgp, https://www.apache.org/dist/madlib/1.15.1/apache-madlib-1.15.1-bin-Linux-GPDB43.rpm.sha512";>sha512) â CentOS / Red Hat 5 and higher (64 bit). GPDB 4.3.x. + + https://dist.apache.org/repos/dist/release/madlib/1.15.1/apache-madlib-1.15.1-bin-Linux.rpm";>Linux (https://www.apache.org/dist/madlib/1.15.1/apache-madlib-1.15.1-bin-Linux.rpm.asc";>pgp, https://www.apache.org/dist/madlib/1.15.1/apache-madlib-1.15.1-bin-Linux.rpm.sha512";>sha512) â CentOS / Red Hat 6 and higher (64 bit). GPDB 5.x, PostgreSQL 9.6 and 10.x. - http://apache.org/dyn/closer.cgi?filename=madlib/1.15/apache-madlib-1.15-bin-Linux-GPDB43.rpm&action=download";>Linux (https://www.apache.org/dist/madlib/1.15/apache-madlib-1.15-bin-Linux-GPDB43.rpm.asc";>pgp, https://www.apache.org/dist/madlib/1.15/apache-madlib-1.15-bin-Linux-GPDB43.rpm.sha512";>sha512) â CentOS / Red Hat 5 and higher (64 bit). GPDB 4.3.x. + https://dist.apache.org/repos/dist/release/madlib/1.15.1/apache-madlib-1.15.1-bin-Linux.deb";>Linux (https://www.apache.org/dist/madlib/1.15.1/apache-madlib-1.15.1-bin-Linux.deb.asc";>pgp, https://www.apache.org/dist/madlib/1.15.1/apache-madlib-1.15.1-bin-Linux.deb.sha512";>sha512) â Ubuntu 16.04. GPDB 5.x, PostgreSQL 9.6 and 10.x. - http://apache.org/dyn/closer.cgi?filename=madlib/1.15/apache-madlib-1.15-bin-Linux.rpm&action=download";>Linux (https://www.apache.org/dist/madlib/1.15/apac
madlib git commit: add caution on run-times to assoc rules user docs re: max itemset size usage
Repository: madlib Updated Branches: refs/heads/master d62e5516b -> e0f76db8b add caution on run-times to assoc rules user docs re: max itemset size usage Project: http://git-wip-us.apache.org/repos/asf/madlib/repo Commit: http://git-wip-us.apache.org/repos/asf/madlib/commit/e0f76db8 Tree: http://git-wip-us.apache.org/repos/asf/madlib/tree/e0f76db8 Diff: http://git-wip-us.apache.org/repos/asf/madlib/diff/e0f76db8 Branch: refs/heads/master Commit: e0f76db8bf2d7ca478d972cef302939b6f2babb5 Parents: d62e551 Author: Frank McQuillan Authored: Tue Sep 18 15:02:18 2018 -0700 Committer: Frank McQuillan Committed: Tue Sep 18 15:02:18 2018 -0700 -- .../modules/assoc_rules/assoc_rules.sql_in | 17 + 1 file changed, 13 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/madlib/blob/e0f76db8/src/ports/postgres/modules/assoc_rules/assoc_rules.sql_in -- diff --git a/src/ports/postgres/modules/assoc_rules/assoc_rules.sql_in b/src/ports/postgres/modules/assoc_rules/assoc_rules.sql_in index ec3c330..bcd5464 100644 --- a/src/ports/postgres/modules/assoc_rules/assoc_rules.sql_in +++ b/src/ports/postgres/modules/assoc_rules/assoc_rules.sql_in @@ -161,6 +161,12 @@ Given a frequent itemset \f$ A \f$ generated from the Apriori algorithm, and all subsets \f$ B \f$ , we generate rules such that \f$ B \Rightarrow (A - B) \f$ meets minimum confidence requirements. +@note Beware of combinatorial explosion. The Apriori algorithm can potentially +generate a huge number of rules, even for fairly simple data sets, resulting +in run-times that are unreasonably long. To avoid this, it is recommended +to cap the maximum itemset size to a small number to start with, then +increase it gradually. Support and confidence values are +parameters that can also be used to control rule generation. @anchor syntax @par Function Syntax @@ -257,14 +263,16 @@ This generates all association rules that satisfy the specified minimum \c conviction columns are calculated as described earlier. - verbose + verbose (optional) BOOLEAN, default: FALSE. Determines if details are printed for each iteration as the algorithm progresses. - max_itemset_size + max_itemset_size (optional) INTEGER, default: generate itemsets of all sizes. Determines the maximum size of frequent itemsets that are used for generating association rules. Must be 2 or more. - This parameter can be used to reduce run time for data sets where itemset size is large. + This parameter can be used to reduce run time for data sets where itemset size is large, + which is a common situation. If your query is not returning or is running too long, + try using a lower value for this parameter. @@ -338,7 +346,8 @@ Result: (7 rows) --# Limit association rules generated from itemsets of size at most 2: +-# Limit association rules generated from itemsets of size at most 2. This parameter is +a good way to reduce long run times. SELECT * FROM madlib.assoc_rules( .25,-- Support .5, -- Confidence
madlib git commit: minor docs update to svm and elastic net on cross validation table naming
Repository: madlib Updated Branches: refs/heads/master 3db98babe -> 85d09e675 minor docs update to svm and elastic net on cross validation table naming Project: http://git-wip-us.apache.org/repos/asf/madlib/repo Commit: http://git-wip-us.apache.org/repos/asf/madlib/commit/85d09e67 Tree: http://git-wip-us.apache.org/repos/asf/madlib/tree/85d09e67 Diff: http://git-wip-us.apache.org/repos/asf/madlib/diff/85d09e67 Branch: refs/heads/master Commit: 85d09e6750570ca2faf653495f5ee6146c25b536 Parents: 3db98ba Author: Frank McQuillan Authored: Thu Sep 13 12:22:21 2018 -0700 Committer: Frank McQuillan Committed: Thu Sep 13 12:22:21 2018 -0700 -- .../modules/elastic_net/elastic_net.sql_in| 2 ++ src/ports/postgres/modules/svm/svm.sql_in | 18 ++ 2 files changed, 20 insertions(+) -- http://git-wip-us.apache.org/repos/asf/madlib/blob/85d09e67/src/ports/postgres/modules/elastic_net/elastic_net.sql_in -- diff --git a/src/ports/postgres/modules/elastic_net/elastic_net.sql_in b/src/ports/postgres/modules/elastic_net/elastic_net.sql_in index e30c98c..157851d 100644 --- a/src/ports/postgres/modules/elastic_net/elastic_net.sql_in +++ b/src/ports/postgres/modules/elastic_net/elastic_net.sql_in @@ -239,6 +239,8 @@ averaged over all folds and all rows. For classification, the accuracy metric used is the ratio of correct classifications. For regression, the accuracy metric used is the negative of mean squared error (negative to make it a concave problem, thus selecting \e max means the highest accuracy). +Cross validation scores are written out to a separate table with the +user specified name given in the 'validation_result' parameter. The values of a parameter to cross validate should be provided in a list. For example, to regularize with the L1 norm and use a lambda value http://git-wip-us.apache.org/repos/asf/madlib/blob/85d09e67/src/ports/postgres/modules/svm/svm.sql_in -- diff --git a/src/ports/postgres/modules/svm/svm.sql_in b/src/ports/postgres/modules/svm/svm.sql_in index ccf2c26..ddfa134 100644 --- a/src/ports/postgres/modules/svm/svm.sql_in +++ b/src/ports/postgres/modules/svm/svm.sql_in @@ -229,6 +229,24 @@ A summary table named \_summary is also created, which has the fol + If cross validation is used, a table is created with a + user specified name having the following columns: + + +... +Names of cross validation parameters + + +mean_score +Mean value of accuracy when predicted on the +validation fold, averaged over all folds and all rows. + + +std_dev_score +Standard deviation of accuracy when predicted on the +validation fold, averaged over all folds and all rows. + + @anchor svm_regression @par Regression Training Function
madlib git commit: add note to user docs on vec2cols about unequal arrays
Repository: madlib Updated Branches: refs/heads/master a3b59356f -> 5e707f745 add note to user docs on vec2cols about unequal arrays Project: http://git-wip-us.apache.org/repos/asf/madlib/repo Commit: http://git-wip-us.apache.org/repos/asf/madlib/commit/5e707f74 Tree: http://git-wip-us.apache.org/repos/asf/madlib/tree/5e707f74 Diff: http://git-wip-us.apache.org/repos/asf/madlib/diff/5e707f74 Branch: refs/heads/master Commit: 5e707f745c50343dd7395a3e8f86c04428210977 Parents: a3b5935 Author: Frank McQuillan Authored: Fri Aug 17 13:38:20 2018 -0700 Committer: Frank McQuillan Committed: Fri Aug 17 13:38:20 2018 -0700 -- .../postgres/modules/stats/correlation.sql_in| 10 +- .../postgres/modules/utilities/cols2vec.sql_in | 4 ++-- .../postgres/modules/utilities/vec2cols.sql_in | 19 --- 3 files changed, 19 insertions(+), 14 deletions(-) -- http://git-wip-us.apache.org/repos/asf/madlib/blob/5e707f74/src/ports/postgres/modules/stats/correlation.sql_in -- diff --git a/src/ports/postgres/modules/stats/correlation.sql_in b/src/ports/postgres/modules/stats/correlation.sql_in index 64ed27e..3bf3e46 100644 --- a/src/ports/postgres/modules/stats/correlation.sql_in +++ b/src/ports/postgres/modules/stats/correlation.sql_in @@ -222,7 +222,7 @@ SELECT * FROM example_data_output ORDER BY column_position; column_position | variable | temperature | humidity -+-+-+-- - 1 | temperature | 1 | + 1 | temperature | 1 | 2 | humidity| 0.00607993890408995 |1 (2 rows) @@ -259,11 +259,11 @@ SELECT * FROM example_data_output ORDER BY day, column_position; column_position | variable | day |temperature| humidity -+-+--+---+-- - 1 | temperature | Mon | 1 | + 1 | temperature | Mon | 1 | 2 | humidity| Mon | 0.616876934548786 |1 - 1 | temperature | Tues | 1 | + 1 | temperature | Tues | 1 | 2 | humidity| Tues | 0.616876934548786 |1 - 1 | temperature | Wed | 1 | + 1 | temperature | Wed | 1 | 2 | humidity| Wed | -0.28969669368457 |1 (6 rows) @@ -315,7 +315,7 @@ SELECT * FROM example_data_output ORDER BY column_position; column_position | variable | temperature| humidity -+-+--+-- - 1 | temperature | 507.926664293343 | + 1 | temperature | 507.926664293343 | 2 | humidity| 2.40227839088644 | 307.359914560342 (2 rows) http://git-wip-us.apache.org/repos/asf/madlib/blob/5e707f74/src/ports/postgres/modules/utilities/cols2vec.sql_in -- diff --git a/src/ports/postgres/modules/utilities/cols2vec.sql_in b/src/ports/postgres/modules/utilities/cols2vec.sql_in index 82a1f94..0c54ab5 100644 --- a/src/ports/postgres/modules/utilities/cols2vec.sql_in +++ b/src/ports/postgres/modules/utilities/cols2vec.sql_in @@ -82,8 +82,8 @@ values. list_of_features_to_exclude (optional) TEXT. Default NULL. -Comma-separated string of column names to exclude from the feature array. -Typically used when 'list_of_features' is set to '*'. +Comma-separated string of column names to exclude from the feature array. Typically used +when 'list_of_features' is set to '*'. cols_to_output (optional) TEXT. Default NULL. http://git-wip-us.apache.org/repos/asf/madlib/blob/5e707f74/src/ports/postgres/modules/utilities/vec2cols.sql_in -- diff --git a/src/ports/postgres/modules/utilities/vec2cols.sql_in b/src/ports/postgres/modules/utilities/vec2cols.sql_in index 989074c..115e015 100644 --- a/src/ports/postgres/modules/utilities/vec2cols.sql_in +++ b/src/ports/postgres/modules/utilities/vec2cols.sql_in @@ -72,23 +72,28 @@ vec2cols( same name already exists, an error will be returned. vector_col -TEXT. Name of the column containing the feature array. -Must be a one-dimensional array. +TEXT. Name of the column containing the feature array. Must be a one-dimensional array. feature_names (optional) -TEXT[]. Array of names associated with the feature array. -Note that this array exists in the -summary table created by the function 'cols2vec'. -If the 'feature_names' array is not specified, +TEXT[].
madlib-site git commit: website update for 1dot15
Repository: madlib-site Updated Branches: refs/heads/asf-site 573d66d85 -> 127c0b7e7 website update for 1dot15 Project: http://git-wip-us.apache.org/repos/asf/madlib-site/repo Commit: http://git-wip-us.apache.org/repos/asf/madlib-site/commit/127c0b7e Tree: http://git-wip-us.apache.org/repos/asf/madlib-site/tree/127c0b7e Diff: http://git-wip-us.apache.org/repos/asf/madlib-site/diff/127c0b7e Branch: refs/heads/asf-site Commit: 127c0b7e7dd5d760bdbee5e82928e70418af1447 Parents: 573d66d Author: Frank McQuillan Authored: Sat Aug 11 09:32:51 2018 -0700 Committer: Frank McQuillan Committed: Sat Aug 11 09:37:21 2018 -0700 -- documentation.html | 3 ++- download.html | 42 +++--- index.html | 42 +++--- 3 files changed, 56 insertions(+), 31 deletions(-) -- http://git-wip-us.apache.org/repos/asf/madlib-site/blob/127c0b7e/documentation.html -- diff --git a/documentation.html b/documentation.html index 4670603..0d01094 100644 --- a/documentation.html +++ b/documentation.html @@ -55,6 +55,7 @@ jQuery(document).ready(function() { The primary documentation reference material providing detailed information on the functions and algorithms within MADlib as well as background theory and references into the literature. Older Documentation +MADlib v1.14 MADlib v1.13 MADlib v1.12 MADlib v1.11 @@ -102,7 +103,7 @@ jQuery(document).ready(function() { https://github.com/apache/madlib-site/tree/asf-site/community-artifacts";>Jupyter Notebooks for Getting Started -Includes many of the most commonly used algorithms by data scientists. +Includes many commonly used algorithms by data scientists. http://git-wip-us.apache.org/repos/asf/madlib-site/blob/127c0b7e/download.html -- diff --git a/download.html b/download.html index 8728d10..ce790ad 100644 --- a/download.html +++ b/download.html @@ -58,7 +58,7 @@ Current Release - v1.14 + v1.15 Source Code and Convenience Binaries MADlib® source code and convenience binaries are available from the Apache distribution site. @@ -66,10 +66,13 @@ Latest stable release: - http://apache.org/dyn/closer.cgi?filename=madlib/1.14/apache-madlib-1.14-src.tar.gz&action=download";>Source code tar.gz (https://www.apache.org/dist/madlib/1.14/apache-madlib-1.14-src.tar.gz.asc";>pgp, https://www.apache.org/dist/madlib/1.14/apache-madlib-1.14-src.tar.gz.sha512";>sha512) - http://apache.org/dyn/closer.cgi?filename=madlib/1.14/apache-madlib-1.14-bin-Linux-GPDB43.rpm&action=download";>Linux (https://www.apache.org/dist/madlib/1.14/apache-madlib-1.14-bin-Linux-GPDB43.rpm.asc";>pgp, https://www.apache.org/dist/madlib/1.14/apache-madlib-1.14-bin-Linux-GPDB43.rpm.sha512";>sha512) â CentOS / Red Hat 5 and higher (64 bit). GPDB 4.3.x. - http://apache.org/dyn/closer.cgi?filename=madlib/1.14/apache-madlib-1.14-bin-Linux.rpm&action=download";>Linux (https://www.apache.org/dist/madlib/1.14/apache-madlib-1.14-bin-Linux.rpm.asc";>pgp, https://www.apache.org/dist/madlib/1.14/apache-madlib-1.14-bin-Linux.rpm.sha512";>sha512) â CentOS / Red Hat 6 and higher (64 bit). GPDB 5.x, PostgreSQL 9.6 and 10.2. - http://apache.org/dyn/closer.cgi?filename=madlib/1.14/apache-madlib-1.14-bin-Darwin.dmg&action=download";>Mac OS X (https://www.apache.org/dist/madlib/1.14/apache-madlib-1.14-bin-Darwin.dmg.asc";>pgp, https://www.apache.org/dist/madlib/1.14/apache-madlib-1.14-bin-Darwin.dmg.sha512";>sha512) â OS 10.6 and higher. For PostgreSQL 9.6 and 10.2. + http://apache.org/dyn/closer.cgi?filename=madlib/1.15/apache-madlib-1.15-src.tar.gz&action=download";>Source code tar.gz (https://www.apache.org/dist/madlib/1.15/apache-madlib-1.15-src.tar.gz.asc";>pgp, https://www.apache.org/dist/madlib/1.15/apache-madlib-1.15-src.tar.gz.sha512";>sha512) + + http://apache.org/dyn/closer.cgi?filename=madlib/1.15/apache-madlib-1.15-bin-Linux-GPDB43.rpm&action=download";>Linux (https://www.apache.org/dist/madlib/1.15/apache-madlib-1.15-bin-Linux-GPDB43.rpm.asc";>pgp, https://www.apache.org/d
madlib git commit: minor edit to minibatch preproc user doc
Repository: madlib Updated Branches: refs/heads/master 186390f7c -> 298fed799 minor edit to minibatch preproc user doc Project: http://git-wip-us.apache.org/repos/asf/madlib/repo Commit: http://git-wip-us.apache.org/repos/asf/madlib/commit/298fed79 Tree: http://git-wip-us.apache.org/repos/asf/madlib/tree/298fed79 Diff: http://git-wip-us.apache.org/repos/asf/madlib/diff/298fed79 Branch: refs/heads/master Commit: 298fed799f3e8c728195882bea01479b644ee248 Parents: 186390f Author: Frank McQuillan Authored: Thu Aug 2 10:34:17 2018 -0700 Committer: Frank McQuillan Committed: Thu Aug 2 10:34:17 2018 -0700 -- .../utilities/minibatch_preprocessing.sql_in| 16 1 file changed, 16 deletions(-) -- http://git-wip-us.apache.org/repos/asf/madlib/blob/298fed79/src/ports/postgres/modules/utilities/minibatch_preprocessing.sql_in -- diff --git a/src/ports/postgres/modules/utilities/minibatch_preprocessing.sql_in b/src/ports/postgres/modules/utilities/minibatch_preprocessing.sql_in index 75adcc9..ead43d9 100644 --- a/src/ports/postgres/modules/utilities/minibatch_preprocessing.sql_in +++ b/src/ports/postgres/modules/utilities/minibatch_preprocessing.sql_in @@ -144,22 +144,6 @@ already encoded the dependent variable yourself, you can ignore this parameter. Also, if you want to encode float values for some reason, cast them to text first. - - one_hot_encode_int_dep_var (optional) - BOOLEAN. default: FALSE. - A flag to decide whether to one-hot encode dependent variables that are -scalar integers. This parameter is ignored if the dependent variable is not a -scalar integer. - -@note The mini-batch preprocessor automatically encodes -dependent variables that are boolean and character types such as text, char and -varchar. However, scalar integers are a special case because they can be used -in both classification and regression problems, so you must tell the mini-batch -preprocessor whether you want to encode them or not. In the case that you have -already encoded the dependent variable yourself, you can ignore this parameter. -Also, if you want to encode float values for some reason, cast them to text -first. - Output tables
[12/18] madlib-site git commit: update jupyter notebooks for 1dot15
http://git-wip-us.apache.org/repos/asf/madlib-site/blob/acd339f6/community-artifacts/KNN-v4.ipynb -- diff --git a/community-artifacts/KNN-v4.ipynb b/community-artifacts/KNN-v4.ipynb new file mode 100644 index 000..a4b3304 --- /dev/null +++ b/community-artifacts/KNN-v4.ipynb @@ -0,0 +1,857 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# k-Nearest Neighbors\n", +"Finds k nearest data points to a given data point and outputs majority vote value of output classes in case of classification, and average value of target values in case of regression. KNN was first added in MADlib 1.10 with updates in 1.13 and 1.14." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ +{ + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/config.py:13: ShimWarning: The `IPython.config` package has been deprecated. You should import from traitlets.config instead.\n", + " \"You should import from traitlets.config instead.\", ShimWarning)\n", + "/Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/utils/traitlets.py:5: UserWarning: IPython.utils.traitlets has moved to a top-level traitlets package.\n", + " warn(\"IPython.utils.traitlets has moved to a top-level traitlets package.\")\n" + ] +} + ], + "source": [ +"%load_ext sql" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ +{ + "data": { + "text/plain": [ + "u'Connected: gpadmin@madlib'" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"# Greenplum Database 5.4.0 on GCP (demo machine)\n", +"%sql postgresql://gpadmin@35.184.253.255:5432/madlib\n", +"\n", +"# PostgreSQL local\n", +"#%sql postgresql://fmcquillan@localhost:5432/madlib\n", +"\n", +"# Greenplum Database 4.3.10.0\n", +"#%sql postgresql://gpdbchina@10.194.10.68:61000/madlib" + ] + }, + { + "cell_type": "code", + "execution_count": 67, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "1 rows affected.\n" + ] +}, +{ + "data": { + "text/html": [ + "\n", + "\n", + "version\n", + "\n", + "\n", + "MADlib version: 1.14-dev, git revision: rc/1.13-rc1-12-gb8a306e, cmake configuration time: Mon Feb 12 19:57:54 UTC 2018, build type: release, build system: Linux-2.6.32-696.20.1.el6.x86_64, C compiler: gcc 4.4.7, C++ compiler: g++ 4.4.7\n", + "\n", + "" + ], + "text/plain": [ + "[(u'MADlib version: 1.14-dev, git revision: rc/1.13-rc1-12-gb8a306e, cmake configuration time: Mon Feb 12 19:57:54 UTC 2018, build type: release, build system: Linux-2.6.32-696.20.1.el6.x86_64, C compiler: gcc 4.4.7, C++ compiler: g++ 4.4.7',)]" + ] + }, + "execution_count": 67, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"%sql select madlib.version();\n", +"#%sql select version();" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# 1. Load data for classification" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "Done.\n", + "Done.\n", + "9 rows affected.\n", + "9 rows affected.\n" + ] +}, +{ + "data": { + "text/html": [ + "\n", + "\n", + "id\n", + "data\n", + "label\n", + &
[08/18] madlib-site git commit: update jupyter notebooks for 1dot15
http://git-wip-us.apache.org/repos/asf/madlib-site/blob/acd339f6/community-artifacts/Random-forest-v1.ipynb -- diff --git a/community-artifacts/Random-forest-v1.ipynb b/community-artifacts/Random-forest-v1.ipynb deleted file mode 100644 index bac8363..000 --- a/community-artifacts/Random-forest-v1.ipynb +++ /dev/null @@ -1,2899 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ -"# Random forest\n", -"\n", -"Random forests build an ensemble of classifiers, each of which is a tree model constructed using bootstrapped samples from the input data. The results of these models are then combined to yield a single prediction, which, at the expense of some loss in interpretation, have been found to be highly accurate.\n", -"\n", -"Please also refer to the decision tree user documentation for information relevant to the implementation of random forests in MADlib." - ] - }, - { - "cell_type": "code", - "execution_count": 72, - "metadata": {}, - "outputs": [ -{ - "name": "stdout", - "output_type": "stream", - "text": [ - "The sql extension is already loaded. To reload it, use:\n", - " %reload_ext sql\n" - ] -} - ], - "source": [ -"%load_ext sql" - ] - }, - { - "cell_type": "code", - "execution_count": 73, - "metadata": {}, - "outputs": [ -{ - "data": { - "text/plain": [ - "u'Connected: gpadmin@madlib'" - ] - }, - "execution_count": 73, - "metadata": {}, - "output_type": "execute_result" -} - ], - "source": [ -"# Greenplum Database 5.4.0 on GCP (demo machine)\n", -"%sql postgresql://gpadmin@35.184.253.255:5432/madlib\n", -"\n", -"# PostgreSQL local\n", -"#%sql postgresql://fmcquillan@localhost:5432/madlib\n", -"\n", -"# Greenplum Database 4.3.10.0\n", -"#%sql postgresql://gpdbchina@10.194.10.68:61000/madlib" - ] - }, - { - "cell_type": "code", - "execution_count": 75, - "metadata": {}, - "outputs": [ -{ - "name": "stdout", - "output_type": "stream", - "text": [ - "1 rows affected.\n" - ] -}, -{ - "data": { - "text/html": [ - "\n", - "\n", - "version\n", - "\n", - "\n", - "MADlib version: 1.14-dev, git revision: rc/1.13-rc1-40-ga1360f3, cmake configuration time: Wed Mar 28 18:16:08 UTC 2018, build type: release, build system: Linux-2.6.32-696.20.1.el6.x86_64, C compiler: gcc 4.4.7, C++ compiler: g++ 4.4.7\n", - "\n", - "" - ], - "text/plain": [ - "[(u'MADlib version: 1.14-dev, git revision: rc/1.13-rc1-40-ga1360f3, cmake configuration time: Wed Mar 28 18:16:08 UTC 2018, build type: release, build system: Linux-2.6.32-696.20.1.el6.x86_64, C compiler: gcc 4.4.7, C++ compiler: g++ 4.4.7',)]" - ] - }, - "execution_count": 75, - "metadata": {}, - "output_type": "execute_result" -} - ], - "source": [ -"%sql select madlib.version();\n", -"#%sql select version();" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ -"# Random forest classification examples" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ -"# 1. Load data\n", -"Data set related to whether to play golf or not." - ] - }, - { - "cell_type": "code", - "execution_count": 76, - "metadata": {}, - "outputs": [ -{ - "name": "stdout", - "output_type": "stream", - "text": [ - "Done.\n", - "Done.\n", - "14 rows affected.\n", - "14 rows affected.\n" - ] -}, -{ - "data": { - "text/html": [ - "\n", - "\n", - "id\n", - "OUTLOOK\n", - "
[17/18] madlib-site git commit: update jupyter notebooks for 1dot15
http://git-wip-us.apache.org/repos/asf/madlib-site/blob/acd339f6/community-artifacts/Covariance-and-correlation-v1.ipynb -- diff --git a/community-artifacts/Covariance-and-correlation-v1.ipynb b/community-artifacts/Covariance-and-correlation-v1.ipynb new file mode 100644 index 000..aa17628 --- /dev/null +++ b/community-artifacts/Covariance-and-correlation-v1.ipynb @@ -0,0 +1,1318 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# Covariance and Correlation\n", +"\n", +"Generates a covariance or Pearson correlation matrix for pairs of numeric columns in a table. Grouping added in 1.15." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { +"scrolled": true + }, + "outputs": [ +{ + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/config.py:13: ShimWarning: The `IPython.config` package has been deprecated. You should import from traitlets.config instead.\n", + " \"You should import from traitlets.config instead.\", ShimWarning)\n", + "/Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/utils/traitlets.py:5: UserWarning: IPython.utils.traitlets has moved to a top-level traitlets package.\n", + " warn(\"IPython.utils.traitlets has moved to a top-level traitlets package.\")\n" + ] +} + ], + "source": [ +"%load_ext sql" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ +{ + "data": { + "text/plain": [ + "u'Connected: gpadmin@madlib'" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"# Greenplum Database 5.4.0 on GCP (demo machine)\n", +"%sql postgresql://gpadmin@35.184.253.255:5432/madlib\n", +"\n", +"# PostgreSQL local\n", +"#%sql postgresql://fmcquillan@localhost:5432/madlib\n", +"\n", +"# Greenplum Database 4.3.10.0\n", +"#%sql postgresql://gpdbchina@10.194.10.68:61000/madlib" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "1 rows affected.\n" + ] +}, +{ + "data": { + "text/html": [ + "\n", + "\n", + "version\n", + "\n", + "\n", + "MADlib version: 1.15-dev, git revision: rc/1.14-rc1-6-g3b80a32, cmake configuration time: Wed May 16 19:29:52 UTC 2018, build type: release, build system: Linux-2.6.32-696.20.1.el6.x86_64, C compiler: gcc 4.4.7, C++ compiler: g++ 4.4.7\n", + "\n", + "" + ], + "text/plain": [ + "[(u'MADlib version: 1.15-dev, git revision: rc/1.14-rc1-6-g3b80a32, cmake configuration time: Wed May 16 19:29:52 UTC 2018, build type: release, build system: Linux-2.6.32-696.20.1.el6.x86_64, C compiler: gcc 4.4.7, C++ compiler: g++ 4.4.7',)]" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"%sql select madlib.version();\n", +"#%sql select version();" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# 1. Load data" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "Done.\n", + "Done.\n", + "53 rows affected.\n", + "53 rows affected.\n" + ] +}, +{ + "data": { + "text/html": [ + "\n", + "\n", + "id\n", + "outlook\n", + "t
[13/18] madlib-site git commit: update jupyter notebooks for 1dot15
http://git-wip-us.apache.org/repos/asf/madlib-site/blob/acd339f6/community-artifacts/Elastic-net-v3.ipynb -- diff --git a/community-artifacts/Elastic-net-v3.ipynb b/community-artifacts/Elastic-net-v3.ipynb new file mode 100644 index 000..7592fe6 --- /dev/null +++ b/community-artifacts/Elastic-net-v3.ipynb @@ -0,0 +1,2049 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# Elastic net (MADlib v1.10+)\n", +"Demonstrates elastic net, including these updates:\n", +"- in MADlib 1.10: grouping and cross validation introduced \n", +"- in MADlib 1.13: report negative root mean squared error instead of the negative mean squared error" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "The sql extension is already loaded. To reload it, use:\n", + " %reload_ext sql\n" + ] +} + ], + "source": [ +"%load_ext sql" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ +{ + "data": { + "text/plain": [ + "u'Connected: gpadmin@madlib'" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"# Greenplum Database 5.4.0 on GCP (demo machine)\n", +"%sql postgresql://gpadmin@35.184.253.255:5432/madlib\n", +"\n", +"# PostgreSQL local\n", +"#%sql postgresql://fmcquillan@localhost:5432/madlib" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "1 rows affected.\n" + ] +}, +{ + "data": { + "text/html": [ + "\n", + "\n", + "version\n", + "\n", + "\n", + "MADlib version: 1.15-dev, git revision: rc/1.14-rc1-23-gabafa66, cmake configuration time: Wed Jul 11 00:36:05 UTC 2018, build type: release, build system: Linux-2.6.32-696.20.1.el6.x86_64, C compiler: gcc 4.4.7, C++ compiler: g++ 4.4.7\n", + "\n", + "" + ], + "text/plain": [ + "[(u'MADlib version: 1.15-dev, git revision: rc/1.14-rc1-23-gabafa66, cmake configuration time: Wed Jul 11 00:36:05 UTC 2018, build type: release, build system: Linux-2.6.32-696.20.1.el6.x86_64, C compiler: gcc 4.4.7, C++ compiler: g++ 4.4.7',)]" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"%sql select madlib.version();\n", +"#%sql select version();" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"## 1. Create data set\n", +"House prices and characteristics." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "Done.\n", + "Done.\n", + "27 rows affected.\n", + "27 rows affected.\n" + ] +}, +{ + "data": { + "text/html": [ + "\n", + "\n", + "id\n", + "tax\n", + "bedroom\n", + "bath\n", + "price\n", + "size\n", + "lot\n", + "zipcode\n", + "\n", + "\n", + "1\n", + "590\n", + "2\n", + "1.0\n", + "5\n", + "770\n", + "22100\n", + "94301\n", + "\n", + "
[09/18] madlib-site git commit: update jupyter notebooks for 1dot15
http://git-wip-us.apache.org/repos/asf/madlib-site/blob/acd339f6/community-artifacts/Novelty-detection-demo-1.ipynb -- diff --git a/community-artifacts/Novelty-detection-demo-1.ipynb b/community-artifacts/Novelty-detection-demo-1.ipynb deleted file mode 100755 index 563bda4..000 --- a/community-artifacts/Novelty-detection-demo-1.ipynb +++ /dev/null @@ -1,478 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ -"# Novelty detection using 1-class SVM\n", -"\n", -"Classifies new data as similar or different to the training set. This method is an unsupervised method that builds a decision boundary between the data and origin in kernel space and can be used as a novelty detector." - ] - }, - { - "cell_type": "code", - "execution_count": 37, - "metadata": { -"collapsed": false - }, - "outputs": [ -{ - "name": "stdout", - "output_type": "stream", - "text": [ - "The sql extension is already loaded. To reload it, use:\n", - " %reload_ext sql\n" - ] -} - ], - "source": [ -"# Setup\n", -"%load_ext sql\n", -"# %sql postgresql://gpdbchina@10.194.10.68:55000/madlib\n", -"%sql postgresql://fmcquillan@localhost:5432/madlib\n", -"%matplotlib inline\n", -"\n", -"import pandas as pd\n", -"import numpy as np\n", -"import matplotlib.pyplot as plt\n", -"import matplotlib.font_manager" - ] - }, - { - "cell_type": "code", - "execution_count": 38, - "metadata": { -"collapsed": false - }, - "outputs": [ -{ - "data": { - "image/png": "iVBORw0KGgoNSUhEUgAAAW8AAAD7CAYAAAClvBX1BHNCSVQICAgIfAhkiAlwSFlz\nAAALEgAACxIB0t1+/AAAHdNJREFUeJzt3W9wXNWZ5/HvkWWRTrADso2J48SAnRk2hICMijJFaqVN\n0mqGqWhG0hvCwDSwi3Zq+WOsNiiOCHFheRUnESQwM8WYsEhhirCVYTUjZid9LZKSqkSF7LA2lJeB\nAHaGTUIYYpydGOiJsHX2xbndarW69cfq7tu3+/ep6qL76va9R23z+PRznnOOsdYiIiLhUhd0A0RE\nZOkUvEVEQkjBW0QkhBS8RURCSMFbRCSEFLxFREKovlw3MsaoJlFE5DRYa03usbL2vK21gT6++tWv\nBt6GSnnos9Bnoc8iHJ9FIUqbiIiEkIK3iEgI1VTwbm1tDboJFUOfxQx9FjP0Wcyo9M/CzJdTKeqN\njLHlupeISLUwxmCDHrAUEZHiUPAWEQkhBW8RkRBS8BYRCSEFbxGREFLwFhEJIQVvEZEQUvAWEQkh\nBW8RkRAKffD2PI+utja62trwPC/o5oiIlEWop8d7nke8o4N9qRQAvZEIwyMjxGKxot5HRCQohabH\nhzp4d7W10T42Rtx/PQyMRqM8eeBAUe8jIhKUkq9tYoypM8YcNMaMFuuaIiKSXzG3QdsO/BOwuojX\nnFd3IkF8chKy0yaJRLluLyISmKL0vI0xG4Grge8U43qLFYvFGB4ZYTQaZTQaVb5bQk2D77IURcl5\nG2O+D+wFPgwkrLXtec7Ret4iBWjwXQoplPNedtrEGPOHwL9Ya583xrQCc26Stnv37szz1tbWit+p\nQqRc9g8Osi+Vygy+k0qxf3BQwbsGjY+PMz4+vuB5y+55G2P+K3AdcBKIAKuA/2Gt/dOc89TzFilA\nlVNSSFlKBY0xLShtIrJkSptIISVLm4jI8qUH3/cPD gIwnEgocMu8Qj1JR0Sk2mkDYhGRKqLgLSIS\nQgreIiIhpOAtIhJCCt4iIiGk4C0iUkKlWrNGpYIiIiVSjMlXVbkZg4hIJSvGsgeq8xYRqSKaHi8i\nUiKl3DBGaRMRkRLyPC+zZk33aaxZo5y3iEgIKectIlJFFLxFpCy0R2dxKW0iIiWnzSZOn3LeIhIY\nbfN2+pTzFhE8z6OtrYu2ti6lLkJOdd4iNcLzPDo64qRS+wCYnIwzMjJcltRFKeuda5XSJiI1oq2t\ni7GxdshKXkSjoxw48GRZ7r/ceudapQ2IRSQvz/MYHNwPQCLRXbKgGovFFLCLSMFbpEYkEt1MTsbT\nmQsikV5aWm4LLJUiy6O0iUgNye1lDw7uDzSVIgtT2kRE5qQu0oFcwkelgiI1Zu/evaxZs4U1a7aw\nYcMqIpFeXOX1MJFIL4lEd9BNlEVQ2kSkhuzdu5e77/468IB/5Hbi8Q7eeOMEUNoBSzk9mmEpIqxZ\ns4Xjx79Cdo67sXEPb7/9WpDNknlohqWILJpmYlY+BW8JBa1IN7/FBtuenhuB20nnuOF2/9jsa3V0\nxBkba2dsrJ2Ojrg+80pkrS3Lw91KZOmSyaRdH4nYIbBDYNdHIjaZTAbdrIqRTCZtJLLewpCFIRuJ\nrC/4+SSTSbt58ydtff05dtWqj9v+/v4550Sjnf61rP8YstFoZ+b9TU0ttrFxs21qulJ/DmXgx865\nMTXfwVI8FLzldHVGo3ZoJpLYIbCd0WjQzaoY8wXbbPMF+WQyaaPRThuNdtqmppY511u16mM2Ho9b\nYxotbLOQsLDaGnO2bWpqURAvoULBW3XeIjVi164BfyalG6xMpWbqvLNnWTY03MGKFTs4dSr9zp2c\nOJFiePgJ4FbgYuAOoB5r7+PQIfd+zcwsL+W8peJ1JxJuFTpclrY3EqG7BlekK5TXTiS689Zq33DD\nDaxcuZ6VK9cTjUZ54YX/k74S 0AU8xNGjr3LttbeQSp0PnAucy9TURk6deh94CBgF/hr4C+BDwLP+\neWcCv+c/d4FfE37KLF93vBQPlDaRZUgmk7YzGrWd0WhNfkVfKK+dnfZIJpM2Ho9bWJ053z2/0sKZ\nFtbmHE/4z8+ysM5/vm1O6sQd+4iF9VnvX28hWTBVI8uHct4i4TVfXru/v982Nm62kci5NhJZZxsb\nN1tjPphzfsIPzp/KE5Q3+wE4O2Anc4L82f41zi0Q1FfnHfyU5SsUvJU2EQmx9IzJ48e/Qir1NVIp\ny/Hjf4y19cBTWWc+A3wLl+rItQ64BvhnXKrEA2K43PhOXH77s8AQkG+i3a+Bm5mYOFicX0oWZdkD\nlsaYjcB3gfXANPCwtfaB+d8lIkuRbznXRGKYa6+9BTfVPZ519qh/bAcuDw7wsv/f7qxzD+MC8hrg\nJHAecCVwLXAO8BbwH4CngQNAJ/ATYHvWvdK59jeBnxXjV5VFKkbP+yTQY629CLgCuMUYc2ERrisi\nvlgsxsiIW641Gh1dVGVHQ0MD9fV3UV9/F5//fLM/qPkmcB2uauQRYBD4MvABXOB+BNgAvAPcCEwC\n/wnYBPwt8CX/9Xagx7/Wm6e1oJVmcS5P0dc2Mcb8LfCgtfaHOcdtse8lUuvmLjS1E9ezfpj+/rsA\nuO++RwH4whc+k1mA6ujRoxw5cgfM2s/9YeCnwDf9Y7244PwM8Aug3z/fA3ZTV/cSH/rQarZsuYCB\ngV1LKhPM3U8zEulVqWEBZVnP2xhzHnAp7ruViJRYX18fAPfdt4dU6l3AUF//fc45ZyOPPvo4R478\nM/BpAIaHn2Dz5n/HBRdcwOrVH8650mHgVVzgzk7BfA2XPpn2z/H8n+9jehpOnNjJ
[03/18] madlib-site git commit: update jupyter notebooks for 1dot15
http://git-wip-us.apache.org/repos/asf/madlib-site/blob/acd339f6/community-artifacts/mlp-mnist-v2.ipynb -- diff --git a/community-artifacts/mlp-mnist-v2.ipynb b/community-artifacts/mlp-mnist-v2.ipynb deleted file mode 100644 index 3c1ad14..000 --- a/community-artifacts/mlp-mnist-v2.ipynb +++ /dev/null @@ -1,1154 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ -"# Neural networks\n", -"\n", -"Multilayer perceptron (MLP) using the well known MNIST data set.\n", -"\n", -"Updated to include mini-batching which was added in the 1.14 release.\n", -"\n", -"# Intro" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [ -{ - "data": { - "image/jpeg": "/9j/4R5fRXhpZgAATU0AKggABwESAAMBAAEAAAEaAAUBYgEbAAUB\nagEoAAMBAAIAAAExAAIccgEyAAIUjodpAAQBpNAACvyA\nAAAnEAAK/IAAACcQQWRvYmUgUGhvdG9zaG9wIENTNSBXaW5kb3dzADIwMTU6MDc6MjQgMTA6NTk6\nNTEAA6ABAAMBAAEAAKACAAQBAAACoKADAAQBAAABcwAGAQMAAwAA\nAAEABgAAARoABQEAAAEeARsABQEAAAEmASgAAwEAAgAAAgEABAEAAAEuAgIA\nBAEAAB0pAEgBSAH/2P/tAAxBZG9iZV9DTQAB/+4ADkFkb2JlAGSA\nAf/bAIQADAgICAkIDAkJDBELCgsRFQ8MDA8VGBMTFRMTGBEMDAwMDAwRDAwMDAwMDAwMDAwM\nDAwMDAwMDAwMDAwMDAwMDAENCwsNDg0QDg4QFA4ODhQUDg4ODhQRDAwMDAwREQwMDAwMDBEMDAwM\nDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwM/8AAEQgAWACgAwEiAAIRAQMRAf/dAAQACv/EAT8AAAEF\nAQEBAQEBAAMAAQIEBQYHCAkKCwEAAQUBAQEBAQEAAQACAwQFBgcICQoLEAAB\nBAEDAgQCBQcGCAUDDDMBAAIRAwQhEjEFQVFhEyJxgTIGFJGhsUIjJBVSwWIzNHKC0UMHJZJT8OHx\nY3M1FqKygyZEk1RkRcKjdDYX0lXiZfKzhMPTdePzRieUpIW0lcTU5PSltcXV5fVWZnaGlqa2xtbm\n9jdHV2d3h5ent8fX5/cRAAICAQIEBAMEBQYHBwYF NQEAAhEDITESBEFRYXEiEwUygZEUobFCI8FS\n0fAzJGLhcoKSQ1MVY3M08SUGFqKygwcmNcLSRJNUoxdkRVU2dGXi8rOEw9N14/NGlKSFtJXE1OT0\npbXF1eX1VmZ2hpamtsbW5vYnN0dXZ3eHl6e3x//aAAwDAQACEQMRAD8A8+yL677Bjsa4VU+1pr5c\nZ997qHex2/8AkvqVmrEDunZFjr2OqqurbVZqNjnh+/1G7fUZQ7+bf/w3p2s/PR+o9Lz+kVMuusdm\nY1h9mVSG2UknX0vXs3Pqs/kW1/8AFb0OjM9YNu2FhYSygXuL91h/wTXN9KnZs/nv0P8AwX+EVwRH\nERI1OvlP8v0WrxXEGGsL+Yfi6n1b+qNudU578dz6GkObbYfTZPZ9su2V4j/of4Sy/wDwXpLQ+tXS\nOidDbVl0Xs6hmOALNAcdriZda9u57MjZ7fRxv5j/AE3s/QrAOX1nIoi8bq2GHAlpx/3dwZLfsX/W\n/wBH/wAUhWU+ptPUdlTK2D1Qx732j6XpsrbvuZ6j2t9m5VZ8vmMwYnhiP3gYxr+t+jxf3nRx83y8\ncUoyjxEg7cMpX/e+fh/2f/VHP9PJ6hkvfY51j3AudY4+0azue7/BsRsu+uPtNW71AQL9hibOPVtd\n9JzLo3Naz9H/ADnvVhl7La/s2JS19FmjaCSywfvW2tZ77/67LX/9bUR07Ira66k1uraD6raSCA3u\n17bW73s/66rAgRHT1dZS31aEpgy19P7sfBosyxbIbtxbXab2NAYf65hz6/625W+k5WS3Os6dkOc6\nvPY7FtredwBcZoewas3svbW6t6rkV2GWFjmNGrbWw5o8fUq3OsZ/L/z1ZxaDe0Vh4FteuPaTIkcU\nG5m5vu/wPqemm2QQbutfCX9X/CZBAT9OwP8AzfH/AAXO22Y73V3s3ODi19TuxB2O/lMf7UW+trqz\nbS42AfSB+nWB/pGt+kz/AId ns/4pdD9b+gHBycfMqE09Rory2gCNrntHrV/9ublzM2UvFlbi17T7\nSNCFBDNGVgai6/rNjLy04Ue4v+qUYrLQCSId2lELQGyHNDDoDrz3Uox8ru3HyI+Fdh/6nHf/AOAf\n8Qp112Ue8si9mlbXabf+HP8A6K/z/wDBqQD7O7AT337Mr7NlDGma3NeQA0agDbE/y91aLVjV5YL2\nAG0iLGAR/wChDG/+fa0G5jWUVusJBcBHftvc7/PuR6qMmusCtjqrLoDCdH7eXP8A3m7voMTrAJMv\nlA1WxgZUIfMTo3cetwz2VUVB9bQC+x2jWNI29/5KNlYFLb6rWPDgAfc7RrgP39v57VY6b0rqTG2W\nZ9fpYEB92Y7QV/ubv9I6z6Laq/0j0LO6sy+jKq6QQPQaLG3Fv6Z1YivJZW1381Tsd6/s9/6L9Ips\nOXBkxmQPFqTED+r+7+8s5jl8+HJwyiYmgDekfV++hf0NptORfdXTV/g6LPa6I3fzbvd/nIb8zpWP\nXNFbr36tftbtAP8AJts3PZ/Yas3EzH7ybALHO/Odq6f3nOSNFlgL2PLGEk+72Nn+S523d/ZQOSNf\nq4Cz39Ulvtm6nIkDt6Ypm9YbUf1WivHIduFhm2wT/wAJb/5FA+1W+u251jrTIIJ4Gu72fu+5RdTc\n0x6lTncwXt/79tU62vYSL2+m0CZI0I/kOHtUfFM6GwB4cMf+9ZOGI1Gv1uX/AHz/AP/Q5Tp+bfhd\nQNFDt+Na4NdU8B9dlbiN1N1T5rtb+Yl9Yei04edQ2uxuLhWVCzFYSS5u91nrV/n3Xena3+d/0foq\n5iYWFTe1vrnLvYQ5lTKHtqY8wd5teTZbWzd7WVM+mq31nysPL6kxl15NePWzGZ6bdzy6ou9d3q2P\npZ7rbH/yFp5Yj27IF8VCzXp/d/uubCROb02Bwni0/wAU/wB5yrc/ErtF1Hvyg0NdkWNhriNN9dbC\n51Vm3/ WtTr6jk10NfXY5jQTuLX7AS76Xs9rfcosxeie5nqZFV4gtN7Q2l2vua5+P9otr9v0H7LP+\ntqX2BmVWa6wxkRse29trASf8I2WWVb/9a1XByEmqvX0w/a2CMdCwfOf7GVGRg5FzXiaskNg2D3F3\n9lv538v1N607KMT0baKrWnMsbNo8h+839/8A8+f8YsNmBkY9pY7HvfYzuGnaD/J27t3tVlvSC8Nf\nYy0thpLnVlsal7/e76Wytjv7akhOfCYmAJOh/R/lJbMR6TqOhH6X8otW3Cdj2OBvre6r6fpl7XVu\n/wBGfVrZ/nfzX/CKz07GnIrscfQfIhzSNjp/lsOyh/8AX/RP/wCCTMz8k+y5hlsltpbuewk7thkO\nd6as/YnmoZEso3fzdm4sZaZ1Yxz9/p2/29n/ABagnhE4nhvby4Wzgz+3IGVb+fE+kk9Mz/q2K7nl\n2fgsdFThHtmYdW4ek7a1eddV+r2RWW5Lsd+NjWFw9dw2tLgRuZTjndfb6e737G+/9J+4rGH1HMbf\nXiZRsqJikWD81p09LLriux1Tf33s/Rfy6lW64d/Ucm6x9hvLx7gXBhaGtb+lDHOv/M2+p9BV+X5I\nwuc5cW0K+T5dpzv9Nu81z8cg9vGOEEyy6+vh498eLh4f1fF6nOpq6I2zZZ9otaP5zIY5lZYP324l\njLPV2/ufaP0v/BqfUS/CvOLfGTjPa1+JazQmp38zbjv9zms/0mO/ez1vU/wqDk5mdG0trfXyWFu+\nQPzv0+65zW/1lp9KnqeIMWzF9O7FJtwDXvAe4/pLMT3+vt9X+ep/m6vUVqIs8EdD/d08v0uL/Ccy\nZI9ctY9fV/zv0eFp5NIOTQ3HLbW4VMbZAO8Tb7mE+/8ASWN/fVn6vY9Z6gcnqdr8XDxh6+Vbtmxw\nnayqhr/5zJybXbGfufpL/wCbpeh19LynMyLL22V22kMHqV7LPcd9n6K1zf8AO9VXX9N zMTCrxBi2\nZFYd9ozi8gMBINePW6zcfTtopc+2yqv+bfk/ziWXAcg2MRLXv/U4aX4OaGGYIIkYn6/vcTf+sX15\nd1ikYNdBqopG3GoZ7obG0b3aufZs/PXJY9jsPKrue4M2E7gDLy1w2WAbPou2O/PVi+uu
[11/18] madlib-site git commit: update jupyter notebooks for 1dot15
http://git-wip-us.apache.org/repos/asf/madlib-site/blob/acd339f6/community-artifacts/MLP-mnist-v3.ipynb -- diff --git a/community-artifacts/MLP-mnist-v3.ipynb b/community-artifacts/MLP-mnist-v3.ipynb new file mode 100644 index 000..1fa6210 --- /dev/null +++ b/community-artifacts/MLP-mnist-v3.ipynb @@ -0,0 +1,1329 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# Neural networks\n", +"\n", +"Multilayer perceptron (MLP) using the well known MNIST data set.\n", +"\n", +"Updated to include mini-batching which was added in 1.14. Momentum was added in 1.15.\n", +"\n", +"# Intro" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ +{ + "data": { + "image/jpeg": "/9j/4R5fRXhpZgAATU0AKggABwESAAMBAAEAAAEaAAUBYgEbAAUB\nagEoAAMBAAIAAAExAAIccgEyAAIUjodpAAQBpNAACvyA\nAAAnEAAK/IAAACcQQWRvYmUgUGhvdG9zaG9wIENTNSBXaW5kb3dzADIwMTU6MDc6MjQgMTA6NTk6\nNTEAA6ABAAMBAAEAAKACAAQBAAACoKADAAQBAAABcwAGAQMAAwAA\nAAEABgAAARoABQEAAAEeARsABQEAAAEmASgAAwEAAgAAAgEABAEAAAEuAgIA\nBAEAAB0pAEgBSAH/2P/tAAxBZG9iZV9DTQAB/+4ADkFkb2JlAGSA\nAf/bAIQADAgICAkIDAkJDBELCgsRFQ8MDA8VGBMTFRMTGBEMDAwMDAwRDAwMDAwMDAwMDAwM\nDAwMDAwMDAwMDAwMDAwMDAENCwsNDg0QDg4QFA4ODhQUDg4ODhQRDAwMDAwREQwMDAwMDBEMDAwM\nDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwM/8AAEQgAWACgAwEiAAIRAQMRAf/dAAQACv/EAT8AAAEF\nAQEBAQEBAAMAAQIEBQYHCAkKCwEAAQUBAQEBAQEAAQACAwQFBgcICQoLEAAB\nBAEDAgQCBQcGCAUDDDMBAAIRAwQhEjEFQVFhEyJxgTIGFJGhsUIjJBVSwWIzNHKC0UMHJZJT8OHx\nY3M1FqKygyZEk1RkRcKjdDYX0lXiZfKzhMPTdePzRieUpIW0lcTU5PSltcXV5fVWZnaGlqa2xtbm\n9jdHV2d3h5ent8fX5/cRAAICAQIEBAMEBQYHBwYF NQEAAhEDITESBEFRYXEiEwUygZEUobFCI8FS\n0fAzJGLhcoKSQ1MVY3M08SUGFqKygwcmNcLSRJNUoxdkRVU2dGXi8rOEw9N14/NGlKSFtJXE1OT0\npbXF1eX1VmZ2hpamtsbW5vYnN0dXZ3eHl6e3x//aAAwDAQACEQMRAD8A8+yL677Bjsa4VU+1pr5c\nZ997qHex2/8AkvqVmrEDunZFjr2OqqurbVZqNjnh+/1G7fUZQ7+bf/w3p2s/PR+o9Lz+kVMuusdm\nY1h9mVSG2UknX0vXs3Pqs/kW1/8AFb0OjM9YNu2FhYSygXuL91h/wTXN9KnZs/nv0P8AwX+EVwRH\nERI1OvlP8v0WrxXEGGsL+Yfi6n1b+qNudU578dz6GkObbYfTZPZ9su2V4j/of4Sy/wDwXpLQ+tXS\nOidDbVl0Xs6hmOALNAcdriZda9u57MjZ7fRxv5j/AE3s/QrAOX1nIoi8bq2GHAlpx/3dwZLfsX/W\n/wBH/wAUhWU+ptPUdlTK2D1Qx732j6XpsrbvuZ6j2t9m5VZ8vmMwYnhiP3gYxr+t+jxf3nRx83y8\ncUoyjxEg7cMpX/e+fh/2f/VHP9PJ6hkvfY51j3AudY4+0azue7/BsRsu+uPtNW71AQL9hibOPVtd\n9JzLo3Naz9H/ADnvVhl7La/s2JS19FmjaCSywfvW2tZ77/67LX/9bUR07Ira66k1uraD6raSCA3u\n17bW73s/66rAgRHT1dZS31aEpgy19P7sfBosyxbIbtxbXab2NAYf65hz6/625W+k5WS3Os6dkOc6\nvPY7FtredwBcZoewas3svbW6t6rkV2GWFjmNGrbWw5o8fUq3OsZ/L/z1ZxaDe0Vh4FteuPaTIkcU\nG5m5vu/wPqemm2QQbutfCX9X/CZBAT9OwP8AzfH/AAXO22Y73V3s3ODi19TuxB2O/lMf7UW+trqz\nbS42AfSB+nWB/pGt+kz/AId ns/4pdD9b+gHBycfMqE09Rory2gCNrntHrV/9ublzM2UvFlbi17T7\nSNCFBDNGVgai6/rNjLy04Ue4v+qUYrLQCSId2lELQGyHNDDoDrz3Uox8ru3HyI+Fdh/6nHf/AOAf\n8Qp112Ue8si9mlbXabf+HP8A6K/z/wDBqQD7O7AT337Mr7NlDGma3NeQA0agDbE/y91aLVjV5YL2\nAG0iLGAR/wChDG/+fa0G5jWUVusJBcBHftvc7/PuR6qMmusCtjqrLoDCdH7eXP8A3m7voMTrAJMv\nlA1WxgZUIfMTo3cetwz2VUVB9bQC+x2jWNI29/5KNlYFLb6rWPDgAfc7RrgP39v57VY6b0rqTG2W\nZ9fpYEB92Y7QV/ubv9I6z6Laq/0j0LO6sy+jKq6QQPQaLG3Fv6Z1YivJZW1381Tsd6/s9/6L9Ips\nOXBkxmQPFqTED+r+7+8s5jl8+HJwyiYmgDekfV++hf0NptORfdXTV/g6LPa6I3fzbvd/nIb8zpWP\nXNFbr36tftbtAP8AJts3PZ/Yas3EzH7ybALHO/Odq6f3nOSNFlgL2PLGEk+72Nn+S523d/ZQOSNf\nq4Cz39Ulvtm6nIkDt6Ypm9YbUf1WivHIduFhm2wT/wAJb/5FA+1W+u251jrTIIJ4Gu72fu+5RdTc\n0x6lTncwXt/79tU62vYSL2+m0CZI0I/kOHtUfFM6GwB4cMf+9ZOGI1Gv1uX/AHz/AP/Q5Tp+bfhd\nQNFDt+Na4NdU8B9dlbiN1N1T5rtb+Yl9Yei04edQ2uxuLhWVCzFYSS5u91nrV/n3Xena3+d/0foq\n5iYWFTe1vrnLvYQ5lTKHtqY8wd5teTZbWzd7WVM+mq31nysPL6kxl15NePWzGZ6bdzy6ou9d3q2P\npZ7rbH/yFp5Yj27IF8VCzXp/d/uubCROb02Bwni0/wAU/wB5yrc/ErtF1Hvyg0NdkWNhriNN9dbC\n51Vm3/ WtTr6jk10NfXY5jQTuLX7AS76Xs9rfcosxeie5nqZFV4gtN7Q2l2vua5+P9otr9v0H7LP+\ntqX2BmVWa6wxkRse29trASf8I2WWVb/9a1XByEmqvX0w/a2CMdCwfOf7GVGRg5FzXiaskNg2D3F3\n9lv538v1N607KMT0baKrWnMsbNo8h+839/8A8+f8YsNmBkY9pY7HvfYzuGnaD/J27t3tVlvSC8Nf\nYy0thpLnVlsal7/e76Wytjv7akhOfCYmAJOh/R/lJbMR6TqOhH6X8otW3Cdj2OBvre6r6fpl7XVu\n/wBGfVrZ/nfzX/CKz07GnIrscfQfIhzSNjp/lsOyh/8AX/RP/wCCTMz8k+y5hlsltpbuewk7thkO\nd6as/YnmoZEso3fzdm4sZaZ1Yxz9/p2/29n/ABagnhE4nhvby4Wzgz+3IGVb+fE+kk9Mz/q2K7nl\n2fgsdFThHtmYdW4ek7a1eddV+r2RWW5Lsd+NjWFw9dw2tLgRuZTjndfb6e737G+/9J+4rGH1HMbf\nXiZRsqJikWD81p09LLriux1Tf33s/Rfy6lW64d/Ucm6x9hvLx7gXBhaGtb+lDHOv/M2+p9BV+X5I\nwuc5cW0K+T5dpzv9Nu81z8cg9vGOEEyy6+vh498eLh4f1fF6nOpq6I2zZZ9otaP5zIY5lZYP324l\njLPV2/ufaP0v/BqfUS/CvOLfGTjPa1+JazQmp38zbjv9zms/0mO/ez1vU/wqDk5mdG0trfXyWFu+\nQPzv0+65zW/1lp9KnqeIMWzF9O7FJtwDXvAe4/pLMT3+vt9X+ep/m6vUVqIs8EdD/d08v0uL/Ccy\nZI9ctY9fV/zv0eFp5NIOTQ3HLbW4VMbZAO8Tb7mE+/8ASWN/fVn6vY9Z6gcnqdr8XDxh6+Vbtmxw\nnayqhr/5zJybXbGfufpL/wCbpeh19LynMyLL22V22kMHqV7LPcd9n6K1zf8AO9VXX9N zMTCrxBi2\nZFYd9ozi8gMBINePW6zcfTtopc+2yqv+bfk/ziWXAcg2MRLXv/U4aX4OaGGYIIkYn6/vcTf+sX15\nd1ikYNdBqopG3GoZ7obG0b3aufZs/PXJY9jsPKrue4M2E7gDLy1w2W
[14/18] madlib-site git commit: update jupyter notebooks for 1dot15
http://git-wip-us.apache.org/repos/asf/madlib-site/blob/acd339f6/community-artifacts/Elastic-net-v2.ipynb -- diff --git a/community-artifacts/Elastic-net-v2.ipynb b/community-artifacts/Elastic-net-v2.ipynb deleted file mode 100644 index b6082f0..000 --- a/community-artifacts/Elastic-net-v2.ipynb +++ /dev/null @@ -1,2078 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ -"# Elastic net (MADlib v1.10+)\n", -"Demonstrates elastic net, including these updates:\n", -"- in MADlib 1.10: grouping and cross validation which were introduced \n", -"- in MADlib 1.13: report negative root mean squared error instead of the negative mean squared error" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [ -{ - "name": "stderr", - "output_type": "stream", - "text": [ - "/Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/config.py:13: ShimWarning: The `IPython.config` package has been deprecated. You should import from traitlets.config instead.\n", - " \"You should import from traitlets.config instead.\", ShimWarning)\n", - "/Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/utils/traitlets.py:5: UserWarning: IPython.utils.traitlets has moved to a top-level traitlets package.\n", - " warn(\"IPython.utils.traitlets has moved to a top-level traitlets package.\")\n" - ] -} - ], - "source": [ -"%load_ext sql" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [ -{ - "data": { - "text/plain": [ - "u'Connected: gpdbchina@madlib'" - ] - }, - "execution_count": 2, - "metadata": {}, - "output_type": "execute_result" -} - ], - "source": [ -"# Greenplum 4.3.10.0\n", -"%sql postgresql://gpdbchina@10.194.10.68:61000/madlib\n", -"\n", -"# PostgreSQL local\n", -"#%sql postgresql://fmcquillan@localhost:5432/madlib\n", -"\n", -"# Greenplum 4.2.3.0\n", -"#%sql postgresql://gpdbchina@10.194.10.68:55000/madlib" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [ -{ - "name": "stdout", - "output_type": "stream", - "text": [ - "1 rows affected.\n" - ] -}, -{ - "data": { - "text/html": [ - "\n", - "\n", - "version\n", - "\n", - "\n", - "MADlib version: 1.13-dev, git revision: rel/v1.12-42-gedc93f5, cmake configuration time: Fri Dec 8 18:28:18 UTC 2017, build type: Release, build system: Linux-2.6.18-238.27.1.el5.hotfix.bz516490, C compiler: gcc 4.4.0, C++ compiler: g++ 4.4.0\n", - "\n", - "" - ], - "text/plain": [ - "[(u'MADlib version: 1.13-dev, git revision: rel/v1.12-42-gedc93f5, cmake configuration time: Fri Dec 8 18:28:18 UTC 2017, build type: Release, build system: Linux-2.6.18-238.27.1.el5.hotfix.bz516490, C compiler: gcc 4.4.0, C++ compiler: g++ 4.4.0',)]" - ] - }, - "execution_count": 3, - "metadata": {}, - "output_type": "execute_result" -} - ], - "source": [ -"%sql select madlib.version();\n", -"#%sql select version();" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ -"## 1. Create data set\n", -"House prices and characteristics." - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [ -{ - "name": "stdout", - "output_type": "stream", - "text": [ - "Done.\n", - "Done.\n", - "27 rows affected.\n", - "27 rows affected.\n" - ] -}, -{ - "data": { - "text/html": [ - "\n", - "\n", - "id\n", -
[04/18] madlib-site git commit: update jupyter notebooks for 1dot15
http://git-wip-us.apache.org/repos/asf/madlib-site/blob/acd339f6/community-artifacts/Stratified-sampling-v2.ipynb -- diff --git a/community-artifacts/Stratified-sampling-v2.ipynb b/community-artifacts/Stratified-sampling-v2.ipynb new file mode 100644 index 000..daa417b --- /dev/null +++ b/community-artifacts/Stratified-sampling-v2.ipynb @@ -0,0 +1,672 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# Stratified sampling\n", +"Stratified sampling is a method for sampling subpopulations (strata) independently. It is commonly used to reduce sampling error by ensuring that subgroups are adequately represented in the sample.\n", +"\n", +"Stratified sampling was added in MADlib 1.12." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { +"scrolled": true + }, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "The sql extension is already loaded. To reload it, use:\n", + " %reload_ext sql\n" + ] +} + ], + "source": [ +"%load_ext sql" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ +{ + "data": { + "text/plain": [ + "u'Connected: gpdbchina@madlib'" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"# Greenplum Database 5.4.0 on GCP (demo machine)\n", +"#%sql postgresql://gpadmin@35.184.253.255:5432/madlib\n", +"\n", +"# PostgreSQL local\n", +"%sql postgresql://fmcquillan@localhost:5432/madlib\n", +"\n", +"# Greenplum Database 4.3.10.0\n", +"#%sql postgresql://gpdbchina@10.194.10.68:61000/madlib" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "1 rows affected.\n" + ] +}, +{ + "data": { + "text/html": [ + "\n", + "\n", + "version\n", + "\n", + "\n", + "MADlib version: 1.12-dev, git revision: rel/v1.11-23-gfdf7b6d, cmake configuration time: Wed Jun 28 18:06:35 UTC 2017, build type: Release, build system: Linux-2.6.18-238.27.1.el5.hotfix.bz516490, C compiler: gcc 4.4.0, C++ compiler: g++ 4.4.0\n", + "\n", + "" + ], + "text/plain": [ + "[(u'MADlib version: 1.12-dev, git revision: rel/v1.11-23-gfdf7b6d, cmake configuration time: Wed Jun 28 18:06:35 UTC 2017, build type: Release, build system: Linux-2.6.18-238.27.1.el5.hotfix.bz516490, C compiler: gcc 4.4.0, C++ compiler: g++ 4.4.0',)]" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"%sql select madlib.version();\n", +"#%sql select version();" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# 1. Create input table" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "Done.\n", + "Done.\n", + "25 rows affected.\n", + "25 rows affected.\n" + ] +}, +{ + "data": { + "text/html": [ + "\n", + "\n", + "id1\n", + "id2\n", + "gr1\n", + "gr2\n", + "\n", + "\n", + "1\n", + "0\n", + "1\n", + "1\n", + "\n", + "\n", + "2\n", + "0\n", + &
[01/18] madlib-site git commit: update jupyter notebooks for 1dot15
Repository: madlib-site Updated Branches: refs/heads/asf-site 5fa1ac070 -> acd339f65 http://git-wip-us.apache.org/repos/asf/madlib-site/blob/acd339f6/community-artifacts/stratified-sampling-v1.ipynb -- diff --git a/community-artifacts/stratified-sampling-v1.ipynb b/community-artifacts/stratified-sampling-v1.ipynb deleted file mode 100644 index 75e02fd..000 --- a/community-artifacts/stratified-sampling-v1.ipynb +++ /dev/null @@ -1,672 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ -"# Stratified sampling\n", -"Stratified sampling is a method for sampling subpopulations (strata) independently. It is commonly used to reduce sampling error by ensuring that subgroups are adequately represented in the sample.\n", -"\n", -"Stratified sampling was added in MADlib 1.12." - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": { -"scrolled": true - }, - "outputs": [ -{ - "name": "stdout", - "output_type": "stream", - "text": [ - "The sql extension is already loaded. To reload it, use:\n", - " %reload_ext sql\n" - ] -} - ], - "source": [ -"%load_ext sql" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [ -{ - "data": { - "text/plain": [ - "u'Connected: gpdbchina@madlib'" - ] - }, - "execution_count": 10, - "metadata": {}, - "output_type": "execute_result" -} - ], - "source": [ -"# Greenplum 4.3.10.0\n", -"%sql postgresql://gpdbchina@10.194.10.68:61000/madlib\n", -"\n", -"# PostgreSQL local\n", -"#%sql postgresql://fmcquillan@localhost:5432/madlib\n", -"\n", -"# Greenplum 4.2.3.0\n", -"#%sql postgresql://gpdbchina@10.194.10.68:55000/madlib" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [ -{ - "name": "stdout", - "output_type": "stream", - "text": [ - "1 rows affected.\n" - ] -}, -{ - "data": { - "text/html": [ - "\n", - "\n", - "version\n", - "\n", - "\n", - "MADlib version: 1.12-dev, git revision: rel/v1.11-23-gfdf7b6d, cmake configuration time: Wed Jun 28 18:06:35 UTC 2017, build type: Release, build system: Linux-2.6.18-238.27.1.el5.hotfix.bz516490, C compiler: gcc 4.4.0, C++ compiler: g++ 4.4.0\n", - "\n", - "" - ], - "text/plain": [ - "[(u'MADlib version: 1.12-dev, git revision: rel/v1.11-23-gfdf7b6d, cmake configuration time: Wed Jun 28 18:06:35 UTC 2017, build type: Release, build system: Linux-2.6.18-238.27.1.el5.hotfix.bz516490, C compiler: gcc 4.4.0, C++ compiler: g++ 4.4.0',)]" - ] - }, - "execution_count": 11, - "metadata": {}, - "output_type": "execute_result" -} - ], - "source": [ -"%sql select madlib.version();\n", -"#%sql select version();" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ -"# 1. Create input table" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": {}, - "outputs": [ -{ - "name": "stdout", - "output_type": "stream", - "text": [ - "Done.\n", - "Done.\n", - "25 rows affected.\n", - "25 rows affected.\n" - ] -}, -{ - "data": { - "text/html": [ - "\n", - "\n", - "id1\n", - "id2\n", - "gr1\n", - "gr2\n", - "\n", - "\n", - "1\n", - "0\n", - "1\n", - "1\n", - "\n", - "\n", - "2\n&
[06/18] madlib-site git commit: update jupyter notebooks for 1dot15
http://git-wip-us.apache.org/repos/asf/madlib-site/blob/acd339f6/community-artifacts/SVM-novelty-detection-v2.ipynb -- diff --git a/community-artifacts/SVM-novelty-detection-v2.ipynb b/community-artifacts/SVM-novelty-detection-v2.ipynb new file mode 100755 index 000..678d7c9 --- /dev/null +++ b/community-artifacts/SVM-novelty-detection-v2.ipynb @@ -0,0 +1,511 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# Novelty detection using 1-class SVM\n", +"\n", +"Classifies new data as similar or different to the training set. This method is an unsupervised method that builds a decision boundary between the data and origin in kernel space and can be used as a novelty detector." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ +{ + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/config.py:13: ShimWarning: The `IPython.config` package has been deprecated. You should import from traitlets.config instead.\n", + " \"You should import from traitlets.config instead.\", ShimWarning)\n", + "/Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/utils/traitlets.py:5: UserWarning: IPython.utils.traitlets has moved to a top-level traitlets package.\n", + " warn(\"IPython.utils.traitlets has moved to a top-level traitlets package.\")\n" + ] +} + ], + "source": [ +"%load_ext sql" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ +{ + "data": { + "text/plain": [ + "u'Connected: gpadmin@madlib'" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"# Greenplum Database 5.4.0 on GCP (demo machine)\n", +"%sql postgresql://gpadmin@35.184.253.255:5432/madlib\n", +"\n", +"# PostgreSQL local\n", +"#%sql postgresql://fmcquillan@localhost:5432/madlib\n", +"\n", +"# Greenplum Database 4.3.10.0\n", +"#%sql postgresql://gpdbchina@10.194.10.68:61000/madlib" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { +"collapsed": true + }, + "outputs": [], + "source": [ +"# Setup\n", +"%matplotlib inline\n", +"\n", +"import pandas as pd\n", +"import numpy as np\n", +"import matplotlib.pyplot as plt\n", +"import matplotlib.font_manager" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ +{ + "data": { + "image/png": "iVBORw0KGgoNSUhEUgAAAW8AAAD7CAYAAAClvBX1BHNCSVQICAgIfAhkiAlwSFlz\nAAALEgAACxIB0t1+/AAAHfZJREFUeJzt3X9wXOV97/H3IwulS2yMZTmG4OCACOMCHiOby3DHnWsN\nYXcZOlUq6x9CSJUfjSZznZofx0RQU6IEcYkTtvnVtB6RTKzAMPQmvmrVznSPlXbEHTE3vQk2lDhQ\niIcyIQZSYXJBwyay0XP/eM6uVqtdayWt9uzZ/bxmdtgfZ88+LObjZ7/Pj2OstYiISLQ0hd0AERFZ\nPIW3iEgEKbxFRCJI4S0iEkEKbxGRCFJ4i4hEUHO1PsgYozmJIiJLYK01hc9VtedtrQ319oUvfCH0\nNtTKTd+Fvgt9F9H4LkpR2UREJIIU3iIiEdRQ4d3Z2Rl2E2qGvotZ+i5m6buYVevfhTlbTaWiH2SM\nrdZniYjUC2MMNuwBSxERqQyFt4hIBCm8RUQiSOEtIhJBCm+RMvm+T08iQU8ige/7YTdHGpxmm4iU\nwfd9eru7OZDJANAfizE8MkIymQy5ZVLvNNtEZBmGUikOZDL0Ar3AgUyGoVQq7GZFin65VFbVNqYS\nkcZV+Muld2JCv1yWSeEtUoY+z6N3YgLyyyaeF3KroiP/lwsAwS8XhffSKbxFypBMJhkeGcmVSoY9\nT8EjodKApTQ83/dzodynUF4RGvBdulIDlgpvaWgKlerRX5JLs+LhbYxpAn4KvGKt7SryusJbak5P\nIkHX2FiuFjsMjMbjHD5yJMxmieRUY6rgbcDPK3g+EREpoSLhbYzZBNwEfKcS5xOplj7Pc6USXK+7\nPxajT7NIJAIqUjYxxvwAeABYC3gqm0iUqBYrtaxU2WTZUwWNMX8IvG6tfdoY0wnM+5CsgYGB3P3O\nzs6av1KFNIZkMqnAlpoxPj7O+Pj4gsctu+dtjPkfw K3AGSAGrAH+l7X2TwqOU89bRGSRqjJV0Biz\nC5VNREQqRhtTiYjUES3SERGpYep5i4jUEYV3hWnPYhGpBoV3BWX3yegaG3NLrru7FeAiDWqlO3IK\n7wrS1VYam351SVY1OnLaz1ukAnSlGMlXjYtPKLwrSFdbaVy6UoxUm8K7gnS1FRGB6nTkNM9bpAJ0\nUQcpVKkNz3QlHZEVpt0JZSUovEVEIkgrLEVE6ojCW0QkghTeIiIRpPAWEYkghbeISAQpvEVEIkjh\nLSISQQpvkQainQ/rhxbpiDQILeGPJq2wFGlwPYmE21s6eDwMjMbjHD5yJMxmyQK0wlJEpI4ovEUa\nRJ/nuVIJrtfdH4vRt8htSn3fJ5HoIZHoUc08ZCqbiDSQ5ex86Ps+3d29ZDIHAIjF+hkZGVbNfIWp\n5i0i8/i+Tyo1BIDn9Z01iBOJHsbGuiCvah6Pj3LkyOGVb2gDKxXeupKOSAPJD+tdu7bzwAPfyvWk\nJyZ61ZOO
[07/18] madlib-site git commit: update jupyter notebooks for 1dot15
http://git-wip-us.apache.org/repos/asf/madlib-site/blob/acd339f6/community-artifacts/Random-forest-v2.ipynb -- diff --git a/community-artifacts/Random-forest-v2.ipynb b/community-artifacts/Random-forest-v2.ipynb new file mode 100644 index 000..87605b7 --- /dev/null +++ b/community-artifacts/Random-forest-v2.ipynb @@ -0,0 +1,3082 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# Random forest\n", +"\n", +"Random forests build an ensemble of classifiers, each of which is a tree model constructed using bootstrapped samples from the input data. The results of these models are then combined to yield a single prediction, which, at the expense of some loss in interpretation, have been found to be highly accurate.\n", +"\n", +"Please also refer to the decision tree user documentation for information relevant to the implementation of random forests in MADlib.\n", +"\n", +"This notebook includes impurity importance which was added in 1.15." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ +{ + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/config.py:13: ShimWarning: The `IPython.config` package has been deprecated. You should import from traitlets.config instead.\n", + " \"You should import from traitlets.config instead.\", ShimWarning)\n", + "/Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/utils/traitlets.py:5: UserWarning: IPython.utils.traitlets has moved to a top-level traitlets package.\n", + " warn(\"IPython.utils.traitlets has moved to a top-level traitlets package.\")\n" + ] +} + ], + "source": [ +"%load_ext sql" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ +{ + "data": { + "text/plain": [ + "u'Connected: gpadmin@madlib'" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"# Greenplum Database 5.4.0 on GCP (demo machine)\n", +"%sql postgresql://gpadmin@35.184.253.255:5432/madlib\n", +"\n", +"# PostgreSQL local\n", +"#%sql postgresql://fmcquillan@localhost:5432/madlib" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "1 rows affected.\n" + ] +}, +{ + "data": { + "text/html": [ + "\n", + "\n", + "version\n", + "\n", + "\n", + "MADlib version: 1.15-dev, git revision: rc/1.14-rc1-45-g3ab7554, cmake configuration time: Wed Aug 1 18:34:10 UTC 2018, build type: release, build system: Linux-2.6.32-696.20.1.el6.x86_64, C compiler: gcc 4.4.7, C++ compiler: g++ 4.4.7\n", + "\n", + "" + ], + "text/plain": [ + "[(u'MADlib version: 1.15-dev, git revision: rc/1.14-rc1-45-g3ab7554, cmake configuration time: Wed Aug 1 18:34:10 UTC 2018, build type: release, build system: Linux-2.6.32-696.20.1.el6.x86_64, C compiler: gcc 4.4.7, C++ compiler: g++ 4.4.7',)]" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"%sql select madlib.version();\n", +"#%sql select version();" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# Random forest classification examples" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# 1. Load data\n", +"Data set related to whether to play golf or not." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ +{ + "name": &quo
[10/18] madlib-site git commit: update jupyter notebooks for 1dot15
http://git-wip-us.apache.org/repos/asf/madlib-site/blob/acd339f6/community-artifacts/MLP-v4.ipynb -- diff --git a/community-artifacts/MLP-v4.ipynb b/community-artifacts/MLP-v4.ipynb new file mode 100644 index 000..a6b62d6 --- /dev/null +++ b/community-artifacts/MLP-v4.ipynb @@ -0,0 +1,4588 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# Multilayer Perceptron\n", +"\n", +"Multilayer Perceptron (MLP) is a type of neural network that can be used for regression and classification.\n", +"\n", +"This version of the workbook includes mini-batching added in 1.14 and momentum added in 1.15" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { +"scrolled": true + }, + "outputs": [ +{ + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/config.py:13: ShimWarning: The `IPython.config` package has been deprecated. You should import from traitlets.config instead.\n", + " \"You should import from traitlets.config instead.\", ShimWarning)\n", + "/Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/utils/traitlets.py:5: UserWarning: IPython.utils.traitlets has moved to a top-level traitlets package.\n", + " warn(\"IPython.utils.traitlets has moved to a top-level traitlets package.\")\n" + ] +} + ], + "source": [ +"%load_ext sql" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ +{ + "data": { + "text/plain": [ + "u'Connected: gpadmin@madlib'" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"# Greenplum Database 5.4.0 on GCP (demo machine)\n", +"%sql postgresql://gpadmin@35.184.253.255:5432/madlib\n", +"\n", +"# PostgreSQL local\n", +"#%sql postgresql://fmcquillan@localhost:5432/madlib\n", +"\n", +"# Greenplum Database 4.3.10.0\n", +"#%sql postgresql://gpdbchina@10.194.10.68:61000/madlib" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "1 rows affected.\n" + ] +}, +{ + "data": { + "text/html": [ + "\n", + "\n", + "version\n", + "\n", + "\n", + "MADlib version: 1.15-dev, git revision: rc/1.14-rc1-23-g5c4331d, cmake configuration time: Thu Jul 5 17:46:06 UTC 2018, build type: release, build system: Linux-2.6.32-696.20.1.el6.x86_64, C compiler: gcc 4.4.7, C++ compiler: g++ 4.4.7\n", + "\n", + "" + ], + "text/plain": [ + "[(u'MADlib version: 1.15-dev, git revision: rc/1.14-rc1-23-g5c4331d, cmake configuration time: Thu Jul 5 17:46:06 UTC 2018, build type: release, build system: Linux-2.6.32-696.20.1.el6.x86_64, C compiler: gcc 4.4.7, C++ compiler: g++ 4.4.7',)]" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"%sql select madlib.version();\n", +"#%sql select version();" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# Classification without Mini-Batching\n", +"\n", +"# 1. Create input table for classification" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "Done.\n", + "Done.\n", + "52 rows affected.\n", + "52 rows affected.\n" + ] +}, +{ + "data": { + "text/html": [ + "
[05/18] madlib-site git commit: update jupyter notebooks for 1dot15
http://git-wip-us.apache.org/repos/asf/madlib-site/blob/acd339f6/community-artifacts/SVM-v1.ipynb -- diff --git a/community-artifacts/SVM-v1.ipynb b/community-artifacts/SVM-v1.ipynb new file mode 100644 index 000..405710d --- /dev/null +++ b/community-artifacts/SVM-v1.ipynb @@ -0,0 +1,2806 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# Support Vector Machines\n", +"Support Vector Machines (SVMs) are models for regression and classification tasks. SVM models have two particularly desirable features: robustness in the presence of noisy data and applicability to a variety of data configurations. At its core, a linear SVM model is a hyperplane separating two distinct classes of data (in the case of classification problems), in such a way that the distance between the hyperplane and the nearest training data point (called the margin) is maximized. Vectors that lie on this margin are called support vectors. With the support vectors fixed, perturbations of vectors beyond the margin will not affect the model; this contributes to the modelâs robustness. By substituting a kernel function for the usual inner product, one can approximate a large variety of decision boundaries in addition to linear hyperplanes." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "The sql extension is already loaded. To reload it, use:\n", + " %reload_ext sql\n" + ] +} + ], + "source": [ +"%load_ext sql" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ +{ + "data": { + "text/plain": [ + "u'Connected: gpadmin@madlib'" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"# Greenplum Database 5.4.0 on GCP (demo machine)\n", +"%sql postgresql://gpadmin@35.184.253.255:5432/madlib\n", +"\n", +"# PostgreSQL local\n", +"#%sql postgresql://fmcquillan@localhost:5432/madlib\n", +"\n", +"# Greenplum Database 4.3.10.0\n", +"#%sql postgresql://gpdbchina@10.194.10.68:61000/madlib" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "1 rows affected.\n" + ] +}, +{ + "data": { + "text/html": [ + "\n", + "\n", + "version\n", + "\n", + "\n", + "MADlib version: 1.15-dev, git revision: rc/1.14-rc1-25-gda13eb7, cmake configuration time: Tue Jul 10 21:37:52 UTC 2018, build type: release, build system: Linux-2.6.32-696.20.1.el6.x86_64, C compiler: gcc 4.4.7, C++ compiler: g++ 4.4.7\n", + "\n", + "" + ], + "text/plain": [ + "[(u'MADlib version: 1.15-dev, git revision: rc/1.14-rc1-25-gda13eb7, cmake configuration time: Tue Jul 10 21:37:52 UTC 2018, build type: release, build system: Linux-2.6.32-696.20.1.el6.x86_64, C compiler: gcc 4.4.7, C++ compiler: g++ 4.4.7',)]" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"%sql select madlib.version();" + ] + }, + { + "cell_type": "markdown", + "metadata": { +"collapsed": true + }, + "source": [ +"# Classification\n", +"# 1. Create input data set" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "Done.\n", + "Done.\n", + "15 rows affected.\n", + "15 rows affected.\n" + ] +}, +{ + "data": { + "text/html": [ + "\n&
[18/18] madlib-site git commit: update jupyter notebooks for 1dot15
update jupyter notebooks for 1dot15 Project: http://git-wip-us.apache.org/repos/asf/madlib-site/repo Commit: http://git-wip-us.apache.org/repos/asf/madlib-site/commit/acd339f6 Tree: http://git-wip-us.apache.org/repos/asf/madlib-site/tree/acd339f6 Diff: http://git-wip-us.apache.org/repos/asf/madlib-site/diff/acd339f6 Branch: refs/heads/asf-site Commit: acd339f65ab5b6b9c2f95ca370cc1fb8460fd7c6 Parents: 5fa1ac0 Author: Frank McQuillan Authored: Wed Aug 1 13:13:25 2018 -0700 Committer: Frank McQuillan Committed: Wed Aug 1 13:13:25 2018 -0700 -- .../Column-vector-operations-v1.ipynb | 2553 ++ .../Covariance-and-correlation-v1.ipynb | 1318 + community-artifacts/Decision-trees-v1.ipynb | 3051 community-artifacts/Decision-trees-v2.ipynb | 3208 community-artifacts/Elastic-net-v2.ipynb| 2078 community-artifacts/Elastic-net-v3.ipynb| 2049 community-artifacts/KNN-v4.ipynb| 857 community-artifacts/MLP-mnist-v3.ipynb | 1329 + community-artifacts/MLP-v4.ipynb| 4588 ++ .../Novelty-detection-demo-1.ipynb | 478 -- community-artifacts/Random-forest-v1.ipynb | 2899 --- community-artifacts/Random-forest-v2.ipynb | 3082 .../SVM-novelty-detection-v2.ipynb | 511 ++ community-artifacts/SVM-v1.ipynb| 2806 +++ .../Stratified-sampling-v2.ipynb| 672 +++ community-artifacts/kNN-v3.ipynb| 857 community-artifacts/mlp-mnist-v2.ipynb | 1154 - community-artifacts/mlp-v3.ipynb| 4584 - .../stratified-sampling-v1.ipynb| 672 --- 19 files changed, 22973 insertions(+), 15773 deletions(-) -- http://git-wip-us.apache.org/repos/asf/madlib-site/blob/acd339f6/community-artifacts/Column-vector-operations-v1.ipynb -- diff --git a/community-artifacts/Column-vector-operations-v1.ipynb b/community-artifacts/Column-vector-operations-v1.ipynb new file mode 100644 index 000..147b328 --- /dev/null +++ b/community-artifacts/Column-vector-operations-v1.ipynb @@ -0,0 +1,2553 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# Column and vector operations\n", +"\n", +"Column and vector operations were added in 1.15.\n", +"\n", +"* cols2vec\n", +"* vec2cols\n", +"* drop columns" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { +"scrolled": true + }, + "outputs": [ +{ + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/config.py:13: ShimWarning: The `IPython.config` package has been deprecated. You should import from traitlets.config instead.\n", + " \"You should import from traitlets.config instead.\", ShimWarning)\n", + "/Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/utils/traitlets.py:5: UserWarning: IPython.utils.traitlets has moved to a top-level traitlets package.\n", + " warn(\"IPython.utils.traitlets has moved to a top-level traitlets package.\")\n" + ] +} + ], + "source": [ +"%load_ext sql" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [ +{ + "data": { + "text/plain": [ + "u'Connected: gpadmin@madlib'" + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"# Greenplum Database 5.4.0 on GCP (demo machine)\n", +"%sql postgresql://gpadmin@35.184.253.255:5432/madlib\n", +"\n", +"# PostgreSQL local\n", +"#%sql postgresql://fmcquillan@localhost:5432/madlib" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "1 rows affected.\n" + ] +}, +{ + "data": { + "text/html"
[02/18] madlib-site git commit: update jupyter notebooks for 1dot15
http://git-wip-us.apache.org/repos/asf/madlib-site/blob/acd339f6/community-artifacts/mlp-v3.ipynb -- diff --git a/community-artifacts/mlp-v3.ipynb b/community-artifacts/mlp-v3.ipynb deleted file mode 100644 index 8c585a6..000 --- a/community-artifacts/mlp-v3.ipynb +++ /dev/null @@ -1,4584 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ -"# Multilayer Perceptron\n", -"\n", -"Multilayer Perceptron (MLP) is a type of neural network that can be used for regression and classification.\n", -"\n", -"This version of the workbook includes mini-batching which was added in the 1.14 release." - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": { -"scrolled": true - }, - "outputs": [ -{ - "name": "stderr", - "output_type": "stream", - "text": [ - "/Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/config.py:13: ShimWarning: The `IPython.config` package has been deprecated. You should import from traitlets.config instead.\n", - " \"You should import from traitlets.config instead.\", ShimWarning)\n", - "/Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/utils/traitlets.py:5: UserWarning: IPython.utils.traitlets has moved to a top-level traitlets package.\n", - " warn(\"IPython.utils.traitlets has moved to a top-level traitlets package.\")\n" - ] -} - ], - "source": [ -"%load_ext sql" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [ -{ - "data": { - "text/plain": [ - "u'Connected: gpadmin@madlib'" - ] - }, - "execution_count": 2, - "metadata": {}, - "output_type": "execute_result" -} - ], - "source": [ -"# Greenplum Database 5.4.0 on GCP (demo machine)\n", -"%sql postgresql://gpadmin@35.184.253.255:5432/madlib\n", -"\n", -"# PostgreSQL local\n", -"#%sql postgresql://fmcquillan@localhost:5432/madlib\n", -"\n", -"# Greenplum Database 4.3.10.0\n", -"#%sql postgresql://gpdbchina@10.194.10.68:61000/madlib" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [ -{ - "name": "stdout", - "output_type": "stream", - "text": [ - "1 rows affected.\n" - ] -}, -{ - "data": { - "text/html": [ - "\n", - "\n", - "version\n", - "\n", - "\n", - "MADlib version: 1.14-dev, git revision: rc/1.13-rc1-66-g4cced1b, cmake configuration time: Mon Apr 23 16:26:17 UTC 2018, build type: release, build system: Linux-2.6.32-696.20.1.el6.x86_64, C compiler: gcc 4.4.7, C++ compiler: g++ 4.4.7\n", - "\n", - "" - ], - "text/plain": [ - "[(u'MADlib version: 1.14-dev, git revision: rc/1.13-rc1-66-g4cced1b, cmake configuration time: Mon Apr 23 16:26:17 UTC 2018, build type: release, build system: Linux-2.6.32-696.20.1.el6.x86_64, C compiler: gcc 4.4.7, C++ compiler: g++ 4.4.7',)]" - ] - }, - "execution_count": 3, - "metadata": {}, - "output_type": "execute_result" -} - ], - "source": [ -"%sql select madlib.version();\n", -"#%sql select version();" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ -"# Classification without Mini-Batching\n", -"\n", -"# 1. Create input table for classification" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [ -{ - "name": "stdout", - "output_type": "stream", - "text": [ - "Done.\n", - "Done.\n", - "52 rows affected.\n", - "52 rows affected.\n" - ] -}, -{ - "data": { - "text/html": [ - "
[16/18] madlib-site git commit: update jupyter notebooks for 1dot15
http://git-wip-us.apache.org/repos/asf/madlib-site/blob/acd339f6/community-artifacts/Decision-trees-v1.ipynb -- diff --git a/community-artifacts/Decision-trees-v1.ipynb b/community-artifacts/Decision-trees-v1.ipynb deleted file mode 100644 index 02a60ef..000 --- a/community-artifacts/Decision-trees-v1.ipynb +++ /dev/null @@ -1,3051 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ -"# Decision trees\n", -"\n", -"A decision tree is a supervised learning method that can be used for classification and regression. It consists of a structure in which internal nodes represent tests on attributes, and the branches from nodes represent the result of those tests. Each leaf node is a class label and the paths from root to leaf nodes define the set of classification or regression rules." - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [ -{ - "name": "stderr", - "output_type": "stream", - "text": [ - "/Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/config.py:13: ShimWarning: The `IPython.config` package has been deprecated. You should import from traitlets.config instead.\n", - " \"You should import from traitlets.config instead.\", ShimWarning)\n", - "/Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/utils/traitlets.py:5: UserWarning: IPython.utils.traitlets has moved to a top-level traitlets package.\n", - " warn(\"IPython.utils.traitlets has moved to a top-level traitlets package.\")\n" - ] -} - ], - "source": [ -"%load_ext sql" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [ -{ - "data": { - "text/plain": [ - "u'Connected: gpadmin@madlib'" - ] - }, - "execution_count": 2, - "metadata": {}, - "output_type": "execute_result" -} - ], - "source": [ -"# Greenplum Database 5.4.0 on GCP (demo machine)\n", -"%sql postgresql://gpadmin@35.184.253.255:5432/madlib\n", -"\n", -"# PostgreSQL local\n", -"#%sql postgresql://fmcquillan@localhost:5432/madlib\n", -"\n", -"# Greenplum Database 4.3.10.0\n", -"#%sql postgresql://gpdbchina@10.194.10.68:61000/madlib" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [ -{ - "name": "stdout", - "output_type": "stream", - "text": [ - "1 rows affected.\n" - ] -}, -{ - "data": { - "text/html": [ - "\n", - "\n", - "version\n", - "\n", - "\n", - "MADlib version: 1.14, git revision: rc/1.13-rc1-68-g1c81cb1, cmake configuration time: Tue Apr 24 15:54:15 UTC 2018, build type: release, build system: Linux-2.6.32-696.20.1.el6.x86_64, C compiler: gcc 4.4.7, C++ compiler: g++ 4.4.7\n", - "\n", - "" - ], - "text/plain": [ - "[(u'MADlib version: 1.14, git revision: rc/1.13-rc1-68-g1c81cb1, cmake configuration time: Tue Apr 24 15:54:15 UTC 2018, build type: release, build system: Linux-2.6.32-696.20.1.el6.x86_64, C compiler: gcc 4.4.7, C++ compiler: g++ 4.4.7',)]" - ] - }, - "execution_count": 3, - "metadata": {}, - "output_type": "execute_result" -} - ], - "source": [ -"%sql select madlib.version();\n", -"#%sql select version();" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ -"# Decision tree classification examples" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ -"# 1. Load data\n", -"Data set related to whether to play golf or not." - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [ -{ - "name": "stdout", - "output_type": "stream", - &q
[15/18] madlib-site git commit: update jupyter notebooks for 1dot15
http://git-wip-us.apache.org/repos/asf/madlib-site/blob/acd339f6/community-artifacts/Decision-trees-v2.ipynb -- diff --git a/community-artifacts/Decision-trees-v2.ipynb b/community-artifacts/Decision-trees-v2.ipynb new file mode 100644 index 000..5b55b03 --- /dev/null +++ b/community-artifacts/Decision-trees-v2.ipynb @@ -0,0 +1,3208 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# Decision trees\n", +"\n", +"A decision tree is a supervised learning method that can be used for classification and regression. It consists of a structure in which internal nodes represent tests on attributes, and the branches from nodes represent the result of those tests. Each leaf node is a class label and the paths from root to leaf nodes define the set of classification or regression rules.\n", +"\n", +"This notebook includes impurity importance which was added in 1.15." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ +{ + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/config.py:13: ShimWarning: The `IPython.config` package has been deprecated. You should import from traitlets.config instead.\n", + " \"You should import from traitlets.config instead.\", ShimWarning)\n", + "/Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/utils/traitlets.py:5: UserWarning: IPython.utils.traitlets has moved to a top-level traitlets package.\n", + " warn(\"IPython.utils.traitlets has moved to a top-level traitlets package.\")\n" + ] +} + ], + "source": [ +"%load_ext sql" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ +{ + "data": { + "text/plain": [ + "u'Connected: gpadmin@madlib'" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"# Greenplum Database 5.4.0 on GCP (demo machine)\n", +"%sql postgresql://gpadmin@35.184.253.255:5432/madlib\n", +"\n", +"# PostgreSQL local\n", +"#%sql postgresql://fmcquillan@localhost:5432/madlib" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "1 rows affected.\n" + ] +}, +{ + "data": { + "text/html": [ + "\n", + "\n", + "version\n", + "\n", + "\n", + "MADlib version: 1.15-dev, git revision: rc/1.14-rc1-45-g3ab7554, cmake configuration time: Wed Aug 1 18:34:10 UTC 2018, build type: release, build system: Linux-2.6.32-696.20.1.el6.x86_64, C compiler: gcc 4.4.7, C++ compiler: g++ 4.4.7\n", + "\n", + "" + ], + "text/plain": [ + "[(u'MADlib version: 1.15-dev, git revision: rc/1.14-rc1-45-g3ab7554, cmake configuration time: Wed Aug 1 18:34:10 UTC 2018, build type: release, build system: Linux-2.6.32-696.20.1.el6.x86_64, C compiler: gcc 4.4.7, C++ compiler: g++ 4.4.7',)]" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"%sql select madlib.version();\n", +"#%sql select version();" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# Decision tree classification examples" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# 1. Load data\n", +"Data set related to whether to play golf or not." + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text"
madlib-site git commit: updated 1.13 links to archive.apache.org
Repository: madlib-site Updated Branches: refs/heads/asf-site e76da81ae -> 5fa1ac070 updated 1.13 links to archive.apache.org Project: http://git-wip-us.apache.org/repos/asf/madlib-site/repo Commit: http://git-wip-us.apache.org/repos/asf/madlib-site/commit/5fa1ac07 Tree: http://git-wip-us.apache.org/repos/asf/madlib-site/tree/5fa1ac07 Diff: http://git-wip-us.apache.org/repos/asf/madlib-site/diff/5fa1ac07 Branch: refs/heads/asf-site Commit: 5fa1ac07007dce077c18ee36052a500faaad19fd Parents: e76da81 Author: Frank McQuillan Authored: Thu May 3 15:39:39 2018 -0700 Committer: Frank McQuillan Committed: Thu May 3 15:39:39 2018 -0700 -- download.html | 8 index.html| 2 +- 2 files changed, 5 insertions(+), 5 deletions(-) -- http://git-wip-us.apache.org/repos/asf/madlib-site/blob/5fa1ac07/download.html -- diff --git a/download.html b/download.html index 0e03047..8728d10 100644 --- a/download.html +++ b/download.html @@ -104,10 +104,10 @@ Release artifacts: - http://apache.org/dyn/closer.cgi?filename=madlib/1.13/apache-madlib-1.13-src.tar.gz&action=download";>Source code tar.gz (https://www.apache.org/dist/madlib/1.13/apache-madlib-1.13-src.tar.gz.asc";>pgp, https://www.apache.org/dist/madlib/1.13/apache-madlib-1.13-src.tar.gz.md5";>md5, https://www.apache.org/dist/madlib/1.13/apache-madlib-1.13-src.tar.gz.sha512";>sha512) - http://apache.org/dyn/closer.cgi?filename=madlib/1.13/apache-madlib-1.13-bin-Linux.rpm&action=download";>Linux (https://www.apache.org/dist/madlib/1.13/apache-madlib-1.13-bin-Linux.rpm.asc";>pgp, https://www.apache.org/dist/madlib/1.13/apache-madlib-1.13-bin-Linux.rpm.md5";>md5, https://www.apache.org/dist/madlib/1.13/apache-madlib-1.13-bin-Linux.rpm.sha512";>sha512) â CentOS / Red Hat 5 and higher (64 bit). GPDB 4.3.x, PostgreSQL 9.5 and 9.6. - http://apache.org/dyn/closer.cgi?filename=madlib/1.13/apache-madlib-1.13-bin-Linux-GPDB5.rpm&action=download";>Linux (https://www.apache.org/dist/madlib/1.13/apache-madlib-1.13-bin-Linux-GPDB5.rpm.asc";>pgp, https://www.apache.org/dist/madlib/1.13/apache-madlib-1.13-bin-Linux-GPDB5.rpm.md5";>md5, https://www.apache.org/dist/madlib/1.13/apache-madlib-1.13-bin-Linux.rpm.sha512";>sha512) â CentOS / Red Hat 6 and higher (64 bit). GPDB 5.3.x. - http://apache.org/dyn/closer.cgi?filename=madlib/1.13/apache-madlib-1.13-bin-Darwin.dmg&action=download";>Mac OS X (https://www.apache.org/dist/madlib/1.13/apache-madlib-1.13-bin-Darwin.dmg.asc";>pgp, https://www.apache.org/dist/madlib/1.13/apache-madlib-1.13-bin-Darwin.dmg.md5";>md5, https://www.apache.org/dist/madlib/1.13/apache-madlib-1.13-bin-Darwin.dmg.sha512";>sha512) â OS 10.6 and higher. For PostgreSQL 9.5 and 9.6. + https://archive.apache.org/dist/madlib/1.13/apache-madlib-1.13-src.tar.gz";>Source code tar.gz (https://archive.apache.org/dist/madlib/1.13/apache-madlib-1.13-src.tar.gz.asc";>pgp, https://archive.apache.org/dist/madlib/1.13/apache-madlib-1.13-src.tar.gz.md5";>md5, https://archive.apache.org/dist/madlib/1.13/apache-madlib-1.13-src.tar.gz.sha512";>sha512) + https://archive.apache.org/dist/madlib/1.13/apache-madlib-1.13-bin-Linux.rpm";>Linux (https://archive.apache.org/dist/madlib/1.13/apache-madlib-1.13-bin-Linux.rpm.asc";>pgp, https://archive.apache.org/dist/madlib/1.13/apache-madlib-1.13-bin-Linux.rpm.md5";>md5, https://archive.apache.org/dist/madlib/1.13/apache-madlib-1.13-bin-Linux.rpm.sha512";>sha512) â CentOS / Red Hat 5 and higher (64 bit). GPDB 4.3.x, PostgreSQL 9.5 and 9.6. + https://archive.apache.org/dist/madlib/1.13/apache-madlib-1.13-bin-Linux-GPDB5.rpm";>Linux (https://archive.apache.org/dist/madlib/1.13/apache-madlib-1.13-bin-Linux-GPDB5.rpm.asc";>pgp, https://archive.apache.org/dist/madlib/1.13/apache-madlib-1.13-bin-Linux-GPDB5.rpm.md5";>md5, https://archive.apache.org/dist/madlib/1.13/apache-madlib-1.13-bin-Linux.rpm.sha512";>sha512) â CentOS / Red Hat 6 and higher (64 bit). GPDB 5.3.x. + https://archive.apache.org/dist/madlib/1.13/apache-madlib-1.13-bin-Darwin.dmg";>Mac OS X (https://archive.apache.org/dist/madlib/1.13/apache-madlib-1.13-bin-Darwin.dmg.asc";>pgp, https://archive.apache.org/dist/madlib/1.13/apache-madlib-1.13-bin-Darwin.dmg.md5";>md5, https://archive.apache.org/dist/madlib/1.1
madlib-site git commit: website update for 1.14 release
Repository: madlib-site Updated Branches: refs/heads/asf-site f732f863c -> 39604a00c website update for 1.14 release Project: http://git-wip-us.apache.org/repos/asf/madlib-site/repo Commit: http://git-wip-us.apache.org/repos/asf/madlib-site/commit/39604a00 Tree: http://git-wip-us.apache.org/repos/asf/madlib-site/tree/39604a00 Diff: http://git-wip-us.apache.org/repos/asf/madlib-site/diff/39604a00 Branch: refs/heads/asf-site Commit: 39604a00c6284e43d211480a5f9054e33fbb0dc1 Parents: f732f86 Author: Frank McQuillan Authored: Wed May 2 09:46:48 2018 -0700 Committer: Frank McQuillan Committed: Wed May 2 09:46:48 2018 -0700 -- design.pdf | Bin 1929401 -> 1930975 bytes documentation.html | 9 + download.html | 24 +++- index.html | 39 ++- 4 files changed, 62 insertions(+), 10 deletions(-) -- http://git-wip-us.apache.org/repos/asf/madlib-site/blob/39604a00/design.pdf -- diff --git a/design.pdf b/design.pdf index 073fdb1..164ecc1 100644 Binary files a/design.pdf and b/design.pdf differ http://git-wip-us.apache.org/repos/asf/madlib-site/blob/39604a00/documentation.html -- diff --git a/documentation.html b/documentation.html index 41cc7f0..4670603 100644 --- a/documentation.html +++ b/documentation.html @@ -55,6 +55,7 @@ jQuery(document).ready(function() { The primary documentation reference material providing detailed information on the functions and algorithms within MADlib as well as background theory and references into the literature. Older Documentation +MADlib v1.13 MADlib v1.12 MADlib v1.11 MADlib v1.10 @@ -98,6 +99,14 @@ jQuery(document).ready(function() { + + +https://github.com/apache/madlib-site/tree/asf-site/community-artifacts";>Jupyter Notebooks for Getting Started +Includes many of the most commonly used algorithms by data scientists. + + + + Community Portal http://git-wip-us.apache.org/repos/asf/madlib-site/blob/39604a00/download.html -- diff --git a/download.html b/download.html index 997c93f..59ded0d 100644 --- a/download.html +++ b/download.html @@ -58,7 +58,7 @@ Current Release - v1.13 + v1.14 Source Code and Convenience Binaries MADlib® source code and convenience binaries are available from the Apache distribution site. @@ -66,10 +66,10 @@ Latest stable release: - http://apache.org/dyn/closer.cgi?filename=madlib/1.13/apache-madlib-1.13-src.tar.gz&action=download";>Source code tar.gz (https://www.apache.org/dist/madlib/1.13/apache-madlib-1.13-src.tar.gz.asc";>pgp, https://www.apache.org/dist/madlib/1.13/apache-madlib-1.13-src.tar.gz.md5";>md5, https://www.apache.org/dist/madlib/1.13/apache-madlib-1.13-src.tar.gz.sha512";>sha512) - http://apache.org/dyn/closer.cgi?filename=madlib/1.13/apache-madlib-1.13-bin-Linux.rpm&action=download";>Linux (https://www.apache.org/dist/madlib/1.13/apache-madlib-1.13-bin-Linux.rpm.asc";>pgp, https://www.apache.org/dist/madlib/1.13/apache-madlib-1.13-bin-Linux.rpm.md5";>md5, https://www.apache.org/dist/madlib/1.13/apache-madlib-1.13-bin-Linux.rpm.sha512";>sha512) â CentOS / Red Hat 5 and higher (64 bit). GPDB 4.3.x, PostgreSQL 9.5 and 9.6. - http://apache.org/dyn/closer.cgi?filename=madlib/1.13/apache-madlib-1.13-bin-Linux-GPDB5.rpm&action=download";>Linux (https://www.apache.org/dist/madlib/1.13/apache-madlib-1.13-bin-Linux-GPDB5.rpm.asc";>pgp, https://www.apache.org/dist/madlib/1.13/apache-madlib-1.13-bin-Linux-GPDB5.rpm.md5";>md5, https://www.apache.org/dist/madlib/1.13/apache-madlib-1.13-bin-Linux.rpm.sha512";>sha512) â CentOS / Red Hat 6 and higher (64 bit). GPDB 5.3.x. - http://apache.org/dyn/closer.cgi?filename=madlib/1.13/apache-madlib-1.13-bin-Darwin.dmg&action=download";>Mac OS X (https://www.apache.org/dist/madlib/1.13/apache-madlib-1.13-bin-Darwin.dmg.asc";>pgp, https://www.apache.org/dist/madlib/1.13/apache-madlib-1.13-bin-Darwin.dmg.md5";>md5, https://www.apache.org/dist/madlib/1.13/apache-madlib-1.13-bin-Darwin.dmg.sha512";>sha512) â OS 10.6 and higher. For PostgreSQL 9.5 and 9.6. +
[1/2] madlib-site git commit: fix decision tree jupyter notebook
Repository: madlib-site Updated Branches: refs/heads/asf-site 418f361cf -> f732f863c http://git-wip-us.apache.org/repos/asf/madlib-site/blob/f732f863/community-artifacts/Decision-trees-v1.ipynb -- diff --git a/community-artifacts/Decision-trees-v1.ipynb b/community-artifacts/Decision-trees-v1.ipynb index e97b943..02a60ef 100644 --- a/community-artifacts/Decision-trees-v1.ipynb +++ b/community-artifacts/Decision-trees-v1.ipynb @@ -11,15 +11,17 @@ }, { "cell_type": "code", - "execution_count": 34, + "execution_count": 1, "metadata": {}, "outputs": [ { - "name": "stdout", + "name": "stderr", "output_type": "stream", "text": [ - "The sql extension is already loaded. To reload it, use:\n", - " %reload_ext sql\n" + "/Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/config.py:13: ShimWarning: The `IPython.config` package has been deprecated. You should import from traitlets.config instead.\n", + " \"You should import from traitlets.config instead.\", ShimWarning)\n", + "/Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/utils/traitlets.py:5: UserWarning: IPython.utils.traitlets has moved to a top-level traitlets package.\n", + " warn(\"IPython.utils.traitlets has moved to a top-level traitlets package.\")\n" ] } ], @@ -29,26 +31,26 @@ }, { "cell_type": "code", - "execution_count": 35, + "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "u'Connected: fmcquillan@madlib'" + "u'Connected: gpadmin@madlib'" ] }, - "execution_count": 35, + "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Greenplum Database 5.4.0 on GCP (demo machine)\n", -"#%sql postgresql://gpadmin@35.184.253.255:5432/madlib\n", +"%sql postgresql://gpadmin@35.184.253.255:5432/madlib\n", "\n", "# PostgreSQL local\n", -"%sql postgresql://fmcquillan@localhost:5432/madlib\n", +"#%sql postgresql://fmcquillan@localhost:5432/madlib\n", "\n", "# Greenplum Database 4.3.10.0\n", "#%sql postgresql://gpdbchina@10.194.10.68:61000/madlib" @@ -56,9 +58,37 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 3, "metadata": {}, - "outputs": [], + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "1 rows affected.\n" + ] +}, +{ + "data": { + "text/html": [ + "\n", + "\n", + "version\n", + "\n", + "\n", + "MADlib version: 1.14, git revision: rc/1.13-rc1-68-g1c81cb1, cmake configuration time: Tue Apr 24 15:54:15 UTC 2018, build type: release, build system: Linux-2.6.32-696.20.1.el6.x86_64, C compiler: gcc 4.4.7, C++ compiler: g++ 4.4.7\n", + "\n", + "" + ], + "text/plain": [ + "[(u'MADlib version: 1.14, git revision: rc/1.13-rc1-68-g1c81cb1, cmake configuration time: Tue Apr 24 15:54:15 UTC 2018, build type: release, build system: Linux-2.6.32-696.20.1.el6.x86_64, C compiler: gcc 4.4.7, C++ compiler: g++ 4.4.7',)]" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" +} + ], "source": [ "%sql select madlib.version();\n", "#%sql select version();" @@ -81,7 +111,7 @@ }, { "cell_type": "code", - "execution_count": 36, + "execution_count": 4, "metadata": {}, "outputs": [ { @@ -282,7 +312,7 @@ " (14, u'rain', 71.0, 80.0, [71.0, 80.0], [u'low', u'unhealthy'], True, u\"Don't Play\", 1.0)]" ] }, - "execution_count": 36, + "execution_count": 4, "metadata": {}, "output_type": "execute_result" } @@ -332,7 +362,7 @@
[2/2] madlib-site git commit: fix decision tree jupyter notebook
fix decision tree jupyter notebook Project: http://git-wip-us.apache.org/repos/asf/madlib-site/repo Commit: http://git-wip-us.apache.org/repos/asf/madlib-site/commit/f732f863 Tree: http://git-wip-us.apache.org/repos/asf/madlib-site/tree/f732f863 Diff: http://git-wip-us.apache.org/repos/asf/madlib-site/diff/f732f863 Branch: refs/heads/asf-site Commit: f732f863cead81f4ecee5fbe3efb9dd362964c57 Parents: 418f361 Author: Frank McQuillan Authored: Tue Apr 24 09:10:53 2018 -0700 Committer: Frank McQuillan Committed: Tue Apr 24 09:10:53 2018 -0700 -- community-artifacts/Decision-trees-v1.ipynb | 1785 -- 1 file changed, 1623 insertions(+), 162 deletions(-) --
[madlib-site] Git Push Summary
Repository: madlib-site Updated Branches: refs/heads/notebook-updates-1dot14 [deleted] 3f849b9e4
[12/15] madlib-site git commit: jupyter notebooks for 1.14 release
http://git-wip-us.apache.org/repos/asf/madlib-site/blob/418f361c/community-artifacts/Encoding-categorical-variables-1dot10-v1.ipynb -- diff --git a/community-artifacts/Encoding-categorical-variables-1dot10-v1.ipynb b/community-artifacts/Encoding-categorical-variables-1dot10-v1.ipynb deleted file mode 100644 index 409de20..000 --- a/community-artifacts/Encoding-categorical-variables-1dot10-v1.ipynb +++ /dev/null @@ -1,2748 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ -"# Encoding categorical variables (MADlib v1.10+)\n", -"This is the new module that replaces create_indicator_variables() which has been deprecated as of MADlib v1.10" - ] - }, - { - "cell_type": "code", - "execution_count": 44, - "metadata": { -"collapsed": false - }, - "outputs": [ -{ - "name": "stdout", - "output_type": "stream", - "text": [ - "The sql extension is already loaded. To reload it, use:\n", - " %reload_ext sql\n" - ] -} - ], - "source": [ -"%load_ext sql" - ] - }, - { - "cell_type": "code", - "execution_count": 45, - "metadata": { -"collapsed": false - }, - "outputs": [ -{ - "data": { - "text/plain": [ - "u'Connected: gpdbchina@madlib'" - ] - }, - "execution_count": 45, - "metadata": {}, - "output_type": "execute_result" -} - ], - "source": [ -"%sql postgresql://gpdbchina@10.194.10.68:55000/madlib\n", -"#%sql postgresql://fmcquillan@localhost:5432/madlib\n", -"#%sql postgresql://gpadmin@54.197.30.46:10432/gpadmin" - ] - }, - { - "cell_type": "code", - "execution_count": 46, - "metadata": { -"collapsed": false - }, - "outputs": [ -{ - "name": "stdout", - "output_type": "stream", - "text": [ - "1 rows affected.\n" - ] -}, -{ - "data": { - "text/html": [ - "\n", - "\n", - "version\n", - "\n", - "\n", - "MADlib version: 1.10.0-dev, git revision: rel/v1.9.1-47-g2d5a5ed, cmake configuration time: Tue Feb 7 19:45:19 UTC 2017, build type: Release, build system: Linux-2.6.18-238.27.1.el5.hotfix.bz516490, C compiler: gcc 4.4.0, C++ compiler: g++ 4.4.0\n", - "\n", - "" - ], - "text/plain": [ - "[(u'MADlib version: 1.10.0-dev, git revision: rel/v1.9.1-47-g2d5a5ed, cmake configuration time: Tue Feb 7 19:45:19 UTC 2017, build type: Release, build system: Linux-2.6.18-238.27.1.el5.hotfix.bz516490, C compiler: gcc 4.4.0, C++ compiler: g++ 4.4.0',)]" - ] - }, - "execution_count": 46, - "metadata": {}, - "output_type": "execute_result" -} - ], - "source": [ -"%sql select madlib.version();\n", -"#%sql select version();" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ -"## 1. Load data set\n", -"Use a subset of the abalone dataset:" - ] - }, - { - "cell_type": "code", - "execution_count": 47, - "metadata": { -"collapsed": false - }, - "outputs": [ -{ - "name": "stdout", - "output_type": "stream", - "text": [ - "Done.\n", - "Done.\n", - "20 rows affected.\n", - "20 rows affected.\n" - ] -}, -{ - "data": { - "text/html": [ - "\n", - "\n", - "id\n", - "sex\n", - "length\n", - "diameter\n", - "height\n", - "rings\n", - "\n", - "\n", - "1\n", - "M\n", - "0.455\n", - "0.365\n", - "0.095\n", - "15\n", - "\n", - "\n", - "
[15/15] madlib-site git commit: jupyter notebooks for 1.14 release
jupyter notebooks for 1.14 release Project: http://git-wip-us.apache.org/repos/asf/madlib-site/repo Commit: http://git-wip-us.apache.org/repos/asf/madlib-site/commit/418f361c Tree: http://git-wip-us.apache.org/repos/asf/madlib-site/tree/418f361c Diff: http://git-wip-us.apache.org/repos/asf/madlib-site/diff/418f361c Branch: refs/heads/asf-site Commit: 418f361cffb2d03563634047cb2a26a4c1b71caf Parents: 4fe8cfb Author: Frank McQuillan Authored: Mon Apr 23 14:56:06 2018 -0700 Committer: Frank McQuillan Committed: Mon Apr 23 16:14:50 2018 -0700 -- community-artifacts/Balanced-sampling-v1.ipynb | 3706 ++ community-artifacts/Decision-trees-v1.ipynb | 1590 ++ ...coding-categorical-variables-1dot10-v1.ipynb | 2748 --- .../Encoding-categorical-variables-v2.ipynb | 4026 +++ community-artifacts/LDA-v1.ipynb| 2034 community-artifacts/MLP.ipynb | 514 -- .../Minibatch-preprocessor-v1.ipynb | 1330 + community-artifacts/PageRank-v1.ipynb | 774 --- community-artifacts/PageRank-v2.ipynb | 889 community-artifacts/Random-forest-v1.ipynb | 2899 +++ community-artifacts/Summary-v1.ipynb| 1026 community-artifacts/Summary-v2.ipynb| 1017 community-artifacts/Term-frequency-v1.ipynb | 1062 community-artifacts/kNN-v2.ipynb| 751 --- community-artifacts/kNN-v3.ipynb| 857 community-artifacts/mlp-mnist-v2.ipynb | 1154 + community-artifacts/mlp-v2.ipynb| 3755 -- community-artifacts/mlp-v3.ipynb| 4584 ++ images/neural-net-head.jpg | Bin 0 -> 326157 bytes 19 files changed, 25148 insertions(+), 9568 deletions(-) --
[14/15] madlib-site git commit: jupyter notebooks for 1.14 release
http://git-wip-us.apache.org/repos/asf/madlib-site/blob/418f361c/community-artifacts/Balanced-sampling-v1.ipynb -- diff --git a/community-artifacts/Balanced-sampling-v1.ipynb b/community-artifacts/Balanced-sampling-v1.ipynb new file mode 100644 index 000..5f6ec23 --- /dev/null +++ b/community-artifacts/Balanced-sampling-v1.ipynb @@ -0,0 +1,3706 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# Balanced sampling\n", +"\n", +"This module offers a number of re-sampling techniques including under-sampling majority classes, over-sampling minority classes, and combinations of the two.\n", +"\n", +"Balanced sampling was added in MADlib 1.14." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "The sql extension is already loaded. To reload it, use:\n", + " %reload_ext sql\n" + ] +} + ], + "source": [ +"%load_ext sql" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ +{ + "data": { + "text/plain": [ + "u'Connected: gpadmin@madlib'" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"# Greenplum Database 5.4.0 on GCP (demo machine)\n", +"%sql postgresql://gpadmin@35.184.253.255:5432/madlib\n", +"\n", +"# PostgreSQL local\n", +"#%sql postgresql://fmcquillan@localhost:5432/madlib\n", +"\n", +"# Greenplum Database 4.3.10.0\n", +"#%sql postgresql://gpdbchina@10.194.10.68:61000/madlib" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "1 rows affected.\n" + ] +}, +{ + "data": { + "text/html": [ + "\n", + "\n", + "version\n", + "\n", + "\n", + "MADlib version: 1.14-dev, git revision: rc/1.13-rc1-22-g0bfcaf5, cmake configuration time: Wed Mar 14 21:35:16 UTC 2018, build type: release, build system: Linux-2.6.32-696.20.1.el6.x86_64, C compiler: gcc 4.4.7, C++ compiler: g++ 4.4.7\n", + "\n", + "" + ], + "text/plain": [ + "[(u'MADlib version: 1.14-dev, git revision: rc/1.13-rc1-22-g0bfcaf5, cmake configuration time: Wed Mar 14 21:35:16 UTC 2018, build type: release, build system: Linux-2.6.32-696.20.1.el6.x86_64, C compiler: gcc 4.4.7, C++ compiler: g++ 4.4.7',)]" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"%sql select madlib.version();\n", +"#%sql select version();" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# 1. Load data\n", +"Based in part on the flags data set from https://archive.ics.uci.edu/ml/datasets/Flags"; + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "Done.\n", + "Done.\n", + "22 rows affected.\n", + "22 rows affected.\n" + ] +}, +{ + "data": { + "text/html": [ + "\n", + "\n", + "id\n", + "name\n", + "landmass\n", + "zone\n", + "area\n", + "population\n", + "language\n", + "colours\n", + "mainhue\n", + "\n", + "\n", + "1\n", + "
[11/15] madlib-site git commit: jupyter notebooks for 1.14 release
http://git-wip-us.apache.org/repos/asf/madlib-site/blob/418f361c/community-artifacts/Encoding-categorical-variables-v2.ipynb -- diff --git a/community-artifacts/Encoding-categorical-variables-v2.ipynb b/community-artifacts/Encoding-categorical-variables-v2.ipynb new file mode 100644 index 000..5e4cb6f --- /dev/null +++ b/community-artifacts/Encoding-categorical-variables-v2.ipynb @@ -0,0 +1,4026 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# Encoding categorical variables\n", +"This is the new module that replaces create_indicator_variables() which was deprecated as of MADlib v1.10" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ +{ + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/config.py:13: ShimWarning: The `IPython.config` package has been deprecated. You should import from traitlets.config instead.\n", + " \"You should import from traitlets.config instead.\", ShimWarning)\n", + "/Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/utils/traitlets.py:5: UserWarning: IPython.utils.traitlets has moved to a top-level traitlets package.\n", + " warn(\"IPython.utils.traitlets has moved to a top-level traitlets package.\")\n" + ] +} + ], + "source": [ +"%load_ext sql" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ +{ + "data": { + "text/plain": [ + "u'Connected: gpadmin@madlib'" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"# Greenplum Database 5.4.0 on GCP (demo machine)\n", +"%sql postgresql://gpadmin@35.184.253.255:5432/madlib\n", +"\n", +"# PostgreSQL local\n", +"#%sql postgresql://fmcquillan@localhost:5432/madlib\n", +"\n", +"# Greenplum Database 4.3.10.0\n", +"#%sql postgresql://gpdbchina@10.194.10.68:61000/madlib" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "1 rows affected.\n" + ] +}, +{ + "data": { + "text/html": [ + "\n", + "\n", + "version\n", + "\n", + "\n", + "MADlib version: 1.14-dev, git revision: rc/1.13-rc1-21-g3af2d70, cmake configuration time: Mon Feb 26 18:00:54 UTC 2018, build type: release, build system: Linux-2.6.32-696.20.1.el6.x86_64, C compiler: gcc 4.4.7, C++ compiler: g++ 4.4.7\n", + "\n", + "" + ], + "text/plain": [ + "[(u'MADlib version: 1.14-dev, git revision: rc/1.13-rc1-21-g3af2d70, cmake configuration time: Mon Feb 26 18:00:54 UTC 2018, build type: release, build system: Linux-2.6.32-696.20.1.el6.x86_64, C compiler: gcc 4.4.7, C++ compiler: g++ 4.4.7',)]" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"%sql select madlib.version();\n", +"#%sql select version();" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"## 1. Load data set\n", +"Use a subset of the abalone dataset:" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "Done.\n", + "Done.\n", + "20 rows affected.\n", + "20 rows affected.\n" + ] +}, +{ + "data": { + "text/html": [ + "\n", + "\n", + "id\n", + "sex\n", + "length\n&
[03/15] madlib-site git commit: jupyter notebooks for 1.14 release
http://git-wip-us.apache.org/repos/asf/madlib-site/blob/418f361c/community-artifacts/mlp-mnist-v2.ipynb -- diff --git a/community-artifacts/mlp-mnist-v2.ipynb b/community-artifacts/mlp-mnist-v2.ipynb new file mode 100644 index 000..3c1ad14 --- /dev/null +++ b/community-artifacts/mlp-mnist-v2.ipynb @@ -0,0 +1,1154 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# Neural networks\n", +"\n", +"Multilayer perceptron (MLP) using the well known MNIST data set.\n", +"\n", +"Updated to include mini-batching which was added in the 1.14 release.\n", +"\n", +"# Intro" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ +{ + "data": { + "image/jpeg": "/9j/4R5fRXhpZgAATU0AKggABwESAAMBAAEAAAEaAAUBYgEbAAUB\nagEoAAMBAAIAAAExAAIccgEyAAIUjodpAAQBpNAACvyA\nAAAnEAAK/IAAACcQQWRvYmUgUGhvdG9zaG9wIENTNSBXaW5kb3dzADIwMTU6MDc6MjQgMTA6NTk6\nNTEAA6ABAAMBAAEAAKACAAQBAAACoKADAAQBAAABcwAGAQMAAwAA\nAAEABgAAARoABQEAAAEeARsABQEAAAEmASgAAwEAAgAAAgEABAEAAAEuAgIA\nBAEAAB0pAEgBSAH/2P/tAAxBZG9iZV9DTQAB/+4ADkFkb2JlAGSA\nAf/bAIQADAgICAkIDAkJDBELCgsRFQ8MDA8VGBMTFRMTGBEMDAwMDAwRDAwMDAwMDAwMDAwM\nDAwMDAwMDAwMDAwMDAwMDAENCwsNDg0QDg4QFA4ODhQUDg4ODhQRDAwMDAwREQwMDAwMDBEMDAwM\nDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwM/8AAEQgAWACgAwEiAAIRAQMRAf/dAAQACv/EAT8AAAEF\nAQEBAQEBAAMAAQIEBQYHCAkKCwEAAQUBAQEBAQEAAQACAwQFBgcICQoLEAAB\nBAEDAgQCBQcGCAUDDDMBAAIRAwQhEjEFQVFhEyJxgTIGFJGhsUIjJBVSwWIzNHKC0UMHJZJT8OHx\nY3M1FqKygyZEk1RkRcKjdDYX0lXiZfKzhMPTdePzRieUpIW0lcTU5PSltcXV5fVWZnaGlqa2xtbm\n9jdHV2d3h5ent8fX5/cRAAICAQIEBAMEBQYHBwYF NQEAAhEDITESBEFRYXEiEwUygZEUobFCI8FS\n0fAzJGLhcoKSQ1MVY3M08SUGFqKygwcmNcLSRJNUoxdkRVU2dGXi8rOEw9N14/NGlKSFtJXE1OT0\npbXF1eX1VmZ2hpamtsbW5vYnN0dXZ3eHl6e3x//aAAwDAQACEQMRAD8A8+yL677Bjsa4VU+1pr5c\nZ997qHex2/8AkvqVmrEDunZFjr2OqqurbVZqNjnh+/1G7fUZQ7+bf/w3p2s/PR+o9Lz+kVMuusdm\nY1h9mVSG2UknX0vXs3Pqs/kW1/8AFb0OjM9YNu2FhYSygXuL91h/wTXN9KnZs/nv0P8AwX+EVwRH\nERI1OvlP8v0WrxXEGGsL+Yfi6n1b+qNudU578dz6GkObbYfTZPZ9su2V4j/of4Sy/wDwXpLQ+tXS\nOidDbVl0Xs6hmOALNAcdriZda9u57MjZ7fRxv5j/AE3s/QrAOX1nIoi8bq2GHAlpx/3dwZLfsX/W\n/wBH/wAUhWU+ptPUdlTK2D1Qx732j6XpsrbvuZ6j2t9m5VZ8vmMwYnhiP3gYxr+t+jxf3nRx83y8\ncUoyjxEg7cMpX/e+fh/2f/VHP9PJ6hkvfY51j3AudY4+0azue7/BsRsu+uPtNW71AQL9hibOPVtd\n9JzLo3Naz9H/ADnvVhl7La/s2JS19FmjaCSywfvW2tZ77/67LX/9bUR07Ira66k1uraD6raSCA3u\n17bW73s/66rAgRHT1dZS31aEpgy19P7sfBosyxbIbtxbXab2NAYf65hz6/625W+k5WS3Os6dkOc6\nvPY7FtredwBcZoewas3svbW6t6rkV2GWFjmNGrbWw5o8fUq3OsZ/L/z1ZxaDe0Vh4FteuPaTIkcU\nG5m5vu/wPqemm2QQbutfCX9X/CZBAT9OwP8AzfH/AAXO22Y73V3s3ODi19TuxB2O/lMf7UW+trqz\nbS42AfSB+nWB/pGt+kz/AId ns/4pdD9b+gHBycfMqE09Rory2gCNrntHrV/9ublzM2UvFlbi17T7\nSNCFBDNGVgai6/rNjLy04Ue4v+qUYrLQCSId2lELQGyHNDDoDrz3Uox8ru3HyI+Fdh/6nHf/AOAf\n8Qp112Ue8si9mlbXabf+HP8A6K/z/wDBqQD7O7AT337Mr7NlDGma3NeQA0agDbE/y91aLVjV5YL2\nAG0iLGAR/wChDG/+fa0G5jWUVusJBcBHftvc7/PuR6qMmusCtjqrLoDCdH7eXP8A3m7voMTrAJMv\nlA1WxgZUIfMTo3cetwz2VUVB9bQC+x2jWNI29/5KNlYFLb6rWPDgAfc7RrgP39v57VY6b0rqTG2W\nZ9fpYEB92Y7QV/ubv9I6z6Laq/0j0LO6sy+jKq6QQPQaLG3Fv6Z1YivJZW1381Tsd6/s9/6L9Ips\nOXBkxmQPFqTED+r+7+8s5jl8+HJwyiYmgDekfV++hf0NptORfdXTV/g6LPa6I3fzbvd/nIb8zpWP\nXNFbr36tftbtAP8AJts3PZ/Yas3EzH7ybALHO/Odq6f3nOSNFlgL2PLGEk+72Nn+S523d/ZQOSNf\nq4Cz39Ulvtm6nIkDt6Ypm9YbUf1WivHIduFhm2wT/wAJb/5FA+1W+u251jrTIIJ4Gu72fu+5RdTc\n0x6lTncwXt/79tU62vYSL2+m0CZI0I/kOHtUfFM6GwB4cMf+9ZOGI1Gv1uX/AHz/AP/Q5Tp+bfhd\nQNFDt+Na4NdU8B9dlbiN1N1T5rtb+Yl9Yei04edQ2uxuLhWVCzFYSS5u91nrV/n3Xena3+d/0foq\n5iYWFTe1vrnLvYQ5lTKHtqY8wd5teTZbWzd7WVM+mq31nysPL6kxl15NePWzGZ6bdzy6ou9d3q2P\npZ7rbH/yFp5Yj27IF8VCzXp/d/uubCROb02Bwni0/wAU/wB5yrc/ErtF1Hvyg0NdkWNhriNN9dbC\n51Vm3/ WtTr6jk10NfXY5jQTuLX7AS76Xs9rfcosxeie5nqZFV4gtN7Q2l2vua5+P9otr9v0H7LP+\ntqX2BmVWa6wxkRse29trASf8I2WWVb/9a1XByEmqvX0w/a2CMdCwfOf7GVGRg5FzXiaskNg2D3F3\n9lv538v1N607KMT0baKrWnMsbNo8h+839/8A8+f8YsNmBkY9pY7HvfYzuGnaD/J27t3tVlvSC8Nf\nYy0thpLnVlsal7/e76Wytjv7akhOfCYmAJOh/R/lJbMR6TqOhH6X8otW3Cdj2OBvre6r6fpl7XVu\n/wBGfVrZ/nfzX/CKz07GnIrscfQfIhzSNjp/lsOyh/8AX/RP/wCCTMz8k+y5hlsltpbuewk7thkO\nd6as/YnmoZEso3fzdm4sZaZ1Yxz9/p2/29n/ABagnhE4nhvby4Wzgz+3IGVb+fE+kk9Mz/q2K7nl\n2fgsdFThHtmYdW4ek7a1eddV+r2RWW5Lsd+NjWFw9dw2tLgRuZTjndfb6e737G+/9J+4rGH1HMbf\nXiZRsqJikWD81p09LLriux1Tf33s/Rfy6lW64d/Ucm6x9hvLx7gXBhaGtb+lDHOv/M2+p9BV+X5I\nwuc5cW0K+T5dpzv9Nu81z8cg9vGOEEyy6+vh498eLh4f1fF6nOpq6I2zZZ9otaP5zIY5lZYP324l\njLPV2/ufaP0v/BqfUS/CvOLfGTjPa1+JazQmp38zbjv9zms/0mO/ez1vU/wqDk5mdG0trfXyWFu+\nQPzv0+65zW/1lp9KnqeIMWzF9O7FJtwDXvAe4/pLMT3+vt9X+ep/m6vUVqIs8EdD/d08v0uL/Ccy\nZI9ctY9fV/zv0eFp5NIOTQ3HLbW4VMbZAO8Tb7mE+/8ASWN/fVn6vY9Z6gcnqdr8XDxh6+Vbtmxw\nnayqhr/5zJybXbGfufpL/wCbpeh19LynMyLL22V22kMHqV7LPcd9n6K1zf8AO9VXX9N zMTCrxBi2\nZFYd9ozi8gMBINePW6zcfTtopc+2yqv+bfk/ziWXAcg2MRLXv/U4aX4OaGGYIIkYn6/vcTf+sX15\nd1ikYNdBqopG3GoZ7obG0b3aufZs/PXJY9jsPKrue4M2E7gDLy1w2WAbPou2O/PVi+uut+y7
[04/15] madlib-site git commit: jupyter notebooks for 1.14 release
http://git-wip-us.apache.org/repos/asf/madlib-site/blob/418f361c/community-artifacts/Term-frequency-v1.ipynb -- diff --git a/community-artifacts/Term-frequency-v1.ipynb b/community-artifacts/Term-frequency-v1.ipynb new file mode 100644 index 000..99a0cd0 --- /dev/null +++ b/community-artifacts/Term-frequency-v1.ipynb @@ -0,0 +1,1062 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# Term Frequency\n", +"Term frequency computes the number of times that a word or term occurs in a document. Term frequency is often used as part of a larger text processing pipeline, which may include operations such as stemming, stop word removal and topic modelling." + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "The sql extension is already loaded. To reload it, use:\n", + " %reload_ext sql\n" + ] +} + ], + "source": [ +"%load_ext sql" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": {}, + "outputs": [ +{ + "data": { + "text/plain": [ + "u'Connected: fmcquillan@madlib'" + ] + }, + "execution_count": 37, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"# Greenplum 4.3.10.0\n", +"# %sql postgresql://gpdbchina@10.194.10.68:61000/madlib\n", +"\n", +"# PostgreSQL local\n", +"%sql postgresql://fmcquillan@localhost:5432/madlib" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "1 rows affected.\n" + ] +}, +{ + "data": { + "text/html": [ + "\n", + "\n", + "version\n", + "\n", + "\n", + "MADlib version: 1.13, git revision: unknown, cmake configuration time: Wed Dec 20 08:02:21 UTC 2017, build type: Release, build system: Darwin-17.3.0, C compiler: Clang, C++ compiler: Clang\n", + "\n", + "" + ], + "text/plain": [ + "[(u'MADlib version: 1.13, git revision: unknown, cmake configuration time: Wed Dec 20 08:02:21 UTC 2017, build type: Release, build system: Darwin-17.3.0, C compiler: Clang, C++ compiler: Clang',)]" + ] + }, + "execution_count": 38, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"%sql select madlib.version();\n", +"#%sql select version();" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# 1. Prepare documents\n", +"First we create a document table with one document per row:" + ] + }, + { + "cell_type": "code", + "execution_count": 58, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "Done.\n", + "Done.\n", + "4 rows affected.\n", + "4 rows affected.\n" + ] +}, +{ + "data": { + "text/html": [ + "\n", + "\n", + "docid\n", + "contents\n", + "\n", + "\n", + "0\n", + "I like to eat broccoli and bananas. I ate a banana and spinach smoothie for breakfast.\n", + "\n", + "\n", + "1\n", + "Chinchillas and kittens are cute.\n", + "\n", + "\n", + "2\n", + "My sister adopted two kittens yesterday.\n", + "\n", + "\n", + "3\n", + "Look at this cute hamster munching on a piece of broccoli.\n&quo
[07/15] madlib-site git commit: jupyter notebooks for 1.14 release
http://git-wip-us.apache.org/repos/asf/madlib-site/blob/418f361c/community-artifacts/PageRank-v1.ipynb -- diff --git a/community-artifacts/PageRank-v1.ipynb b/community-artifacts/PageRank-v1.ipynb deleted file mode 100644 index 32b1caf..000 --- a/community-artifacts/PageRank-v1.ipynb +++ /dev/null @@ -1,774 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ -"# PageRank\n", -"The PageRank algorithm produces a probability distribution representing the likelihood that a person randomly traversing a graph will arrive at any particular vertex. PageRank was added in MADlib 1.11." - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": { -"collapsed": false - }, - "outputs": [ -{ - "name": "stderr", - "output_type": "stream", - "text": [ - "/Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/config.py:13: ShimWarning: The `IPython.config` package has been deprecated. You should import from traitlets.config instead.\n", - " \"You should import from traitlets.config instead.\", ShimWarning)\n", - "/Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/utils/traitlets.py:5: UserWarning: IPython.utils.traitlets has moved to a top-level traitlets package.\n", - " warn(\"IPython.utils.traitlets has moved to a top-level traitlets package.\")\n" - ] -} - ], - "source": [ -"%load_ext sql" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": { -"collapsed": false - }, - "outputs": [ -{ - "data": { - "text/plain": [ - "u'Connected: fmcquillan@madlib'" - ] - }, - "execution_count": 3, - "metadata": {}, - "output_type": "execute_result" -} - ], - "source": [ -"# Greenplum 4.3.10.0\n", -"#%sql postgresql://gpdbchina@10.194.10.68:61000/madlib\n", -"\n", -"# PostgreSQL local\n", -"%sql postgresql://fmcquillan@localhost:5432/madlib\n", -"\n", -"# Greenplum 4.2.3.0\n", -"#%sql postgresql://gpdbchina@10.194.10.68:55000/madlib" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": { -"collapsed": false - }, - "outputs": [ -{ - "name": "stdout", - "output_type": "stream", - "text": [ - "1 rows affected.\n" - ] -}, -{ - "data": { - "text/html": [ - "\n", - "\n", - "version\n", - "\n", - "\n", - "MADlib version: 1.11-dev, git revision: rc/v1.9alpha-rc1-138-gcc5ce09, cmake configuration time: Tue Apr 11 20:47:30 UTC 2017, build type: Release, build system: Linux-2.6.18-238.27.1.el5.hotfix.bz516490, C compiler: gcc 4.4.0, C++ compiler: g++ 4.4.0\n", - "\n", - "" - ], - "text/plain": [ - "[(u'MADlib version: 1.11-dev, git revision: rc/v1.9alpha-rc1-138-gcc5ce09, cmake configuration time: Tue Apr 11 20:47:30 UTC 2017, build type: Release, build system: Linux-2.6.18-238.27.1.el5.hotfix.bz516490, C compiler: gcc 4.4.0, C++ compiler: g++ 4.4.0',)]" - ] - }, - "execution_count": 3, - "metadata": {}, - "output_type": "execute_result" -} - ], - "source": [ -"%sql select madlib.version();\n", -"#%sql select version();" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ -"# 1. Create vertex and edge tables" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": { -"collapsed": false - }, - "outputs": [ -{ - "name": "stdout", - "output_type": "stream", - "text": [ - "Done.\n", - "Done.\n", - "Done.\n", - "7 rows affected.\n", - "22 rows affected.\n", - "22 rows affected.\n" - ] -}, -{ - "data": { - "
[10/15] madlib-site git commit: jupyter notebooks for 1.14 release
http://git-wip-us.apache.org/repos/asf/madlib-site/blob/418f361c/community-artifacts/LDA-v1.ipynb -- diff --git a/community-artifacts/LDA-v1.ipynb b/community-artifacts/LDA-v1.ipynb new file mode 100644 index 000..19a199c --- /dev/null +++ b/community-artifacts/LDA-v1.ipynb @@ -0,0 +1,2034 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# Latent Dirichlet Allocation \n", +"\n", +"Latent Dirichlet Allocation (LDA) is a generative probabilistic model for natural texts. It is used in problems such as automated topic discovery, collaborative filtering, and document classification.\n", +"\n", +"In addition to an implementation of LDA, this MADlib module also provides a number of additional helper functions to interpret results of the LDA output." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ +{ + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/config.py:13: ShimWarning: The `IPython.config` package has been deprecated. You should import from traitlets.config instead.\n", + " \"You should import from traitlets.config instead.\", ShimWarning)\n", + "/Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/utils/traitlets.py:5: UserWarning: IPython.utils.traitlets has moved to a top-level traitlets package.\n", + " warn(\"IPython.utils.traitlets has moved to a top-level traitlets package.\")\n" + ] +} + ], + "source": [ +"%load_ext sql" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ +{ + "data": { + "text/plain": [ + "u'Connected: gpadmin@madlib'" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"# Greenplum Database 5.4.0 on GCP (demo machine)\n", +"%sql postgresql://gpadmin@35.184.253.255:5432/madlib\n", +"\n", +"# PostgreSQL local\n", +"#%sql postgresql://fmcquillan@localhost:5432/madlib\n", +"\n", +"# Greenplum Database 4.3.10.0\n", +"#%sql postgresql://gpdbchina@10.194.10.68:61000/madlib" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "1 rows affected.\n" + ] +}, +{ + "data": { + "text/html": [ + "\n", + "\n", + "version\n", + "\n", + "\n", + "MADlib version: 1.14-dev, git revision: rc/1.13-rc1-15-g7ffad03, cmake configuration time: Wed Feb 21 01:33:31 UTC 2018, build type: release, build system: Linux-2.6.32-696.20.1.el6.x86_64, C compiler: gcc 4.4.7, C++ compiler: g++ 4.4.7\n", + "\n", + "" + ], + "text/plain": [ + "[(u'MADlib version: 1.14-dev, git revision: rc/1.13-rc1-15-g7ffad03, cmake configuration time: Wed Feb 21 01:33:31 UTC 2018, build type: release, build system: Linux-2.6.32-696.20.1.el6.x86_64, C compiler: gcc 4.4.7, C++ compiler: g++ 4.4.7',)]" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"%sql select madlib.version();\n", +"#%sql select version();" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# 1. Prepare documents\n", +"The examples below are short strings extracted from various Wikipedia documents. First we create a document table with one document per row:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "Done.\n", + "Done.\n&qu
[06/15] madlib-site git commit: jupyter notebooks for 1.14 release
http://git-wip-us.apache.org/repos/asf/madlib-site/blob/418f361c/community-artifacts/Random-forest-v1.ipynb -- diff --git a/community-artifacts/Random-forest-v1.ipynb b/community-artifacts/Random-forest-v1.ipynb new file mode 100644 index 000..bac8363 --- /dev/null +++ b/community-artifacts/Random-forest-v1.ipynb @@ -0,0 +1,2899 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# Random forest\n", +"\n", +"Random forests build an ensemble of classifiers, each of which is a tree model constructed using bootstrapped samples from the input data. The results of these models are then combined to yield a single prediction, which, at the expense of some loss in interpretation, have been found to be highly accurate.\n", +"\n", +"Please also refer to the decision tree user documentation for information relevant to the implementation of random forests in MADlib." + ] + }, + { + "cell_type": "code", + "execution_count": 72, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "The sql extension is already loaded. To reload it, use:\n", + " %reload_ext sql\n" + ] +} + ], + "source": [ +"%load_ext sql" + ] + }, + { + "cell_type": "code", + "execution_count": 73, + "metadata": {}, + "outputs": [ +{ + "data": { + "text/plain": [ + "u'Connected: gpadmin@madlib'" + ] + }, + "execution_count": 73, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"# Greenplum Database 5.4.0 on GCP (demo machine)\n", +"%sql postgresql://gpadmin@35.184.253.255:5432/madlib\n", +"\n", +"# PostgreSQL local\n", +"#%sql postgresql://fmcquillan@localhost:5432/madlib\n", +"\n", +"# Greenplum Database 4.3.10.0\n", +"#%sql postgresql://gpdbchina@10.194.10.68:61000/madlib" + ] + }, + { + "cell_type": "code", + "execution_count": 75, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "1 rows affected.\n" + ] +}, +{ + "data": { + "text/html": [ + "\n", + "\n", + "version\n", + "\n", + "\n", + "MADlib version: 1.14-dev, git revision: rc/1.13-rc1-40-ga1360f3, cmake configuration time: Wed Mar 28 18:16:08 UTC 2018, build type: release, build system: Linux-2.6.32-696.20.1.el6.x86_64, C compiler: gcc 4.4.7, C++ compiler: g++ 4.4.7\n", + "\n", + "" + ], + "text/plain": [ + "[(u'MADlib version: 1.14-dev, git revision: rc/1.13-rc1-40-ga1360f3, cmake configuration time: Wed Mar 28 18:16:08 UTC 2018, build type: release, build system: Linux-2.6.32-696.20.1.el6.x86_64, C compiler: gcc 4.4.7, C++ compiler: g++ 4.4.7',)]" + ] + }, + "execution_count": 75, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"%sql select madlib.version();\n", +"#%sql select version();" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# Random forest classification examples" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# 1. Load data\n", +"Data set related to whether to play golf or not." + ] + }, + { + "cell_type": "code", + "execution_count": 76, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "Done.\n", + "Done.\n", + "14 rows affected.\n", + "14 rows affected.\n" + ] +}, +{ + "data": { + "text/html": [ + "\n", + "\n", + "id\n", + "OUTLOOK\n", + "
[02/15] madlib-site git commit: jupyter notebooks for 1.14 release
http://git-wip-us.apache.org/repos/asf/madlib-site/blob/418f361c/community-artifacts/mlp-v2.ipynb -- diff --git a/community-artifacts/mlp-v2.ipynb b/community-artifacts/mlp-v2.ipynb deleted file mode 100644 index 145b3e2..000 --- a/community-artifacts/mlp-v2.ipynb +++ /dev/null @@ -1,3755 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ -"# Multilayer Perceptron" - ] - }, - { - "cell_type": "code", - "execution_count": 117, - "metadata": { -"scrolled": true - }, - "outputs": [ -{ - "name": "stdout", - "output_type": "stream", - "text": [ - "The sql extension is already loaded. To reload it, use:\n", - " %reload_ext sql\n" - ] -} - ], - "source": [ -"%load_ext sql" - ] - }, - { - "cell_type": "code", - "execution_count": 118, - "metadata": {}, - "outputs": [ -{ - "data": { - "text/plain": [ - "u'Connected: gpdbchina@madlib'" - ] - }, - "execution_count": 118, - "metadata": {}, - "output_type": "execute_result" -} - ], - "source": [ -"# Greenplum 4.3.10.0\n", -"%sql postgresql://gpdbchina@10.194.10.68:61000/madlib\n", -"\n", -"# PostgreSQL local\n", -"#%sql postgresql://fmcquillan@localhost:5432/madlib\n", -"\n", -"# Greenplum 4.2.3.0\n", -"#%sql postgresql://gpdbchina@10.194.10.68:55000/madlib" - ] - }, - { - "cell_type": "code", - "execution_count": 119, - "metadata": {}, - "outputs": [ -{ - "name": "stdout", - "output_type": "stream", - "text": [ - "1 rows affected.\n" - ] -}, -{ - "data": { - "text/html": [ - "\n", - "\n", - "version\n", - "\n", - "\n", - "MADlib version: 2.0-dev, git revision: rel/v1.12-9-gf790a61, cmake configuration time: Tue Sep 19 17:56:02 UTC 2017, build type: Release, build system: Linux-2.6.18-238.27.1.el5.hotfix.bz516490, C compiler: gcc 4.4.0, C++ compiler: g++ 4.4.0\n", - "\n", - "" - ], - "text/plain": [ - "[(u'MADlib version: 2.0-dev, git revision: rel/v1.12-9-gf790a61, cmake configuration time: Tue Sep 19 17:56:02 UTC 2017, build type: Release, build system: Linux-2.6.18-238.27.1.el5.hotfix.bz516490, C compiler: gcc 4.4.0, C++ compiler: g++ 4.4.0',)]" - ] - }, - "execution_count": 119, - "metadata": {}, - "output_type": "execute_result" -} - ], - "source": [ -"%sql select madlib.version();\n", -"#%sql select version();" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ -"# 1. Create input table for classification" - ] - }, - { - "cell_type": "code", - "execution_count": 120, - "metadata": {}, - "outputs": [ -{ - "name": "stdout", - "output_type": "stream", - "text": [ - "Done.\n", - "Done.\n", - "52 rows affected.\n", - "52 rows affected.\n" - ] -}, -{ - "data": { - "text/html": [ - "\n", - "\n", - "id\n", - "attributes\n", - "class_text\n", - "class\n", - "state\n", - "\n", - "\n", - "1\n", - "[Decimal('5.0'), Decimal('3.2'), Decimal('1.2'), Decimal('0.2')]\n", - "Iris_setosa\n", - "1\n", - "Alaska\n", - "\n", - "\n", - "2\n", - "[Decimal('5.5'), Decimal('3.5'), Decimal('1.3'), Decimal('0.2')]\n", - "Iris_setosa\n", - "1\n", -
[01/15] madlib-site git commit: jupyter notebooks for 1.14 release
Repository: madlib-site Updated Branches: refs/heads/asf-site 4fe8cfb2f -> 418f361cf http://git-wip-us.apache.org/repos/asf/madlib-site/blob/418f361c/community-artifacts/mlp-v3.ipynb -- diff --git a/community-artifacts/mlp-v3.ipynb b/community-artifacts/mlp-v3.ipynb new file mode 100644 index 000..8c585a6 --- /dev/null +++ b/community-artifacts/mlp-v3.ipynb @@ -0,0 +1,4584 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# Multilayer Perceptron\n", +"\n", +"Multilayer Perceptron (MLP) is a type of neural network that can be used for regression and classification.\n", +"\n", +"This version of the workbook includes mini-batching which was added in the 1.14 release." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { +"scrolled": true + }, + "outputs": [ +{ + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/config.py:13: ShimWarning: The `IPython.config` package has been deprecated. You should import from traitlets.config instead.\n", + " \"You should import from traitlets.config instead.\", ShimWarning)\n", + "/Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/utils/traitlets.py:5: UserWarning: IPython.utils.traitlets has moved to a top-level traitlets package.\n", + " warn(\"IPython.utils.traitlets has moved to a top-level traitlets package.\")\n" + ] +} + ], + "source": [ +"%load_ext sql" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ +{ + "data": { + "text/plain": [ + "u'Connected: gpadmin@madlib'" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"# Greenplum Database 5.4.0 on GCP (demo machine)\n", +"%sql postgresql://gpadmin@35.184.253.255:5432/madlib\n", +"\n", +"# PostgreSQL local\n", +"#%sql postgresql://fmcquillan@localhost:5432/madlib\n", +"\n", +"# Greenplum Database 4.3.10.0\n", +"#%sql postgresql://gpdbchina@10.194.10.68:61000/madlib" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "1 rows affected.\n" + ] +}, +{ + "data": { + "text/html": [ + "\n", + "\n", + "version\n", + "\n", + "\n", + "MADlib version: 1.14-dev, git revision: rc/1.13-rc1-66-g4cced1b, cmake configuration time: Mon Apr 23 16:26:17 UTC 2018, build type: release, build system: Linux-2.6.32-696.20.1.el6.x86_64, C compiler: gcc 4.4.7, C++ compiler: g++ 4.4.7\n", + "\n", + "" + ], + "text/plain": [ + "[(u'MADlib version: 1.14-dev, git revision: rc/1.13-rc1-66-g4cced1b, cmake configuration time: Mon Apr 23 16:26:17 UTC 2018, build type: release, build system: Linux-2.6.32-696.20.1.el6.x86_64, C compiler: gcc 4.4.7, C++ compiler: g++ 4.4.7',)]" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"%sql select madlib.version();\n", +"#%sql select version();" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# Classification without Mini-Batching\n", +"\n", +"# 1. Create input table for classification" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "Done.\n", + "Done.\n", + "52 rows affected.\n", + "52 rows affected.\n" + ] +}, +{ +
[09/15] madlib-site git commit: jupyter notebooks for 1.14 release
http://git-wip-us.apache.org/repos/asf/madlib-site/blob/418f361c/community-artifacts/MLP.ipynb -- diff --git a/community-artifacts/MLP.ipynb b/community-artifacts/MLP.ipynb deleted file mode 100644 index dcd0cdb..000 --- a/community-artifacts/MLP.ipynb +++ /dev/null @@ -1,514 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ -"- This demo uses the popular MNIST dataset, which consists of 70,000 hand written digits and is used for \n", -"classification.\n", -"\n", -"## Current best accuracy on postgres\n", -"\n", -"### train_accuracy\n", -"\n", -"- 99.64%\n", -"\n", -"### test_accuracy\n", -"\n", -"- 96.79%\n", -"\n", -"### Parameters\n", -"- Hidden layers: [200,200], tanh activation, n_iterations=10, learning_rate_init=0.001, learning_rate_policy=constant, lambda=0.0001, tolerance=0" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [ -{ - "data": { - "text/plain": [ - "u'Connected: csloan@postgres'" - ] - }, - "execution_count": 1, - "metadata": {}, - "output_type": "execute_result" -} - ], - "source": [ -"%load_ext sql\n", -"%sql postgresql://csloan@localhost:5432/postgres" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [ -{ - "name": "stdout", - "output_type": "stream", - "text": [ - "DROP TABLE\n", - "CREATE TABLE\n", - "COPY 6\n", - "DROP TABLE\n", - "CREATE TABLE\n", - "COPY 1\n" - ] -} - ], - "source": [ -"%%bash\n", -"# Note that these datasets are available from https://github.com/apache/incubator-madlib-site\n";, -"gunzip -c ../data/mnist_train.sql.gz > ../data/mnist_train.sql\n", -"gunzip -c ../data/mnist_test.sql.gz > ../data/mnist_test.sql\n", -"psql -f ../data/mnist_train.sql\n", -"psql -f ../data/mnist_test.sql" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [ -{ - "data": { - "image/png": "iVBORw0KGgoNSUhEUgAAA9YAAAGDCAIAAABBVx+IAABFLklEQVR42u3df3AU953n/7ElukHM\nGCOBsSQ2TFivZZ8FxsjEsXyxoZxS4iOK76I9fw/OW18l32/B9+vvslvl2dqUtZVjvHVfcanyuGqB\nS1Wou0TZJOZqK3IdYflWzMYb4hxKiKzlh0UBwpYH2xLIFgIxg2BaI/Pt6Y7aTXfPaDQaTXd/5vn4\ng5GE0OfT/Zn59IuP3vPpO28BKKE7AwBKiAgOEMEBIjgAIjgAAABABAcA\nAABABAcAAACI4AARHAARHCCCAwAAACCCA0ARpYb7jmT3zihnCADK55IQz3FJ6DmX\nIIIDQHEk+mJ/ld2+gRSnCADKxcUj0RyXhB8VdkkgggMlRQQHgGykx1/5xVG7V5plzg0AlItw\n+8/sV4Jf/fTPVxHBAWB+MngoJDvgxABAOXG4EIRC0px+ZOER/A4AyILpGgCAeYngAIjg\nQBlE8FsAoGFKBQCgRBEcABEcAEonFT8YjXREY7FoJNodT4nRlngHJeQwud6oB/vghW4I/wRg\nIiKCA4DrAbxv1/Z9oW07o5HIS+1S146951L+b0u8gxJymFxv1IN98EI3hH8CMBERwQHA/QTe3334\ncl1TXWa3cDncXHPx4P55u4F9ydoS76CEHCbXG/VgH7zQDeGfAExERHAAcF8iHlek4B/u0SBJNVIy\nPqz4vS3xDkrIYXK9UQ/2wQvdEP4JwEREBAcA9ykp80pJZgUlkUz4vS3xDkrIYXK9UQ/2wQvdEP4J\nwEREBAcA90m33bA+M4nLkuT3tsQ7KCGHyfVGPdgHL3RD+CcAExERHADcFwrXSkoyof/GUlGSSjBc\nF/J7W+IdlJDD5HqjHuyDF7oh/BOAiYgIDgDukxu3t tQM92g7WKXiR+K1LVsaZb+3Jd5BCTlMrjfq\nwT54oRvCPwGYiIjgAOCFDN4U2d2e3NfRGYvt6gpseyWyRvZ/W+IdlJDD5HqjHuyDF7oh/BOAiYgI\nDgBeEGpo69wT64hEorFoW4MsRlviHZSQw+R6ox7sgxe6IfwTgImICA4AAAD4DBEcIIID\nRHAARHCACA4AAACACA4QwQEiOAAiOEAEBwAAAEAEBwAAAIjg\nAIjgABEcIIIDIIIDgEel4gejkY5oLBaNRLvjKTHaEu+ghBwm1xv1YB+80A3hnwBM\nRERwAHA9gPft2r4vtG1nNBJ5qV3q2rH3XMr/bYl3UEIOk+uNerAPXuiG8E8AJiIiOAC4n8D7uw9f\nrmuqk9WP5XBzzcWD+wdSfm9LvIMScphcb9SDffBCN4R/AjAREcEBwH2JeFyRgiFJ+0SSaqRkfFjx\ne1viHZSQw+R6ox7sgxe6IfwTgImICA4A7lNS5pWSzApKIpnwe1viHZSQw+R6ox7sgxe6IfwTgImI\nCA4A7pNk2fRZZhKXJcnvbYl3UEIOk+uNerAPXuiG8E8AJiIiOAC4LxSulZRkQv+NpaIklWC4LuT3\ntsQ7KCGHyfVGPdgHL3RD+CcAExERHADcJzdubakZ7tF2sErFj8RrW7Y0yn5vS7yDEnKYXG/Ug33w\nQjeEfwIwERHBAcALGbwpsrs9ua+jMxbb1RXY9kpkjez/tsQ7KCGHyfVGPdgHL3RD+CcAExERHAC8\nINTQ1rkn1hGJRGPRtgZZjLbEOyghh8n1Rj3YBy90Q/gnABMRERwAAADwGSI4QAQHiOAA\niOAAERwARCCtWLF48dIVSyVOBQDASejepeqF4o/uLuxCQQQHADtlZOT69SsjVxROBQDA\nSeLSFfVC8eHVwi4URPBy8alm amoqrVE0qVTqpubGjRsTmuvXryc1Cc01zfj4+FWbcZNrJuq/Uv+5\n+nPUn6b+WPWHpzSTGrXpT21u2TBeAABAYETwcqHmWj2C6ylcTcP5p3A9Z88Yvo38rdJ/lBHB1Yb0\n/K22TvIGAABEcJRXBNdXwfU1aTUZ2/O3To/gRrY2Urjxpz2FJ6aZV8H1JXAjgltWwW85YbAR\nHCLQ87e+8q1mbjUfGyFbjdRXrly5fPnyqGZkZOSSZlgzNDT0geaCif6VDz/88KOPPtL/HNKo33/x\n4kX1J6g/R/2BY2NjekZX29JXxNXW9SA+ZWKO44wUAAAggkMQ+iq4ynEV3FgIz7EK7liIYq9F0QtR\nLLXgllVwvTOO69+kcACY0Y/moFh9+LtCMXwAEby8IrgqWy24UYiSoxbcSOG5q1D0QhQ9fxuFKPrb\nMc214MZ/Cczhm/wNAACI4BCHnryNzK1HbTUu68HaKET55JNPPv74Y70WRa9C0UtNshWiGFUo+p96\nIYr6b9Wfo/60sbEx9YfrodxYEU9PM78107IozngBAAAiOHwv9yq4eSE8Oc1Y1bYsgZsXwvU18mw7\nEtrfjmmp/LaEb4YJAAAQwSFUBLfvCz6XQhTj6+ZClGz7glsiuGMteIBCcPhWKn4wGumIxmLRSLQ7\nnhKjLfEOSshhytb+J+/806FDv/gn9Y+3LlwvefNTYwO9v+vt/d3RI2++1f+JIvZL0jsdYCLy16gR\nwcuF/kZMNQebC1HUrGwuRFF
[08/15] madlib-site git commit: jupyter notebooks for 1.14 release
http://git-wip-us.apache.org/repos/asf/madlib-site/blob/418f361c/community-artifacts/Minibatch-preprocessor-v1.ipynb -- diff --git a/community-artifacts/Minibatch-preprocessor-v1.ipynb b/community-artifacts/Minibatch-preprocessor-v1.ipynb new file mode 100644 index 000..fe03a27 --- /dev/null +++ b/community-artifacts/Minibatch-preprocessor-v1.ipynb @@ -0,0 +1,1330 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# Mini-batch preprocessor\n", +"\n", +"The mini-batch preprocessor is a utility that prepares input data for use by models that support mini-batch as an optimization option. (This is currently only the case for Neural Networks.) It is effectively a packing operation that builds arrays of dependent and independent variables from the source data table.\n", +"\n", +"The mini-batch preprocessor was added in MADlib 1.14." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ +{ + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/config.py:13: ShimWarning: The `IPython.config` package has been deprecated. You should import from traitlets.config instead.\n", + " \"You should import from traitlets.config instead.\", ShimWarning)\n", + "/Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/utils/traitlets.py:5: UserWarning: IPython.utils.traitlets has moved to a top-level traitlets package.\n", + " warn(\"IPython.utils.traitlets has moved to a top-level traitlets package.\")\n" + ] +} + ], + "source": [ +"%load_ext sql" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ +{ + "data": { + "text/plain": [ + "u'Connected: gpadmin@madlib'" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"# Greenplum Database 5.4.0 on GCP (demo machine)\n", +"%sql postgresql://gpadmin@35.184.253.255:5432/madlib\n", +"\n", +"# PostgreSQL local\n", +"#%sql postgresql://fmcquillan@localhost:5432/madlib\n", +"\n", +"# Greenplum Database 4.3.10.0\n", +"#%sql postgresql://gpdbchina@10.194.10.68:61000/madlib" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "1 rows affected.\n" + ] +}, +{ + "data": { + "text/html": [ + "\n", + "\n", + "version\n", + "\n", + "\n", + "MADlib version: 1.14-dev, git revision: rc/1.13-rc1-66-g4cced1b, cmake configuration time: Mon Apr 23 16:26:17 UTC 2018, build type: release, build system: Linux-2.6.32-696.20.1.el6.x86_64, C compiler: gcc 4.4.7, C++ compiler: g++ 4.4.7\n", + "\n", + "" + ], + "text/plain": [ + "[(u'MADlib version: 1.14-dev, git revision: rc/1.13-rc1-66-g4cced1b, cmake configuration time: Mon Apr 23 16:26:17 UTC 2018, build type: release, build system: Linux-2.6.32-696.20.1.el6.x86_64, C compiler: gcc 4.4.7, C++ compiler: g++ 4.4.7',)]" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"%sql select madlib.version();\n", +"#%sql select version();" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# 1. Load data\n", +"Based on the well known iris dataset." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "Done.\n", + "Done.\n", + "52 rows affect
[13/15] madlib-site git commit: jupyter notebooks for 1.14 release
http://git-wip-us.apache.org/repos/asf/madlib-site/blob/418f361c/community-artifacts/Decision-trees-v1.ipynb -- diff --git a/community-artifacts/Decision-trees-v1.ipynb b/community-artifacts/Decision-trees-v1.ipynb new file mode 100644 index 000..e97b943 --- /dev/null +++ b/community-artifacts/Decision-trees-v1.ipynb @@ -0,0 +1,1590 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# Decision trees\n", +"\n", +"A decision tree is a supervised learning method that can be used for classification and regression. It consists of a structure in which internal nodes represent tests on attributes, and the branches from nodes represent the result of those tests. Each leaf node is a class label and the paths from root to leaf nodes define the set of classification or regression rules." + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "The sql extension is already loaded. To reload it, use:\n", + " %reload_ext sql\n" + ] +} + ], + "source": [ +"%load_ext sql" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": {}, + "outputs": [ +{ + "data": { + "text/plain": [ + "u'Connected: fmcquillan@madlib'" + ] + }, + "execution_count": 35, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"# Greenplum Database 5.4.0 on GCP (demo machine)\n", +"#%sql postgresql://gpadmin@35.184.253.255:5432/madlib\n", +"\n", +"# PostgreSQL local\n", +"%sql postgresql://fmcquillan@localhost:5432/madlib\n", +"\n", +"# Greenplum Database 4.3.10.0\n", +"#%sql postgresql://gpdbchina@10.194.10.68:61000/madlib" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ +"%sql select madlib.version();\n", +"#%sql select version();" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# Decision tree classification examples" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# 1. Load data\n", +"Data set related to whether to play golf or not." + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "Done.\n", + "Done.\n", + "14 rows affected.\n", + "14 rows affected.\n" + ] +}, +{ + "data": { + "text/html": [ + "\n", + "\n", + "id\n", + "OUTLOOK\n", + "temperature\n", + "humidity\n", + "Temp_Humidity\n", + "clouds_airquality\n", + "windy\n", + "class\n", + "observation_weight\n", + "\n", + "\n", + "1\n", + "sunny\n", + "85.0\n", + "85.0\n", + "[85.0, 85.0]\n", + "[u'none', u'unhealthy']\n", + "False\n", + "Don't Play\n", + "5.0\n", + "\n", + "\n", + "2\n", + "sunny\n", + "80.0\n", + "90.0\n", + "[80.0, 90.0]\n", + "[u'none', u'moderate']\n", + "True\n", + "Don't Play\n", + "5.0\n", + "\n", + "\n", + "3\n", + "overcast\n", + "
[05/15] madlib-site git commit: jupyter notebooks for 1.14 release
http://git-wip-us.apache.org/repos/asf/madlib-site/blob/418f361c/community-artifacts/Summary-v1.ipynb -- diff --git a/community-artifacts/Summary-v1.ipynb b/community-artifacts/Summary-v1.ipynb deleted file mode 100644 index 57c3611..000 --- a/community-artifacts/Summary-v1.ipynb +++ /dev/null @@ -1,1026 +0,0 @@ -{ - "cells": [ - { - "cell_type": "code", - "execution_count": 13, - "metadata": {}, - "outputs": [ -{ - "name": "stdout", - "output_type": "stream", - "text": [ - "The sql extension is already loaded. To reload it, use:\n", - " %reload_ext sql\n" - ] -} - ], - "source": [ -"%load_ext sql" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": {}, - "outputs": [ -{ - "data": { - "text/plain": [ - "u'Connected: fmcquillan@madlib'" - ] - }, - "execution_count": 14, - "metadata": {}, - "output_type": "execute_result" -} - ], - "source": [ -"# Greenplum 4.3.10.0\n", -"# %sql postgresql://gpdbchina@10.194.10.68:61000/madlib\n", -"\n", -"# PostgreSQL local\n", -"%sql postgresql://fmcquillan@localhost:5432/madlib" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": {}, - "outputs": [ -{ - "name": "stdout", - "output_type": "stream", - "text": [ - "1 rows affected.\n" - ] -}, -{ - "data": { - "text/html": [ - "\n", - "\n", - "version\n", - "\n", - "\n", - "MADlib version: 1.12, git revision: unknown, cmake configuration time: Wed Aug 23 23:07:18 UTC 2017, build type: Release, build system: Darwin-16.7.0, C compiler: Clang, C++ compiler: Clang\n", - "\n", - "" - ], - "text/plain": [ - "[(u'MADlib version: 1.12, git revision: unknown, cmake configuration time: Wed Aug 23 23:07:18 UTC 2017, build type: Release, build system: Darwin-16.7.0, C compiler: Clang, C++ compiler: Clang',)]" - ] - }, - "execution_count": 15, - "metadata": {}, - "output_type": "execute_result" -} - ], - "source": [ -"%sql select madlib.version();\n", -"#%sql select version();" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ -"# 1. On-line help" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "metadata": {}, - "outputs": [ -{ - "name": "stdout", - "output_type": "stream", - "text": [ - "1 rows affected.\n" - ] -}, -{ - "data": { - "text/html": [ - "\n", - "\n", - "summary\n", - "\n", - "\n", - "'summary' is a generic function used to produce summary statisticsof any data table. The function invokes particular 'methods' fromthe MADlib library to provide an overview of the data.---For an overview on usage, run:SELECT madlib.summary('usage'); ---For an example, run:SELECT madlib.summary('example')\n", - "\n", - "" - ], - "text/plain": [ - "[(u\"\\n'summary' is a generic function used to produce summary statistics\\nof any data table. The function invokes particular 'methods' from\\nthe MADlib library to provide an overview of the data.\\n---\\nFor an overview on usage, run:\\nSELECT madlib.summary('usage');\\n ---\\nFor an example, run:\\nSELECT madlib.summary('example')\\n\",)]" - ] - }, - "execution_count": 16, - "metadata": {}, - "output_type": "execute_result" -}
[01/15] madlib-site git commit: jupyter notebooks for 1.14 release
Repository: madlib-site Updated Branches: refs/heads/notebook-updates-1dot14 [created] 3f849b9e4 http://git-wip-us.apache.org/repos/asf/madlib-site/blob/3f849b9e/community-artifacts/mlp-v3.ipynb -- diff --git a/community-artifacts/mlp-v3.ipynb b/community-artifacts/mlp-v3.ipynb new file mode 100644 index 000..8c585a6 --- /dev/null +++ b/community-artifacts/mlp-v3.ipynb @@ -0,0 +1,4584 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# Multilayer Perceptron\n", +"\n", +"Multilayer Perceptron (MLP) is a type of neural network that can be used for regression and classification.\n", +"\n", +"This version of the workbook includes mini-batching which was added in the 1.14 release." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { +"scrolled": true + }, + "outputs": [ +{ + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/config.py:13: ShimWarning: The `IPython.config` package has been deprecated. You should import from traitlets.config instead.\n", + " \"You should import from traitlets.config instead.\", ShimWarning)\n", + "/Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/utils/traitlets.py:5: UserWarning: IPython.utils.traitlets has moved to a top-level traitlets package.\n", + " warn(\"IPython.utils.traitlets has moved to a top-level traitlets package.\")\n" + ] +} + ], + "source": [ +"%load_ext sql" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ +{ + "data": { + "text/plain": [ + "u'Connected: gpadmin@madlib'" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"# Greenplum Database 5.4.0 on GCP (demo machine)\n", +"%sql postgresql://gpadmin@35.184.253.255:5432/madlib\n", +"\n", +"# PostgreSQL local\n", +"#%sql postgresql://fmcquillan@localhost:5432/madlib\n", +"\n", +"# Greenplum Database 4.3.10.0\n", +"#%sql postgresql://gpdbchina@10.194.10.68:61000/madlib" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "1 rows affected.\n" + ] +}, +{ + "data": { + "text/html": [ + "\n", + "\n", + "version\n", + "\n", + "\n", + "MADlib version: 1.14-dev, git revision: rc/1.13-rc1-66-g4cced1b, cmake configuration time: Mon Apr 23 16:26:17 UTC 2018, build type: release, build system: Linux-2.6.32-696.20.1.el6.x86_64, C compiler: gcc 4.4.7, C++ compiler: g++ 4.4.7\n", + "\n", + "" + ], + "text/plain": [ + "[(u'MADlib version: 1.14-dev, git revision: rc/1.13-rc1-66-g4cced1b, cmake configuration time: Mon Apr 23 16:26:17 UTC 2018, build type: release, build system: Linux-2.6.32-696.20.1.el6.x86_64, C compiler: gcc 4.4.7, C++ compiler: g++ 4.4.7',)]" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"%sql select madlib.version();\n", +"#%sql select version();" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# Classification without Mini-Batching\n", +"\n", +"# 1. Create input table for classification" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "Done.\n", + "Done.\n", + "52 rows affected.\n", + "52 rows affected.\n" + ] +},
[11/15] madlib-site git commit: jupyter notebooks for 1.14 release
http://git-wip-us.apache.org/repos/asf/madlib-site/blob/3f849b9e/community-artifacts/Encoding-categorical-variables-v2.ipynb -- diff --git a/community-artifacts/Encoding-categorical-variables-v2.ipynb b/community-artifacts/Encoding-categorical-variables-v2.ipynb new file mode 100644 index 000..5e4cb6f --- /dev/null +++ b/community-artifacts/Encoding-categorical-variables-v2.ipynb @@ -0,0 +1,4026 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"# Encoding categorical variables\n", +"This is the new module that replaces create_indicator_variables() which was deprecated as of MADlib v1.10" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ +{ + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/config.py:13: ShimWarning: The `IPython.config` package has been deprecated. You should import from traitlets.config instead.\n", + " \"You should import from traitlets.config instead.\", ShimWarning)\n", + "/Users/fmcquillan/anaconda/lib/python2.7/site-packages/IPython/utils/traitlets.py:5: UserWarning: IPython.utils.traitlets has moved to a top-level traitlets package.\n", + " warn(\"IPython.utils.traitlets has moved to a top-level traitlets package.\")\n" + ] +} + ], + "source": [ +"%load_ext sql" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ +{ + "data": { + "text/plain": [ + "u'Connected: gpadmin@madlib'" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"# Greenplum Database 5.4.0 on GCP (demo machine)\n", +"%sql postgresql://gpadmin@35.184.253.255:5432/madlib\n", +"\n", +"# PostgreSQL local\n", +"#%sql postgresql://fmcquillan@localhost:5432/madlib\n", +"\n", +"# Greenplum Database 4.3.10.0\n", +"#%sql postgresql://gpdbchina@10.194.10.68:61000/madlib" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "1 rows affected.\n" + ] +}, +{ + "data": { + "text/html": [ + "\n", + "\n", + "version\n", + "\n", + "\n", + "MADlib version: 1.14-dev, git revision: rc/1.13-rc1-21-g3af2d70, cmake configuration time: Mon Feb 26 18:00:54 UTC 2018, build type: release, build system: Linux-2.6.32-696.20.1.el6.x86_64, C compiler: gcc 4.4.7, C++ compiler: g++ 4.4.7\n", + "\n", + "" + ], + "text/plain": [ + "[(u'MADlib version: 1.14-dev, git revision: rc/1.13-rc1-21-g3af2d70, cmake configuration time: Mon Feb 26 18:00:54 UTC 2018, build type: release, build system: Linux-2.6.32-696.20.1.el6.x86_64, C compiler: gcc 4.4.7, C++ compiler: g++ 4.4.7',)]" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" +} + ], + "source": [ +"%sql select madlib.version();\n", +"#%sql select version();" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ +"## 1. Load data set\n", +"Use a subset of the abalone dataset:" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [ +{ + "name": "stdout", + "output_type": "stream", + "text": [ + "Done.\n", + "Done.\n", + "20 rows affected.\n", + "20 rows affected.\n" + ] +}, +{ + "data": { + "text/html": [ + "\n", + "\n", + "id\n", + "sex\n", + "length\n&