[
https://issues.apache.org/jira/browse/MADLIB-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16279516#comment-16279516
]
Frank McQuillan commented on MADLIB-1166:
-----------------------------------------
This seems to work for me. Let’s say we have a data set with a lot of NULLs in
the last column:
{code}
DROP TABLE IF EXISTS example_data;
CREATE TABLE example_data(
id SERIAL,
outlook TEXT,
temperature FLOAT8,
humidity FLOAT8,
windy TEXT,
class TEXT,
var1 FLOAT8);
INSERT INTO example_data VALUES
(1, 'sunny', 85, 85, 'false', 'Dont Play', 1),
(2, 'sunny', 80, 90, 'true', 'Dont Play', 2),
(3, 'overcast', 83, 78, 'false', 'Play', 3),
(4, 'rain', 70, 96, 'false', 'Play', NULL),
(5, 'rain', 68, 80, 'false', 'Play', NULL),
(6, 'rain', 65, 70, 'true', 'Dont Play', NULL),
(7, 'overcast', 64, 65, 'true', 'Play', NULL),
(8, 'sunny', 72, 95, 'false', 'Dont Play', NULL),
(9, 'sunny', 69, 70, 'false', 'Play', NULL),
(10, 'rain', 75, 80, 'false', 'Play', NULL),
(11, 'sunny', 75, 70, 'true', 'Play', NULL),
(12, 'overcast', 72, 90, 'true', 'Play', NULL),
(13, 'overcast', 81, 75, 'false', 'Play', NULL),
(14, 'rain', 71, 80, 'true', 'Dont Play', NULL),
(15, NULL, 100, 100, 'true', NULL, NULL),
(16, NULL, 110, 100, 'true', NULL, NULL);
{code}
Now run correlation function:
{code}
DROP TABLE IF EXISTS example_data_output, example_data_output_summary;
SELECT madlib.correlation( 'example_data',
'example_data_output',
'temperature, humidity, var1'
);
SELECT * FROM example_data_output ORDER BY column_position;
{code}
gives:
{code}
column_position | variable | temperature | humidity |
var1
-----------------+-------------+---------------------+--------------------+------
1 | temperature | 1 | |
2 | humidity | 0.616876934548786 | 1 |
3 | var1 | -0.0292352673102343 | -0.112407201999569 | 1
(3 rows)
{code}
Note the correlation between temp and humidity is 0.616876934548786
And
{code}
madlib=# SELECT * FROM example_data_output_summary;
{code}
produces
{code}
method | source | output_table | column_names
| mean_vector | total_rows_processed
-------------+--------------+---------------------+-----------------------------+----------------+----------------------
Correlation | example_data | example_data_output | {temperature,humidity,var1}
| {77.5,82.75,2} | 16
(1 row)
{code}
so it is processing all columns and not dropping the nulls
If I run:
{code}
DROP TABLE IF EXISTS example_data_output, example_data_output_summary;
SELECT madlib.correlation( 'example_data',
'example_data_output',
'temperature, humidity'
);
SELECT * FROM example_data_output ORDER BY column_position;
{code}
gives:
{code}
column_position | variable | temperature | humidity
-----------------+-------------+-------------------+----------
1 | temperature | 1 |
2 | humidity | 0.616876934548786 | 1
(2 rows)
{code}
with the same correlation between temp and humidity of 0.616876934548786.
> Correlation - do not drop whole row if a target column is NULL
> --------------------------------------------------------------
>
> Key: MADLIB-1166
> URL: https://issues.apache.org/jira/browse/MADLIB-1166
> Project: Apache MADlib
> Issue Type: Improvement
> Components: Module: Descriptive Statistics
> Reporter: Frank McQuillan
> Priority: Minor
> Fix For: v1.13
>
>
> http://madlib.apache.org/docs/latest/group__grp__correlation.html
> Current implementation ignores a row that contains NULL entirely. This means
> any correlation in such a row (with NULLs) does not contribute to the final
> answer.
> This improvement is not to drop the whole row if a target column is NULL, but
> rather to do pairwise correlation for any target columns that exist, to make
> most use of the data.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)