[
https://issues.apache.org/jira/browse/MADLIB-338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rahul Iyer closed MADLIB-338.
-----------------------------
> Kmeans: canopy takes several hours to fail with "No space left on device" on
> large dataset
> ------------------------------------------------------------------------------------------
>
> Key: MADLIB-338
> URL: https://issues.apache.org/jira/browse/MADLIB-338
> Project: Apache MADlib
> Issue Type: Bug
> Reporter: Ruilong Huo
> Assignee: Florian Schoppmann
> Priority: Blocker
>
> On large dataset, kmeans canopy takes several hours to fail with "No space
> left on device" error.
> 1. Dataset: UCI dataset us_census_1990 which has 2458285 rows and 68
> dimensions
> {noformat}
> madlib=# select count(*) from madlibtestdata.km_us_census_1990;
> count
> ---------
> 2458285
> (1 row)
> madlib=# \d madlibtestdata.km_us_census_1990
> Table "madlibtestdata.km_us_census_1990"
> Column | Type | Modifiers
> ----------+--------------------+-----------
> pid | bigint |
> position | double precision[] |
> Distributed randomly
> {noformat}
> 2. Kmeans canopy invocation
> {noformat}
> SELECT * FROM madlib.kmeans('madlibtestdata.km_us_census_1990', 'position',
> 'pid', 'canopy', 0.01, NULL, NULL, 'l2norm', 20, 0.0001, True,
> 'madlibtestresult.kmeans_canopy_baseline_out_points',
> 'madlibtestresult.kmeans_canopy_baseline_out_centroids', True, True) AS q;
> {noformat}
>
> 3. Test result
> {noformat}
> ...
> eans_canopy_baseline_out_points
> CONTEXT: PL/Python function "kmeans"
>
>
> INFO: * output_centroids =
> madlibtestresult.kmeans_canopy_baseline_out_centroids
>
> CONTEXT: PL/Python function "kmeans"
>
>
> INFO: * verbose = True
>
>
> CONTEXT: PL/Python function "kmeans"
>
>
> INFO: Input:
>
>
> CONTEXT: PL/Python function "kmeans"
>
>
> INFO: ... analyzing data points
>
>
> CONTEXT: PL/Python function "kmeans"
>
>
> INFO: * points: 2458285 (68 dimensions), kept 2458285 after removing NULLs
>
>
> CONTEXT: PL/Python function "kmeans"
>
>
> INFO: ... generating initial centroids
>
>
> CONTEXT: PL/Python function "kmeans"
>
>
> ERROR: plpy.SPIError: could not write 32768 bytes to temporary file: No
> space left on device (buffile.c:501) (seg2
> gpdb2.delta.sm.greenplum.com:40002 pid=15985) (plpython.c:4700)
> CONTEXT: Traceback (most recent call last):
>
>
> PL/Python function "kmeans", line 36, in <module>
>
>
> , verbose
>
>
> PL/Python function "kmeans", line 830, in kmeans
>
>
> PL/Python function "kmeans", line 459, in __init_canopy
>
>
> PL/Python function "kmeans"
>
>
> .
> (1 row)
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)