[ 
https://issues.apache.org/jira/browse/MADLIB-338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rahul Iyer closed MADLIB-338.
-----------------------------

> Kmeans: canopy takes several hours to fail with "No space left on device" on 
> large dataset
> ------------------------------------------------------------------------------------------
>
>                 Key: MADLIB-338
>                 URL: https://issues.apache.org/jira/browse/MADLIB-338
>             Project: Apache MADlib
>          Issue Type: Bug
>            Reporter: Ruilong Huo
>            Assignee: Florian Schoppmann
>            Priority: Blocker
>
> On large dataset, kmeans canopy takes several hours to fail with "No space 
> left on device" error.
> 1. Dataset: UCI dataset us_census_1990 which has 2458285 rows and 68 
> dimensions
> {noformat}
> madlib=# select count(*) from madlibtestdata.km_us_census_1990;
>   count  
> ---------
>  2458285
> (1 row)
> madlib=# \d madlibtestdata.km_us_census_1990
>  Table "madlibtestdata.km_us_census_1990"
>   Column  |        Type        | Modifiers 
> ----------+--------------------+-----------
>  pid      | bigint             | 
>  position | double precision[] | 
> Distributed randomly
> {noformat}
> 2. Kmeans canopy invocation
> {noformat}
> SELECT * FROM madlib.kmeans('madlibtestdata.km_us_census_1990', 'position', 
> 'pid', 'canopy', 0.01, NULL, NULL, 'l2norm', 20, 0.0001, True, 
> 'madlibtestresult.kmeans_canopy_baseline_out_points', 
> 'madlibtestresult.kmeans_canopy_baseline_out_centroids', True, True) AS q;
> {noformat}                                                                    
>                                                                    
> 3. Test result
> {noformat}
> ...
> eans_canopy_baseline_out_points  
>  CONTEXT:  PL/Python function "kmeans"                                        
>                                                                               
>                           
>  INFO:   * output_centroids = 
> madlibtestresult.kmeans_canopy_baseline_out_centroids                         
>                                                                           
>  CONTEXT:  PL/Python function "kmeans"                                        
>                                                                               
>                           
>  INFO:   * verbose = True                                                     
>                                                                               
>                           
>  CONTEXT:  PL/Python function "kmeans"                                        
>                                                                               
>                           
>  INFO:  Input:                                                                
>                                                                               
>                           
>  CONTEXT:  PL/Python function "kmeans"                                        
>                                                                               
>                           
>  INFO:  ... analyzing data points                                             
>                                                                               
>                           
>  CONTEXT:  PL/Python function "kmeans"                                        
>                                                                               
>                           
>  INFO:   * points: 2458285 (68 dimensions), kept 2458285 after removing NULLs 
>                                                                               
>                           
>  CONTEXT:  PL/Python function "kmeans"                                        
>                                                                               
>                           
>  INFO:  ... generating initial centroids                                      
>                                                                               
>                           
>  CONTEXT:  PL/Python function "kmeans"                                        
>                                                                               
>                           
>  ERROR:  plpy.SPIError: could not write 32768 bytes to temporary file: No 
> space left on device (buffile.c:501)  (seg2 
> gpdb2.delta.sm.greenplum.com:40002 pid=15985) (plpython.c:4700) 
>  CONTEXT:  Traceback (most recent call last):                                 
>                                                                               
>                           
>    PL/Python function "kmeans", line 36, in <module>                          
>                                                                               
>                           
>      , verbose                                                                
>                                                                               
>                           
>    PL/Python function "kmeans", line 830, in kmeans                           
>                                                                               
>                           
>    PL/Python function "kmeans", line 459, in __init_canopy                    
>                                                                               
>                           
>  PL/Python function "kmeans"                                                  
>                                                                               
>                           
>  .
> (1 row)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to