[jira] [Updated] (SPARK-5405) Spark clusterer should support high dimensional data

Derrick Burns (JIRA) Sun, 25 Jan 2015 22:31:54 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-5405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Derrick Burns updated SPARK-5405:
---------------------------------
    Description: 
The MLLIB clusterer works well for low  (<200) dimensional data.  However, 
performance is linear with the number of dimensions.  So, for practical 
purposes, it is not very useful for high dimensional data.  

Depending on the data type, one can embed the high dimensional data into lower 
dimensional spaces in a distance-preserving way.  The Spark clusterer should 
support such embedding.

An example implementation that supports high dimensional data is here:
https://github.com/derrickburns/generalized-kmeans-clustering

  was:
The MLLIB clusterer works well for low  (<200) dimensional data.  However, 
performance is linear with the number of dimensions.  So, for practical 
purposes, it is not very useful for high dimensional data.  

Depending on the data type, one can embed the high dimensional data into lower 
dimensional spaces in a distance-preserving way.  The Spark clusterer should 
support such embedding.


> Spark clusterer should support high dimensional data
> ----------------------------------------------------
>
>                 Key: SPARK-5405
>                 URL: https://issues.apache.org/jira/browse/SPARK-5405
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>    Affects Versions: 1.2.0
>            Reporter: Derrick Burns
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> The MLLIB clusterer works well for low  (<200) dimensional data.  However, 
> performance is linear with the number of dimensions.  So, for practical 
> purposes, it is not very useful for high dimensional data.  
> Depending on the data type, one can embed the high dimensional data into 
> lower dimensional spaces in a distance-preserving way.  The Spark clusterer 
> should support such embedding.
> An example implementation that supports high dimensional data is here:
> https://github.com/derrickburns/generalized-kmeans-clustering



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5405) Spark clusterer should support high dimensional data

Reply via email to