Xiangrui Meng created SPARK-10388:
-------------------------------------

             Summary: Public dataset loader interface
                 Key: SPARK-10388
                 URL: https://issues.apache.org/jira/browse/SPARK-10388
             Project: Spark
          Issue Type: New Feature
          Components: ML
            Reporter: Xiangrui Meng
            Assignee: Xiangrui Meng


It is very useful to have a public dataset loader to fetch ML datasets from 
popular repos, e.g., libsvm and UCI. This JIRA is to discuss the design, 
requirements, and initial implementation.

{code}
val loader = new DatasetLoader(sqlContext)
val df = loader.get("libsvm", "rcv1_train.binary")
{code}

User should be able to list (or preview) datasets, e.g.
{code}
val datasets = loader.ls("libsvm") // returns a local DataFrame
datasets.show() // list all datasets under libsvm repo
{code}

It would be nice to allow 3rd-party packages to register new repos. Both the 
API and implementation are pending discussion. Note that this requires http and 
https support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to