Xiangrui Meng created SPARK-10388: ------------------------------------- Summary: Public dataset loader interface Key: SPARK-10388 URL: https://issues.apache.org/jira/browse/SPARK-10388 Project: Spark Issue Type: New Feature Components: ML Reporter: Xiangrui Meng Assignee: Xiangrui Meng
It is very useful to have a public dataset loader to fetch ML datasets from popular repos, e.g., libsvm and UCI. This JIRA is to discuss the design, requirements, and initial implementation. {code} val loader = new DatasetLoader(sqlContext) val df = loader.get("libsvm", "rcv1_train.binary") {code} User should be able to list (or preview) datasets, e.g. {code} val datasets = loader.ls("libsvm") // returns a local DataFrame datasets.show() // list all datasets under libsvm repo {code} It would be nice to allow 3rd-party packages to register new repos. Both the API and implementation are pending discussion. Note that this requires http and https support. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org