[ https://issues.apache.org/jira/browse/SPARK-10388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955567#comment-14955567 ]
Xiangrui Meng commented on SPARK-10388: --------------------------------------- Discussed with [~rams] offline and he is interested in working together on this feature. > Public dataset loader interface > ------------------------------- > > Key: SPARK-10388 > URL: https://issues.apache.org/jira/browse/SPARK-10388 > Project: Spark > Issue Type: New Feature > Components: ML > Reporter: Xiangrui Meng > Assignee: Xiangrui Meng > > It is very useful to have a public dataset loader to fetch ML datasets from > popular repos, e.g., libsvm and UCI. This JIRA is to discuss the design, > requirements, and initial implementation. > {code} > val loader = new DatasetLoader(sqlContext) > val df = loader.get("libsvm", "rcv1_train.binary") > {code} > User should be able to list (or preview) datasets, e.g. > {code} > val datasets = loader.ls("libsvm") // returns a local DataFrame > datasets.show() // list all datasets under libsvm repo > {code} > It would be nice to allow 3rd-party packages to register new repos. Both the > API and implementation are pending discussion. Note that this requires http > and https support. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org