[jira] [Commented] (SPARK-10388) Public dataset loader interface

Xiangrui Meng (JIRA) Tue, 15 Sep 2015 09:58:28 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-10388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14745729#comment-14745729
 ]


Xiangrui Meng commented on SPARK-10388:
---------------------------------------

[~lewuathe] Thanks for the discussion! Agree that it would be great to cache 
the data at local and other enhancement. But let's design an MVP version first. 
Improvements could be done as follow-ups.

For example, I don't think json and orc are commonly used for ML datasets. 
LIBSVM and CSV are more common. But all depend on fetching data over HTTP. A 
proper implementation would be implementing HTTP as a Hadoop FileSystem. The 
initial version might not support file split. A hacky implementation would be 
`sc.parallelize(Seq(1)).flatMap( ... // download and generate records)`. 

It would be great if you can help the design. Please keep the features minimal. 
Thanks!

> Public dataset loader interface
> -------------------------------
>
>                 Key: SPARK-10388
>                 URL: https://issues.apache.org/jira/browse/SPARK-10388
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>            Reporter: Xiangrui Meng
>            Assignee: Xiangrui Meng
>
> It is very useful to have a public dataset loader to fetch ML datasets from 
> popular repos, e.g., libsvm and UCI. This JIRA is to discuss the design, 
> requirements, and initial implementation.
> {code}
> val loader = new DatasetLoader(sqlContext)
> val df = loader.get("libsvm", "rcv1_train.binary")
> {code}
> User should be able to list (or preview) datasets, e.g.
> {code}
> val datasets = loader.ls("libsvm") // returns a local DataFrame
> datasets.show() // list all datasets under libsvm repo
> {code}
> It would be nice to allow 3rd-party packages to register new repos. Both the 
> API and implementation are pending discussion. Note that this requires http 
> and https support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10388) Public dataset loader interface

Reply via email to