[ 
https://issues.apache.org/jira/browse/SPARK-10388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15099118#comment-15099118
 ] 

Xiangrui Meng commented on SPARK-10388:
---------------------------------------

[~zjffdu] Thanks for posting the design doc! There might be some 
miscommunication in my description. We shouldn't assume any additional work on 
the server side. LIBSVM and UCI repos are out of our control, and we cannot 
mirror the repos and implement servers because of license issues and 
maintenance cost. We should only consider what we can do on the Spark side. 
Essentially, we need the following:

1) a catalog of public datasets
2) how to fetch datasets into Spark (via http/ftp)
3) how to expand the catalog

For example, we can host the catalog as a resource file inside Spark repo. But 
it won't be updated frequently due to Spark release cycle. Or we can put the 
catalog file on spark.apache.org. Then we need to make sure it is compatible 
cross Spark versions (or maintain a catalog file for each Spark release).

I think this would be the main focus of the design. Do you have time to check 
the details and update doc? Thanks!



> Public dataset loader interface
> -------------------------------
>
>                 Key: SPARK-10388
>                 URL: https://issues.apache.org/jira/browse/SPARK-10388
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>            Reporter: Xiangrui Meng
>         Attachments: SPARK-10388PublicDataSetLoaderInterface.pdf
>
>
> It is very useful to have a public dataset loader to fetch ML datasets from 
> popular repos, e.g., libsvm and UCI. This JIRA is to discuss the design, 
> requirements, and initial implementation.
> {code}
> val loader = new DatasetLoader(sqlContext)
> val df = loader.get("libsvm", "rcv1_train.binary")
> {code}
> User should be able to list (or preview) datasets, e.g.
> {code}
> val datasets = loader.ls("libsvm") // returns a local DataFrame
> datasets.show() // list all datasets under libsvm repo
> {code}
> It would be nice to allow 3rd-party packages to register new repos. Both the 
> API and implementation are pending discussion. Note that this requires http 
> and https support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to