[ 
https://issues.apache.org/jira/browse/SPARK-10388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14802747#comment-14802747
 ] 

Kai Sasaki commented on SPARK-10388:
------------------------------------

[~mengxr] I totally agree with you. The initial version should be minimal and 
simple. So the previous suggestion is just for desired features. In this sense, 
the initial suggestion might be sufficient as MVP. 
{quote}
For example, I don't think json and orc are commonly used for ML datasets.
{quote}
Yes, json or orc are not used for machine learning data. I just think public 
dataset loader should be flexible to later extension. That means other dataset 
format can be added as plugin. 
{quote}
A proper implementation would be implementing HTTP as a Hadoop FileSystem.
{quote}
Does it mean public dataset can be used through RDD directly? For example we 
can use {{val data = sc.textFile( // public dataset url )}}

> Public dataset loader interface
> -------------------------------
>
>                 Key: SPARK-10388
>                 URL: https://issues.apache.org/jira/browse/SPARK-10388
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>            Reporter: Xiangrui Meng
>            Assignee: Xiangrui Meng
>
> It is very useful to have a public dataset loader to fetch ML datasets from 
> popular repos, e.g., libsvm and UCI. This JIRA is to discuss the design, 
> requirements, and initial implementation.
> {code}
> val loader = new DatasetLoader(sqlContext)
> val df = loader.get("libsvm", "rcv1_train.binary")
> {code}
> User should be able to list (or preview) datasets, e.g.
> {code}
> val datasets = loader.ls("libsvm") // returns a local DataFrame
> datasets.show() // list all datasets under libsvm repo
> {code}
> It would be nice to allow 3rd-party packages to register new repos. Both the 
> API and implementation are pending discussion. Note that this requires http 
> and https support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to