[jira] [Commented] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4

Antonio Piccolboni (JIRA) Wed, 29 Apr 2015 15:33:21 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14520424#comment-14520424
 ]


Antonio Piccolboni commented on SPARK-7230:
-------------------------------------------

plyrmr on spark depends on the RDD API and has hundreds of downloads per month. 
 We also have an experimental doParallelSpark that interfaces spark with 
foreach, hence with 50+ R packages including mainstream ones like CARET. I know 
the current API wasn't meant to be stable, but retiring the whole thing, I 
think that's a declaration of war on people developing on top of it.  Sure you 
are going to have more mainstream appeal with the proposed changes and 
dataframe API, no discussion, but as far as appealing to developers, that's an 
unambiguous F you directed to them. rmr2 is a package that interfaces R with 
mapreduce at a similar level of abstraction as sparkR, and has thousands of 
downloads per month and a commercial product based on it. 

> Make RDD API private in SparkR for Spark 1.4
> --------------------------------------------
>
>                 Key: SPARK-7230
>                 URL: https://issues.apache.org/jira/browse/SPARK-7230
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SparkR
>    Affects Versions: 1.4.0
>            Reporter: Shivaram Venkataraman
>            Assignee: Shivaram Venkataraman
>            Priority: Critical
>
> This ticket proposes making the RDD API in SparkR private for the 1.4 
> release. The motivation for doing so are discussed in a larger design 
> document aimed at a more top-down design of the SparkR APIs. A first cut that 
> discusses motivation and proposed changes can be found at http://goo.gl/GLHKZI
> The main points in that document that relate to this ticket are:
> - The RDD API requires knowledge of the distributed system and is pretty low 
> level. This is not very suitable for a number of R users who are used to more 
> high-level packages that work out of the box.
> - The RDD implementation in SparkR is not fully robust right now: we are 
> missing features like spilling for aggregation, handling partitions which 
> don't fit in memory etc. There are further limitations like lack of hashCode 
> for non-native types etc. which might affect user experience.
> The only change we will make for now is to not export the RDD functions as 
> public methods in the SparkR package and I will create another ticket for 
> discussing more details public API for 1.5



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4

Reply via email to