Reynold Xin created SPARK-7025:
----------------------------------

             Summary: Create a Java-friendly input source API
                 Key: SPARK-7025
                 URL: https://issues.apache.org/jira/browse/SPARK-7025
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
            Reporter: Reynold Xin
            Assignee: Reynold Xin


The goal of this ticket is to create a simple input source API that we can 
maintain and support long term.

Spark currently has two de facto input source API:
1. RDD API
2. Hadoop MapReduce InputFormat API

Neither of the above is ideal:

1. RDD: It is hard for Java developers to implement RDD, given the implicit 
class tags. In addition, the RDD API depends on Scala's runtime library, which 
does not preserve binary compatibility across Scala versions. If a developer 
chooses Java to implement an input source, it would be great if that input 
source can be binary compatible in years to come.

2. Hadoop InputFormat: The Hadoop InputFormat API is overly restrictive. For 
example, it forces key-value semantics, and does not support running arbitrary 
code on the driver side (an example of why this is useful is broadcast). In 
addition, it is somewhat awkward to tell developers that in order to implement 
an input source for Spark, they should learn the Hadoop MapReduce API first.


So here's the proposal:

An InputSource is described by:
* an array of InputPartition that specifies the data partitioning
* a RecordReader that specifies how data on each partition can be read

This interface would be similar to Hadoop's InputFormat, except that there is 
no explicit key/value separation.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to