Reynold Xin created SPARK-7025: ---------------------------------- Summary: Create a Java-friendly input source API Key: SPARK-7025 URL: https://issues.apache.org/jira/browse/SPARK-7025 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Reynold Xin Assignee: Reynold Xin
The goal of this ticket is to create a simple input source API that we can maintain and support long term. Spark currently has two de facto input source API: 1. RDD API 2. Hadoop MapReduce InputFormat API Neither of the above is ideal: 1. RDD: It is hard for Java developers to implement RDD, given the implicit class tags. In addition, the RDD API depends on Scala's runtime library, which does not preserve binary compatibility across Scala versions. If a developer chooses Java to implement an input source, it would be great if that input source can be binary compatible in years to come. 2. Hadoop InputFormat: The Hadoop InputFormat API is overly restrictive. For example, it forces key-value semantics, and does not support running arbitrary code on the driver side (an example of why this is useful is broadcast). In addition, it is somewhat awkward to tell developers that in order to implement an input source for Spark, they should learn the Hadoop MapReduce API first. So here's the proposal: An InputSource is described by: * an array of InputPartition that specifies the data partitioning * a RecordReader that specifies how data on each partition can be read This interface would be similar to Hadoop's InputFormat, except that there is no explicit key/value separation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org