[ https://issues.apache.org/jira/browse/SPARK-15466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-15466: ------------------------------------ Assignee: (was: Apache Spark) > Make `SparkSession` as the entry point to programming with RDD too > ------------------------------------------------------------------ > > Key: SPARK-15466 > URL: https://issues.apache.org/jira/browse/SPARK-15466 > Project: Spark > Issue Type: Improvement > Components: Examples, SQL > Reporter: Dongjoon Hyun > > `SparkSession` greatly reduces the number of concepts which Spark users must > know. Currently, `SparkSession` is defined as the entry point to programming > Spark with the Dataset and DataFrame API. And, we can easily get `RDD` by > calling `Dataset.rdd` or `DataFrame.rdd`, too. > However, many usages (including examples) are observed to extract > `SparkSession.sparkContext` and keep it as own variable to call `parallelize`. > If `SparkSession` supports RDD seamlessly too, it would be great for > usability. We can do this by simply adding `parallelize` API. > **Example** > {code:title=SparkPi.scala|borderStyle=solid} > object SparkPi { > def main(args: Array[String]) { > val spark = SparkSession > .builder > .appName("Spark Pi") > .getOrCreate() > - val sc = spark.sparkContext > val slices = if (args.length > 0) args(0).toInt else 2 > val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow > - val count = sc.parallelize(1 until n, slices).map { i => > + val count = spark.parallelize(1 until n, slices).map { i => > val count = spark.parallelize(1 until n, slices).map { i => > val x = random * 2 - 1 > val y = random * 2 - 1 > if (x*x + y*y < 1) 1 else 0 > }.reduce(_ + _) > println("Pi is roughly " + 4.0 * count / n) > spark.stop() > } > } > {code} > {code:title=pi.py|borderStyle=solid} > spark = SparkSession\ > .builder\ > .appName("PythonPi")\ > .getOrCreate() > - sc = spark._sc > - > partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2 > n = 100000 * partitions > def f(_): > x = random() * 2 - 1 > y = random() * 2 - 1 > return 1 if x ** 2 + y ** 2 < 1 else 0 > -count = sc.parallelize(range(1, n + 1), partitions).map(f).reduce(add) > count = spark.parallelize(range(1, n + 1), partitions).map(f).reduce(add) > print("Pi is roughly %f" % (4.0 * count / n)) > spark.stop() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org