[GitHub] spark pull request #14467: [SPARK-16861][PYSPARK][CORE] Refactor PySpark acc...

holdenk Mon, 15 Aug 2016 17:03:36 -0700

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14467#discussion_r74857555
  
    --- Diff: python/pyspark/context.py ---
    @@ -173,9 +173,8 @@ def _do_init(self, master, appName, sparkHome, pyFiles, 
environment, batchSize,
             # they will be passed back to us through a TCP server
             self._accumulatorServer = accumulators._start_update_server()
             (host, port) = self._accumulatorServer.server_address
    -        self._javaAccumulator = self._jsc.accumulator(
    -            self._jvm.java.util.ArrayList(),
    -            self._jvm.PythonAccumulatorParam(host, port))
    +        self._javaAccumulator = self._jvm.PythonAccumulatorV2(host, port)
    +        self._jsc.sc().register(self._javaAccumulator)
    --- End diff --
    
    So in general you would have one SparkContext and many RDDs. The 
accumulator here doesn't represent a specific accumulator rather a general 
mechanism for all of the Python accumulators are built on top of. The design is 
certainly a bit confusing if you try and think of it as a regular accumulator - 
I found it helped to look at how the scala side "merge" is implemented.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14467: [SPARK-16861][PYSPARK][CORE] Refactor PySpark acc...

Reply via email to