[ https://issues.apache.org/jira/browse/SPARK-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Walter Bogorad updated SPARK-3992: ---------------------------------- Description: A dictionary accumulator defined as a global variable is not visible inside a function called by "foreach()". Here is the minimal code snippet: from collections import defaultdict from pyspark import SparkContext from pyspark.accumulators import AccumulatorParam class DictAccumParam(AccumulatorParam): def zero(self, value): value.clear() def addInPlace(self, val1, val2): return val1 sc = SparkContext("local", "Dict Accumulator Bug") va = sc.accumulator(defaultdict(int), DictAccumParam()) def foo(x): global va print "va is:", va rdd = sc.parallelize([1,2,3]).foreach(foo) When ran the code snippet produced the following results: ... va is: None va is: None va is: None ... I have verified that the global variables are visible inside foo() called by foreach only if they are for scalars or lists like in the API doc at http://spark.apache.org/docs/latest/api/python/ The problem exists with standard dictionaries and collections. I also verified that if foo() is called directly, i.e. outside foreach, then the global variables are visible OK. was: A dictionary accumulator defined as a global variable is not visible inside a function called by "foreach()". Here is the minimal code snippet: from collections import defaultdict from pyspark import SparkContext from pyspark.accumulators import AccumulatorParam class DictAccumParam(AccumulatorParam): def zero(self, value): value.clear() def addInPlace(self, val1, val2): return val1 sc = SparkContext("local", "Dict Accumulator Bug") va = sc.accumulator(defaultdict(int), DictAccumParam()) def foo(x): global va print "va is:", va rdd = sc.parallelize([1,2,3]).foreach(foo) When ran the code snippet produced the following results: ... va is: None va is: None va is: None ... I have verified that the global variables are visible inside foo() called by foreach only if they are for scalars or lists like in the API doc at http://spark.apache.org/docs/latest/api/python/ The problem exists with standard dictionaries and collections. > Spark 1.1.0 python binding cannot use any collections but list as Accumulators > ------------------------------------------------------------------------------ > > Key: SPARK-3992 > URL: https://issues.apache.org/jira/browse/SPARK-3992 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 1.1.0 > Environment: 3.13.0-36-generic #63-Ubuntu SMP Wed Sep 3 21:30:07 UTC > 2014 x86_64 x86_64 x86_64 GNU/Linux > Reporter: Walter Bogorad > > A dictionary accumulator defined as a global variable is not visible inside a > function called by "foreach()". > Here is the minimal code snippet: > from collections import defaultdict > from pyspark import SparkContext > from pyspark.accumulators import AccumulatorParam > class DictAccumParam(AccumulatorParam): > def zero(self, value): > value.clear() > def addInPlace(self, val1, val2): > return val1 > sc = SparkContext("local", "Dict Accumulator Bug") > va = sc.accumulator(defaultdict(int), DictAccumParam()) > def foo(x): > global va > print "va is:", va > rdd = sc.parallelize([1,2,3]).foreach(foo) > When ran the code snippet produced the following results: > ... > va is: None > va is: None > va is: None > ... > I have verified that the global variables are visible inside foo() called by > foreach only if they are for scalars or lists like in the API doc at > http://spark.apache.org/docs/latest/api/python/ > The problem exists with standard dictionaries and collections. > I also verified that if foo() is called directly, i.e. outside foreach, then > the global variables are visible OK. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org