Dear All,

I'm trying to implement a procedure that iteratively updates a rdd using results from GaussianMixtureModel.predictSoft. In order to avoid problems with local variable (the obtained GMM) beeing overwritten in each pass of the loop I'm doing the following:

#######################################################
for i in xrange(10):
    gmm = GaussianMixture.train(rdd, 2)

    def getSafePredictor(unsafeGMM):
        return lambda x: \
            (unsafeGMM.predictSoft(x.features), unsafeGMM.gaussians.mu)

    safePredictor = getSafePredictor(gmm)
    predictionsRDD = (labelledpointrddselectedfeatsNansPatched
          .map(safePredictor)
    )
    print predictionsRDD.take(1)
    (... - rest of code - update rdd with results from predictionsRdd)
#######################################################

Unfortunately this ends with:

#######################################################
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
#######################################################

Any idea why I'm getting this behaviour? My expectation would be, that GMM should be a "simple" object without SparkContext in it. I'm using spark 1.5.2

 Thanks,
   Tomasz


ps As a workaround I'm doing currently

########################
    def getSafeGMM(unsafeGMM):
        return lambda x: unsafeGMM.predictSoft(x)

    safeGMM = getSafeGMM(gmm)
    predictionsRDD = \
        safeGMM(labelledpointrddselectedfeatsNansPatched.map(rdd))
########################
which works fine. If it's possible I would like to avoid this approach, since it would require to perform another closure on gmm.gaussians later in my code


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to