[ https://issues.apache.org/jira/browse/SPARK-24447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16505335#comment-16505335 ]
Perry Chu commented on SPARK-24447: ----------------------------------- Thanks for following up. I tried running on a fresh spark 2.3 download, and it worked for me too. After testing a few things, I realized there's one more thing I'm doing that causes the error. Within pyspark, I stop and restart the spark session to adjust some config settings. Could you try the following? {code} ## Do this before the code snippet import pyspark spark.stop() spark = pyspark.sql.SparkSession.builder.getOrCreate() ## Code snippet should reproduce error now{code} So it looks like the rdd returned by RowMatrix.columnSimilarities() is still referring to the original SparkSession (that I stopped), and doesn't have a handle on the new one. I'm not quite sure if this counts as a bug anymore... obvious workaround is just to use command line or spark conf to do my config rather than starting and stopping spark. However, it seems strange to me that Spark appears to be doing all the work for RowMatrix.columnSimilarities() (I see tasks getting executed in the UI, and sims.numCols() works) but then simply can't display the resulting matrix! > Pyspark RowMatrix.columnSimilarities() loses spark context > ---------------------------------------------------------- > > Key: SPARK-24447 > URL: https://issues.apache.org/jira/browse/SPARK-24447 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark > Affects Versions: 2.3.0 > Reporter: Perry Chu > Priority: Major > > The RDD behind the CoordinateMatrix returned by > RowMatrix.columnSimilarities() appears to be losing track of the spark > context. > I'm pretty new to spark - not sure if the problem is on the python side or > the scala side - would appreciate someone more experienced taking a look. > This snippet should reproduce the error: > {code:java} > from pyspark.mllib.linalg.distributed import RowMatrix > rows = spark.sparkContext.parallelize([[0,1,2],[1,1,1]]) > matrix = RowMatrix(rows) > sims = matrix.columnSimilarities() > ## This works, prints "3 3" as expected (3 columns = 3x3 matrix) > print(sims.numRows(),sims.numCols()) > ## This throws an error (stack trace below) > print(sims.entries.first()) > ## Later I tried this > print(rows.context) #<SparkContext master=yarn appName=Spark ML Pipeline> > print(sims.entries.context) #<SparkContext master=yarn appName = > PySparkShell>, then throws an error{code} > Error stack trace > {code:java} > --------------------------------------------------------------------------- > AttributeError Traceback (most recent call last) > <ipython-input-47-50f83a6cf449> in <module>() > ----> 1 sims.entries.first() > /usr/lib/spark/python/pyspark/rdd.py in first(self) > 1374 ValueError: RDD is empty > 1375 """ > -> 1376 rs = self.take(1) > 1377 if rs: > 1378 return rs[0] > /usr/lib/spark/python/pyspark/rdd.py in take(self, num) > 1356 > 1357 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts)) > -> 1358 res = self.context.runJob(self, takeUpToNumLeft, p) > 1359 > 1360 items += res > /usr/lib/spark/python/pyspark/context.py in runJob(self, rdd, partitionFunc, > partitions, allowLocal) > 999 # SparkContext#runJob. > 1000 mappedRDD = rdd.mapPartitions(partitionFunc) > -> 1001 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, > partitions) > 1002 return list(_load_from_socket(port, mappedRDD._jrdd_deserializer)) > 1003 > AttributeError: 'NoneType' object has no attribute 'sc' > {code} > PySpark columnSimilarities documentation > http://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/linalg/distributed.html#RowMatrix.columnSimilarities -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org