[jira] [Commented] (SPARK-24447) Pyspark RowMatrix.columnSimilarities() loses spark context

Perry Chu (JIRA) Thu, 07 Jun 2018 14:37:50 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-24447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16505335#comment-16505335
 ]


Perry Chu commented on SPARK-24447:
-----------------------------------

Thanks for following up. I tried running on a fresh spark 2.3 download, and it 
worked for me too.

After testing a few things, I realized there's one more thing I'm doing that 
causes the error. Within pyspark, I stop and restart the spark session to 
adjust some config settings. Could you try the following?
{code}
## Do this before the code snippet
import pyspark
spark.stop()
spark = pyspark.sql.SparkSession.builder.getOrCreate() 

## Code snippet should reproduce error now{code}
So it looks like the rdd returned by RowMatrix.columnSimilarities() is still 
referring to the original SparkSession (that I stopped), and doesn't have a 
handle on the new one.

I'm not quite sure if this counts as a bug anymore... obvious workaround is 
just to use command line or spark conf to do my config rather than starting and 
stopping spark. However, it seems strange to me that Spark appears to be doing 
all the work for RowMatrix.columnSimilarities() (I see tasks getting executed 
in the UI, and sims.numCols() works) but then simply can't display the 
resulting matrix!

> Pyspark RowMatrix.columnSimilarities() loses spark context
> ----------------------------------------------------------
>
>                 Key: SPARK-24447
>                 URL: https://issues.apache.org/jira/browse/SPARK-24447
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib, PySpark
>    Affects Versions: 2.3.0
>            Reporter: Perry Chu
>            Priority: Major
>
> The RDD behind the CoordinateMatrix returned by 
> RowMatrix.columnSimilarities() appears to be losing track of the spark 
> context. 
> I'm pretty new to spark - not sure if the problem is on the python side or 
> the scala side - would appreciate someone more experienced taking a look.
> This snippet should reproduce the error:
> {code:java}
> from pyspark.mllib.linalg.distributed import RowMatrix
> rows = spark.sparkContext.parallelize([[0,1,2],[1,1,1]])
> matrix = RowMatrix(rows)
> sims = matrix.columnSimilarities()
> ## This works, prints "3 3" as expected (3 columns = 3x3 matrix)
> print(sims.numRows(),sims.numCols())
> ## This throws an error (stack trace below)
> print(sims.entries.first())
> ## Later I tried this
> print(rows.context) #<SparkContext master=yarn appName=Spark ML Pipeline>
> print(sims.entries.context) #<SparkContext master=yarn appName = 
> PySparkShell>, then throws an error{code}
> Error stack trace
> {code:java}
> ---------------------------------------------------------------------------
> AttributeError Traceback (most recent call last)
> <ipython-input-47-50f83a6cf449> in <module>()
> ----> 1 sims.entries.first()
> /usr/lib/spark/python/pyspark/rdd.py in first(self)
> 1374 ValueError: RDD is empty
> 1375 """
> -> 1376 rs = self.take(1)
> 1377 if rs:
> 1378 return rs[0]
> /usr/lib/spark/python/pyspark/rdd.py in take(self, num)
> 1356
> 1357 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts))
> -> 1358 res = self.context.runJob(self, takeUpToNumLeft, p)
> 1359
> 1360 items += res
> /usr/lib/spark/python/pyspark/context.py in runJob(self, rdd, partitionFunc, 
> partitions, allowLocal)
> 999 # SparkContext#runJob.
> 1000 mappedRDD = rdd.mapPartitions(partitionFunc)
> -> 1001 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
> partitions)
> 1002 return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
> 1003
> AttributeError: 'NoneType' object has no attribute 'sc'
> {code}
> PySpark columnSimilarities documentation
> http://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/linalg/distributed.html#RowMatrix.columnSimilarities



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24447) Pyspark RowMatrix.columnSimilarities() loses spark context

Reply via email to