Use Gensim with PySpark

philipghu Tue, 03 Jul 2018 13:21:03 -0700

I'm new to PySpark and Python in general, I'm trying to use PySpark to
process a large text file using Gensim's preprocess_string function. What I
did was simply putting preprocess_string in PySpark's map function:


rdd1 = text.map(lambda x: (x.preprocess_string(x, CUSTOM_FILTERS)))

it gave me the error when I ran it:

Traceback (most recent call last):

RDD1 = text.map(lambda x: (x.preprocess_string(x, CUSTOM_FILTERS))).cache()
  File "$SPARK_HOME/python/lib/pyspark.zip/pyspark/rdd.py", line 226, in
cache
  File "$SPARK_HOME/python/lib/pyspark.zip/pyspark/rdd.py", line 242, in
persist
  File "$SPARK_HOME/python/lib/pyspark.zip/pyspark/rdd.py", line 2380, in
_jrdd
  File "$SPARK_HOME/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line
813, in __call__
  File "$SPARK_HOME/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 312,
in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o24.rdd. Trace:
py4j.Py4JException: Method rdd([]) does not exist
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:335)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344)
    at py4j.Gateway.invoke(Gateway.java:252)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:209)
    at java.lang.Thread.run(Thread.java:745)

philguang...@gmail.com

Gensim is installed on all the nodes in my cluster, I thought how PySpark
works is that it will ship the Python code on the driver program to the
worker nodes and invoke Python processes to execute it. My question is what
kind of user defined map function can PySpark handle? Thanks!







--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Use Gensim with PySpark

Reply via email to