I'm new to PySpark and Python in general, I'm trying to use PySpark to process a large text file using Gensim's preprocess_string function. What I did was simply putting preprocess_string in PySpark's map function:
rdd1 = text.map(lambda x: (x.preprocess_string(x, CUSTOM_FILTERS))) it gave me the error when I ran it: Traceback (most recent call last): RDD1 = text.map(lambda x: (x.preprocess_string(x, CUSTOM_FILTERS))).cache() File "$SPARK_HOME/python/lib/pyspark.zip/pyspark/rdd.py", line 226, in cache File "$SPARK_HOME/python/lib/pyspark.zip/pyspark/rdd.py", line 242, in persist File "$SPARK_HOME/python/lib/pyspark.zip/pyspark/rdd.py", line 2380, in _jrdd File "$SPARK_HOME/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__ File "$SPARK_HOME/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 312, in get_return_value py4j.protocol.Py4JError: An error occurred while calling o24.rdd. Trace: py4j.Py4JException: Method rdd([]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:335) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:344) at py4j.Gateway.invoke(Gateway.java:252) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:209) at java.lang.Thread.run(Thread.java:745) philguang...@gmail.com Gensim is installed on all the nodes in my cluster, I thought how PySpark works is that it will ship the Python code on the driver program to the worker nodes and invoke Python processes to execute it. My question is what kind of user defined map function can PySpark handle? Thanks! -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org