[ 
https://issues.apache.org/jira/browse/SPARK-15679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15308436#comment-15308436
 ] 

nalin garg commented on SPARK-15679:
------------------------------------

The purpose of the code is to load compute some logic based on the "myFunc" 
method that operated on RDD to get parallelization benefit. 

The following lines :
df_rdd = ParallelBuild().run().map(lambda line: line).persist()
r = df_rdd.map(ParallelBuild().myFunc)

gave me exit 0. Reading google suggested that Spark is lazy evaluation so some 
action will trigger the effect and i add :

r.count() which give above mentioned stacktrace. 

Noticeable thing was that :
r = df_rdd.map(ParallelBuild().myFunc) 

gives "pipelinedrdd" not sure what that is but looks like some transformation?

Interesting part was that when I removed run method and implemented:
data = [(1,'a'), (1,'b'), (1,'c'), (2,'d'), (3,'r'), (4,'a'), (2,'t'), (3,'y'), 
(1,'f')]
df = sqlContext.createDataFrame(data, schema= ['uid', 'address_uid']) 

directly in my main function then things worked just fine. But obviously i 
loose modular part of my code.



> Passing functions do not work in classes
> ----------------------------------------
>
>                 Key: SPARK-15679
>                 URL: https://issues.apache.org/jira/browse/SPARK-15679
>             Project: Spark
>          Issue Type: Bug
>            Reporter: nalin garg
>
> {code}
> from pyspark import SparkContext
> from pyspark.sql import SQLContext, HiveContext
> import csv, io, StringIO
> from pyspark.sql.functions import *
> from pyspark.sql import Row
> from pyspark.sql import *
> from pyspark.sql import functions as F
> from pyspark.sql.functions import asc, desc
> sc = SparkContext("local", "Summary Report")
> sqlContext = SQLContext(sc)
> class ParallelBuild(object):
>     def myFunc(self, s):
>        l =  s.split(',')
>        print l[0], l[1]
>        return l[0]
>     def list_to_csv_str(x):
>         output = StringIO.StringIO("")
>         csv.writer(output).writerow(x)
>         return output.getvalue().strip()
>     def run(self):
>         data = [(1,'a'), (1,'b'), (1,'c'), (2,'d'), (3,'r'), (4,'a'), 
> (2,'t'), (3,'y'), (1,'f')]
>         df = sqlContext.createDataFrame(data, schema= ['uid', 'address_uid'])
>         return df
> if __name__ == "__main__":
>     df_rdd = ParallelBuild().run().map(lambda line: line).persist()
>     r = df_rdd.map(ParallelBuild().myFunc)
>     r.count()
> {code}
> Above code returns JavaPackage not callable exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to