[jira] [Created] (SPARK-15787) Display more helpful error messages for several invalid operations
nalin garg created SPARK-15787: -- Summary: Display more helpful error messages for several invalid operations Key: SPARK-15787 URL: https://issues.apache.org/jira/browse/SPARK-15787 Project: Spark Issue Type: Bug Affects Versions: 1.6.1 Reporter: nalin garg Fix For: 1.2.1 Referencing SPARK-5063. The issue has reappeared. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15679) Passing functions do not work in classes
[ https://issues.apache.org/jira/browse/SPARK-15679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15308425#comment-15308425 ] nalin garg edited comment on SPARK-15679 at 5/31/16 7:35 PM: - Below is stack trace: Traceback (most recent call last): File "/Users/nalingarg/PycharmProjects/parallel_check/parallel_check.py", line 90, in r.count() File "/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/rdd.py", line 1004, in count return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File "/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/rdd.py", line 995, in sum return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add) File "/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/rdd.py", line 869, in fold vals = self.mapPartitions(func).collect() File "/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/rdd.py", line 771, in collect port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) File "/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/rdd.py", line 2379, in _jrdd pickled_cmd, bvars, env, includes = _prepare_for_python_RDD(self.ctx, command, self) File "/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/rdd.py", line 2299, in _prepare_for_python_RDD pickled_command = ser.dumps(command) File "/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/serializers.py", line 428, in dumps return cloudpickle.dumps(obj, 2) File "/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/cloudpickle.py", line 646, in dumps cp.dump(obj) File "/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/cloudpickle.py", line 107, in dump return Pickler.dump(self, obj) File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 224, in dump self.save(obj) File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 568, in save_tuple save(element) File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/cloudpickle.py", line 199, in save_function self.save_function_tuple(obj) File "/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/cloudpickle.py", line 236, in save_function_tuple save((code, closure, base_globals)) File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 554, in save_tuple save(element) File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 606, in save_list self._batch_appends(iter(obj)) File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 639, in _batch_appends save(x) File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/cloudpickle.py", line 199, in save_function self.save_function_tuple(obj) File "/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/cloudpickle.py", line 236, in save_function_tuple save((code, closure, base_globals)) File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 554, in save_tuple save(element) File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 606, in save_list self._batch_appends(iter(obj)) File "/usr/local/Cellar/python/2.7.11/
[jira] [Commented] (SPARK-15679) Passing functions do not work in classes
[ https://issues.apache.org/jira/browse/SPARK-15679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15308436#comment-15308436 ] nalin garg commented on SPARK-15679: The purpose of the code is to load compute some logic based on the "myFunc" method that operated on RDD to get parallelization benefit. The following lines : df_rdd = ParallelBuild().run().map(lambda line: line).persist() r = df_rdd.map(ParallelBuild().myFunc) gave me exit 0. Reading google suggested that Spark is lazy evaluation so some action will trigger the effect and i add : r.count() which give above mentioned stacktrace. Noticeable thing was that : r = df_rdd.map(ParallelBuild().myFunc) gives "pipelinedrdd" not sure what that is but looks like some transformation? Interesting part was that when I removed run method and implemented: data = [(1,'a'), (1,'b'), (1,'c'), (2,'d'), (3,'r'), (4,'a'), (2,'t'), (3,'y'), (1,'f')] df = sqlContext.createDataFrame(data, schema= ['uid', 'address_uid']) directly in my main function then things worked just fine. But obviously i loose modular part of my code. > Passing functions do not work in classes > > > Key: SPARK-15679 > URL: https://issues.apache.org/jira/browse/SPARK-15679 > Project: Spark > Issue Type: Bug >Reporter: nalin garg > > {code} > from pyspark import SparkContext > from pyspark.sql import SQLContext, HiveContext > import csv, io, StringIO > from pyspark.sql.functions import * > from pyspark.sql import Row > from pyspark.sql import * > from pyspark.sql import functions as F > from pyspark.sql.functions import asc, desc > sc = SparkContext("local", "Summary Report") > sqlContext = SQLContext(sc) > class ParallelBuild(object): > def myFunc(self, s): >l = s.split(',') >print l[0], l[1] >return l[0] > def list_to_csv_str(x): > output = StringIO.StringIO("") > csv.writer(output).writerow(x) > return output.getvalue().strip() > def run(self): > data = [(1,'a'), (1,'b'), (1,'c'), (2,'d'), (3,'r'), (4,'a'), > (2,'t'), (3,'y'), (1,'f')] > df = sqlContext.createDataFrame(data, schema= ['uid', 'address_uid']) > return df > if __name__ == "__main__": > df_rdd = ParallelBuild().run().map(lambda line: line).persist() > r = df_rdd.map(ParallelBuild().myFunc) > r.count() > {code} > Above code returns JavaPackage not callable exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15679) Passing functions do not work in classes
[ https://issues.apache.org/jira/browse/SPARK-15679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15308425#comment-15308425 ] nalin garg commented on SPARK-15679: [~sowen] below is stack trace: Traceback (most recent call last): File "/Users/nalingarg/PycharmProjects/parallel_check/parallel_check.py", line 90, in r.count() File "/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/rdd.py", line 1004, in count return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File "/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/rdd.py", line 995, in sum return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add) File "/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/rdd.py", line 869, in fold vals = self.mapPartitions(func).collect() File "/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/rdd.py", line 771, in collect port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) File "/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/rdd.py", line 2379, in _jrdd pickled_cmd, bvars, env, includes = _prepare_for_python_RDD(self.ctx, command, self) File "/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/rdd.py", line 2299, in _prepare_for_python_RDD pickled_command = ser.dumps(command) File "/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/serializers.py", line 428, in dumps return cloudpickle.dumps(obj, 2) File "/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/cloudpickle.py", line 646, in dumps cp.dump(obj) File "/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/cloudpickle.py", line 107, in dump return Pickler.dump(self, obj) File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 224, in dump self.save(obj) File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 568, in save_tuple save(element) File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/cloudpickle.py", line 199, in save_function self.save_function_tuple(obj) File "/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/cloudpickle.py", line 236, in save_function_tuple save((code, closure, base_globals)) File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 554, in save_tuple save(element) File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 606, in save_list self._batch_appends(iter(obj)) File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 639, in _batch_appends save(x) File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/cloudpickle.py", line 199, in save_function self.save_function_tuple(obj) File "/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/cloudpickle.py", line 236, in save_function_tuple save((code, closure, base_globals)) File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 554, in save_tuple save(element) File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 606, in save_list self._batch_appends(iter(obj)) File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7
[jira] [Updated] (SPARK-15679) Passing functions do not work in classes
[ https://issues.apache.org/jira/browse/SPARK-15679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nalin garg updated SPARK-15679: --- Description: {code} from pyspark import SparkContext from pyspark.sql import SQLContext, HiveContext import csv, io, StringIO from pyspark.sql.functions import * from pyspark.sql import Row from pyspark.sql import * from pyspark.sql import functions as F from pyspark.sql.functions import asc, desc sc = SparkContext("local", "Summary Report") sqlContext = SQLContext(sc) class ParallelBuild(object): def myFunc(self, s): l = s.split(',') print l[0], l[1] return l[0] def list_to_csv_str(x): output = StringIO.StringIO("") csv.writer(output).writerow(x) return output.getvalue().strip() def run(self): data = [(1,'a'), (1,'b'), (1,'c'), (2,'d'), (3,'r'), (4,'a'), (2,'t'), (3,'y'), (1,'f')] df = sqlContext.createDataFrame(data, schema= ['uid', 'address_uid']) return df if __name__ == "__main__": df_rdd = ParallelBuild().run().map(lambda line: line).persist() r = df_rdd.map(ParallelBuild().myFunc) r.count() {code} Above code returns JavaPackage not callable exception. was: {code} from pyspark import SparkContext from pyspark.sql import SQLContext, HiveContext import csv, io, StringIO from pyspark.sql.functions import * from pyspark.sql import Row from pyspark.sql import * from pyspark.sql import functions as F from pyspark.sql.functions import asc, desc sc = SparkContext("local", "Summary Report") sqlContext = SQLContext(sc) class ParallelBuild(object): # def myFunc(self, s, i, k): # # words = s.split(" ") # # print words # # return len(words) # k = k + 1 # print s, i, k # return s, i, k def myFunc(self, s): l = s.split(',') print l[0], l[1] return l[0] def list_to_csv_str(x): output = StringIO.StringIO("") csv.writer(output).writerow(x) return output.getvalue().strip() def run(self): data = [(1,'a'), (1,'b'), (1,'c'), (2,'d'), (3,'r'), (4,'a'), (2,'t'), (3,'y'), (1,'f')] df = sqlContext.createDataFrame(data, schema= ['uid', 'address_uid']) return df if __name__ == "__main__": df_rdd = ParallelBuild().run().map(lambda line: line).persist() r = df_rdd.map(ParallelBuild().myFunc) r.count() {code} Above code returns JavaPackage not callable exception. > Passing functions do not work in classes > > > Key: SPARK-15679 > URL: https://issues.apache.org/jira/browse/SPARK-15679 > Project: Spark > Issue Type: Bug >Reporter: nalin garg > > {code} > from pyspark import SparkContext > from pyspark.sql import SQLContext, HiveContext > import csv, io, StringIO > from pyspark.sql.functions import * > from pyspark.sql import Row > from pyspark.sql import * > from pyspark.sql import functions as F > from pyspark.sql.functions import asc, desc > sc = SparkContext("local", "Summary Report") > sqlContext = SQLContext(sc) > class ParallelBuild(object): > def myFunc(self, s): >l = s.split(',') >print l[0], l[1] >return l[0] > def list_to_csv_str(x): > output = StringIO.StringIO("") > csv.writer(output).writerow(x) > return output.getvalue().strip() > def run(self): > data = [(1,'a'), (1,'b'), (1,'c'), (2,'d'), (3,'r'), (4,'a'), > (2,'t'), (3,'y'), (1,'f')] > df = sqlContext.createDataFrame(data, schema= ['uid', 'address_uid']) > return df > if __name__ == "__main__": > df_rdd = ParallelBuild().run().map(lambda line: line).persist() > r = df_rdd.map(ParallelBuild().myFunc) > r.count() > {code} > Above code returns JavaPackage not callable exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15679) Passing functions do not work in classes
[ https://issues.apache.org/jira/browse/SPARK-15679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nalin garg updated SPARK-15679: --- Description: from pyspark import SparkContext from pyspark.sql import SQLContext, HiveContext import csv, io, StringIO from pyspark.sql.functions import * from pyspark.sql import Row from pyspark.sql import * from pyspark.sql import functions as F from pyspark.sql.functions import asc, desc sc = SparkContext("local", "Summary Report") sqlContext = SQLContext(sc) class ParallelBuild(object): # def myFunc(self, s, i, k): # # words = s.split(" ") # # print words # # return len(words) # k = k + 1 # print s, i, k # return s, i, k def myFunc(self, s): l = s.split(',') print l[0], l[1] return l[0] def list_to_csv_str(x): output = StringIO.StringIO("") csv.writer(output).writerow(x) return output.getvalue().strip() def run(self): data = [(1,'a'), (1,'b'), (1,'c'), (2,'d'), (3,'r'), (4,'a'), (2,'t'), (3,'y'), (1,'f')] df = sqlContext.createDataFrame(data, schema= ['uid', 'address_uid']) return df if __name__ == "__main__": df_rdd = ParallelBuild().run().map(lambda line: line).persist() r = df_rdd.map(ParallelBuild().myFunc) r.count() Above code returns JavaPackage not callable exception. > Passing functions do not work in classes > > > Key: SPARK-15679 > URL: https://issues.apache.org/jira/browse/SPARK-15679 > Project: Spark > Issue Type: Bug >Reporter: nalin garg > > from pyspark import SparkContext > from pyspark.sql import SQLContext, HiveContext > import csv, io, StringIO > from pyspark.sql.functions import * > from pyspark.sql import Row > from pyspark.sql import * > from pyspark.sql import functions as F > from pyspark.sql.functions import asc, desc > sc = SparkContext("local", "Summary Report") > sqlContext = SQLContext(sc) > class ParallelBuild(object): > # def myFunc(self, s, i, k): > # # words = s.split(" ") > # # print words > # # return len(words) > # k = k + 1 > # print s, i, k > # return s, i, k > def myFunc(self, s): >l = s.split(',') >print l[0], l[1] >return l[0] > def list_to_csv_str(x): > output = StringIO.StringIO("") > csv.writer(output).writerow(x) > return output.getvalue().strip() > def run(self): > data = [(1,'a'), (1,'b'), (1,'c'), (2,'d'), (3,'r'), (4,'a'), > (2,'t'), (3,'y'), (1,'f')] > df = sqlContext.createDataFrame(data, schema= ['uid', 'address_uid']) > return df > if __name__ == "__main__": > df_rdd = ParallelBuild().run().map(lambda line: line).persist() > r = df_rdd.map(ParallelBuild().myFunc) > r.count() > Above code returns JavaPackage not callable exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15679) Passing functions do not work in classes
nalin garg created SPARK-15679: -- Summary: Passing functions do not work in classes Key: SPARK-15679 URL: https://issues.apache.org/jira/browse/SPARK-15679 Project: Spark Issue Type: Bug Reporter: nalin garg -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org