[jira] [Created] (SPARK-15787) Display more helpful error messages for several invalid operations

2016-06-06 Thread nalin garg (JIRA)
nalin garg created SPARK-15787:
--

 Summary: Display more helpful error messages for several invalid 
operations
 Key: SPARK-15787
 URL: https://issues.apache.org/jira/browse/SPARK-15787
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.6.1
Reporter: nalin garg
 Fix For: 1.2.1


Referencing SPARK-5063. The issue has reappeared.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15679) Passing functions do not work in classes

2016-05-31 Thread nalin garg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15308425#comment-15308425
 ] 

nalin garg edited comment on SPARK-15679 at 5/31/16 7:35 PM:
-

Below is stack trace:

Traceback (most recent call last):
  File "/Users/nalingarg/PycharmProjects/parallel_check/parallel_check.py", 
line 90, in 
r.count()
  File 
"/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/rdd.py", 
line 1004, in count
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File 
"/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/rdd.py", 
line 995, in sum
return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
  File 
"/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/rdd.py", 
line 869, in fold
vals = self.mapPartitions(func).collect()
  File 
"/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/rdd.py", 
line 771, in collect
port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
  File 
"/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/rdd.py", 
line 2379, in _jrdd
pickled_cmd, bvars, env, includes = _prepare_for_python_RDD(self.ctx, 
command, self)
  File 
"/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/rdd.py", 
line 2299, in _prepare_for_python_RDD
pickled_command = ser.dumps(command)
  File 
"/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/serializers.py",
 line 428, in dumps
return cloudpickle.dumps(obj, 2)
  File 
"/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/cloudpickle.py",
 line 646, in dumps
cp.dump(obj)
  File 
"/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/cloudpickle.py",
 line 107, in dump
return Pickler.dump(self, obj)
  File 
"/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
 line 224, in dump
self.save(obj)
  File 
"/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
 line 286, in save
f(self, obj) # Call unbound method with explicit self
  File 
"/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
 line 568, in save_tuple
save(element)
  File 
"/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
 line 286, in save
f(self, obj) # Call unbound method with explicit self
  File 
"/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/cloudpickle.py",
 line 199, in save_function
self.save_function_tuple(obj)
  File 
"/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/cloudpickle.py",
 line 236, in save_function_tuple
save((code, closure, base_globals))
  File 
"/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
 line 286, in save
f(self, obj) # Call unbound method with explicit self
  File 
"/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
 line 554, in save_tuple
save(element)
  File 
"/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
 line 286, in save
f(self, obj) # Call unbound method with explicit self
  File 
"/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
 line 606, in save_list
self._batch_appends(iter(obj))
  File 
"/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
 line 639, in _batch_appends
save(x)
  File 
"/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
 line 286, in save
f(self, obj) # Call unbound method with explicit self
  File 
"/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/cloudpickle.py",
 line 199, in save_function
self.save_function_tuple(obj)
  File 
"/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/cloudpickle.py",
 line 236, in save_function_tuple
save((code, closure, base_globals))
  File 
"/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
 line 286, in save
f(self, obj) # Call unbound method with explicit self
  File 
"/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
 line 554, in save_tuple
save(element)
  File 
"/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
 line 286, in save
f(self, obj) # Call unbound method with explicit self
  File 
"/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
 line 606, in save_list
self._batch_appends(iter(obj))
  File 
"/usr/local/Cellar/python/2.7.11/

[jira] [Commented] (SPARK-15679) Passing functions do not work in classes

2016-05-31 Thread nalin garg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15308436#comment-15308436
 ] 

nalin garg commented on SPARK-15679:


The purpose of the code is to load compute some logic based on the "myFunc" 
method that operated on RDD to get parallelization benefit. 

The following lines :
df_rdd = ParallelBuild().run().map(lambda line: line).persist()
r = df_rdd.map(ParallelBuild().myFunc)

gave me exit 0. Reading google suggested that Spark is lazy evaluation so some 
action will trigger the effect and i add :

r.count() which give above mentioned stacktrace. 

Noticeable thing was that :
r = df_rdd.map(ParallelBuild().myFunc) 

gives "pipelinedrdd" not sure what that is but looks like some transformation?

Interesting part was that when I removed run method and implemented:
data = [(1,'a'), (1,'b'), (1,'c'), (2,'d'), (3,'r'), (4,'a'), (2,'t'), (3,'y'), 
(1,'f')]
df = sqlContext.createDataFrame(data, schema= ['uid', 'address_uid']) 

directly in my main function then things worked just fine. But obviously i 
loose modular part of my code.



> Passing functions do not work in classes
> 
>
> Key: SPARK-15679
> URL: https://issues.apache.org/jira/browse/SPARK-15679
> Project: Spark
>  Issue Type: Bug
>Reporter: nalin garg
>
> {code}
> from pyspark import SparkContext
> from pyspark.sql import SQLContext, HiveContext
> import csv, io, StringIO
> from pyspark.sql.functions import *
> from pyspark.sql import Row
> from pyspark.sql import *
> from pyspark.sql import functions as F
> from pyspark.sql.functions import asc, desc
> sc = SparkContext("local", "Summary Report")
> sqlContext = SQLContext(sc)
> class ParallelBuild(object):
> def myFunc(self, s):
>l =  s.split(',')
>print l[0], l[1]
>return l[0]
> def list_to_csv_str(x):
> output = StringIO.StringIO("")
> csv.writer(output).writerow(x)
> return output.getvalue().strip()
> def run(self):
> data = [(1,'a'), (1,'b'), (1,'c'), (2,'d'), (3,'r'), (4,'a'), 
> (2,'t'), (3,'y'), (1,'f')]
> df = sqlContext.createDataFrame(data, schema= ['uid', 'address_uid'])
> return df
> if __name__ == "__main__":
> df_rdd = ParallelBuild().run().map(lambda line: line).persist()
> r = df_rdd.map(ParallelBuild().myFunc)
> r.count()
> {code}
> Above code returns JavaPackage not callable exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15679) Passing functions do not work in classes

2016-05-31 Thread nalin garg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15308425#comment-15308425
 ] 

nalin garg commented on SPARK-15679:


[~sowen] below is stack trace:

Traceback (most recent call last):
  File "/Users/nalingarg/PycharmProjects/parallel_check/parallel_check.py", 
line 90, in 
r.count()
  File 
"/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/rdd.py", 
line 1004, in count
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File 
"/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/rdd.py", 
line 995, in sum
return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
  File 
"/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/rdd.py", 
line 869, in fold
vals = self.mapPartitions(func).collect()
  File 
"/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/rdd.py", 
line 771, in collect
port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
  File 
"/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/rdd.py", 
line 2379, in _jrdd
pickled_cmd, bvars, env, includes = _prepare_for_python_RDD(self.ctx, 
command, self)
  File 
"/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/rdd.py", 
line 2299, in _prepare_for_python_RDD
pickled_command = ser.dumps(command)
  File 
"/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/serializers.py",
 line 428, in dumps
return cloudpickle.dumps(obj, 2)
  File 
"/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/cloudpickle.py",
 line 646, in dumps
cp.dump(obj)
  File 
"/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/cloudpickle.py",
 line 107, in dump
return Pickler.dump(self, obj)
  File 
"/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
 line 224, in dump
self.save(obj)
  File 
"/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
 line 286, in save
f(self, obj) # Call unbound method with explicit self
  File 
"/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
 line 568, in save_tuple
save(element)
  File 
"/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
 line 286, in save
f(self, obj) # Call unbound method with explicit self
  File 
"/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/cloudpickle.py",
 line 199, in save_function
self.save_function_tuple(obj)
  File 
"/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/cloudpickle.py",
 line 236, in save_function_tuple
save((code, closure, base_globals))
  File 
"/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
 line 286, in save
f(self, obj) # Call unbound method with explicit self
  File 
"/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
 line 554, in save_tuple
save(element)
  File 
"/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
 line 286, in save
f(self, obj) # Call unbound method with explicit self
  File 
"/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
 line 606, in save_list
self._batch_appends(iter(obj))
  File 
"/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
 line 639, in _batch_appends
save(x)
  File 
"/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
 line 286, in save
f(self, obj) # Call unbound method with explicit self
  File 
"/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/cloudpickle.py",
 line 199, in save_function
self.save_function_tuple(obj)
  File 
"/Users/nalingarg/Documents/spark-1.6.1-bin-hadoop2.6/python/pyspark/cloudpickle.py",
 line 236, in save_function_tuple
save((code, closure, base_globals))
  File 
"/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
 line 286, in save
f(self, obj) # Call unbound method with explicit self
  File 
"/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
 line 554, in save_tuple
save(element)
  File 
"/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
 line 286, in save
f(self, obj) # Call unbound method with explicit self
  File 
"/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
 line 606, in save_list
self._batch_appends(iter(obj))
  File 
"/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7

[jira] [Updated] (SPARK-15679) Passing functions do not work in classes

2016-05-31 Thread nalin garg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nalin garg updated SPARK-15679:
---
Description: 
{code}
from pyspark import SparkContext
from pyspark.sql import SQLContext, HiveContext
import csv, io, StringIO
from pyspark.sql.functions import *
from pyspark.sql import Row
from pyspark.sql import *
from pyspark.sql import functions as F
from pyspark.sql.functions import asc, desc
sc = SparkContext("local", "Summary Report")
sqlContext = SQLContext(sc)


class ParallelBuild(object):

def myFunc(self, s):
   l =  s.split(',')
   print l[0], l[1]
   return l[0]


def list_to_csv_str(x):
output = StringIO.StringIO("")
csv.writer(output).writerow(x)
return output.getvalue().strip()

def run(self):
data = [(1,'a'), (1,'b'), (1,'c'), (2,'d'), (3,'r'), (4,'a'), (2,'t'), 
(3,'y'), (1,'f')]
df = sqlContext.createDataFrame(data, schema= ['uid', 'address_uid'])
return df


if __name__ == "__main__":

df_rdd = ParallelBuild().run().map(lambda line: line).persist()
r = df_rdd.map(ParallelBuild().myFunc)
r.count()
{code}


Above code returns JavaPackage not callable exception.

  was:
{code}
from pyspark import SparkContext
from pyspark.sql import SQLContext, HiveContext
import csv, io, StringIO
from pyspark.sql.functions import *
from pyspark.sql import Row
from pyspark.sql import *
from pyspark.sql import functions as F
from pyspark.sql.functions import asc, desc
sc = SparkContext("local", "Summary Report")
sqlContext = SQLContext(sc)


class ParallelBuild(object):

# def myFunc(self, s, i, k):
# # words = s.split(" ")
# # print words
# # return len(words)
# k = k + 1
# print s, i, k
# return s, i, k

def myFunc(self, s):
   l =  s.split(',')
   print l[0], l[1]
   return l[0]


def list_to_csv_str(x):
output = StringIO.StringIO("")
csv.writer(output).writerow(x)
return output.getvalue().strip()

def run(self):
data = [(1,'a'), (1,'b'), (1,'c'), (2,'d'), (3,'r'), (4,'a'), (2,'t'), 
(3,'y'), (1,'f')]
df = sqlContext.createDataFrame(data, schema= ['uid', 'address_uid'])
return df


if __name__ == "__main__":

df_rdd = ParallelBuild().run().map(lambda line: line).persist()
r = df_rdd.map(ParallelBuild().myFunc)
r.count()
{code}


Above code returns JavaPackage not callable exception.


> Passing functions do not work in classes
> 
>
> Key: SPARK-15679
> URL: https://issues.apache.org/jira/browse/SPARK-15679
> Project: Spark
>  Issue Type: Bug
>Reporter: nalin garg
>
> {code}
> from pyspark import SparkContext
> from pyspark.sql import SQLContext, HiveContext
> import csv, io, StringIO
> from pyspark.sql.functions import *
> from pyspark.sql import Row
> from pyspark.sql import *
> from pyspark.sql import functions as F
> from pyspark.sql.functions import asc, desc
> sc = SparkContext("local", "Summary Report")
> sqlContext = SQLContext(sc)
> class ParallelBuild(object):
> def myFunc(self, s):
>l =  s.split(',')
>print l[0], l[1]
>return l[0]
> def list_to_csv_str(x):
> output = StringIO.StringIO("")
> csv.writer(output).writerow(x)
> return output.getvalue().strip()
> def run(self):
> data = [(1,'a'), (1,'b'), (1,'c'), (2,'d'), (3,'r'), (4,'a'), 
> (2,'t'), (3,'y'), (1,'f')]
> df = sqlContext.createDataFrame(data, schema= ['uid', 'address_uid'])
> return df
> if __name__ == "__main__":
> df_rdd = ParallelBuild().run().map(lambda line: line).persist()
> r = df_rdd.map(ParallelBuild().myFunc)
> r.count()
> {code}
> Above code returns JavaPackage not callable exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15679) Passing functions do not work in classes

2016-05-31 Thread nalin garg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nalin garg updated SPARK-15679:
---
Description: 
from pyspark import SparkContext
from pyspark.sql import SQLContext, HiveContext
import csv, io, StringIO
from pyspark.sql.functions import *
from pyspark.sql import Row
from pyspark.sql import *
from pyspark.sql import functions as F
from pyspark.sql.functions import asc, desc
sc = SparkContext("local", "Summary Report")
sqlContext = SQLContext(sc)


class ParallelBuild(object):

# def myFunc(self, s, i, k):
# # words = s.split(" ")
# # print words
# # return len(words)
# k = k + 1
# print s, i, k
# return s, i, k

def myFunc(self, s):
   l =  s.split(',')
   print l[0], l[1]
   return l[0]


def list_to_csv_str(x):
output = StringIO.StringIO("")
csv.writer(output).writerow(x)
return output.getvalue().strip()

def run(self):
data = [(1,'a'), (1,'b'), (1,'c'), (2,'d'), (3,'r'), (4,'a'), (2,'t'), 
(3,'y'), (1,'f')]
df = sqlContext.createDataFrame(data, schema= ['uid', 'address_uid'])
return df


if __name__ == "__main__":

df_rdd = ParallelBuild().run().map(lambda line: line).persist()
r = df_rdd.map(ParallelBuild().myFunc)
r.count()


Above code returns JavaPackage not callable exception.

> Passing functions do not work in classes
> 
>
> Key: SPARK-15679
> URL: https://issues.apache.org/jira/browse/SPARK-15679
> Project: Spark
>  Issue Type: Bug
>Reporter: nalin garg
>
> from pyspark import SparkContext
> from pyspark.sql import SQLContext, HiveContext
> import csv, io, StringIO
> from pyspark.sql.functions import *
> from pyspark.sql import Row
> from pyspark.sql import *
> from pyspark.sql import functions as F
> from pyspark.sql.functions import asc, desc
> sc = SparkContext("local", "Summary Report")
> sqlContext = SQLContext(sc)
> class ParallelBuild(object):
> # def myFunc(self, s, i, k):
> # # words = s.split(" ")
> # # print words
> # # return len(words)
> # k = k + 1
> # print s, i, k
> # return s, i, k
> def myFunc(self, s):
>l =  s.split(',')
>print l[0], l[1]
>return l[0]
> def list_to_csv_str(x):
> output = StringIO.StringIO("")
> csv.writer(output).writerow(x)
> return output.getvalue().strip()
> def run(self):
> data = [(1,'a'), (1,'b'), (1,'c'), (2,'d'), (3,'r'), (4,'a'), 
> (2,'t'), (3,'y'), (1,'f')]
> df = sqlContext.createDataFrame(data, schema= ['uid', 'address_uid'])
> return df
> if __name__ == "__main__":
> df_rdd = ParallelBuild().run().map(lambda line: line).persist()
> r = df_rdd.map(ParallelBuild().myFunc)
> r.count()
> Above code returns JavaPackage not callable exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15679) Passing functions do not work in classes

2016-05-31 Thread nalin garg (JIRA)
nalin garg created SPARK-15679:
--

 Summary: Passing functions do not work in classes
 Key: SPARK-15679
 URL: https://issues.apache.org/jira/browse/SPARK-15679
 Project: Spark
  Issue Type: Bug
Reporter: nalin garg






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org