[jira] [Commented] (SPARK-13634) Assigning spark context to variable results in serialization error

2016-07-17 Thread Rahul Palamuttam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15381236#comment-15381236
 ] 

Rahul Palamuttam commented on SPARK-13634:
--

Understood and thank you for explaining.
I agree that it is pretty implicit that you can't serialize context-like 
objects, but it's a little strange when the object gets pulled in without the 
user even writing code that explicitly does so (in the shell). I agree with 
your latter point as well, and will take that into consideration. It could just 
be too specific to the use case.


> Assigning spark context to variable results in serialization error
> --
>
> Key: SPARK-13634
> URL: https://issues.apache.org/jira/browse/SPARK-13634
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Reporter: Rahul Palamuttam
>Priority: Minor
>
> The following lines of code cause a task serialization error when executed in 
> the spark-shell. 
> Note that the error does not occur when submitting the code as a batch job - 
> via spark-submit.
> val temp = 10
> val newSC = sc
> val new RDD = newSC.parallelize(0 to 100).map(p => p + temp)
> For some reason when temp is being pulled in to the referencing environment 
> of the closure, so is the SparkContext. 
> We originally hit this issue in the SciSpark project, when referencing a 
> string variable inside of a lambda expression in RDD.map(...)
> Any insight into how this could be resolved would be appreciated.
> While the above code is trivial, SciSpark uses a wrapper around the 
> SparkContext to read from various file formats. We want to keep this class 
> structure and also use it in notebook and shell environments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13634) Assigning spark context to variable results in serialization error

2016-07-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15381169#comment-15381169
 ] 

Sean Owen commented on SPARK-13634:
---

Go ahead, though in general I think it's pretty implicit that you can't 
serialize context-like objects anywhere. This may in fact be just a hack, and 
you need to redesign your code so that objects that are sent around do not 
capture a context object to begin with. Your use case is not normal shell 
usage; you're writing a custom framework. You can suggest doc changes (in a 
PR); just consider what is quite specific to your usage vs what is likely 
widely applicable enough to go in the docs.

> Assigning spark context to variable results in serialization error
> --
>
> Key: SPARK-13634
> URL: https://issues.apache.org/jira/browse/SPARK-13634
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Reporter: Rahul Palamuttam
>Priority: Minor
>
> The following lines of code cause a task serialization error when executed in 
> the spark-shell. 
> Note that the error does not occur when submitting the code as a batch job - 
> via spark-submit.
> val temp = 10
> val newSC = sc
> val new RDD = newSC.parallelize(0 to 100).map(p => p + temp)
> For some reason when temp is being pulled in to the referencing environment 
> of the closure, so is the SparkContext. 
> We originally hit this issue in the SciSpark project, when referencing a 
> string variable inside of a lambda expression in RDD.map(...)
> Any insight into how this could be resolved would be appreciated.
> While the above code is trivial, SciSpark uses a wrapper around the 
> SparkContext to read from various file formats. We want to keep this class 
> structure and also use it in notebook and shell environments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13634) Assigning spark context to variable results in serialization error

2016-07-17 Thread Rahul Palamuttam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15381168#comment-15381168
 ] 

Rahul Palamuttam commented on SPARK-13634:
--

Kai Chen, thank you. I apologize for not responding sooner. This does resolve 
our issue. 
As a little background :
We utilize a wrapper class for the SparkContext, and while I set the 
SparkContext variable inside the class to transient it didn't resolve our issue.
Instead attaching @transient tag to an instance of the wrapper class resolved 
the issue. 
Before :
val SciSc = new SciSparkContext(sc)
After
@transient SciSc = new SciSparkContext(sc)
We utilize the wrapper class SciSparkContext to delegate to functions like 
BinaryFiles to read file formats like netcdf while abstracting the extra 
details to actually read it in that format.

Sean Owen and Chris A. Mattmann - thank you for allowing the JIRA to be 
re-opened.
I would like to resolve the issue, but first I did wanted to point out that I 
didn't see much or any documentation on this issue. 
I was looking at the quick start here : 
http://spark.apache.org/docs/latest/quick-start.html#interactive-analysis-with-the-spark-shell
(I may have just missed it else where).
The spark-shell as a mode of interacting with spark seems to be becoming more 
common - especially with notebook projects like zeppelin (which we are using).
I do think that this is worth pointing out and mentioning - even if it is 
really an issue with scala.
If we are in agreement, I would like to change this JIRA to a documentation 
JIRA and submit the patch (I've never submitted a doc patch and it would be a 
nice experience for me).

I'll also respond sooner next time.




> Assigning spark context to variable results in serialization error
> --
>
> Key: SPARK-13634
> URL: https://issues.apache.org/jira/browse/SPARK-13634
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Reporter: Rahul Palamuttam
>Priority: Minor
>
> The following lines of code cause a task serialization error when executed in 
> the spark-shell. 
> Note that the error does not occur when submitting the code as a batch job - 
> via spark-submit.
> val temp = 10
> val newSC = sc
> val new RDD = newSC.parallelize(0 to 100).map(p => p + temp)
> For some reason when temp is being pulled in to the referencing environment 
> of the closure, so is the SparkContext. 
> We originally hit this issue in the SciSpark project, when referencing a 
> string variable inside of a lambda expression in RDD.map(...)
> Any insight into how this could be resolved would be appreciated.
> While the above code is trivial, SciSpark uses a wrapper around the 
> SparkContext to read from various file formats. We want to keep this class 
> structure and also use it in notebook and shell environments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13634) Assigning spark context to variable results in serialization error

2016-05-09 Thread Kai Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15277427#comment-15277427
 ] 

Kai Chen commented on SPARK-13634:
--

[~Rahul Palamuttam] and [~chrismattmann]

Try
{code}
@transient val newSC = sc
{code}

 in the REPL to prevent SparkContext from being dragged into the serialization 
graph.

Cheers!

> Assigning spark context to variable results in serialization error
> --
>
> Key: SPARK-13634
> URL: https://issues.apache.org/jira/browse/SPARK-13634
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Reporter: Rahul Palamuttam
>Priority: Minor
>
> The following lines of code cause a task serialization error when executed in 
> the spark-shell. 
> Note that the error does not occur when submitting the code as a batch job - 
> via spark-submit.
> val temp = 10
> val newSC = sc
> val new RDD = newSC.parallelize(0 to 100).map(p => p + temp)
> For some reason when temp is being pulled in to the referencing environment 
> of the closure, so is the SparkContext. 
> We originally hit this issue in the SciSpark project, when referencing a 
> string variable inside of a lambda expression in RDD.map(...)
> Any insight into how this could be resolved would be appreciated.
> While the above code is trivial, SciSpark uses a wrapper around the 
> SparkContext to read from various file formats. We want to keep this class 
> structure and also use it in notebook and shell environments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13634) Assigning spark context to variable results in serialization error

2016-03-07 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184597#comment-15184597
 ] 

Chris A. Mattmann commented on SPARK-13634:
---

Sean, thanks for your reply. We can agree to disagree on the semantics. I've 
been doing open source for a long time, and leaving JIRAs open for longer than 
43 minutes is not damaging by any means. As a former Spark mentor too during 
its Incubation and its Champion, I also disagree, and was involved in Spark 
from its early inception here at the ASF and so have not always seen this type 
of behavior, which is why it's troubling to me. Your comparison of one end of 
the spectrum (10) to 1000s in size of JIRAs and activity also kind of leaves a 
sour taste in my mouth. I know Spark gets lots of activity. So do many of the 
projects I've helped start and contribute to (Hadoop, Lucene/Solr, Nutch during 
its hey day, etc etc). I  left JIRAs open for longer than 43 mins in those 
projects as did many others wiser than me and that have been around a lot 
longer than me in open source. 

Thanks for taking time to think through what may be causing it. I'll choose to 
take the positive away from your reply and try to report back more on our 
workarounds in SciSpark and on our project.

--Chris

> Assigning spark context to variable results in serialization error
> --
>
> Key: SPARK-13634
> URL: https://issues.apache.org/jira/browse/SPARK-13634
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Reporter: Rahul Palamuttam
>Priority: Minor
>
> The following lines of code cause a task serialization error when executed in 
> the spark-shell. 
> Note that the error does not occur when submitting the code as a batch job - 
> via spark-submit.
> val temp = 10
> val newSC = sc
> val new RDD = newSC.parallelize(0 to 100).map(p => p + temp)
> For some reason when temp is being pulled in to the referencing environment 
> of the closure, so is the SparkContext. 
> We originally hit this issue in the SciSpark project, when referencing a 
> string variable inside of a lambda expression in RDD.map(...)
> Any insight into how this could be resolved would be appreciated.
> While the above code is trivial, SciSpark uses a wrapper around the 
> SparkContext to read from various file formats. We want to keep this class 
> structure and also use it in notebook and shell environments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13634) Assigning spark context to variable results in serialization error

2016-03-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184563#comment-15184563
 ] 

Sean Owen commented on SPARK-13634:
---

JIRAs can be reopened, and should be if there's a change, like: you have a pull 
request to propose, or a different example or more analysis that suggests it's 
not just a Scala REPL thing. People can still comment on JIRAs too.

All else equal, a reply in 43 minutes is a good thing. While I can appreciate 
that, ideally, we'd always let the reporter explicitly confirm they're done or 
something, that's not feasible in this project. On average a JIRA is opened 
every _hour_, many of which never receive any follow-up. Leaving them open is 
damaging too, since people inevitably parse that as "legitimate issue I should 
work on or wait on". If I see a quite-likely answer, I'd rather reflect it in 
JIRA, and once in a while overturn it, since reopening is a normal lightweight 
operation that can be performed by the reporter.

Further, the reality is that about half of those JIRAs are not problems, badly 
described, poorly researched, etc (not this one), and actually _need_ rapid 
pushback with pointers to the contribution guide to discourage more of the 
behavior.

This is why some things get resolved fast in general, and it's with the intent 
of putting limited time to best use for the most people, and getting most 
people some quick feedback. I understand it's not how a project with 10 JIRAs a 
month probably operates, but I disagree that my reply was wrong or impolite.

Instead I'd certainly welcome materially more information and proposed change 
if you want to pursue and reopen this. For example, off the top of my head: 
does the ClosureCleaner specially treat {{sc}}? it may do so because there 
isn't supposed to be a second context in the application.

However if this is your real code, I strongly suspect you have a simple 
workaround in refactoring the third line into a function on an {{object}} (i.e. 
static). The layer of indirection, or something similar, likely avoids tripping 
on this. This is what I've suggested you pursue next. If that works, that's 
great info to paste here, at least as confirmation. Or if not, add it here 
anyway to show what else doesn't work.

> Assigning spark context to variable results in serialization error
> --
>
> Key: SPARK-13634
> URL: https://issues.apache.org/jira/browse/SPARK-13634
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Reporter: Rahul Palamuttam
>Priority: Minor
>
> The following lines of code cause a task serialization error when executed in 
> the spark-shell. 
> Note that the error does not occur when submitting the code as a batch job - 
> via spark-submit.
> val temp = 10
> val newSC = sc
> val new RDD = newSC.parallelize(0 to 100).map(p => p + temp)
> For some reason when temp is being pulled in to the referencing environment 
> of the closure, so is the SparkContext. 
> We originally hit this issue in the SciSpark project, when referencing a 
> string variable inside of a lambda expression in RDD.map(...)
> Any insight into how this could be resolved would be appreciated.
> While the above code is trivial, SciSpark uses a wrapper around the 
> SparkContext to read from various file formats. We want to keep this class 
> structure and also use it in notebook and shell environments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13634) Assigning spark context to variable results in serialization error

2016-03-07 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184536#comment-15184536
 ] 

Chris A. Mattmann commented on SPARK-13634:
---

I'm CC'ed b/c I'm the PI of the SciSpark project and I asked Rahul to file this 
issue here. It's not a toy example - it's a real example from our system. We 
have a work around but were wondering if Apache Spark had thought of anything 
better or seen something similar. 

Our code is here: 
https://github.com/Scispark/scispark/

The question I was asking was related to etiquette. I don't think it's good 
etiquette to close tickets under which the reporter has weighed in. This was 
closed literally in 43 minutes, without even waiting for Rahul to chime back 
in. Is it really that urgent to close an issue that a user has reported that 
quickly without hearing back from them to see if your suggestion helped or 
answered their question?

> Assigning spark context to variable results in serialization error
> --
>
> Key: SPARK-13634
> URL: https://issues.apache.org/jira/browse/SPARK-13634
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Reporter: Rahul Palamuttam
>Priority: Minor
>
> The following lines of code cause a task serialization error when executed in 
> the spark-shell. 
> Note that the error does not occur when submitting the code as a batch job - 
> via spark-submit.
> val temp = 10
> val newSC = sc
> val new RDD = newSC.parallelize(0 to 100).map(p => p + temp)
> For some reason when temp is being pulled in to the referencing environment 
> of the closure, so is the SparkContext. 
> We originally hit this issue in the SciSpark project, when referencing a 
> string variable inside of a lambda expression in RDD.map(...)
> Any insight into how this could be resolved would be appreciated.
> While the above code is trivial, SciSpark uses a wrapper around the 
> SparkContext to read from various file formats. We want to keep this class 
> structure and also use it in notebook and shell environments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13634) Assigning spark context to variable results in serialization error

2016-03-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184505#comment-15184505
 ] 

Sean Owen commented on SPARK-13634:
---

Chris, I resolved this as a duplicate, of an issue that's "WontFix". I'm not 
suggesting there is a resolution in Spark. The implicit workaround here is to 
not declare newSC of course. There may be others, and that may matter since I 
suspect this is just a toy example. Without seeing real code, I couldn't say 
more about other workarounds. I'm not sure why you were CCed, but what are you 
taking issue with?

> Assigning spark context to variable results in serialization error
> --
>
> Key: SPARK-13634
> URL: https://issues.apache.org/jira/browse/SPARK-13634
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Reporter: Rahul Palamuttam
>Priority: Minor
>
> The following lines of code cause a task serialization error when executed in 
> the spark-shell. 
> Note that the error does not occur when submitting the code as a batch job - 
> via spark-submit.
> val temp = 10
> val newSC = sc
> val new RDD = newSC.parallelize(0 to 100).map(p => p + temp)
> For some reason when temp is being pulled in to the referencing environment 
> of the closure, so is the SparkContext. 
> We originally hit this issue in the SciSpark project, when referencing a 
> string variable inside of a lambda expression in RDD.map(...)
> Any insight into how this could be resolved would be appreciated.
> While the above code is trivial, SciSpark uses a wrapper around the 
> SparkContext to read from various file formats. We want to keep this class 
> structure and also use it in notebook and shell environments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13634) Assigning spark context to variable results in serialization error

2016-03-02 Thread Rahul Palamuttam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177093#comment-15177093
 ] 

Rahul Palamuttam commented on SPARK-13634:
--

[~chrismattmann]

> Assigning spark context to variable results in serialization error
> --
>
> Key: SPARK-13634
> URL: https://issues.apache.org/jira/browse/SPARK-13634
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Reporter: Rahul Palamuttam
>
> The following lines of code cause a task serialization error when executed in 
> the spark-shell. Note that the error does not occur when submitting the code 
> as a batch job - via spark-submit.
> val temp = 10
> val newSC = sc
> val new RDD = newSC.parallelize(0 to 100).map(p => p + temp)
> For some reason when temp is being pulled in to the referencing environment 
> of the closure, so is the SparkContext. 
> We originally hit this issue in the SciSpark project, when referencing a 
> string variable inside of a lambda expression in RDD.map(...)
> Any insight into how this could be resolved would be appreciated.
> While the above code is trivial, SciSpark uses wrapper around the 
> SparkContext to read from various file formats. We want to keep this class 
> structure and also use it notebook and shell environments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org