[jira] [Commented] (SPARK-13634) Assigning spark context to variable results in serialization error
[ https://issues.apache.org/jira/browse/SPARK-13634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184597#comment-15184597 ] Chris A. Mattmann commented on SPARK-13634: --- Sean, thanks for your reply. We can agree to disagree on the semantics. I've been doing open source for a long time, and leaving JIRAs open for longer than 43 minutes is not damaging by any means. As a former Spark mentor too during its Incubation and its Champion, I also disagree, and was involved in Spark from its early inception here at the ASF and so have not always seen this type of behavior, which is why it's troubling to me. Your comparison of one end of the spectrum (10) to 1000s in size of JIRAs and activity also kind of leaves a sour taste in my mouth. I know Spark gets lots of activity. So do many of the projects I've helped start and contribute to (Hadoop, Lucene/Solr, Nutch during its hey day, etc etc). I left JIRAs open for longer than 43 mins in those projects as did many others wiser than me and that have been around a lot longer than me in open source. Thanks for taking time to think through what may be causing it. I'll choose to take the positive away from your reply and try to report back more on our workarounds in SciSpark and on our project. --Chris > Assigning spark context to variable results in serialization error > -- > > Key: SPARK-13634 > URL: https://issues.apache.org/jira/browse/SPARK-13634 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Reporter: Rahul Palamuttam >Priority: Minor > > The following lines of code cause a task serialization error when executed in > the spark-shell. > Note that the error does not occur when submitting the code as a batch job - > via spark-submit. > val temp = 10 > val newSC = sc > val new RDD = newSC.parallelize(0 to 100).map(p => p + temp) > For some reason when temp is being pulled in to the referencing environment > of the closure, so is the SparkContext. > We originally hit this issue in the SciSpark project, when referencing a > string variable inside of a lambda expression in RDD.map(...) > Any insight into how this could be resolved would be appreciated. > While the above code is trivial, SciSpark uses a wrapper around the > SparkContext to read from various file formats. We want to keep this class > structure and also use it in notebook and shell environments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13734) SparkR histogram
[ https://issues.apache.org/jira/browse/SPARK-13734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-13734: -- Priority: Minor (was: Major) [~olarayej] this isn't sufficient for a JIRA. Please fill out title and description properly. > SparkR histogram > > > Key: SPARK-13734 > URL: https://issues.apache.org/jira/browse/SPARK-13734 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Oscar D. Lara Yejas >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13736) Big-Endian plataform issues
[ https://issues.apache.org/jira/browse/SPARK-13736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184583#comment-15184583 ] Sean Owen commented on SPARK-13736: --- Let's not make an umbrella, since it doesn't cover much, and doesn't actually cover other big-endian issues in the past. These are discrete JIRAs underneath. > Big-Endian plataform issues > --- > > Key: SPARK-13736 > URL: https://issues.apache.org/jira/browse/SPARK-13736 > Project: Spark > Issue Type: Epic > Components: SQL >Affects Versions: 1.6.0 >Reporter: Luciano Resende >Priority: Critical > > We are starting to see few issues when building/testing on Big-Endian > platform. This serves as an umbrella jira to group all platform specific > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13634) Assigning spark context to variable results in serialization error
[ https://issues.apache.org/jira/browse/SPARK-13634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184563#comment-15184563 ] Sean Owen commented on SPARK-13634: --- JIRAs can be reopened, and should be if there's a change, like: you have a pull request to propose, or a different example or more analysis that suggests it's not just a Scala REPL thing. People can still comment on JIRAs too. All else equal, a reply in 43 minutes is a good thing. While I can appreciate that, ideally, we'd always let the reporter explicitly confirm they're done or something, that's not feasible in this project. On average a JIRA is opened every _hour_, many of which never receive any follow-up. Leaving them open is damaging too, since people inevitably parse that as "legitimate issue I should work on or wait on". If I see a quite-likely answer, I'd rather reflect it in JIRA, and once in a while overturn it, since reopening is a normal lightweight operation that can be performed by the reporter. Further, the reality is that about half of those JIRAs are not problems, badly described, poorly researched, etc (not this one), and actually _need_ rapid pushback with pointers to the contribution guide to discourage more of the behavior. This is why some things get resolved fast in general, and it's with the intent of putting limited time to best use for the most people, and getting most people some quick feedback. I understand it's not how a project with 10 JIRAs a month probably operates, but I disagree that my reply was wrong or impolite. Instead I'd certainly welcome materially more information and proposed change if you want to pursue and reopen this. For example, off the top of my head: does the ClosureCleaner specially treat {{sc}}? it may do so because there isn't supposed to be a second context in the application. However if this is your real code, I strongly suspect you have a simple workaround in refactoring the third line into a function on an {{object}} (i.e. static). The layer of indirection, or something similar, likely avoids tripping on this. This is what I've suggested you pursue next. If that works, that's great info to paste here, at least as confirmation. Or if not, add it here anyway to show what else doesn't work. > Assigning spark context to variable results in serialization error > -- > > Key: SPARK-13634 > URL: https://issues.apache.org/jira/browse/SPARK-13634 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Reporter: Rahul Palamuttam >Priority: Minor > > The following lines of code cause a task serialization error when executed in > the spark-shell. > Note that the error does not occur when submitting the code as a batch job - > via spark-submit. > val temp = 10 > val newSC = sc > val new RDD = newSC.parallelize(0 to 100).map(p => p + temp) > For some reason when temp is being pulled in to the referencing environment > of the closure, so is the SparkContext. > We originally hit this issue in the SciSpark project, when referencing a > string variable inside of a lambda expression in RDD.map(...) > Any insight into how this could be resolved would be appreciated. > While the above code is trivial, SciSpark uses a wrapper around the > SparkContext to read from various file formats. We want to keep this class > structure and also use it in notebook and shell environments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13634) Assigning spark context to variable results in serialization error
[ https://issues.apache.org/jira/browse/SPARK-13634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184536#comment-15184536 ] Chris A. Mattmann commented on SPARK-13634: --- I'm CC'ed b/c I'm the PI of the SciSpark project and I asked Rahul to file this issue here. It's not a toy example - it's a real example from our system. We have a work around but were wondering if Apache Spark had thought of anything better or seen something similar. Our code is here: https://github.com/Scispark/scispark/ The question I was asking was related to etiquette. I don't think it's good etiquette to close tickets under which the reporter has weighed in. This was closed literally in 43 minutes, without even waiting for Rahul to chime back in. Is it really that urgent to close an issue that a user has reported that quickly without hearing back from them to see if your suggestion helped or answered their question? > Assigning spark context to variable results in serialization error > -- > > Key: SPARK-13634 > URL: https://issues.apache.org/jira/browse/SPARK-13634 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Reporter: Rahul Palamuttam >Priority: Minor > > The following lines of code cause a task serialization error when executed in > the spark-shell. > Note that the error does not occur when submitting the code as a batch job - > via spark-submit. > val temp = 10 > val newSC = sc > val new RDD = newSC.parallelize(0 to 100).map(p => p + temp) > For some reason when temp is being pulled in to the referencing environment > of the closure, so is the SparkContext. > We originally hit this issue in the SciSpark project, when referencing a > string variable inside of a lambda expression in RDD.map(...) > Any insight into how this could be resolved would be appreciated. > While the above code is trivial, SciSpark uses a wrapper around the > SparkContext to read from various file formats. We want to keep this class > structure and also use it in notebook and shell environments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13740) add null check for _verify_type in types.py
[ https://issues.apache.org/jira/browse/SPARK-13740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184532#comment-15184532 ] Apache Spark commented on SPARK-13740: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/11574 > add null check for _verify_type in types.py > --- > > Key: SPARK-13740 > URL: https://issues.apache.org/jira/browse/SPARK-13740 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13740) add null check for _verify_type in types.py
[ https://issues.apache.org/jira/browse/SPARK-13740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13740: Assignee: (was: Apache Spark) > add null check for _verify_type in types.py > --- > > Key: SPARK-13740 > URL: https://issues.apache.org/jira/browse/SPARK-13740 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13740) add null check for _verify_type in types.py
[ https://issues.apache.org/jira/browse/SPARK-13740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13740: Assignee: Apache Spark > add null check for _verify_type in types.py > --- > > Key: SPARK-13740 > URL: https://issues.apache.org/jira/browse/SPARK-13740 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13740) add null check for _verify_type in types.py
Wenchen Fan created SPARK-13740: --- Summary: add null check for _verify_type in types.py Key: SPARK-13740 URL: https://issues.apache.org/jira/browse/SPARK-13740 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13231) Make Accumulable.countFailedValues a user facing API.
[ https://issues.apache.org/jira/browse/SPARK-13231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma updated SPARK-13231: Description: Exposing it to user has no disadvantage I can think of, but it can be useful for them. One scenario can be a user defined metric. It also clarifies the fact that, by default we do not include values of failed tasks and behavior can be changed if a user is using it to account for some metrics. (was: Rename Accumulable.countFailedValues to Accumulable.includeValuesOfFailedTasks (or includeFailedTasks) I liked the longer version though. Exposing it to user has no disadvantage I can think of, but it can be useful for them. One scenario can be a user defined metric.) > Make Accumulable.countFailedValues a user facing API. > - > > Key: SPARK-13231 > URL: https://issues.apache.org/jira/browse/SPARK-13231 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Prashant Sharma >Priority: Minor > > Exposing it to user has no disadvantage I can think of, but it can be useful > for them. One scenario can be a user defined metric. It also clarifies the > fact that, by default we do not include values of failed tasks and behavior > can be changed if a user is using it to account for some metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3200) Class defined with reference to external variables crashes in REPL.
[ https://issues.apache.org/jira/browse/SPARK-3200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184510#comment-15184510 ] Prashant Sharma commented on SPARK-3200: Yes, the issue SPARK-13634 is actually a duplicate. I have linked it as duplicate. > Class defined with reference to external variables crashes in REPL. > --- > > Key: SPARK-3200 > URL: https://issues.apache.org/jira/browse/SPARK-3200 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.1.0 >Reporter: Prashant Sharma >Assignee: Prashant Sharma > > Reproducer: > {noformat} > val a = sc.textFile("README.md").count > case class A(i: Int) { val j = a} > sc.parallelize(1 to 10).map(A(_)).collect() > {noformat} > This will happen only in distributed mode, when one refers something that > refers sc and not otherwise. > There are many ways to work around this, like directly assign a constant > value instead of referring the variable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13634) Assigning spark context to variable results in serialization error
[ https://issues.apache.org/jira/browse/SPARK-13634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184505#comment-15184505 ] Sean Owen commented on SPARK-13634: --- Chris, I resolved this as a duplicate, of an issue that's "WontFix". I'm not suggesting there is a resolution in Spark. The implicit workaround here is to not declare newSC of course. There may be others, and that may matter since I suspect this is just a toy example. Without seeing real code, I couldn't say more about other workarounds. I'm not sure why you were CCed, but what are you taking issue with? > Assigning spark context to variable results in serialization error > -- > > Key: SPARK-13634 > URL: https://issues.apache.org/jira/browse/SPARK-13634 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Reporter: Rahul Palamuttam >Priority: Minor > > The following lines of code cause a task serialization error when executed in > the spark-shell. > Note that the error does not occur when submitting the code as a batch job - > via spark-submit. > val temp = 10 > val newSC = sc > val new RDD = newSC.parallelize(0 to 100).map(p => p + temp) > For some reason when temp is being pulled in to the referencing environment > of the closure, so is the SparkContext. > We originally hit this issue in the SciSpark project, when referencing a > string variable inside of a lambda expression in RDD.map(...) > Any insight into how this could be resolved would be appreciated. > While the above code is trivial, SciSpark uses a wrapper around the > SparkContext to read from various file formats. We want to keep this class > structure and also use it in notebook and shell environments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5581) When writing sorted map output file, avoid open / close between each partition
[ https://issues.apache.org/jira/browse/SPARK-5581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184506#comment-15184506 ] Josh Rosen commented on SPARK-5581: --- [~sitalke...@gmail.com], I'm not working on this right now so feel free to submit a PR. Before you do, though, you might want to take a peek at https://github.com/apache/spark/pull/11498 to see whether the refactorings in that PR might make things easier. I'd also appreciate help on review of that PR, since it's kind of tricky and needs a bit more work. > When writing sorted map output file, avoid open / close between each partition > -- > > Key: SPARK-5581 > URL: https://issues.apache.org/jira/browse/SPARK-5581 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 1.3.0 >Reporter: Sandy Ryza > > {code} > // Bypassing merge-sort; get an iterator by partition and just write > everything directly. > for ((id, elements) <- this.partitionedIterator) { > if (elements.hasNext) { > val writer = blockManager.getDiskWriter( > blockId, outputFile, ser, fileBufferSize, > context.taskMetrics.shuffleWriteMetrics.get) > for (elem <- elements) { > writer.write(elem) > } > writer.commitAndClose() > val segment = writer.fileSegment() > lengths(id) = segment.length > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5581) When writing sorted map output file, avoid open / close between each partition
[ https://issues.apache.org/jira/browse/SPARK-5581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184491#comment-15184491 ] Sital Kedia commented on SPARK-5581: Is anyone working on this issue? I would like to work on it, if not. We are seeing very bad map side performance when the number of partitions is too large because of this issue. > When writing sorted map output file, avoid open / close between each partition > -- > > Key: SPARK-5581 > URL: https://issues.apache.org/jira/browse/SPARK-5581 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 1.3.0 >Reporter: Sandy Ryza > > {code} > // Bypassing merge-sort; get an iterator by partition and just write > everything directly. > for ((id, elements) <- this.partitionedIterator) { > if (elements.hasNext) { > val writer = blockManager.getDiskWriter( > blockId, outputFile, ser, fileBufferSize, > context.taskMetrics.shuffleWriteMetrics.get) > for (elem <- elements) { > writer.write(elem) > } > writer.commitAndClose() > val segment = writer.fileSegment() > lengths(id) = segment.length > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3200) Class defined with reference to external variables crashes in REPL.
[ https://issues.apache.org/jira/browse/SPARK-3200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184486#comment-15184486 ] Chris A. Mattmann commented on SPARK-3200: -- Hi [~prashant_] thanks for your reply. If you look at the linked issue, someone linked the issue we were having in https://issues.apache.org/jira/browse/SPARK-13634 as a duplicate of this, so just wanted to see if this was a fix for SPARK-13634. > Class defined with reference to external variables crashes in REPL. > --- > > Key: SPARK-3200 > URL: https://issues.apache.org/jira/browse/SPARK-3200 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.1.0 >Reporter: Prashant Sharma >Assignee: Prashant Sharma > > Reproducer: > {noformat} > val a = sc.textFile("README.md").count > case class A(i: Int) { val j = a} > sc.parallelize(1 to 10).map(A(_)).collect() > {noformat} > This will happen only in distributed mode, when one refers something that > refers sc and not otherwise. > There are many ways to work around this, like directly assign a constant > value instead of referring the variable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3200) Class defined with reference to external variables crashes in REPL.
[ https://issues.apache.org/jira/browse/SPARK-3200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184468#comment-15184468 ] Prashant Sharma edited comment on SPARK-3200 at 3/8/16 6:06 AM: Hi [~chrismattmann], It never worked. I have clarified above. But since no one apart from me ever ran into this and complexity of the fix was non trivial, it was "won't fix". Actually now that we use the scala repl "as is" without much modifications. So if it needs to be fixed, there is a considerably large amount of change than it was required back then. Along with the change the maintenance overhead will also be large. However, if the fix is in high demand. One can go ahead and fix in the scala repl too. It is also possible to work around it. Did you ran into this issue ? [EDIT] The patch proposed in the Jira can still be merged for scala 2.10 port of scala repl that lives in Spark. But then it should be done, if this fix is highly critical. was (Author: prashant_): Hi [~chrismattmann], It never worked. I have clarified above. But since no one apart from me ever ran into this and complexity of the fix was non trivial, it was "won't fix". Actually now that we use the scala repl "as is" without much modifications. So if it needs to be fixed, there is a considerably large amount of change than it was required back then. Along with the change the maintenance overhead will also be large. However, if the fix is in high demand. One can go ahead and fix in the scala repl too. It is also possible to work around it. Did you ran into this issue ? > Class defined with reference to external variables crashes in REPL. > --- > > Key: SPARK-3200 > URL: https://issues.apache.org/jira/browse/SPARK-3200 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.1.0 >Reporter: Prashant Sharma >Assignee: Prashant Sharma > > Reproducer: > {noformat} > val a = sc.textFile("README.md").count > case class A(i: Int) { val j = a} > sc.parallelize(1 to 10).map(A(_)).collect() > {noformat} > This will happen only in distributed mode, when one refers something that > refers sc and not otherwise. > There are many ways to work around this, like directly assign a constant > value instead of referring the variable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3200) Class defined with reference to external variables crashes in REPL.
[ https://issues.apache.org/jira/browse/SPARK-3200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma closed SPARK-3200. -- Resolution: Won't Fix > Class defined with reference to external variables crashes in REPL. > --- > > Key: SPARK-3200 > URL: https://issues.apache.org/jira/browse/SPARK-3200 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.1.0 >Reporter: Prashant Sharma >Assignee: Prashant Sharma > > Reproducer: > {noformat} > val a = sc.textFile("README.md").count > case class A(i: Int) { val j = a} > sc.parallelize(1 to 10).map(A(_)).collect() > {noformat} > This will happen only in distributed mode, when one refers something that > refers sc and not otherwise. > There are many ways to work around this, like directly assign a constant > value instead of referring the variable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-3200) Class defined with reference to external variables crashes in REPL.
[ https://issues.apache.org/jira/browse/SPARK-3200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma reopened SPARK-3200: I am reopening only to close it again with Won't Fix. > Class defined with reference to external variables crashes in REPL. > --- > > Key: SPARK-3200 > URL: https://issues.apache.org/jira/browse/SPARK-3200 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.1.0 >Reporter: Prashant Sharma >Assignee: Prashant Sharma > > Reproducer: > {noformat} > val a = sc.textFile("README.md").count > case class A(i: Int) { val j = a} > sc.parallelize(1 to 10).map(A(_)).collect() > {noformat} > This will happen only in distributed mode, when one refers something that > refers sc and not otherwise. > There are many ways to work around this, like directly assign a constant > value instead of referring the variable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3200) Class defined with reference to external variables crashes in REPL.
[ https://issues.apache.org/jira/browse/SPARK-3200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184468#comment-15184468 ] Prashant Sharma commented on SPARK-3200: Hi [~chrismattmann], It never worked. I have clarified above. But since no one apart from me ever ran into this and complexity of the fix was non trivial, it was "won't fix". Actually now that we use the scala repl "as is" without much modifications. So if it needs to be fixed, there is a considerably large amount of change than it was required back then. Along with the change the maintenance overhead will also be large. However, if the fix is in high demand. One can go ahead and fix in the scala repl too. It is also possible to work around it. Did you ran into this issue ? > Class defined with reference to external variables crashes in REPL. > --- > > Key: SPARK-3200 > URL: https://issues.apache.org/jira/browse/SPARK-3200 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.1.0 >Reporter: Prashant Sharma >Assignee: Prashant Sharma > > Reproducer: > {noformat} > val a = sc.textFile("README.md").count > case class A(i: Int) { val j = a} > sc.parallelize(1 to 10).map(A(_)).collect() > {noformat} > This will happen only in distributed mode, when one refers something that > refers sc and not otherwise. > There are many ways to work around this, like directly assign a constant > value instead of referring the variable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13659) Remove returnValues from BlockStore APIs
[ https://issues.apache.org/jira/browse/SPARK-13659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-13659. --- Resolution: Fixed Fix Version/s: 2.0.0 > Remove returnValues from BlockStore APIs > > > Key: SPARK-13659 > URL: https://issues.apache.org/jira/browse/SPARK-13659 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.0.0 > > > In preparation for larger refactorings, I think that we should remove the > confusing returnValues() option from the BlockStore put() APIs: returning the > value is only useful in one place (caching) and in other situations, such as > block replication, it's simpler to put() and then get(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3200) Class defined with reference to external variables crashes in REPL.
[ https://issues.apache.org/jira/browse/SPARK-3200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1518#comment-1518 ] Chris A. Mattmann commented on SPARK-3200: -- how did this get resolved other than simply saying in 1.6 it worked? I'm confused. > Class defined with reference to external variables crashes in REPL. > --- > > Key: SPARK-3200 > URL: https://issues.apache.org/jira/browse/SPARK-3200 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.1.0 >Reporter: Prashant Sharma >Assignee: Prashant Sharma > > Reproducer: > {noformat} > val a = sc.textFile("README.md").count > case class A(i: Int) { val j = a} > sc.parallelize(1 to 10).map(A(_)).collect() > {noformat} > This will happen only in distributed mode, when one refers something that > refers sc and not otherwise. > There are many ways to work around this, like directly assign a constant > value instead of referring the variable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13634) Assigning spark context to variable results in serialization error
[ https://issues.apache.org/jira/browse/SPARK-13634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184442#comment-15184442 ] Chris A. Mattmann commented on SPARK-13634: --- Hi [~srowen] it would have been nice to make sure this resolves [~Rahul Palamuttam]'s issue before closing it? Isn't that simply good etiquette? > Assigning spark context to variable results in serialization error > -- > > Key: SPARK-13634 > URL: https://issues.apache.org/jira/browse/SPARK-13634 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Reporter: Rahul Palamuttam >Priority: Minor > > The following lines of code cause a task serialization error when executed in > the spark-shell. > Note that the error does not occur when submitting the code as a batch job - > via spark-submit. > val temp = 10 > val newSC = sc > val new RDD = newSC.parallelize(0 to 100).map(p => p + temp) > For some reason when temp is being pulled in to the referencing environment > of the closure, so is the SparkContext. > We originally hit this issue in the SciSpark project, when referencing a > string variable inside of a lambda expression in RDD.map(...) > Any insight into how this could be resolved would be appreciated. > While the above code is trivial, SciSpark uses a wrapper around the > SparkContext to read from various file formats. We want to keep this class > structure and also use it in notebook and shell environments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13231) Make Accumulable.countFailedValues a user facing API.
[ https://issues.apache.org/jira/browse/SPARK-13231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma updated SPARK-13231: Summary: Make Accumulable.countFailedValues a user facing API. (was: Rename Accumulable.countFailedValues to Accumulable.includeValuesOfFailedTasks and make it a user facing API.) > Make Accumulable.countFailedValues a user facing API. > - > > Key: SPARK-13231 > URL: https://issues.apache.org/jira/browse/SPARK-13231 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Prashant Sharma >Priority: Minor > > Rename Accumulable.countFailedValues to > Accumulable.includeValuesOfFailedTasks (or includeFailedTasks) I liked the > longer version though. > Exposing it to user has no disadvantage I can think of, but it can be useful > for them. One scenario can be a user defined metric. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13711) Apache Spark driver stopping JVM when master not available
[ https://issues.apache.org/jira/browse/SPARK-13711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-13711. -- Resolution: Fixed Assignee: Shixiong Zhu Fix Version/s: 2.0.0 1.6.2 > Apache Spark driver stopping JVM when master not available > --- > > Key: SPARK-13711 > URL: https://issues.apache.org/jira/browse/SPARK-13711 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1, 1.6.0 >Reporter: Era >Assignee: Shixiong Zhu > Fix For: 1.6.2, 2.0.0 > > > In my application Java spark context is created with an unavailable master > URL (you may assume master is down for a maintenance). When creating Java > spark context it leads to stopping JVM that runs spark driver with JVM exit > code 50. > When I checked the logs I found SparkUncaughtExceptionHandler calling the > System.exit. My program should run forever. > package test.mains; > import org.apache.spark.SparkConf; > import org.apache.spark.api.java.JavaSparkContext; > public class CheckJavaSparkContext { > public static void main(String[] args) { > SparkConf conf = new SparkConf(); > conf.setAppName("test"); > conf.setMaster("spark://sunshinee:7077"); > try { > new JavaSparkContext(conf); > } catch (Throwable e) { > System.out.println("Caught an exception : " + e.getMessage()); > > } > System.out.println("Waiting to complete..."); > while (true) { > } > } > } > Output log > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/data/downloads/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/data/downloads/spark-1.6.0-bin-hadoop2.6/lib/spark-examples-1.6.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] > 16/03/04 18:01:15 INFO SparkContext: Running Spark version 1.6.0 > 16/03/04 18:01:17 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/03/04 18:01:17 WARN Utils: Your hostname, pesamara-mobl-vm1 resolves to a > loopback address: 127.0.0.1; using 10.30.9.107 instead (on interface eth0) > 16/03/04 18:01:17 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > 16/03/04 18:01:18 INFO SecurityManager: Changing view acls to: ps40233 > 16/03/04 18:01:18 INFO SecurityManager: Changing modify acls to: ps40233 > 16/03/04 18:01:18 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(ps40233); users > with modify permissions: Set(ps40233) > 16/03/04 18:01:19 INFO Utils: Successfully started service 'sparkDriver' on > port 55309. > 16/03/04 18:01:21 INFO Slf4jLogger: Slf4jLogger started > 16/03/04 18:01:21 INFO Remoting: Starting remoting > 16/03/04 18:01:22 INFO Remoting: Remoting started; listening on addresses > :[akka.tcp://sparkDriverActorSystem@10.30.9.107:52128] > 16/03/04 18:01:22 INFO Utils: Successfully started service > 'sparkDriverActorSystem' on port 52128. > 16/03/04 18:01:22 INFO SparkEnv: Registering MapOutputTracker > 16/03/04 18:01:22 INFO SparkEnv: Registering BlockManagerMaster > 16/03/04 18:01:22 INFO DiskBlockManager: Created local directory at > /tmp/blockmgr-87c20178-357d-4252-a46a-62a755568a98 > 16/03/04 18:01:22 INFO MemoryStore: MemoryStore started with capacity 457.7 MB > 16/03/04 18:01:22 INFO SparkEnv: Registering OutputCommitCoordinator > 16/03/04 18:01:23 INFO Utils: Successfully started service 'SparkUI' on port > 4040. > 16/03/04 18:01:23 INFO SparkUI: Started SparkUI at http://10.30.9.107:4040 > 16/03/04 18:01:24 INFO AppClient$ClientEndpoint: Connecting to master > spark://sunshinee:7077... > 16/03/04 18:01:24 WARN AppClient$ClientEndpoint: Failed to connect to master > sunshinee:7077 > java.io.IOException: Failed to connect to sunshinee:7077 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167) > at > org.apache.spark.rpc.netty.NettyRpcEnv.
[jira] [Commented] (SPARK-13139) Create native DDL commands
[ https://issues.apache.org/jira/browse/SPARK-13139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184402#comment-15184402 ] Apache Spark commented on SPARK-13139: -- User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/11573 > Create native DDL commands > -- > > Key: SPARK-13139 > URL: https://issues.apache.org/jira/browse/SPARK-13139 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin > > We currently delegate most DDLs directly to Hive, through NativePlaceholder > in HiveQl.scala. In Spark 2.0, we want to provide native implementations for > DDLs for both SQLContext and HiveContext. > The first step is to properly parse these DDLs, and then create logical > commands that encapsulate them. The actual implementation can still delegate > to HiveNativeCommand. As an example, we should define a command for > RenameTable with the proper fields, and just delegate the implementation to > HiveNativeCommand (we might need to track the original sql query in order to > run HiveNativeCommand, but we can remove the sql query in the future once we > do the next step). > Once we flush out the internal persistent catalog API, we can then switch the > implementation of these newly added commands to use the catalog API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API
[ https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184387#comment-15184387 ] Cody Koeninger commented on SPARK-12177: I've been hacking on a simple lru cache for consumers and preferred locations to take advantage of it. Will update here if it works out. There are some things about the new consumer that make it awkward for this purpose, mentioned on Kafka dev list but no real response > Update KafkaDStreams to new Kafka 0.9 Consumer API > -- > > Key: SPARK-12177 > URL: https://issues.apache.org/jira/browse/SPARK-12177 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.6.0 >Reporter: Nikita Tarasenko > Labels: consumer, kafka > > Kafka 0.9 already released and it introduce new consumer API that not > compatible with old one. So, I added new consumer api. I made separate > classes in package org.apache.spark.streaming.kafka.v09 with changed API. I > didn't remove old classes for more backward compatibility. User will not need > to change his old spark applications when he uprgade to new Spark version. > Please rewiew my changes -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13593) improve the `createDataFrame` method to accept data type string and verify the data
[ https://issues.apache.org/jira/browse/SPARK-13593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-13593: Summary: improve the `createDataFrame` method to accept data type string and verify the data (was: improve the `toDF()` method to accept data type string and verify the data) > improve the `createDataFrame` method to accept data type string and verify > the data > --- > > Key: SPARK-13593 > URL: https://issues.apache.org/jira/browse/SPARK-13593 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13404) Create the variables for input when it's used
[ https://issues.apache.org/jira/browse/SPARK-13404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-13404. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11274 [https://github.com/apache/spark/pull/11274] > Create the variables for input when it's used > - > > Key: SPARK-13404 > URL: https://issues.apache.org/jira/browse/SPARK-13404 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0 > > > Right now, we create the variables in the first operator (usually > InputAdapter), they could be wasted if most of rows after filtered out > immediately. > We should defer that until they are used by following operators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API
[ https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184346#comment-15184346 ] Mansi Shah commented on SPARK-12177: So looks like we are dealing with two independent issues here - (a) version support. (b) 0.9.0 plugin design. Should we at least start hashing out the design for the new consumer and then we can see where it fits. Do the concerned folks still want to get on a call to figure out the design? > Update KafkaDStreams to new Kafka 0.9 Consumer API > -- > > Key: SPARK-12177 > URL: https://issues.apache.org/jira/browse/SPARK-12177 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.6.0 >Reporter: Nikita Tarasenko > Labels: consumer, kafka > > Kafka 0.9 already released and it introduce new consumer API that not > compatible with old one. So, I added new consumer api. I made separate > classes in package org.apache.spark.streaming.kafka.v09 with changed API. I > didn't remove old classes for more backward compatibility. User will not need > to change his old spark applications when he uprgade to new Spark version. > Please rewiew my changes -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-13737) Add getOrCreate method for HiveContext
[ https://issues.apache.org/jira/browse/SPARK-13737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mao, Wei closed SPARK-13737. Resolution: Won't Fix > Add getOrCreate method for HiveContext > -- > > Key: SPARK-13737 > URL: https://issues.apache.org/jira/browse/SPARK-13737 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Mao, Wei > > There is a "getOrCreate" method in SQLContext, which is useful to recoverable > streaming application with SQL operation. > https://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations > But the corresponding method is missing in HiveContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13737) Add getOrCreate method for HiveContext
[ https://issues.apache.org/jira/browse/SPARK-13737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184263#comment-15184263 ] Mao, Wei commented on SPARK-13737: -- According to Reynold, databricks is actually going to deprecate HiveContext in Spark2.0 because it has been one of the most confusing contexts in Spark. And user can change constructor to just create a SQLContext/SparkSession See more in https://issues.apache.org/jira/browse/SPARK-13485 > Add getOrCreate method for HiveContext > -- > > Key: SPARK-13737 > URL: https://issues.apache.org/jira/browse/SPARK-13737 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Mao, Wei > > There is a "getOrCreate" method in SQLContext, which is useful to recoverable > streaming application with SQL operation. > https://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations > But the corresponding method is missing in HiveContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13731) expression evaluation for NaN in select statement
[ https://issues.apache.org/jira/browse/SPARK-13731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184034#comment-15184034 ] Ian edited comment on SPARK-13731 at 3/8/16 2:35 AM: - The test case we provided is using simple arithmetic expression like divisions for double, but in fact, many other math functions are having the same behaviors that returns null for NaN/Infinity. For instance, log() and corr(). {code} SELECT log(a/b), corr(a, b) FROM testNan order by a, b {code} was (Author: ianlcsd): The test case we provided is using simple arithmetic expression like divisions for double, but in fact, many other math functions are having the same behaviors that returns null for NaN/Infinity. For instance, log() and corr(). {code} SELECT log(a/b), corn(a, b) FROM testNan order by a, b {code} > expression evaluation for NaN in select statement > - > > Key: SPARK-13731 > URL: https://issues.apache.org/jira/browse/SPARK-13731 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Ian > > We are expecting that arithmetic expression a/b should be: > 1. returning NaN if a=0 and b=0 > 2. returning Infinity if a=1 and b=0 > Is the expectation reasonable? > The following is a simple test case snippet that reads from storage and > evaluates arithmetic expressions in select. > It is assuming org.apache.spark.sql.hive.execution.SQLQuerySuite: > {code} > test("Expression should be evaluated to Nan/Infinity in Select") { > withTable("testNan") { > withTempTable("src") { > Seq((1d, 0d), (0d, 0d)).toDF().registerTempTable("src") > sql("CREATE TABLE testNan(a double, b double) STORED AS PARQUET AS > SELECT * FROM src") > } > checkAnswer(sql( > """ > |SELECT a/b FROM testNan > """.stripMargin), > Seq( > Row(Double.PositiveInfinity), > Row(Double.NaN) > ) > ) > } > } > == Physical Plan == > Project [(a#28 / b#29) AS _c0#30] > +- Scan ParquetRelation: default.testnan[a#28,b#29] InputPaths: > file:/private/var/folders/dy/19y6pfm92pj9s40mbs8xd9hmgp/T/warehouse--5b617080-e909-4812-90e8-63d2dd0aef5a/testnan > == Results == > !== Correct Answer - 2 == == Spark Answer - 2 == > ![Infinity] [null] > ![NaN] [null] > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13731) expression evaluation for NaN in select statement
[ https://issues.apache.org/jira/browse/SPARK-13731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184034#comment-15184034 ] Ian edited comment on SPARK-13731 at 3/8/16 2:35 AM: - The test case we provided is using simple arithmetic expression like divisions for double, but in fact, many other math functions are having the same behaviors that returns null for NaN/Infinity. For instance, log() and corr(). {code} SELECT log(a/b), corr(a, b) FROM testNan group by a, b order by a, b {code} was (Author: ianlcsd): The test case we provided is using simple arithmetic expression like divisions for double, but in fact, many other math functions are having the same behaviors that returns null for NaN/Infinity. For instance, log() and corr(). {code} SELECT log(a/b), corr(a, b) FROM testNan order by a, b {code} > expression evaluation for NaN in select statement > - > > Key: SPARK-13731 > URL: https://issues.apache.org/jira/browse/SPARK-13731 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Ian > > We are expecting that arithmetic expression a/b should be: > 1. returning NaN if a=0 and b=0 > 2. returning Infinity if a=1 and b=0 > Is the expectation reasonable? > The following is a simple test case snippet that reads from storage and > evaluates arithmetic expressions in select. > It is assuming org.apache.spark.sql.hive.execution.SQLQuerySuite: > {code} > test("Expression should be evaluated to Nan/Infinity in Select") { > withTable("testNan") { > withTempTable("src") { > Seq((1d, 0d), (0d, 0d)).toDF().registerTempTable("src") > sql("CREATE TABLE testNan(a double, b double) STORED AS PARQUET AS > SELECT * FROM src") > } > checkAnswer(sql( > """ > |SELECT a/b FROM testNan > """.stripMargin), > Seq( > Row(Double.PositiveInfinity), > Row(Double.NaN) > ) > ) > } > } > == Physical Plan == > Project [(a#28 / b#29) AS _c0#30] > +- Scan ParquetRelation: default.testnan[a#28,b#29] InputPaths: > file:/private/var/folders/dy/19y6pfm92pj9s40mbs8xd9hmgp/T/warehouse--5b617080-e909-4812-90e8-63d2dd0aef5a/testnan > == Results == > !== Correct Answer - 2 == == Spark Answer - 2 == > ![Infinity] [null] > ![NaN] [null] > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13739) Predicate Push Down For Window Operator
[ https://issues.apache.org/jira/browse/SPARK-13739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-13739: Description: Push down the predicate through the Window operator. > Predicate Push Down For Window Operator > --- > > Key: SPARK-13739 > URL: https://issues.apache.org/jira/browse/SPARK-13739 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Push down the predicate through the Window operator. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13738) Clean up ResolveDataSource
[ https://issues.apache.org/jira/browse/SPARK-13738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13738: Assignee: Apache Spark (was: Michael Armbrust) > Clean up ResolveDataSource > -- > > Key: SPARK-13738 > URL: https://issues.apache.org/jira/browse/SPARK-13738 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13738) Clean up ResolveDataSource
[ https://issues.apache.org/jira/browse/SPARK-13738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184244#comment-15184244 ] Apache Spark commented on SPARK-13738: -- User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/11572 > Clean up ResolveDataSource > -- > > Key: SPARK-13738 > URL: https://issues.apache.org/jira/browse/SPARK-13738 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Assignee: Michael Armbrust > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13739) Predicate Push Down Through Window Operator
[ https://issues.apache.org/jira/browse/SPARK-13739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-13739: Summary: Predicate Push Down Through Window Operator (was: Predicate Push Down For Window Operator) > Predicate Push Down Through Window Operator > --- > > Key: SPARK-13739 > URL: https://issues.apache.org/jira/browse/SPARK-13739 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Push down the predicate through the Window operator. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13738) Clean up ResolveDataSource
[ https://issues.apache.org/jira/browse/SPARK-13738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13738: Assignee: Michael Armbrust (was: Apache Spark) > Clean up ResolveDataSource > -- > > Key: SPARK-13738 > URL: https://issues.apache.org/jira/browse/SPARK-13738 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Assignee: Michael Armbrust > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13739) Predicate Push Down Through Window Operator
[ https://issues.apache.org/jira/browse/SPARK-13739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184245#comment-15184245 ] Xiao Li commented on SPARK-13739: - Thanks! I am working on it. > Predicate Push Down Through Window Operator > --- > > Key: SPARK-13739 > URL: https://issues.apache.org/jira/browse/SPARK-13739 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Push down the predicate through the Window operator. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13739) Predicate Push Down For Window Operator
Xiao Li created SPARK-13739: --- Summary: Predicate Push Down For Window Operator Key: SPARK-13739 URL: https://issues.apache.org/jira/browse/SPARK-13739 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: Xiao Li -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13664) Simplify and Speedup HadoopFSRelation
[ https://issues.apache.org/jira/browse/SPARK-13664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13664: Assignee: Michael Armbrust (was: Apache Spark) > Simplify and Speedup HadoopFSRelation > - > > Key: SPARK-13664 > URL: https://issues.apache.org/jira/browse/SPARK-13664 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Michael Armbrust >Assignee: Michael Armbrust >Priority: Blocker > > A majority of Spark SQL queries likely run though {{HadoopFSRelation}}, > however there are currently several complexity and performance problems with > this code path: > - The class mixes the concerns of file management, schema reconciliation, > scan building, bucketing, partitioning, and writing data. > - For very large tables, we are broadcasting the entire list of files to > every executor. [SPARK-11441] > - For partitioned tables, we always do an extra projection. This results > not only in a copy, but undoes much of the performance gains that we are > going to get from vectorized reads. > This is an umbrella ticket to track a set of improvements to this codepath. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13664) Simplify and Speedup HadoopFSRelation
[ https://issues.apache.org/jira/browse/SPARK-13664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184238#comment-15184238 ] Apache Spark commented on SPARK-13664: -- User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/11572 > Simplify and Speedup HadoopFSRelation > - > > Key: SPARK-13664 > URL: https://issues.apache.org/jira/browse/SPARK-13664 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Michael Armbrust >Assignee: Michael Armbrust >Priority: Blocker > > A majority of Spark SQL queries likely run though {{HadoopFSRelation}}, > however there are currently several complexity and performance problems with > this code path: > - The class mixes the concerns of file management, schema reconciliation, > scan building, bucketing, partitioning, and writing data. > - For very large tables, we are broadcasting the entire list of files to > every executor. [SPARK-11441] > - For partitioned tables, we always do an extra projection. This results > not only in a copy, but undoes much of the performance gains that we are > going to get from vectorized reads. > This is an umbrella ticket to track a set of improvements to this codepath. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13664) Simplify and Speedup HadoopFSRelation
[ https://issues.apache.org/jira/browse/SPARK-13664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13664: Assignee: Apache Spark (was: Michael Armbrust) > Simplify and Speedup HadoopFSRelation > - > > Key: SPARK-13664 > URL: https://issues.apache.org/jira/browse/SPARK-13664 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Michael Armbrust >Assignee: Apache Spark >Priority: Blocker > > A majority of Spark SQL queries likely run though {{HadoopFSRelation}}, > however there are currently several complexity and performance problems with > this code path: > - The class mixes the concerns of file management, schema reconciliation, > scan building, bucketing, partitioning, and writing data. > - For very large tables, we are broadcasting the entire list of files to > every executor. [SPARK-11441] > - For partitioned tables, we always do an extra projection. This results > not only in a copy, but undoes much of the performance gains that we are > going to get from vectorized reads. > This is an umbrella ticket to track a set of improvements to this codepath. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13738) Clean up ResolveDataSource
Michael Armbrust created SPARK-13738: Summary: Clean up ResolveDataSource Key: SPARK-13738 URL: https://issues.apache.org/jira/browse/SPARK-13738 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13737) Add getOrCreate method for HiveContext
[ https://issues.apache.org/jira/browse/SPARK-13737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184221#comment-15184221 ] Mao, Wei commented on SPARK-13737: -- HiveContext is heavily used by many users now, and enen many of them still coupled with old spark version. As this change would be trivial but not constructive, I think there is not conflict with the context combination work. > Add getOrCreate method for HiveContext > -- > > Key: SPARK-13737 > URL: https://issues.apache.org/jira/browse/SPARK-13737 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Mao, Wei > > There is a "getOrCreate" method in SQLContext, which is useful to recoverable > streaming application with SQL operation. > https://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations > But the corresponding method is missing in HiveContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13737) Add getOrCreate method for HiveContext
[ https://issues.apache.org/jira/browse/SPARK-13737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184221#comment-15184221 ] Mao, Wei edited comment on SPARK-13737 at 3/8/16 2:07 AM: -- HiveContext is heavily used by many users now, and many of them still coupled with old spark version. As this change would be trivial but not constructive, I think there is not conflict with the context combination work. was (Author: mwws): HiveContext is heavily used by many users now, and enen many of them still coupled with old spark version. As this change would be trivial but not constructive, I think there is not conflict with the context combination work. > Add getOrCreate method for HiveContext > -- > > Key: SPARK-13737 > URL: https://issues.apache.org/jira/browse/SPARK-13737 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Mao, Wei > > There is a "getOrCreate" method in SQLContext, which is useful to recoverable > streaming application with SQL operation. > https://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations > But the corresponding method is missing in HiveContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13721) Add support for LATERAL VIEW OUTER explode()
[ https://issues.apache.org/jira/browse/SPARK-13721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184213#comment-15184213 ] Xiao Li commented on SPARK-13721: - That sounds reasonable. Maybe we can wait until DataFrame and DataSet APIs are combined. > Add support for LATERAL VIEW OUTER explode() > > > Key: SPARK-13721 > URL: https://issues.apache.org/jira/browse/SPARK-13721 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Ian Hellstrom > > Hive supports the [LATERAL VIEW > OUTER|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView#LanguageManualLateralView-OuterLateralViews] > syntax to make sure that when an array is empty, the content from the outer > table is still returned. > Within Spark, this is currently only possible within the HiveContext and > executing HiveQL statements. It would be nice if the standard explode() > DataFrame method allows the same. A possible signature would be: > {code:scala} > explode[A, B](inputColumn: String, outputColumn: String, outer: Boolean = > false) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13689) Move some methods in CatalystQl to a util object
[ https://issues.apache.org/jira/browse/SPARK-13689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-13689. --- Resolution: Fixed Fix Version/s: 2.0.0 > Move some methods in CatalystQl to a util object > > > Key: SPARK-13689 > URL: https://issues.apache.org/jira/browse/SPARK-13689 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Andrew Or >Assignee: Andrew Or > Fix For: 2.0.0 > > > When we add more DDL parsing logic in the future, SparkQl will become very > big. To keep it smaller, we'll introduce helper "parser objects", e.g. one to > parse alter table commands. However, these parser objects will need to access > some helper methods that exist in CatalystQl. The proposal is to move those > methods to an isolated ParserUtils object. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184210#comment-15184210 ] Mike Sukmanowsky commented on SPARK-13587: -- [~juliet] I get the concerns relating to Spark supporting a complex virtualenv process. My main objection to only supporting something like --pyspark-python is the difficulty we currently face in locations like Amazon EMR, but really any Spark cluster where nodes are assumed to be added after an application is submitted deals with the issue. We have a bootstrap script which provisions our EMR nodes with required Python dependencies. This approach works alright for a cluster which tends to run very few applications, but if we have multiple tenants, this approach quickly gets unwieldy. Ideally, Spark applications could be submitted from a master node with a user never having to worry about dependency management at the node bootstrapping level. I was thinking that an interesting approach to this problem would be to provide some sort of a --bootstrap option to spark-submit which points to any executable which Spark will run and check for receipt of a 0 exit code before continuing to launch the application itself. This script could obviously execute any code such as creating a virtualenv or conda env and installing requirements. If a non-zero exit code were received, the Spark application would cease to continue. The generalization gets the Spark community away from having to support conda/virtualenv eccentricities. Thoughts? > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13737) Add getOrCreate method for HiveContext
[ https://issues.apache.org/jira/browse/SPARK-13737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13737: Assignee: (was: Apache Spark) > Add getOrCreate method for HiveContext > -- > > Key: SPARK-13737 > URL: https://issues.apache.org/jira/browse/SPARK-13737 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Mao, Wei > > There is a "getOrCreate" method in SQLContext, which is useful to recoverable > streaming application with SQL operation. > https://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations > But the corresponding method is missing in HiveContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13737) Add getOrCreate method for HiveContext
[ https://issues.apache.org/jira/browse/SPARK-13737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184209#comment-15184209 ] Apache Spark commented on SPARK-13737: -- User 'mwws' has created a pull request for this issue: https://github.com/apache/spark/pull/11571 > Add getOrCreate method for HiveContext > -- > > Key: SPARK-13737 > URL: https://issues.apache.org/jira/browse/SPARK-13737 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Mao, Wei > > There is a "getOrCreate" method in SQLContext, which is useful to recoverable > streaming application with SQL operation. > https://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations > But the corresponding method is missing in HiveContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13737) Add getOrCreate method for HiveContext
[ https://issues.apache.org/jira/browse/SPARK-13737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13737: Assignee: Apache Spark > Add getOrCreate method for HiveContext > -- > > Key: SPARK-13737 > URL: https://issues.apache.org/jira/browse/SPARK-13737 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Mao, Wei >Assignee: Apache Spark > > There is a "getOrCreate" method in SQLContext, which is useful to recoverable > streaming application with SQL operation. > https://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations > But the corresponding method is missing in HiveContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13737) Add getOrCreate method for HiveContext
[ https://issues.apache.org/jira/browse/SPARK-13737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184208#comment-15184208 ] Xiao Li commented on SPARK-13737: - I think SQLContext and HiveContext is being combined. Not sure we should wait until the merge is done. > Add getOrCreate method for HiveContext > -- > > Key: SPARK-13737 > URL: https://issues.apache.org/jira/browse/SPARK-13737 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Mao, Wei > > There is a "getOrCreate" method in SQLContext, which is useful to recoverable > streaming application with SQL operation. > https://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations > But the corresponding method is missing in HiveContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-529) Have a single file that controls the environmental variables and spark config options
[ https://issues.apache.org/jira/browse/SPARK-529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184206#comment-15184206 ] Apache Spark commented on SPARK-529: User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/11570 > Have a single file that controls the environmental variables and spark config > options > - > > Key: SPARK-529 > URL: https://issues.apache.org/jira/browse/SPARK-529 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Reynold Xin > > E.g. multiple places in the code base uses SPARK_MEM and has its own default > set to 512. We need a central place to enforce default values as well as > documenting the variables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13737) Add getOrCreate method for HiveContext
Mao, Wei created SPARK-13737: Summary: Add getOrCreate method for HiveContext Key: SPARK-13737 URL: https://issues.apache.org/jira/browse/SPARK-13737 Project: Spark Issue Type: New Feature Components: SQL Reporter: Mao, Wei There is a "getOrCreate" method in SQLContext, which is useful to recoverable streaming application with SQL operation. https://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations But the corresponding method is missing in HiveContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13723) YARN - Change behavior of --num-executors when spark.dynamicAllocation.enabled true
[ https://issues.apache.org/jira/browse/SPARK-13723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184177#comment-15184177 ] Saisai Shao commented on SPARK-13723: - Yes, I agree with what [~tgraves] described above. Normally the use case is that user want to enable dynamic allocation intentionally but doesn't change the current executing command (lack of knowledge of it). So in this case dynamic allocation is failed to start, which is not expected. > YARN - Change behavior of --num-executors when > spark.dynamicAllocation.enabled true > --- > > Key: SPARK-13723 > URL: https://issues.apache.org/jira/browse/SPARK-13723 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.0.0 >Reporter: Thomas Graves >Priority: Minor > > I think we should change the behavior when --num-executors is specified when > dynamic allocation is enabled. Currently if --num-executors is specified > dynamic allocation is disabled and it just uses a static number of executors. > I would rather see the default behavior changed in the 2.x line. If dynamic > allocation config is on then num-executors goes to max and initial # of > executors. I think this would allow users to easily cap their usage and would > still allow it to free up executors. It would also allow users doing ML start > out with a # of executors and if they are actually caching the data the > executors wouldn't be freed up. So you would get very similar behavior to if > dynamic allocation was off. > Part of the reason for this is when using a static number if generally wastes > resources, especially with people doing adhoc things with spark-shell. It > also has a big affect when people are doing MapReduce/ETL type work loads. > The problem is that people are used to specifying num-executors so if we turn > it on by default in a cluster config its just overridden. > We should also update the spark-submit --help description for --num-executors -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12555) Datasets: data is corrupted when input data is reordered
[ https://issues.apache.org/jira/browse/SPARK-12555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luciano Resende updated SPARK-12555: Issue Type: Sub-task (was: Bug) Parent: SPARK-13736 > Datasets: data is corrupted when input data is reordered > > > Key: SPARK-12555 > URL: https://issues.apache.org/jira/browse/SPARK-12555 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.6.0 > Environment: ALL platforms on 1.6 >Reporter: Tim Preece > > Testcase > --- > {code} > import org.apache.spark.sql.expressions.Aggregator > import org.apache.spark.{SparkConf, SparkContext} > import org.apache.spark.sql.SQLContext > import org.apache.spark.sql.Dataset > case class people(age: Int, name: String) > object nameAgg extends Aggregator[people, String, String] { > def zero: String = "" > def reduce(b: String, a: people): String = a.name + b > def merge(b1: String, b2: String): String = b1 + b2 > def finish(r: String): String = r > } > object DataSetAgg { > def main(args: Array[String]) { > val conf = new SparkConf().setAppName("DataSetAgg") > val spark = new SparkContext(conf) > val sqlContext = new SQLContext(spark) > import sqlContext.implicits._ > val peopleds: Dataset[people] = sqlContext.sql("SELECT 'Tim Preece' AS > name, 1279869254 AS age").as[people] > peopleds.groupBy(_.age).agg(nameAgg.toColumn).show() > } > } > {code} > Result ( on a Little Endian Platform ) > > {noformat} > +--+--+ > |_1|_2| > +--+--+ > |1279869254|FAILTi| > +--+--+ > {noformat} > Explanation > --- > Internally the String variable in the unsafe row is not updated after an > unsafe row join operation. > The displayed string is corrupted and shows part of the integer ( interpreted > as a string ) along with "Ti" > The column names also look different on a Little Endian platform. > Result ( on a Big Endian Platform ) > {noformat} > +--+--+ > | value|nameAgg$(name,age)| > +--+--+ > |1279869254|LIAFTi| > +--+--+ > {noformat} > The following Unit test also fails ( but only explicitly on a Big Endian > platorm ) > org.apache.spark.sql.DatasetAggregatorSuite > - typed aggregation: class input with reordering *** FAILED *** > Results do not match for query: > == Parsed Logical Plan == > Aggregate [value#748], > [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS > ClassInputAgg$(b,a)#762] > +- AppendColumns , class[a[0]: int, b[0]: string], > class[value[0]: string], [value#748] > +- Project [one AS b#650,1 AS a#651] > +- OneRowRelation$ > > == Analyzed Logical Plan == > value: string, ClassInputAgg$(b,a): int > Aggregate [value#748], > [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS > ClassInputAgg$(b,a)#762] > +- AppendColumns , class[a[0]: int, b[0]: string], > class[value[0]: string], [value#748] > +- Project [one AS b#650,1 AS a#651] > +- OneRowRelation$ > > == Optimized Logical Plan == > Aggregate [value#748], > [value#748,(ClassInputAgg$(b#650,a#651),mode=Complete,isDistinct=false) AS > ClassInputAgg$(b,a)#762] > +- AppendColumns , class[a[0]: int, b[0]: string], > class[value[0]: string], [value#748] > +- Project [one AS b#650,1 AS a#651] > +- OneRowRelation$ > > == Physical Plan == > TungstenAggregate(key=[value#748], > functions=[(ClassInputAgg$(b#650,a#651),mode=Final,isDistinct=false)], > output=[value#748,ClassInputAgg$(b,a)#762]) > +- TungstenExchange hashpartitioning(value#748,5), None > +- TungstenAggregate(key=[value#748], > functions=[(ClassInputAgg$(b#650,a#651),mode=Partial,isDistinct=false)], > output=[value#748,value#758]) > +- !AppendColumns , class[a[0]: int, b[0]: string], > class[value[0]: string], [value#748] >+- Project [one AS b#650,1 AS a#651] > +- Scan OneRowRelation[] > == Results == > !== Correct Answer - 1 == == Spark Answer - 1 == > ![one,1][one,9] (QueryTest.scala:127) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12319) ExchangeCoordinatorSuite fails on big-endian platforms
[ https://issues.apache.org/jira/browse/SPARK-12319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luciano Resende updated SPARK-12319: Issue Type: Sub-task (was: Bug) Parent: SPARK-13736 > ExchangeCoordinatorSuite fails on big-endian platforms > -- > > Key: SPARK-12319 > URL: https://issues.apache.org/jira/browse/SPARK-12319 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.6.0 > Environment: Problems apparent on BE, LE could be impacted too >Reporter: Adam Roberts >Priority: Critical > > JIRA to cover endian specific problems - since testing 1.6 I've noticed > problems with DataFrames on BE platforms, e.g. > https://issues.apache.org/jira/browse/SPARK-9858 > [~joshrosen] [~yhuai] > Current progress: using com.google.common.io.LittleEndianDataInputStream and > com.google.common.io.LittleEndianDataOutputStream within UnsafeRowSerializer > fixes three test failures in ExchangeCoordinatorSuite but I'm concerned > around performance/wider functional implications > "org.apache.spark.sql.DatasetAggregatorSuite.typed aggregation: class input > with reordering" fails as we expect "one, 1" but instead get "one, 9" - we > believe the issue lies within BitSetMethods.java, specifically around: return > (wi << 6) + subIndex + java.lang.Long.numberOfTrailingZeros(word); -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13736) Big-Endian plataform issues
Luciano Resende created SPARK-13736: --- Summary: Big-Endian plataform issues Key: SPARK-13736 URL: https://issues.apache.org/jira/browse/SPARK-13736 Project: Spark Issue Type: Epic Components: SQL Affects Versions: 1.6.0 Reporter: Luciano Resende Priority: Critical We are starting to see few issues when building/testing on Big-Endian platform. This serves as an umbrella jira to group all platform specific issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13656) Delete spark.sql.parquet.cacheMetadata
[ https://issues.apache.org/jira/browse/SPARK-13656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184110#comment-15184110 ] Takeshi Yamamuro commented on SPARK-13656: -- Okay, I'll make a pr in a day. > Delete spark.sql.parquet.cacheMetadata > -- > > Key: SPARK-13656 > URL: https://issues.apache.org/jira/browse/SPARK-13656 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yin Huai > > Looks like spark.sql.parquet.cacheMetadata is not used anymore. Let's delete > it to avoid any potential confusion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13734) SparkR histogram
[ https://issues.apache.org/jira/browse/SPARK-13734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13734: Assignee: (was: Apache Spark) > SparkR histogram > > > Key: SPARK-13734 > URL: https://issues.apache.org/jira/browse/SPARK-13734 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Oscar D. Lara Yejas > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13734) SparkR histogram
[ https://issues.apache.org/jira/browse/SPARK-13734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13734: Assignee: Apache Spark > SparkR histogram > > > Key: SPARK-13734 > URL: https://issues.apache.org/jira/browse/SPARK-13734 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Oscar D. Lara Yejas >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13734) SparkR histogram
[ https://issues.apache.org/jira/browse/SPARK-13734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184106#comment-15184106 ] Apache Spark commented on SPARK-13734: -- User 'olarayej' has created a pull request for this issue: https://github.com/apache/spark/pull/11569 > SparkR histogram > > > Key: SPARK-13734 > URL: https://issues.apache.org/jira/browse/SPARK-13734 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Oscar D. Lara Yejas > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13418) SQL generation for uncorrelated scalar subqueries
[ https://issues.apache.org/jira/browse/SPARK-13418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-13418. -- Resolution: Duplicate Fix Version/s: 2.0.0 > SQL generation for uncorrelated scalar subqueries > - > > Key: SPARK-13418 > URL: https://issues.apache.org/jira/browse/SPARK-13418 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > Fix For: 2.0.0 > > > This is pretty difficult right now because SQLBuilder is in the hive package, > whereas the sql function for ScalarSubquery is defined in catalyst package. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13727) SparkConf.contains does not consider deprecated keys
[ https://issues.apache.org/jira/browse/SPARK-13727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184095#comment-15184095 ] Apache Spark commented on SPARK-13727: -- User 'bomeng' has created a pull request for this issue: https://github.com/apache/spark/pull/11568 > SparkConf.contains does not consider deprecated keys > > > Key: SPARK-13727 > URL: https://issues.apache.org/jira/browse/SPARK-13727 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Marcelo Vanzin >Priority: Minor > > This makes it kinda inconsistent with other SparkConf APIs. For example: > {code} > scala> import org.apache.spark.SparkConf > import org.apache.spark.SparkConf > scala> val conf = new SparkConf().set("spark.io.compression.lz4.block.size", > "12345") > 16/03/07 10:55:17 WARN spark.SparkConf: The configuration key > 'spark.io.compression.lz4.block.size' has been deprecated as of Spark 1.4 and > and may be removed in the future. Please use the new key > 'spark.io.compression.lz4.blockSize' instead. > conf: org.apache.spark.SparkConf = org.apache.spark.SparkConf@221e8982 > scala> conf.get("spark.io.compression.lz4.blockSize") > res0: String = 12345 > scala> conf.contains("spark.io.compression.lz4.blockSize") > res1: Boolean = false > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13727) SparkConf.contains does not consider deprecated keys
[ https://issues.apache.org/jira/browse/SPARK-13727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13727: Assignee: (was: Apache Spark) > SparkConf.contains does not consider deprecated keys > > > Key: SPARK-13727 > URL: https://issues.apache.org/jira/browse/SPARK-13727 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Marcelo Vanzin >Priority: Minor > > This makes it kinda inconsistent with other SparkConf APIs. For example: > {code} > scala> import org.apache.spark.SparkConf > import org.apache.spark.SparkConf > scala> val conf = new SparkConf().set("spark.io.compression.lz4.block.size", > "12345") > 16/03/07 10:55:17 WARN spark.SparkConf: The configuration key > 'spark.io.compression.lz4.block.size' has been deprecated as of Spark 1.4 and > and may be removed in the future. Please use the new key > 'spark.io.compression.lz4.blockSize' instead. > conf: org.apache.spark.SparkConf = org.apache.spark.SparkConf@221e8982 > scala> conf.get("spark.io.compression.lz4.blockSize") > res0: String = 12345 > scala> conf.contains("spark.io.compression.lz4.blockSize") > res1: Boolean = false > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13727) SparkConf.contains does not consider deprecated keys
[ https://issues.apache.org/jira/browse/SPARK-13727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13727: Assignee: Apache Spark > SparkConf.contains does not consider deprecated keys > > > Key: SPARK-13727 > URL: https://issues.apache.org/jira/browse/SPARK-13727 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Minor > > This makes it kinda inconsistent with other SparkConf APIs. For example: > {code} > scala> import org.apache.spark.SparkConf > import org.apache.spark.SparkConf > scala> val conf = new SparkConf().set("spark.io.compression.lz4.block.size", > "12345") > 16/03/07 10:55:17 WARN spark.SparkConf: The configuration key > 'spark.io.compression.lz4.block.size' has been deprecated as of Spark 1.4 and > and may be removed in the future. Please use the new key > 'spark.io.compression.lz4.blockSize' instead. > conf: org.apache.spark.SparkConf = org.apache.spark.SparkConf@221e8982 > scala> conf.get("spark.io.compression.lz4.blockSize") > res0: String = 12345 > scala> conf.contains("spark.io.compression.lz4.blockSize") > res1: Boolean = false > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13735) Log for parquet relation reading files is too verbose
Zhong Wang created SPARK-13735: -- Summary: Log for parquet relation reading files is too verbose Key: SPARK-13735 URL: https://issues.apache.org/jira/browse/SPARK-13735 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.6.0 Reporter: Zhong Wang Priority: Trivial The INFO level logging contains all files read by Parquet Relation, which is way too verbose if the input contains lots of files -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-6666) org.apache.spark.sql.jdbc.JDBCRDD does not escape/quote column names
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luciano Resende closed SPARK-. -- Resolution: Cannot Reproduce I have tried the scenarios above in Spark trunk using both Postgres and DB2, see: https://github.com/lresende/spark-sandbox/blob/master/src/main/scala/com/luck/sql/JDBCApplication.scala And the described issues seems not reproducible anymore, see all results below root |-- Symbol: string (nullable = true) |-- Name: string (nullable = true) |-- Sector: string (nullable = true) |-- Price: double (nullable = true) |-- Dividend Yield: double (nullable = true) |-- Price/Earnings: double (nullable = true) |-- Earnings/Share: double (nullable = true) |-- Book Value: double (nullable = true) |-- 52 week low: double (nullable = true) |-- 52 week high: double (nullable = true) |-- Market Cap: double (nullable = true) |-- EBITDA: double (nullable = true) |-- Price/Sales: double (nullable = true) |-- Price/Book: double (nullable = true) |-- SEC Filings: string (nullable = true) +--+--+--+-+--+--+--+--+---++--+--+---+--+---+ |Symbol| Name|Sector|Price|Dividend Yield|Price/Earnings|Earnings/Share|Book Value|52 week low|52 week high|Market Cap|EBITDA|Price/Sales|Price/Book|SEC Filings| +--+--+--+-+--+--+--+--+---++--+--+---+--+---+ |S1|Name 1| Sec 1| 10.0| 10.0| 10.0| 10.0| 10.0| 10.0|10.0| 10.0| 10.0| 10.0| 10.0| 100| |s2|Name 2| Sec 2| 20.0| 20.0| 20.0| 20.0| 20.0| 20.0|20.0| 20.0| 20.0| 20.0| 20.0| 200| +--+--+--+-+--+--+--+--+---++--+--+---+--+---+ +--+ |AvgCPI| +--+ | 15.0| +--+ +--+--+--+-+--+--+--+--+---++--+--+---+--+---+ |Symbol| Name|Sector|Price|Dividend Yield|Price/Earnings|Earnings/Share|Book Value|52 week low|52 week high|Market Cap|EBITDA|Price/Sales|Price/Book|SEC Filings| +--+--+--+-+--+--+--+--+---++--+--+---+--+---+ |S1|Name 1| Sec 1| 10.0| 10.0| 10.0| 10.0| 10.0| 10.0|10.0| 10.0| 10.0| 10.0| 10.0| 100| |s2|Name 2| Sec 2| 20.0| 20.0| 20.0| 20.0| 20.0| 20.0|20.0| 20.0| 20.0| 20.0| 20.0| 200| +--+--+--+-+--+--+--+--+---++--+--+---+--+---+ > org.apache.spark.sql.jdbc.JDBCRDD does not escape/quote column names > - > > Key: SPARK- > URL: https://issues.apache.org/jira/browse/SPARK- > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 > Environment: >Reporter: John Ferguson >Priority: Critical > > Is there a way to have JDBC DataFrames use quoted/escaped column names? > Right now, it looks like it "sees" the names correctly in the schema created > but does not escape them in the SQL it creates when they are not compliant: > org.apache.spark.sql.jdbc.JDBCRDD > > private val columnList: String = { > val sb = new StringBuilder() > columns.foreach(x => sb.append(",").append(x)) > if (sb.length == 0) "1" else sb.substring(1) > } > If you see value in this, I would take a shot at adding the quoting > (escaping) of column names here. If you don't do it, some drivers... like > postgresql's will simply drop case all names when parsing the query. As you > can see in the TL;DR below that means they won't match the schema I am given. > TL;DR: > > I am able to connect to a Postgres database in the shell (with driver > referenced): >val jdbcDf = > sqlContext.jdbc("jdbc:postgresql://localhost/sparkdemo?user=dbuser", "sp500") > In fact when I run: >jdbcDf.registerTempTable("sp500") >val avgEPSNamed = sqlContext.sql("SELECT AVG(`Earnings/Share`) as AvgCPI > FROM sp500") > and >val avgEPSProg = jsonDf.agg(avg(jsonDf.col("Earnings/Share"))) > The values come back as expected. However, if I try: >jdbcDf.show > Or if I try > >val all = sqlContext.sql("SELECT * FROM sp500") >all.show > I get errors about column names not be
[jira] [Commented] (SPARK-13726) Spark 1.6.0 stopping working for HiveThriftServer2 and registerTempTable
[ https://issues.apache.org/jira/browse/SPARK-13726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184077#comment-15184077 ] Yin Huai commented on SPARK-13726: -- I think that after setting that conf to true, the behavior should be the same as Spark 1.5. You can still run multiple JDBC connections. For running multiple queries concurrently, as long as you set spark.sql.thriftserver.scheduler.pool for different JDBC connections like before, it should work (see http://spark.apache.org/docs/latest/sql-programming-guide.html#scheduling). For long term, I think we will keep the current behavior. When you need to share temp tables across sessions, we have to set spark.sql.hive.thriftServer.singleSession to true. > Spark 1.6.0 stopping working for HiveThriftServer2 and registerTempTable > > > Key: SPARK-13726 > URL: https://issues.apache.org/jira/browse/SPARK-13726 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Michael Nguyen >Priority: Blocker > > In Spark 1.5.2, DataFrame.registerTempTable works and > hiveContext.table(registerTableName) and HiveThriftServer2 see those tables. > In Spark 1.6.0, hiveContext.table(registerTableName) and HiveThriftServer2 do > not see those tables, even though DataFrame.registerTempTable does not return > an error. > Since this feature used to work in Spark 1.5.2, there is existing code that > breaks after upgrading to Spark 1.6.0. so this issue is a blocker and urgent. > Therefore, please have it fixed asap. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10548) Concurrent execution in SQL does not work
[ https://issues.apache.org/jira/browse/SPARK-10548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184075#comment-15184075 ] nicerobot commented on SPARK-10548: --- Thanks [~zsxwing]. What's the recommended way to accomplish that? > Concurrent execution in SQL does not work > - > > Key: SPARK-10548 > URL: https://issues.apache.org/jira/browse/SPARK-10548 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Blocker > Fix For: 1.5.1, 1.6.0 > > > From the mailing list: > {code} > future { df1.count() } > future { df2.count() } > java.lang.IllegalArgumentException: spark.sql.execution.id is already set > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87) > > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) > at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) > {code} > === edit === > Simple reproduction: > {code} > (1 to 100).par.foreach { _ => > sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count() > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13734) SparkR histogram
[ https://issues.apache.org/jira/browse/SPARK-13734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oscar D. Lara Yejas updated SPARK-13734: Summary: SparkR histogram (was: Histogram) > SparkR histogram > > > Key: SPARK-13734 > URL: https://issues.apache.org/jira/browse/SPARK-13734 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Oscar D. Lara Yejas > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13726) Spark 1.6.0 stopping working for HiveThriftServer2 and registerTempTable
[ https://issues.apache.org/jira/browse/SPARK-13726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184062#comment-15184062 ] Michael Nguyen commented on SPARK-13726: That works. Thanks, Yin, for the work-around solution. Does setting spark.sql.hive.thriftServer.singleSession to true 1. Limit Hive to support only one JDBC connection or one query at a time ? Or 2. Performance for running multiple queries from multiple JDBC connections at the same time ? If so, could you provide a long-term solution that do not have these issues ? > Spark 1.6.0 stopping working for HiveThriftServer2 and registerTempTable > > > Key: SPARK-13726 > URL: https://issues.apache.org/jira/browse/SPARK-13726 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Michael Nguyen >Priority: Blocker > > In Spark 1.5.2, DataFrame.registerTempTable works and > hiveContext.table(registerTableName) and HiveThriftServer2 see those tables. > In Spark 1.6.0, hiveContext.table(registerTableName) and HiveThriftServer2 do > not see those tables, even though DataFrame.registerTempTable does not return > an error. > Since this feature used to work in Spark 1.5.2, there is existing code that > breaks after upgrading to Spark 1.6.0. so this issue is a blocker and urgent. > Therefore, please have it fixed asap. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13734) Histogram
Oscar D. Lara Yejas created SPARK-13734: --- Summary: Histogram Key: SPARK-13734 URL: https://issues.apache.org/jira/browse/SPARK-13734 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Oscar D. Lara Yejas -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13731) expression evaluation for NaN in select statement
[ https://issues.apache.org/jira/browse/SPARK-13731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15183869#comment-15183869 ] Ian edited comment on SPARK-13731 at 3/8/16 12:04 AM: -- The expression in select essentially defines a transformation from data residing on storage or even another RDD. We are seeing that the transformation result is now null for both NaN and Infinity. We saw SPARK-9076, which seemed addressing only how NaN value in RDD can be handled(equality, ordering, ...), but our concerned case is more on the another aspect and more fundamental about the expressions that might produce NaN. was (Author: ianlcsd): The expression in select essentially defines a transformation from data residing on storage or even another RDD. We are seeing that the transformation result is now null for both NaN and Infinity. We saw SPARK-9076, which seemed addressing only how NaN value in RDD can be handled(comparing,ordering, ...), but our concerned case is more on the another aspect and more fundamental about the expressions that might produce NaN. > expression evaluation for NaN in select statement > - > > Key: SPARK-13731 > URL: https://issues.apache.org/jira/browse/SPARK-13731 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Ian > > We are expecting that arithmetic expression a/b should be: > 1. returning NaN if a=0 and b=0 > 2. returning Infinity if a=1 and b=0 > Is the expectation reasonable? > The following is a simple test case snippet that reads from storage and > evaluates arithmetic expressions in select. > It is assuming org.apache.spark.sql.hive.execution.SQLQuerySuite: > {code} > test("Expression should be evaluated to Nan/Infinity in Select") { > withTable("testNan") { > withTempTable("src") { > Seq((1d, 0d), (0d, 0d)).toDF().registerTempTable("src") > sql("CREATE TABLE testNan(a double, b double) STORED AS PARQUET AS > SELECT * FROM src") > } > checkAnswer(sql( > """ > |SELECT a/b FROM testNan > """.stripMargin), > Seq( > Row(Double.PositiveInfinity), > Row(Double.NaN) > ) > ) > } > } > == Physical Plan == > Project [(a#28 / b#29) AS _c0#30] > +- Scan ParquetRelation: default.testnan[a#28,b#29] InputPaths: > file:/private/var/folders/dy/19y6pfm92pj9s40mbs8xd9hmgp/T/warehouse--5b617080-e909-4812-90e8-63d2dd0aef5a/testnan > == Results == > !== Correct Answer - 2 == == Spark Answer - 2 == > ![Infinity] [null] > ![NaN] [null] > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13734) Histogram
[ https://issues.apache.org/jira/browse/SPARK-13734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184057#comment-15184057 ] Oscar D. Lara Yejas commented on SPARK-13734: - I'm working on this one. > Histogram > - > > Key: SPARK-13734 > URL: https://issues.apache.org/jira/browse/SPARK-13734 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Oscar D. Lara Yejas > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13731) expression evaluation for NaN in select statement
[ https://issues.apache.org/jira/browse/SPARK-13731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian updated SPARK-13731: Description: We are expecting that arithmetic expression a/b should be: 1. returning NaN if a=0 and b=0 2. returning Infinity if a=1 and b=0 Is the expectation reasonable? The following is a simple test case snippet that reads from storage and evaluates arithmetic expressions in select. It is assuming org.apache.spark.sql.hive.execution.SQLQuerySuite: {code} test("Expression should be evaluated to Nan/Infinity in Select") { withTable("testNan") { withTempTable("src") { Seq((1d, 0d), (0d, 0d)).toDF().registerTempTable("src") sql("CREATE TABLE testNan(a double, b double) STORED AS PARQUET AS SELECT * FROM src") } checkAnswer(sql( """ |SELECT a/b FROM testNan """.stripMargin), Seq( Row(Double.PositiveInfinity), Row(Double.NaN) ) ) } } == Physical Plan == Project [(a#28 / b#29) AS _c0#30] +- Scan ParquetRelation: default.testnan[a#28,b#29] InputPaths: file:/private/var/folders/dy/19y6pfm92pj9s40mbs8xd9hmgp/T/warehouse--5b617080-e909-4812-90e8-63d2dd0aef5a/testnan == Results == !== Correct Answer - 2 == == Spark Answer - 2 == ![Infinity] [null] ![NaN] [null] {code} was: We are expecting that arithmetic expression a/b should be: 1. returning NaN if a=0 and b=0 2. returning Infinity if a=1 and b=0 Is the expectation reasonable? The following is a simple test case snippet that read from storage and evaluate arithmetic in select. It si assuming org.apache.spark.sql.hive.execution.SQLQuerySuite: {code} test("Expression should be evaluated to Nan/Infinity in Select") { withTable("testNan") { withTempTable("src") { Seq((1d, 0d), (0d, 0d)).toDF().registerTempTable("src") sql("CREATE TABLE testNan(a double, b double) STORED AS PARQUET AS SELECT * FROM src") } checkAnswer(sql( """ |SELECT a/b FROM testNan """.stripMargin), Seq( Row(Double.PositiveInfinity), Row(Double.NaN) ) ) } } == Physical Plan == Project [(a#28 / b#29) AS _c0#30] +- Scan ParquetRelation: default.testnan[a#28,b#29] InputPaths: file:/private/var/folders/dy/19y6pfm92pj9s40mbs8xd9hmgp/T/warehouse--5b617080-e909-4812-90e8-63d2dd0aef5a/testnan == Results == !== Correct Answer - 2 == == Spark Answer - 2 == ![Infinity] [null] ![NaN] [null] {code} > expression evaluation for NaN in select statement > - > > Key: SPARK-13731 > URL: https://issues.apache.org/jira/browse/SPARK-13731 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Ian > > We are expecting that arithmetic expression a/b should be: > 1. returning NaN if a=0 and b=0 > 2. returning Infinity if a=1 and b=0 > Is the expectation reasonable? > The following is a simple test case snippet that reads from storage and > evaluates arithmetic expressions in select. > It is assuming org.apache.spark.sql.hive.execution.SQLQuerySuite: > {code} > test("Expression should be evaluated to Nan/Infinity in Select") { > withTable("testNan") { > withTempTable("src") { > Seq((1d, 0d), (0d, 0d)).toDF().registerTempTable("src") > sql("CREATE TABLE testNan(a double, b double) STORED AS PARQUET AS > SELECT * FROM src") > } > checkAnswer(sql( > """ > |SELECT a/b FROM testNan > """.stripMargin), > Seq( > Row(Double.PositiveInfinity), > Row(Double.NaN) > ) > ) > } > } > == Physical Plan == > Project [(a#28 / b#29) AS _c0#30] > +- Scan ParquetRelation: default.testnan[a#28,b#29] InputPaths: > file:/private/var/folders/dy/19y6pfm92pj9s40mbs8xd9hmgp/T/warehouse--5b617080-e909-4812-90e8-63d2dd0aef5a/testnan > == Results == > !== Correct Answer - 2 == == Spark Answer - 2 == > ![Infinity] [null] > ![NaN] [null] > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13648) org.apache.spark.sql.hive.client.VersionsSuite fails NoClassDefFoundError on IBM JDK
[ https://issues.apache.org/jira/browse/SPARK-13648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-13648. -- Resolution: Fixed Fix Version/s: 1.6.1 2.0.0 Issue resolved by pull request 11495 [https://github.com/apache/spark/pull/11495] > org.apache.spark.sql.hive.client.VersionsSuite fails NoClassDefFoundError on > IBM JDK > > > Key: SPARK-13648 > URL: https://issues.apache.org/jira/browse/SPARK-13648 > Project: Spark > Issue Type: Bug > Components: SQL > Environment: Fails on vendor specific JVMs ( e.g IBM JVM ) >Reporter: Tim Preece >Priority: Minor > Fix For: 2.0.0, 1.6.1 > > > When running the standard Spark unit tests on the IBM Java SDK the hive > VersionsSuite fail with the following error. > java.lang.NoClassDefFoundError: org.apache.hadoop.hive.cli.CliSessionState > when creating Hive client using classpath: .. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13648) org.apache.spark.sql.hive.client.VersionsSuite fails NoClassDefFoundError on IBM JDK
[ https://issues.apache.org/jira/browse/SPARK-13648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-13648: - Fix Version/s: (was: 1.6.1) 1.6.2 > org.apache.spark.sql.hive.client.VersionsSuite fails NoClassDefFoundError on > IBM JDK > > > Key: SPARK-13648 > URL: https://issues.apache.org/jira/browse/SPARK-13648 > Project: Spark > Issue Type: Bug > Components: SQL > Environment: Fails on vendor specific JVMs ( e.g IBM JVM ) >Reporter: Tim Preece >Priority: Minor > Fix For: 1.6.2, 2.0.0 > > > When running the standard Spark unit tests on the IBM Java SDK the hive > VersionsSuite fail with the following error. > java.lang.NoClassDefFoundError: org.apache.hadoop.hive.cli.CliSessionState > when creating Hive client using classpath: .. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12458) Add ExpressionDescription to datetime functions
[ https://issues.apache.org/jira/browse/SPARK-12458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184038#comment-15184038 ] Apache Spark commented on SPARK-12458: -- User 'dilipbiswal' has created a pull request for this issue: https://github.com/apache/spark/pull/10428 > Add ExpressionDescription to datetime functions > --- > > Key: SPARK-12458 > URL: https://issues.apache.org/jira/browse/SPARK-12458 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12458) Add ExpressionDescription to datetime functions
[ https://issues.apache.org/jira/browse/SPARK-12458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12458: Assignee: Apache Spark > Add ExpressionDescription to datetime functions > --- > > Key: SPARK-12458 > URL: https://issues.apache.org/jira/browse/SPARK-12458 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12458) Add ExpressionDescription to datetime functions
[ https://issues.apache.org/jira/browse/SPARK-12458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12458: Assignee: (was: Apache Spark) > Add ExpressionDescription to datetime functions > --- > > Key: SPARK-12458 > URL: https://issues.apache.org/jira/browse/SPARK-12458 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13731) expression evaluation for NaN in select statement
[ https://issues.apache.org/jira/browse/SPARK-13731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184034#comment-15184034 ] Ian edited comment on SPARK-13731 at 3/7/16 11:41 PM: -- The test case we provided is using simple arithmetic expression like divisions for double, but in fact, many other math functions are having the same behaviors that returns null for NaN/Infinity. For instance, log() and corr(). {code} SELECT log(a/b), corn(a, b) FROM testNan order by a, b {code} was (Author: ianlcsd): The test case we provided is using simple arithmetic expression like divisions for double, but in fact, many math other functions are having the same behaviors that returns null for NaN/Infinity. For instance, log() and corr(). {code} SELECT log(a/b), corn(a, b) FROM testNan order by a, b {code} > expression evaluation for NaN in select statement > - > > Key: SPARK-13731 > URL: https://issues.apache.org/jira/browse/SPARK-13731 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Ian > > We are expecting that arithmetic expression a/b should be: > 1. returning NaN if a=0 and b=0 > 2. returning Infinity if a=1 and b=0 > Is the expectation reasonable? > The following is a simple test case snippet that read from storage and > evaluate arithmetic in select. > It si assuming org.apache.spark.sql.hive.execution.SQLQuerySuite: > {code} > test("Expression should be evaluated to Nan/Infinity in Select") { > withTable("testNan") { > withTempTable("src") { > Seq((1d, 0d), (0d, 0d)).toDF().registerTempTable("src") > sql("CREATE TABLE testNan(a double, b double) STORED AS PARQUET AS > SELECT * FROM src") > } > checkAnswer(sql( > """ > |SELECT a/b FROM testNan > """.stripMargin), > Seq( > Row(Double.PositiveInfinity), > Row(Double.NaN) > ) > ) > } > } > == Physical Plan == > Project [(a#28 / b#29) AS _c0#30] > +- Scan ParquetRelation: default.testnan[a#28,b#29] InputPaths: > file:/private/var/folders/dy/19y6pfm92pj9s40mbs8xd9hmgp/T/warehouse--5b617080-e909-4812-90e8-63d2dd0aef5a/testnan > == Results == > !== Correct Answer - 2 == == Spark Answer - 2 == > ![Infinity] [null] > ![NaN] [null] > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13731) expression evaluation for NaN in select statement
[ https://issues.apache.org/jira/browse/SPARK-13731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184034#comment-15184034 ] Ian commented on SPARK-13731: - The test case we provided is using simple arithmetic expression like divisions for double, but in fact, many math other functions are having the same behaviors that returns null for NaN/Infinity. For instance, log() and corr(). {code} SELECT log(a/b), corn(a, b) FROM testNan order by a, b {code} > expression evaluation for NaN in select statement > - > > Key: SPARK-13731 > URL: https://issues.apache.org/jira/browse/SPARK-13731 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Ian > > We are expecting that arithmetic expression a/b should be: > 1. returning NaN if a=0 and b=0 > 2. returning Infinity if a=1 and b=0 > Is the expectation reasonable? > The following is a simple test case snippet that read from storage and > evaluate arithmetic in select. > It si assuming org.apache.spark.sql.hive.execution.SQLQuerySuite: > {code} > test("Expression should be evaluated to Nan/Infinity in Select") { > withTable("testNan") { > withTempTable("src") { > Seq((1d, 0d), (0d, 0d)).toDF().registerTempTable("src") > sql("CREATE TABLE testNan(a double, b double) STORED AS PARQUET AS > SELECT * FROM src") > } > checkAnswer(sql( > """ > |SELECT a/b FROM testNan > """.stripMargin), > Seq( > Row(Double.PositiveInfinity), > Row(Double.NaN) > ) > ) > } > } > == Physical Plan == > Project [(a#28 / b#29) AS _c0#30] > +- Scan ParquetRelation: default.testnan[a#28,b#29] InputPaths: > file:/private/var/folders/dy/19y6pfm92pj9s40mbs8xd9hmgp/T/warehouse--5b617080-e909-4812-90e8-63d2dd0aef5a/testnan > == Results == > !== Correct Answer - 2 == == Spark Answer - 2 == > ![Infinity] [null] > ![NaN] [null] > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13711) Apache Spark driver stopping JVM when master not available
[ https://issues.apache.org/jira/browse/SPARK-13711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13711: Assignee: (was: Apache Spark) > Apache Spark driver stopping JVM when master not available > --- > > Key: SPARK-13711 > URL: https://issues.apache.org/jira/browse/SPARK-13711 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1, 1.6.0 >Reporter: Era > > In my application Java spark context is created with an unavailable master > URL (you may assume master is down for a maintenance). When creating Java > spark context it leads to stopping JVM that runs spark driver with JVM exit > code 50. > When I checked the logs I found SparkUncaughtExceptionHandler calling the > System.exit. My program should run forever. > package test.mains; > import org.apache.spark.SparkConf; > import org.apache.spark.api.java.JavaSparkContext; > public class CheckJavaSparkContext { > public static void main(String[] args) { > SparkConf conf = new SparkConf(); > conf.setAppName("test"); > conf.setMaster("spark://sunshinee:7077"); > try { > new JavaSparkContext(conf); > } catch (Throwable e) { > System.out.println("Caught an exception : " + e.getMessage()); > > } > System.out.println("Waiting to complete..."); > while (true) { > } > } > } > Output log > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/data/downloads/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/data/downloads/spark-1.6.0-bin-hadoop2.6/lib/spark-examples-1.6.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] > 16/03/04 18:01:15 INFO SparkContext: Running Spark version 1.6.0 > 16/03/04 18:01:17 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/03/04 18:01:17 WARN Utils: Your hostname, pesamara-mobl-vm1 resolves to a > loopback address: 127.0.0.1; using 10.30.9.107 instead (on interface eth0) > 16/03/04 18:01:17 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > 16/03/04 18:01:18 INFO SecurityManager: Changing view acls to: ps40233 > 16/03/04 18:01:18 INFO SecurityManager: Changing modify acls to: ps40233 > 16/03/04 18:01:18 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(ps40233); users > with modify permissions: Set(ps40233) > 16/03/04 18:01:19 INFO Utils: Successfully started service 'sparkDriver' on > port 55309. > 16/03/04 18:01:21 INFO Slf4jLogger: Slf4jLogger started > 16/03/04 18:01:21 INFO Remoting: Starting remoting > 16/03/04 18:01:22 INFO Remoting: Remoting started; listening on addresses > :[akka.tcp://sparkDriverActorSystem@10.30.9.107:52128] > 16/03/04 18:01:22 INFO Utils: Successfully started service > 'sparkDriverActorSystem' on port 52128. > 16/03/04 18:01:22 INFO SparkEnv: Registering MapOutputTracker > 16/03/04 18:01:22 INFO SparkEnv: Registering BlockManagerMaster > 16/03/04 18:01:22 INFO DiskBlockManager: Created local directory at > /tmp/blockmgr-87c20178-357d-4252-a46a-62a755568a98 > 16/03/04 18:01:22 INFO MemoryStore: MemoryStore started with capacity 457.7 MB > 16/03/04 18:01:22 INFO SparkEnv: Registering OutputCommitCoordinator > 16/03/04 18:01:23 INFO Utils: Successfully started service 'SparkUI' on port > 4040. > 16/03/04 18:01:23 INFO SparkUI: Started SparkUI at http://10.30.9.107:4040 > 16/03/04 18:01:24 INFO AppClient$ClientEndpoint: Connecting to master > spark://sunshinee:7077... > 16/03/04 18:01:24 WARN AppClient$ClientEndpoint: Failed to connect to master > sunshinee:7077 > java.io.IOException: Failed to connect to sunshinee:7077 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167) > at > org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:200) > at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187) > at org.apache.
[jira] [Assigned] (SPARK-13711) Apache Spark driver stopping JVM when master not available
[ https://issues.apache.org/jira/browse/SPARK-13711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13711: Assignee: Apache Spark > Apache Spark driver stopping JVM when master not available > --- > > Key: SPARK-13711 > URL: https://issues.apache.org/jira/browse/SPARK-13711 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1, 1.6.0 >Reporter: Era >Assignee: Apache Spark > > In my application Java spark context is created with an unavailable master > URL (you may assume master is down for a maintenance). When creating Java > spark context it leads to stopping JVM that runs spark driver with JVM exit > code 50. > When I checked the logs I found SparkUncaughtExceptionHandler calling the > System.exit. My program should run forever. > package test.mains; > import org.apache.spark.SparkConf; > import org.apache.spark.api.java.JavaSparkContext; > public class CheckJavaSparkContext { > public static void main(String[] args) { > SparkConf conf = new SparkConf(); > conf.setAppName("test"); > conf.setMaster("spark://sunshinee:7077"); > try { > new JavaSparkContext(conf); > } catch (Throwable e) { > System.out.println("Caught an exception : " + e.getMessage()); > > } > System.out.println("Waiting to complete..."); > while (true) { > } > } > } > Output log > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/data/downloads/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/data/downloads/spark-1.6.0-bin-hadoop2.6/lib/spark-examples-1.6.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] > 16/03/04 18:01:15 INFO SparkContext: Running Spark version 1.6.0 > 16/03/04 18:01:17 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/03/04 18:01:17 WARN Utils: Your hostname, pesamara-mobl-vm1 resolves to a > loopback address: 127.0.0.1; using 10.30.9.107 instead (on interface eth0) > 16/03/04 18:01:17 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > 16/03/04 18:01:18 INFO SecurityManager: Changing view acls to: ps40233 > 16/03/04 18:01:18 INFO SecurityManager: Changing modify acls to: ps40233 > 16/03/04 18:01:18 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(ps40233); users > with modify permissions: Set(ps40233) > 16/03/04 18:01:19 INFO Utils: Successfully started service 'sparkDriver' on > port 55309. > 16/03/04 18:01:21 INFO Slf4jLogger: Slf4jLogger started > 16/03/04 18:01:21 INFO Remoting: Starting remoting > 16/03/04 18:01:22 INFO Remoting: Remoting started; listening on addresses > :[akka.tcp://sparkDriverActorSystem@10.30.9.107:52128] > 16/03/04 18:01:22 INFO Utils: Successfully started service > 'sparkDriverActorSystem' on port 52128. > 16/03/04 18:01:22 INFO SparkEnv: Registering MapOutputTracker > 16/03/04 18:01:22 INFO SparkEnv: Registering BlockManagerMaster > 16/03/04 18:01:22 INFO DiskBlockManager: Created local directory at > /tmp/blockmgr-87c20178-357d-4252-a46a-62a755568a98 > 16/03/04 18:01:22 INFO MemoryStore: MemoryStore started with capacity 457.7 MB > 16/03/04 18:01:22 INFO SparkEnv: Registering OutputCommitCoordinator > 16/03/04 18:01:23 INFO Utils: Successfully started service 'SparkUI' on port > 4040. > 16/03/04 18:01:23 INFO SparkUI: Started SparkUI at http://10.30.9.107:4040 > 16/03/04 18:01:24 INFO AppClient$ClientEndpoint: Connecting to master > spark://sunshinee:7077... > 16/03/04 18:01:24 WARN AppClient$ClientEndpoint: Failed to connect to master > sunshinee:7077 > java.io.IOException: Failed to connect to sunshinee:7077 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167) > at > org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:200) > at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187)
[jira] [Commented] (SPARK-13711) Apache Spark driver stopping JVM when master not available
[ https://issues.apache.org/jira/browse/SPARK-13711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184016#comment-15184016 ] Apache Spark commented on SPARK-13711: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/11566 > Apache Spark driver stopping JVM when master not available > --- > > Key: SPARK-13711 > URL: https://issues.apache.org/jira/browse/SPARK-13711 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1, 1.6.0 >Reporter: Era > > In my application Java spark context is created with an unavailable master > URL (you may assume master is down for a maintenance). When creating Java > spark context it leads to stopping JVM that runs spark driver with JVM exit > code 50. > When I checked the logs I found SparkUncaughtExceptionHandler calling the > System.exit. My program should run forever. > package test.mains; > import org.apache.spark.SparkConf; > import org.apache.spark.api.java.JavaSparkContext; > public class CheckJavaSparkContext { > public static void main(String[] args) { > SparkConf conf = new SparkConf(); > conf.setAppName("test"); > conf.setMaster("spark://sunshinee:7077"); > try { > new JavaSparkContext(conf); > } catch (Throwable e) { > System.out.println("Caught an exception : " + e.getMessage()); > > } > System.out.println("Waiting to complete..."); > while (true) { > } > } > } > Output log > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/data/downloads/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/data/downloads/spark-1.6.0-bin-hadoop2.6/lib/spark-examples-1.6.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] > 16/03/04 18:01:15 INFO SparkContext: Running Spark version 1.6.0 > 16/03/04 18:01:17 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/03/04 18:01:17 WARN Utils: Your hostname, pesamara-mobl-vm1 resolves to a > loopback address: 127.0.0.1; using 10.30.9.107 instead (on interface eth0) > 16/03/04 18:01:17 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > 16/03/04 18:01:18 INFO SecurityManager: Changing view acls to: ps40233 > 16/03/04 18:01:18 INFO SecurityManager: Changing modify acls to: ps40233 > 16/03/04 18:01:18 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(ps40233); users > with modify permissions: Set(ps40233) > 16/03/04 18:01:19 INFO Utils: Successfully started service 'sparkDriver' on > port 55309. > 16/03/04 18:01:21 INFO Slf4jLogger: Slf4jLogger started > 16/03/04 18:01:21 INFO Remoting: Starting remoting > 16/03/04 18:01:22 INFO Remoting: Remoting started; listening on addresses > :[akka.tcp://sparkDriverActorSystem@10.30.9.107:52128] > 16/03/04 18:01:22 INFO Utils: Successfully started service > 'sparkDriverActorSystem' on port 52128. > 16/03/04 18:01:22 INFO SparkEnv: Registering MapOutputTracker > 16/03/04 18:01:22 INFO SparkEnv: Registering BlockManagerMaster > 16/03/04 18:01:22 INFO DiskBlockManager: Created local directory at > /tmp/blockmgr-87c20178-357d-4252-a46a-62a755568a98 > 16/03/04 18:01:22 INFO MemoryStore: MemoryStore started with capacity 457.7 MB > 16/03/04 18:01:22 INFO SparkEnv: Registering OutputCommitCoordinator > 16/03/04 18:01:23 INFO Utils: Successfully started service 'SparkUI' on port > 4040. > 16/03/04 18:01:23 INFO SparkUI: Started SparkUI at http://10.30.9.107:4040 > 16/03/04 18:01:24 INFO AppClient$ClientEndpoint: Connecting to master > spark://sunshinee:7077... > 16/03/04 18:01:24 WARN AppClient$ClientEndpoint: Failed to connect to master > sunshinee:7077 > java.io.IOException: Failed to connect to sunshinee:7077 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167) > at > org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv
[jira] [Commented] (SPARK-13699) Spark SQL drops the table in "overwrite" mode while writing into table
[ https://issues.apache.org/jira/browse/SPARK-13699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15183996#comment-15183996 ] Suresh Thalamati commented on SPARK-13699: -- Thank you for providing the reproduction to the problem I was able to reproduce the issue. Problem is you are trying to overwrite a table that is also being read in the data frame. This is not allowed , it should fail with an error (I noticed in some cases I get an error org.apache.spark.sql.AnalysisException: Cannot overwrite table `t1` that is also being read from).I think this usage should raise an error. Truncate is any interesting option , especially with jdbc data source. But that will not address the problem you are running into, it will run into same problem as Overwrite. {code} scala> tgtFinal.explain == Physical Plan == Union :- WholeStageCodegen : : +- Project [col1#223,col2#224,col3#225,col4#226,batchid#227,currind#228,startdate#229,cast(enddate#230 as string) AS enddate#263,updatedate#231] : : +- Filter (currind#228 = N) : :+- INPUT : +- HiveTableScan [enddate#230,updatedate#231,col2#224,col1#223,batchid#227,col3#225,startdate#229,currind#228,col4#226], MetastoreRelation default, tgt_table, None :- WholeStageCodegen : : +- Project [col1#223,col2#224,col3#225,col4#226,batchid#227,currind#228,startdate#229,cast(enddate#230 as string) AS enddate#264,updatedate#231] : : +- INPUT : +- Except : :- WholeStageCodegen : : : +- Filter (currind#228 = Y) : : : +- INPUT : : +- HiveTableScan [col1#223,col2#224,col3#225,col4#226,batchid#227,currind#228,startdate#229,enddate#230,updatedate#231], MetastoreRelation default, tgt_table, None : +- WholeStageCodegen :: +- Project [col1#223,col2#224,col3#225,col4#226,batchid#227,currind#228,startdate#229,enddate#230,updatedate#231] :: +- BroadcastHashJoin [cast(col1#223 as double)], [cast(col1#219 as double)], Inner, BuildRight, None :::- Filter (currind#228 = Y) ::: +- INPUT ::+- INPUT ::- HiveTableScan [col1#223,col2#224,col3#225,col4#226,batchid#227,currind#228,startdate#229,enddate#230,updatedate#231], MetastoreRelation default, tgt_table, None :+- HiveTableScan [col1#219], MetastoreRelation default, src_table, None :- WholeStageCodegen : : +- Project [col1#223,col2#224,col3#225,col4#226,batchid#227,UDF(col1#223) AS currInd#232,startdate#229,2016-03-07 15:12:20.584 AS endDate#265,1457392340584000 AS updateDate#234] : : +- BroadcastHashJoin [cast(col1#223 as double)], [cast(col1#219 as double)], Inner, BuildRight, None : ::- Project [col3#225,startdate#229,col2#224,col1#223,batchid#227,col4#226] : :: +- Filter (currind#228 = Y) : :: +- INPUT : :+- INPUT : :- HiveTableScan [col3#225,startdate#229,col2#224,col1#223,batchid#227,col4#226,currind#228], MetastoreRelation default, tgt_table, None : +- HiveTableScan [col1#219], MetastoreRelation default, src_table, None +- WholeStageCodegen : +- Project [cast(col1#219 as string) AS col1#266,col2#220,col3#221,col4#222,UDF(cast(col1#219 as string)) AS batchId#235,UDF(cast(col1#219 as string)) AS currInd#236,1457392340584000 AS startDate#237,date_format(cast(UDF(cast(col1#219 as string)) as timestamp),-MM-dd HH:mm:ss) AS endDate#238,1457392340584000 AS updateDate#239] : +- INPUT +- HiveTableScan [col1#219,col2#220,col3#221,col4#222], MetastoreRelation default, src_table, None scala> {code} > Spark SQL drops the table in "overwrite" mode while writing into table > -- > > Key: SPARK-13699 > URL: https://issues.apache.org/jira/browse/SPARK-13699 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Dhaval Modi > Attachments: stackTrace.txt > > > Hi, > While writing the dataframe to HIVE table with "SaveMode.Overwrite" option. > E.g. > tgtFinal.write.mode(SaveMode.Overwrite).saveAsTable("tgt_table") > sqlContext drop the table instead of truncating. > This is causing error while overwriting. > Adding stacktrace & commands to reproduce the issue, > Thanks & Regards, > Dhaval -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13726) Spark 1.6.0 stopping working for HiveThriftServer2 and registerTempTable
[ https://issues.apache.org/jira/browse/SPARK-13726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15183991#comment-15183991 ] Yin Huai edited comment on SPARK-13726 at 3/7/16 11:21 PM: --- I took a look at the code. I think this change is caused by Session Management added in 1.6 (https://issues.apache.org/jira/browse/SPARK-10810). Basically, every jdbc session creates its own session. So, it does not see temp table registered through df.registerTempTable, which is registered in another session. You can set {{spark.sql.hive.thriftServer.singleSession}} to {{true}} to change the behavior back. was (Author: yhuai): I took a look at the code. I think this change is caused by Session Management added in 1.6. Basically, every jdbc session creates its own session. So, it does not see temp table registered through df.registerTempTable, which is registered in another session. You can set {{spark.sql.hive.thriftServer.singleSession}} to {{true}} to change the behavior back. > Spark 1.6.0 stopping working for HiveThriftServer2 and registerTempTable > > > Key: SPARK-13726 > URL: https://issues.apache.org/jira/browse/SPARK-13726 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Michael Nguyen >Priority: Blocker > > In Spark 1.5.2, DataFrame.registerTempTable works and > hiveContext.table(registerTableName) and HiveThriftServer2 see those tables. > In Spark 1.6.0, hiveContext.table(registerTableName) and HiveThriftServer2 do > not see those tables, even though DataFrame.registerTempTable does not return > an error. > Since this feature used to work in Spark 1.5.2, there is existing code that > breaks after upgrading to Spark 1.6.0. so this issue is a blocker and urgent. > Therefore, please have it fixed asap. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13726) Spark 1.6.0 stopping working for HiveThriftServer2 and registerTempTable
[ https://issues.apache.org/jira/browse/SPARK-13726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15183991#comment-15183991 ] Yin Huai commented on SPARK-13726: -- I took a look at the code. I think this change is caused by Session Management added in 1.6. Basically, every jdbc session creates its own session. So, it does not see temp table registered through df.registerTempTable, which is registered in another session. You can set {{spark.sql.hive.thriftServer.singleSession}} to {{true}} to change the behavior back. > Spark 1.6.0 stopping working for HiveThriftServer2 and registerTempTable > > > Key: SPARK-13726 > URL: https://issues.apache.org/jira/browse/SPARK-13726 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Michael Nguyen >Priority: Blocker > > In Spark 1.5.2, DataFrame.registerTempTable works and > hiveContext.table(registerTableName) and HiveThriftServer2 see those tables. > In Spark 1.6.0, hiveContext.table(registerTableName) and HiveThriftServer2 do > not see those tables, even though DataFrame.registerTempTable does not return > an error. > Since this feature used to work in Spark 1.5.2, there is existing code that > breaks after upgrading to Spark 1.6.0. so this issue is a blocker and urgent. > Therefore, please have it fixed asap. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13731) expression evaluation for NaN in select statement
[ https://issues.apache.org/jira/browse/SPARK-13731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian updated SPARK-13731: Description: We are expecting that arithmetic expression a/b should be: 1. returning NaN if a=0 and b=0 2. returning Infinity if a=1 and b=0 Is the expectation reasonable? The following is a simple test case snippet that read from storage and evaluate arithmetic in select. It si assuming org.apache.spark.sql.hive.execution.SQLQuerySuite: {code} test("Expression should be evaluated to Nan/Infinity in Select") { withTable("testNan") { withTempTable("src") { Seq((1d, 0d), (0d, 0d)).toDF().registerTempTable("src") sql("CREATE TABLE testNan(a double, b double) STORED AS PARQUET AS SELECT * FROM src") } checkAnswer(sql( """ |SELECT a/b FROM testNan """.stripMargin), Seq( Row(Double.PositiveInfinity), Row(Double.NaN) ) ) } } == Physical Plan == Project [(a#28 / b#29) AS _c0#30] +- Scan ParquetRelation: default.testnan[a#28,b#29] InputPaths: file:/private/var/folders/dy/19y6pfm92pj9s40mbs8xd9hmgp/T/warehouse--5b617080-e909-4812-90e8-63d2dd0aef5a/testnan == Results == !== Correct Answer - 2 == == Spark Answer - 2 == ![Infinity] [null] ![NaN] [null] {code} was: We are expecting arithmetic expression a/b should be: 1. returning NaN if a=0 and b=0 2. returning Infinity if a=1 and b=0 Is the expectation reasonable? The following is a simple test case snippet that read from storage and evaluate arithmetic in select. It si assuming org.apache.spark.sql.hive.execution.SQLQuerySuite: {code} test("Expression should be evaluated to Nan/Infinity in Select") { withTable("testNan") { withTempTable("src") { Seq((1d, 0d), (0d, 0d)).toDF().registerTempTable("src") sql("CREATE TABLE testNan(a double, b double) STORED AS PARQUET AS SELECT * FROM src") } checkAnswer(sql( """ |SELECT a/b FROM testNan """.stripMargin), Seq( Row(Double.PositiveInfinity), Row(Double.NaN) ) ) } } == Physical Plan == Project [(a#28 / b#29) AS _c0#30] +- Scan ParquetRelation: default.testnan[a#28,b#29] InputPaths: file:/private/var/folders/dy/19y6pfm92pj9s40mbs8xd9hmgp/T/warehouse--5b617080-e909-4812-90e8-63d2dd0aef5a/testnan == Results == !== Correct Answer - 2 == == Spark Answer - 2 == ![Infinity] [null] ![NaN] [null] {code} > expression evaluation for NaN in select statement > - > > Key: SPARK-13731 > URL: https://issues.apache.org/jira/browse/SPARK-13731 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Ian > > We are expecting that arithmetic expression a/b should be: > 1. returning NaN if a=0 and b=0 > 2. returning Infinity if a=1 and b=0 > Is the expectation reasonable? > The following is a simple test case snippet that read from storage and > evaluate arithmetic in select. > It si assuming org.apache.spark.sql.hive.execution.SQLQuerySuite: > {code} > test("Expression should be evaluated to Nan/Infinity in Select") { > withTable("testNan") { > withTempTable("src") { > Seq((1d, 0d), (0d, 0d)).toDF().registerTempTable("src") > sql("CREATE TABLE testNan(a double, b double) STORED AS PARQUET AS > SELECT * FROM src") > } > checkAnswer(sql( > """ > |SELECT a/b FROM testNan > """.stripMargin), > Seq( > Row(Double.PositiveInfinity), > Row(Double.NaN) > ) > ) > } > } > == Physical Plan == > Project [(a#28 / b#29) AS _c0#30] > +- Scan ParquetRelation: default.testnan[a#28,b#29] InputPaths: > file:/private/var/folders/dy/19y6pfm92pj9s40mbs8xd9hmgp/T/warehouse--5b617080-e909-4812-90e8-63d2dd0aef5a/testnan > == Results == > !== Correct Answer - 2 == == Spark Answer - 2 == > ![Infinity] [null] > ![NaN] [null] > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13665) Initial separation of concerns
[ https://issues.apache.org/jira/browse/SPARK-13665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-13665. - Resolution: Fixed Fix Version/s: 2.0.0 > Initial separation of concerns > -- > > Key: SPARK-13665 > URL: https://issues.apache.org/jira/browse/SPARK-13665 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Assignee: Michael Armbrust >Priority: Blocker > Fix For: 2.0.0 > > > The goal here is to break apart: File Management, code to deal with specific > formats and query planning. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13725) Spark 1.6.0 stopping working for HiveThriftServer2 and registerTempTable
[ https://issues.apache.org/jira/browse/SPARK-13725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15183964#comment-15183964 ] Michael Nguyen commented on SPARK-13725: I typically do not set issues to Blocker. I set this issue to Blocker, because these specified APIs used to work in earlier versions of Spark up to 1.5.2, and there are existing code that relies on that and now fails because of this issue in Spark 1.6.0. > Spark 1.6.0 stopping working for HiveThriftServer2 and registerTempTable > > > Key: SPARK-13725 > URL: https://issues.apache.org/jira/browse/SPARK-13725 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: Spark 1.6.0 with DataFrame.registerTempTable and > HiveThriftServer2 >Reporter: Michael Nguyen > > In Spark 1.5.2, DataFrame.registerTempTable API works correctly and > HiveThriftServer2 sees and returns temp tables that are registered via that > API. > In Spark 1.6.0, that stopped working. registerTempTable API does not return > an error so it is a false positive, and HiveThriftServer2 does not see such > tables. And hiveContext.table(registerTableName) indicates it does not see > those tables either. > Is there a temporary work-around solution in Spark 1.6.0 ? When would it be > fixed ? > Thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13733) Support initial weight distribution in personalized PageRank
Xiangrui Meng created SPARK-13733: - Summary: Support initial weight distribution in personalized PageRank Key: SPARK-13733 URL: https://issues.apache.org/jira/browse/SPARK-13733 Project: Spark Issue Type: New Feature Components: GraphX Reporter: Xiangrui Meng It would be nice to support personalized PageRank with an initial weight distribution besides a single vertex. It should be easy to modify the current implementation to add this support. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13726) Spark 1.6.0 stopping working for HiveThriftServer2 and registerTempTable
[ https://issues.apache.org/jira/browse/SPARK-13726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15183953#comment-15183953 ] Michael Nguyen commented on SPARK-13726: Thrift server was started with HiveThriftServer2.startWithContext. However, this issue is not specified to HiveThriftServer2.startWithContext. The preceding cause is that after the table is regisgtered via DataFrame.registerTempTable, hiveContext.table(registerTableName) still fails because it does not see that table as registered. > Spark 1.6.0 stopping working for HiveThriftServer2 and registerTempTable > > > Key: SPARK-13726 > URL: https://issues.apache.org/jira/browse/SPARK-13726 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Michael Nguyen >Priority: Blocker > > In Spark 1.5.2, DataFrame.registerTempTable works and > hiveContext.table(registerTableName) and HiveThriftServer2 see those tables. > In Spark 1.6.0, hiveContext.table(registerTableName) and HiveThriftServer2 do > not see those tables, even though DataFrame.registerTempTable does not return > an error. > Since this feature used to work in Spark 1.5.2, there is existing code that > breaks after upgrading to Spark 1.6.0. so this issue is a blocker and urgent. > Therefore, please have it fixed asap. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13682) Finalize the public API for FileFormat
[ https://issues.apache.org/jira/browse/SPARK-13682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15183949#comment-15183949 ] Reynold Xin commented on SPARK-13682: - Most "trait" should probably become "abstract class". > Finalize the public API for FileFormat > -- > > Key: SPARK-13682 > URL: https://issues.apache.org/jira/browse/SPARK-13682 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust > > The current file format interface needs to be cleaned up before its > acceptable for public consumption: > - Have a version that takes Row and does a conversion, hide the internal API. > - Remove bucketing > - Remove RDD and the broadcastedConf > - Remove SQLContext (maybe include SparkSession?) > - Pass a better conf object -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13596) Move misc top-level build files into appropriate subdirs
[ https://issues.apache.org/jira/browse/SPARK-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-13596. - Resolution: Fixed Assignee: Sean Owen Fix Version/s: 2.0.0 > Move misc top-level build files into appropriate subdirs > > > Key: SPARK-13596 > URL: https://issues.apache.org/jira/browse/SPARK-13596 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 2.0.0 >Reporter: Sean Owen >Assignee: Sean Owen > Fix For: 2.0.0 > > > I'd like to file away a bunch of misc files that are in the top level of the > project in order to further tidy the build for 2.0.0. See also SPARK-13529, > SPARK-13548. > Some of these may turn out to be difficult or impossible to move. > I'd ideally like to move these files into {{build/}}: > - {{.rat-excludes}} > - {{checkstyle.xml}} > - {{checkstyle-suppressions.xml}} > - {{pylintrc}} > - {{scalastyle-config.xml}} > - {{tox.ini}} > - {{project/}} (or does SBT need this in the root?) > And ideally, these would go under {{dev/}} > - {{make-distribution.sh}} > And remove these > - {{sbt/sbt}} (backwards-compatible location of {{build/sbt}} right?) > Edited to add: apparently this can go in {{.github}} now: > - {{CONTRIBUTING.md}} > Other files in the top level seem to need to be there, like {{README.md}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13692) Fix trivial Coverity/Checkstyle defects
[ https://issues.apache.org/jira/browse/SPARK-13692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-13692: -- Description: This issue fixes the following potential bugs and Java coding style detected by Coverity and Checkstyle. * Implement both null and type checking in equals functions. * Fix wrong type casting logic in SimpleJavaBean2.equals. * Add `implement Cloneable` to `UTF8String` and `SortedIterator`. * Remove dereferencing before null check in `AbstractBytesToBytesMapSuite`. * Fix coding style: Add '{}' to single `for` statement in mllib examples. * Remove unused imports in `ColumnarBatch` and `JavaKinesisStreamSuite.java`. * Remove unused fields in `ChunkFetchIntegrationSuite`. * Add `stop()` to prevent resource leak. Please note that the last two checkstyle errors exist on newly added commits after [SPARK-13583]. was: This issue fixes the following potential bugs and Java coding style detected by Coverity and Checkstyle. * Implement both null and type checking in equals functions. * Fix wrong type casting logic in SimpleJavaBean2.equals. * Add `implement Cloneable` to `UTF8String` and `SortedIterator`. * Remove dereferencing before null check in `AbstractBytesToBytesMapSuite`. * Fix coding style: Add '{}' to single `for` statement in mllib examples. * Remove unused imports in `ColumnarBatch`. * Remove unused fields in `ChunkFetchIntegrationSuite`. * Add `stop()` to prevent resource leak. Please note that the last two checkstyle errors exist on newly added commits after [SPARK-13583]. > Fix trivial Coverity/Checkstyle defects > --- > > Key: SPARK-13692 > URL: https://issues.apache.org/jira/browse/SPARK-13692 > Project: Spark > Issue Type: Bug > Components: Examples, Spark Core, SQL >Reporter: Dongjoon Hyun >Priority: Trivial > > This issue fixes the following potential bugs and Java coding style detected > by Coverity and Checkstyle. > * Implement both null and type checking in equals functions. > * Fix wrong type casting logic in SimpleJavaBean2.equals. > * Add `implement Cloneable` to `UTF8String` and `SortedIterator`. > * Remove dereferencing before null check in `AbstractBytesToBytesMapSuite`. > * Fix coding style: Add '{}' to single `for` statement in mllib examples. > * Remove unused imports in `ColumnarBatch` and > `JavaKinesisStreamSuite.java`. > * Remove unused fields in `ChunkFetchIntegrationSuite`. > * Add `stop()` to prevent resource leak. > Please note that the last two checkstyle errors exist on newly added commits > after [SPARK-13583]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13638) Support for saving with a quote mode
[ https://issues.apache.org/jira/browse/SPARK-13638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-13638: - Description: https://github.com/databricks/spark-csv/pull/254 tobithiel reported this. {quote} I'm dealing with some messy csv files and being able to just quote all fields is very useful, so that other applications don't misunderstand the file because of some sketchy characters {quote} When writing there are several quote modes in apache commons csv. (See https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/QuoteMode.html) This might have to be supported. However, it looks univocity parser used for writing (it looks currently only this library is supported) does not support this quote mode. I think we can drop this backwards compatibility if we are not going to add apache commons csv. This is a reminder that it might break backwards compatibility for the options, {{quoteMode}}. was: https://github.com/databricks/spark-csv/pull/254 tobithiel reported this. {quote} I'm dealing with some messy csv files and being able to just quote all fields is very useful, so that other applications don't misunderstand the file because of some sketchy characters {quote} When writing there are several quote modes in apache commons csv. (See https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/QuoteMode.html) This might have to be supported. However, it looks univocity parser used for writing (it looks currently only this library is supported) does not support this quote mode. I think we can drop this backwards compatibility if we are not going to add apache commons csv. This is a reminder that it will break backwards compatibility for the options, {{quoteMode}}. > Support for saving with a quote mode > > > Key: SPARK-13638 > URL: https://issues.apache.org/jira/browse/SPARK-13638 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > https://github.com/databricks/spark-csv/pull/254 > tobithiel reported this. > {quote} > I'm dealing with some messy csv files and being able to just quote all fields > is very useful, > so that other applications don't misunderstand the file because of some > sketchy characters > {quote} > When writing there are several quote modes in apache commons csv. (See > https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/QuoteMode.html) > This might have to be supported. > However, it looks univocity parser used for writing (it looks currently only > this library is supported) does not support this quote mode. I think we can > drop this backwards compatibility if we are not going to add apache commons > csv. > This is a reminder that it might break backwards compatibility for the > options, {{quoteMode}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13732) Remove projectList from Windows
[ https://issues.apache.org/jira/browse/SPARK-13732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13732: Assignee: Apache Spark > Remove projectList from Windows > --- > > Key: SPARK-13732 > URL: https://issues.apache.org/jira/browse/SPARK-13732 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Apache Spark > > projectList is useless. Remove it from the class Window. It simplifies the > codes in Analyzer and Optimizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org