[jira] [Updated] (SPARK-2227) Support "dfs" command
[ https://issues.apache.org/jira/browse/SPARK-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2227: --- Assignee: Reynold Xin > Support "dfs" command > - > > Key: SPARK-2227 > URL: https://issues.apache.org/jira/browse/SPARK-2227 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Minor > > Potentially just delegate to Hive. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2227) Support "dfs" command
[ https://issues.apache.org/jira/browse/SPARK-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2227: --- Affects Version/s: (was: 1.0.0) > Support "dfs" command > - > > Key: SPARK-2227 > URL: https://issues.apache.org/jira/browse/SPARK-2227 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Priority: Minor > > Potentially just delegate to Hive. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2227) Support "dfs" command
[ https://issues.apache.org/jira/browse/SPARK-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2227: --- Priority: Minor (was: Major) > Support "dfs" command > - > > Key: SPARK-2227 > URL: https://issues.apache.org/jira/browse/SPARK-2227 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.0.0 >Reporter: Reynold Xin >Priority: Minor > > Potentially just delegate to Hive. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2179) Public API for DataTypes and Schema
[ https://issues.apache.org/jira/browse/SPARK-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039682#comment-14039682 ] Yin Huai commented on SPARK-2179: - Just a note. For every data type we have, I think it is better to have a string name (e.g. "int" for IntegerType). > Public API for DataTypes and Schema > --- > > Key: SPARK-2179 > URL: https://issues.apache.org/jira/browse/SPARK-2179 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Michael Armbrust >Assignee: Yin Huai > > We want something like the following: > * Expose DataType in the SQL package and lock down all the internal details > (TypeTags, etc) > * Programatic API for viewing the schema of an RDD as a StructType > * Method that creates a schema RDD given (RDD[A], StructType, A => Row) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2227) Support "dfs" command
Reynold Xin created SPARK-2227: -- Summary: Support "dfs" command Key: SPARK-2227 URL: https://issues.apache.org/jira/browse/SPARK-2227 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.0 Reporter: Reynold Xin Potentially just delegate to Hive. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2061) Deprecate `splits` in JavaRDDLike and add `partitions`
[ https://issues.apache.org/jira/browse/SPARK-2061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2061. Resolution: Fixed Fix Version/s: 1.1.0 Issue resolved by pull request 1062 [https://github.com/apache/spark/pull/1062] > Deprecate `splits` in JavaRDDLike and add `partitions` > -- > > Key: SPARK-2061 > URL: https://issues.apache.org/jira/browse/SPARK-2061 > Project: Spark > Issue Type: Bug > Components: Java API >Reporter: Patrick Wendell >Assignee: Anant Daksh Asthana >Priority: Minor > Labels: starter > Fix For: 1.1.0 > > > Most of spark has used over to consistently using `partitions` instead of > `splits`. We should do likewise and add a `partitions` method to JavaRDDLike > and have `splits` just call that. We should also go through all cases where > other API's (e.g. Python) call `splits` and we should change those to use the > newer API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1777) Pass "cached" blocks directly to disk if memory is not large enough
[ https://issues.apache.org/jira/browse/SPARK-1777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-1777: - Attachment: spark-1777-design-doc.pdf > Pass "cached" blocks directly to disk if memory is not large enough > --- > > Key: SPARK-1777 > URL: https://issues.apache.org/jira/browse/SPARK-1777 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Andrew Or > Fix For: 1.1.0 > > Attachments: spark-1777-design-doc.pdf > > > Currently in Spark we entirely unroll a partition and then check whether it > will cause us to exceed the storage limit. This has an obvious problem - if > the partition itself is enough to push us over the storage limit (and > eventually over the JVM heap), it will cause an OOM. > This can happen in cases where a single partition is very large or when > someone is running examples locally with a small heap. > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/CacheManager.scala#L106 > We should think a bit about the most elegant way to fix this - it shares some > similarities with the external aggregation code. > A simple idea is to periodically check the size of the buffer as we are > unrolling and see if we are over the memory limit. If we are we could prepend > the existing buffer to the iterator and write that entire thing out to disk. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1970) Update unit test in XORShiftRandomSuite to use ChiSquareTest from commons-math3
[ https://issues.apache.org/jira/browse/SPARK-1970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1970. Resolution: Fixed Fix Version/s: 1.1.0 Issue resolved by pull request 1073 [https://github.com/apache/spark/pull/1073] > Update unit test in XORShiftRandomSuite to use ChiSquareTest from > commons-math3 > --- > > Key: SPARK-1970 > URL: https://issues.apache.org/jira/browse/SPARK-1970 > Project: Spark > Issue Type: Test > Components: Spark Core >Reporter: Doris Xin >Assignee: Doris Xin >Priority: Minor > Fix For: 1.1.0 > > > Since we're adding commons-math3 as a test dependency to spark core, updating > the test for randomness in XORShiftRandomSuite to use an actual chi square > test as opposed to hardcoding everything. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1902) Spark shell prints error when :4040 port already in use
[ https://issues.apache.org/jira/browse/SPARK-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1902. Resolution: Fixed Fix Version/s: 1.1.0 Issue resolved by pull request 1019 [https://github.com/apache/spark/pull/1019] > Spark shell prints error when :4040 port already in use > --- > > Key: SPARK-1902 > URL: https://issues.apache.org/jira/browse/SPARK-1902 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Andrew Ash >Assignee: Andrew Ash > Fix For: 1.1.0 > > > When running two shells on the same machine, I get the below error. The > issue is that the first shell takes port 4040, then the next tries tries 4040 > and fails so falls back to 4041, then a third would try 4040 and 4041 before > landing on 4042, etc. > We should catch the error and instead log as "Unable to use port 4041; > already in use. Attempting port 4042..." > {noformat} > 14/05/22 11:31:54 WARN component.AbstractLifeCycle: FAILED > SelectChannelConnector@0.0.0.0:4041: java.net.BindException: Address already > in use > java.net.BindException: Address already in use > at sun.nio.ch.Net.bind0(Native Method) > at sun.nio.ch.Net.bind(Net.java:444) > at sun.nio.ch.Net.bind(Net.java:436) > at > sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214) > at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) > at > org.eclipse.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187) > at > org.eclipse.jetty.server.AbstractConnector.doStart(AbstractConnector.java:316) > at > org.eclipse.jetty.server.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:265) > at > org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64) > at org.eclipse.jetty.server.Server.doStart(Server.java:293) > at > org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64) > at > org.apache.spark.ui.JettyUtils$$anonfun$1.apply$mcV$sp(JettyUtils.scala:192) > at > org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192) > at > org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192) > at scala.util.Try$.apply(Try.scala:161) > at org.apache.spark.ui.JettyUtils$.connect$1(JettyUtils.scala:191) > at > org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:205) > at org.apache.spark.ui.WebUI.bind(WebUI.scala:99) > at org.apache.spark.SparkContext.(SparkContext.scala:217) > at > org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:957) > at $line3.$read$$iwC$$iwC.(:8) > at $line3.$read$$iwC.(:14) > at $line3.$read.(:16) > at $line3.$read$.(:20) > at $line3.$read$.() > at $line3.$eval$.(:7) > at $line3.$eval$.() > at $line3.$eval.$print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788) > at > org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056) > at > org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609) > at > org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796) > at > org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841) > at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753) > at > org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:121) > at > org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:120) > at > org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:263) > at > org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:120) > at > org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:56) > at > org.apache.spark.repl.SparkILoop$$anonfun$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:913) > at > org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:142) > at org.apache.spark.repl.Spar
[jira] [Commented] (SPARK-1652) Fixes and improvements for spark-submit/configs
[ https://issues.apache.org/jira/browse/SPARK-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039565#comment-14039565 ] Nicholas Chammas commented on SPARK-1652: - [~pwendell], is there currently a work-around for running Python driver programs on the cluster? Trying this in 1.0.0 currently yields a succinct: {code} Error: Cannot currently run Python driver programs on cluster {code} I'm not sure if there is a separate issue to track this, or if this is the issue I should watch. > Fixes and improvements for spark-submit/configs > --- > > Key: SPARK-1652 > URL: https://issues.apache.org/jira/browse/SPARK-1652 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Reporter: Patrick Wendell >Assignee: Patrick Wendell >Priority: Blocker > Fix For: 1.1.0 > > > These are almost all a result of my config patch. Unfortunately the changes > were difficult to unit-test and there several edge cases reported. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1856) Standardize MLlib interfaces
[ https://issues.apache.org/jira/browse/SPARK-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039495#comment-14039495 ] Erik Erlandson commented on SPARK-1856: --- How does this jira relate (if at all) to the MLI project, part of whose purpose is (or was) to provide a unified and pluggable type system for machine learning models, their inputs, parameters and training? http://www.cs.berkeley.edu/~ameet/mlbase_website/mlbase_website/publications.html > Standardize MLlib interfaces > > > Key: SPARK-1856 > URL: https://issues.apache.org/jira/browse/SPARK-1856 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > Fix For: 1.1.0 > > > Instead of expanding MLlib based on the current class naming scheme > (ProblemWithAlgorithm), we should standardize MLlib's interfaces that > clearly separate datasets, formulations, algorithms, parameter sets, and > models. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2225) Turn HAVING without GROUP BY into WHERE
[ https://issues.apache.org/jira/browse/SPARK-2225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-2225. Resolution: Fixed Fix Version/s: 1.1.0 1.0.1 Assignee: Reynold Xin (was: William Benton) > Turn HAVING without GROUP BY into WHERE > --- > > Key: SPARK-2225 > URL: https://issues.apache.org/jira/browse/SPARK-2225 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.0.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 1.0.1, 1.1.0 > > > See http://msdn.microsoft.com/en-US/library/8hhs5f4e(v=vs.80).aspx > The HAVING clause specifies conditions that determines the groups included in > the query. If the SQL SELECT statement does not contain aggregate functions, > you can use a SQL SELECT statement that contains a HAVING clause without a > GROUP BY clause. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2180) Basic HAVING clauses support for HiveQL
[ https://issues.apache.org/jira/browse/SPARK-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2180: --- Assignee: William Benton > Basic HAVING clauses support for HiveQL > --- > > Key: SPARK-2180 > URL: https://issues.apache.org/jira/browse/SPARK-2180 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: William Benton >Assignee: William Benton >Priority: Minor > > The HiveQL implementation doesn't support HAVING clauses for aggregations. > This prevents some of the TPCDS benchmarks from running. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-704) ConnectionManager sometimes cannot detect loss of sending connections
[ https://issues.apache.org/jira/browse/SPARK-704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039396#comment-14039396 ] Charles Reiss commented on SPARK-704: - It's been a while since I reported this issue, so it may have been incidentally fixed. But this problem was with a remote node failure _after_ a message (or several messages) was successfully sent to that node but before a response was received. So, there would be no message to send to trigger a failing attempt to write to the channel. If there's a corresponding ReceivingConnection, then the remote node death would be detected via a failed read, but I believe the code in ConnectionManager#removeConnection would not reliably trigger the MessageStatuses. > ConnectionManager sometimes cannot detect loss of sending connections > - > > Key: SPARK-704 > URL: https://issues.apache.org/jira/browse/SPARK-704 > Project: Spark > Issue Type: Bug >Reporter: Charles Reiss >Assignee: Henry Saputra > > ConnectionManager currently does not detect when SendingConnections > disconnect except if it is trying to send through them. As a result, a node > failure just after a connection is initiated but before any acknowledgement > messages can be sent may result in a hang. > ConnectionManager has code intended to detect this case by detecting the > failure of a corresponding ReceivingConnection, but this code assumes that > the remote host:port of the ReceivingConnection is the same as the > ConnectionManagerId, which is almost never true. Additionally, there does not > appear to be any reason to assume a corresponding ReceivingConnection will > exist. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2225) Turn HAVING without GROUP BY into WHERE
[ https://issues.apache.org/jira/browse/SPARK-2225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039365#comment-14039365 ] Reynold Xin commented on SPARK-2225: Yea I don't think we need to follow the negative tests (we should make sure we follow the positive tests). I will submit a PR and if you can review that it'd be great. > Turn HAVING without GROUP BY into WHERE > --- > > Key: SPARK-2225 > URL: https://issues.apache.org/jira/browse/SPARK-2225 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.0.0 >Reporter: Reynold Xin >Assignee: William Benton > > See http://msdn.microsoft.com/en-US/library/8hhs5f4e(v=vs.80).aspx > The HAVING clause specifies conditions that determines the groups included in > the query. If the SQL SELECT statement does not contain aggregate functions, > you can use a SQL SELECT statement that contains a HAVING clause without a > GROUP BY clause. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2225) Turn HAVING without GROUP BY into WHERE
[ https://issues.apache.org/jira/browse/SPARK-2225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2225: --- Assignee: William Benton > Turn HAVING without GROUP BY into WHERE > --- > > Key: SPARK-2225 > URL: https://issues.apache.org/jira/browse/SPARK-2225 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.0.0 >Reporter: Reynold Xin >Assignee: William Benton > > See http://msdn.microsoft.com/en-US/library/8hhs5f4e(v=vs.80).aspx > The HAVING clause specifies conditions that determines the groups included in > the query. If the SQL SELECT statement does not contain aggregate functions, > you can use a SQL SELECT statement that contains a HAVING clause without a > GROUP BY clause. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2226) HAVING should be able to contain aggregate expressions that don't appear in the aggregation list.
[ https://issues.apache.org/jira/browse/SPARK-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2226: --- Assignee: William Benton (was: Reynold Xin) > HAVING should be able to contain aggregate expressions that don't appear in > the aggregation list. > -- > > Key: SPARK-2226 > URL: https://issues.apache.org/jira/browse/SPARK-2226 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.0.0 >Reporter: Reynold Xin >Assignee: William Benton > > https://github.com/apache/hive/blob/trunk/ql/src/test/queries/clientpositive/having.q > This test file contains the following query: > {code} > SELECT key FROM src GROUP BY key HAVING max(value) > "val_255"; > {code} > Once we fixed this issue, we should whitelist having.q. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2226) HAVING should be able to contain aggregate expressions that don't appear in the aggregation list.
[ https://issues.apache.org/jira/browse/SPARK-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2226: --- Assignee: William Benton (was: Reynold Xin) > HAVING should be able to contain aggregate expressions that don't appear in > the aggregation list. > -- > > Key: SPARK-2226 > URL: https://issues.apache.org/jira/browse/SPARK-2226 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.0.0 >Reporter: Reynold Xin >Assignee: William Benton > > https://github.com/apache/hive/blob/trunk/ql/src/test/queries/clientpositive/having.q > This test file contains the following query: > {code} > SELECT key FROM src GROUP BY key HAVING max(value) > "val_255"; > {code} > Once we fixed this issue, we should whitelist having.q. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (SPARK-2226) HAVING should be able to contain aggregate expressions that don't appear in the aggregation list.
[ https://issues.apache.org/jira/browse/SPARK-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin reassigned SPARK-2226: -- Assignee: Reynold Xin (was: William Benton) > HAVING should be able to contain aggregate expressions that don't appear in > the aggregation list. > -- > > Key: SPARK-2226 > URL: https://issues.apache.org/jira/browse/SPARK-2226 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.0.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > > https://github.com/apache/hive/blob/trunk/ql/src/test/queries/clientpositive/having.q > This test file contains the following query: > {code} > SELECT key FROM src GROUP BY key HAVING max(value) > "val_255"; > {code} > Once we fixed this issue, we should whitelist having.q. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2198) Partition the scala build file so that it is easier to maintain
[ https://issues.apache.org/jira/browse/SPARK-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039355#comment-14039355 ] Anant Daksh Asthana commented on SPARK-2198: I am in agreement with Helena. > Partition the scala build file so that it is easier to maintain > --- > > Key: SPARK-2198 > URL: https://issues.apache.org/jira/browse/SPARK-2198 > Project: Spark > Issue Type: Task > Components: Build >Reporter: Helena Edelson >Priority: Minor > Original Estimate: 3h > Remaining Estimate: 3h > > Partition to standard Dependencies, Version, Settings, Publish.scala. keeping > the SparkBuild clean to describe the modules and their deps so that changes > in versions, for example, need only be made in Version.scala, settings > changes such as in scalac in Settings.scala, etc. > I'd be happy to do this ([~helena_e]) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2226) HAVING should be able to contain aggregate expressions that don't appear in the aggregation list.
[ https://issues.apache.org/jira/browse/SPARK-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2226: --- Assignee: Reynold Xin > HAVING should be able to contain aggregate expressions that don't appear in > the aggregation list. > -- > > Key: SPARK-2226 > URL: https://issues.apache.org/jira/browse/SPARK-2226 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.0.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > > https://github.com/apache/hive/blob/trunk/ql/src/test/queries/clientpositive/having.q > This test file contains the following query: > {code} > SELECT key FROM src GROUP BY key HAVING max(value) > "val_255"; > {code} > Once we fixed this issue, we should whitelist having.q. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2180) Basic HAVING clauses support for HiveQL
[ https://issues.apache.org/jira/browse/SPARK-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039344#comment-14039344 ] Reynold Xin commented on SPARK-2180: [~willbenton] I added two related issues. Let me know if you would like to work on them. Thanks! > Basic HAVING clauses support for HiveQL > --- > > Key: SPARK-2180 > URL: https://issues.apache.org/jira/browse/SPARK-2180 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: William Benton >Priority: Minor > > The HiveQL implementation doesn't support HAVING clauses for aggregations. > This prevents some of the TPCDS benchmarks from running. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2226) HAVING should be able to contain aggregate expressions that don't appear in the aggregation list.
Reynold Xin created SPARK-2226: -- Summary: HAVING should be able to contain aggregate expressions that don't appear in the aggregation list. Key: SPARK-2226 URL: https://issues.apache.org/jira/browse/SPARK-2226 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.0 Reporter: Reynold Xin https://github.com/apache/hive/blob/trunk/ql/src/test/queries/clientpositive/having.q This test file contains the following query: {code} SELECT key FROM src GROUP BY key HAVING max(value) > "val_255"; {code} Once we fixed this issue, we should whitelist having.q. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2225) Turn HAVING without GROUP BY into WHERE
[ https://issues.apache.org/jira/browse/SPARK-2225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039335#comment-14039335 ] William Benton commented on SPARK-2225: --- So the Hive test suite treats HAVING without GROUP BY as an error (see having1.q), although it is accepted by some dialects (like SQL Server). I'm happy to take this, though, if this is what we want to do here. > Turn HAVING without GROUP BY into WHERE > --- > > Key: SPARK-2225 > URL: https://issues.apache.org/jira/browse/SPARK-2225 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.0.0 >Reporter: Reynold Xin > > See http://msdn.microsoft.com/en-US/library/8hhs5f4e(v=vs.80).aspx > The HAVING clause specifies conditions that determines the groups included in > the query. If the SQL SELECT statement does not contain aggregate functions, > you can use a SQL SELECT statement that contains a HAVING clause without a > GROUP BY clause. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2225) Turn HAVING without GROUP BY into WHERE
Reynold Xin created SPARK-2225: -- Summary: Turn HAVING without GROUP BY into WHERE Key: SPARK-2225 URL: https://issues.apache.org/jira/browse/SPARK-2225 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.0 Reporter: Reynold Xin See http://msdn.microsoft.com/en-US/library/8hhs5f4e(v=vs.80).aspx The HAVING clause specifies conditions that determines the groups included in the query. If the SQL SELECT statement does not contain aggregate functions, you can use a SQL SELECT statement that contains a HAVING clause without a GROUP BY clause. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2180) Basic HAVING clauses support for HiveQL
[ https://issues.apache.org/jira/browse/SPARK-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2180: --- Summary: Basic HAVING clauses support for HiveQL (was: HiveQL doesn't support GROUP BY with HAVING clauses) > Basic HAVING clauses support for HiveQL > --- > > Key: SPARK-2180 > URL: https://issues.apache.org/jira/browse/SPARK-2180 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: William Benton >Priority: Minor > > The HiveQL implementation doesn't support HAVING clauses for aggregations. > This prevents some of the TPCDS benchmarks from running. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1406) PMML model evaluation support via MLib
[ https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039256#comment-14039256 ] Erik Erlandson commented on SPARK-1406: --- How about a PMML pickler/unpickler, written as an extension to: https://github.com/scala/pickling > PMML model evaluation support via MLib > -- > > Key: SPARK-1406 > URL: https://issues.apache.org/jira/browse/SPARK-1406 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Thomas Darimont > > It would be useful if spark would provide support the evaluation of PMML > models (http://www.dmg.org/v4-2/GeneralStructure.html). > This would allow to use analytical models that were created with a > statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which > would perform the actual model evaluation for a given input tuple. The PMML > model would then just contain the "parameterization" of an analytical model. > Other projects like JPMML-Evaluator do a similar thing. > https://github.com/jpmml/jpmml/tree/master/pmml-evaluator -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2089) With YARN, preferredNodeLocalityData isn't honored
[ https://issues.apache.org/jira/browse/SPARK-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039236#comment-14039236 ] Mridul Muralidharan commented on SPARK-2089: [~pwendell] SplitInfo is not from hadoop - but gives locality preference in context spark (see org.apache.spark.scheduler.SplitInfo) in a reasonably api agnostic way. The default support provided for it is hadoop specific based on dfs blocks - but I dont think there is anything stopping us from expressing other forms (either already currently or with minor modifications as applicable). We actually very heavily use that api - moving 10s or 100s of TB of data tends to be fairly expensive :-) Since we are still stuck in 0.9 + changes, have not yet faced this issue though, so great to see this being addressed. > With YARN, preferredNodeLocalityData isn't honored > --- > > Key: SPARK-2089 > URL: https://issues.apache.org/jira/browse/SPARK-2089 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.0.0 >Reporter: Sandy Ryza >Assignee: Sandy Ryza >Priority: Critical > > When running in YARN cluster mode, apps can pass preferred locality data when > constructing a Spark context that will dictate where to request executor > containers. > This is currently broken because of a race condition. The Spark-YARN code > runs the user class and waits for it to start up a SparkContext. During its > initialization, the SparkContext will create a YarnClusterScheduler, which > notifies a monitor in the Spark-YARN code that . The Spark-Yarn code then > immediately fetches the preferredNodeLocationData from the SparkContext and > uses it to start requesting containers. > But in the SparkContext constructor that takes the preferredNodeLocationData, > setting preferredNodeLocationData comes after the rest of the initialization, > so, if the Spark-YARN code comes around quickly enough after being notified, > the data that's fetched is the empty unset version. The occurred during all > of my runs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2223) Building and running tests with maven is extremely slow
[ https://issues.apache.org/jira/browse/SPARK-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039217#comment-14039217 ] Mridul Muralidharan commented on SPARK-2223: [~tgraves] You could try running zinc - speeds up the maven build quite a bit. I find that I need to shutdown and restart it at times though .. but otherwise, works fine. > Building and running tests with maven is extremely slow > --- > > Key: SPARK-2223 > URL: https://issues.apache.org/jira/browse/SPARK-2223 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.0.0 >Reporter: Thomas Graves > > For some reason using maven with Spark is extremely slow. Building and > running tests takes way longer then other projects I have used that use > maven. We should investigate to see why. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1528) Spark on Yarn: Add option for user to specify additional namenodes to get tokens from
[ https://issues.apache.org/jira/browse/SPARK-1528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039186#comment-14039186 ] Thomas Graves commented on SPARK-1528: -- https://github.com/apache/spark/pull/1159 > Spark on Yarn: Add option for user to specify additional namenodes to get > tokens from > - > > Key: SPARK-1528 > URL: https://issues.apache.org/jira/browse/SPARK-1528 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.0.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > > Some users running spark on yarn may wish to contact other Hdfs clusters then > the one they are running on. We should add in an option for them to specify > those namenodes so that we can get the credentials needed for the application > to contact them. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2204) Scheduler for Mesos in fine-grained mode launches tasks on wrong executors
[ https://issues.apache.org/jira/browse/SPARK-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastien Rainville updated SPARK-2204: --- Description: MesosSchedulerBackend.resourceOffers(SchedulerDriver, List[Offer]) is assuming that TaskSchedulerImpl.resourceOffers(Seq[WorkerOffer]) is returning task lists in the same order as the offers it was passed, but in the current implementation TaskSchedulerImpl.resourceOffers shuffles the offers to avoid assigning the tasks always to the same executors. The result is that the tasks are launched on the wrong executors. (was: MesosSchedulerBackend.resourceOffers(SchedulerDriver, List[Offer]) is assuming that TaskSchedulerImpl.resourceOffers(Seq[WorkerOffer]) is returning task lists in the same order as the offers it was passed, but in the current implementation TaskSchedulerImpl.resourceOffers shuffles the offers to avoid assigning the tasks always to the same executors. The result is that the tasks are launched on random executors.) > Scheduler for Mesos in fine-grained mode launches tasks on wrong executors > -- > > Key: SPARK-2204 > URL: https://issues.apache.org/jira/browse/SPARK-2204 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.0.0 >Reporter: Sebastien Rainville >Priority: Blocker > > MesosSchedulerBackend.resourceOffers(SchedulerDriver, List[Offer]) is > assuming that TaskSchedulerImpl.resourceOffers(Seq[WorkerOffer]) is returning > task lists in the same order as the offers it was passed, but in the current > implementation TaskSchedulerImpl.resourceOffers shuffles the offers to avoid > assigning the tasks always to the same executors. The result is that the > tasks are launched on the wrong executors. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (SPARK-1528) Spark on Yarn: Add option for user to specify additional namenodes to get tokens from
[ https://issues.apache.org/jira/browse/SPARK-1528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned SPARK-1528: Assignee: Thomas Graves > Spark on Yarn: Add option for user to specify additional namenodes to get > tokens from > - > > Key: SPARK-1528 > URL: https://issues.apache.org/jira/browse/SPARK-1528 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.0.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > > Some users running spark on yarn may wish to contact other Hdfs clusters then > the one they are running on. We should add in an option for them to specify > those namenodes so that we can get the credentials needed for the application > to contact them. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1868) Users should be allowed to cogroup at least 4 RDDs
[ https://issues.apache.org/jira/browse/SPARK-1868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1868. Resolution: Fixed Assignee: Allan Douglas R. de Oliveira Fixed by this PR: https://github.com/apache/spark/pull/813 > Users should be allowed to cogroup at least 4 RDDs > -- > > Key: SPARK-1868 > URL: https://issues.apache.org/jira/browse/SPARK-1868 > Project: Spark > Issue Type: Improvement > Components: Java API, Spark Core >Reporter: Allan Douglas R. de Oliveira >Assignee: Allan Douglas R. de Oliveira > > cogroup client api currently allows up to 3 RDDs to be cogrouped. > It's convenient to allow more than this as cogroup is a very fundamental > operation and in the real word we need to group many RDDs together. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1868) Users should be allowed to cogroup at least 4 RDDs
[ https://issues.apache.org/jira/browse/SPARK-1868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1868: --- Fix Version/s: 1.1.0 > Users should be allowed to cogroup at least 4 RDDs > -- > > Key: SPARK-1868 > URL: https://issues.apache.org/jira/browse/SPARK-1868 > Project: Spark > Issue Type: Improvement > Components: Java API, Spark Core >Reporter: Allan Douglas R. de Oliveira >Assignee: Allan Douglas R. de Oliveira > Fix For: 1.1.0 > > > cogroup client api currently allows up to 3 RDDs to be cogrouped. > It's convenient to allow more than this as cogroup is a very fundamental > operation and in the real word we need to group many RDDs together. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2224) allow running tests for one sub module
[ https://issues.apache.org/jira/browse/SPARK-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039081#comment-14039081 ] Thomas Graves commented on SPARK-2224: -- well that worked as in it actually ran in the yarn submodule, unfortunately it didn't actually run any tests (Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0). Closing this one since it works for other modules and I'll open another one if needed for yarn. > allow running tests for one sub module > -- > > Key: SPARK-2224 > URL: https://issues.apache.org/jira/browse/SPARK-2224 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.0.0 >Reporter: Thomas Graves > > We should have a way to run just the unit tests in a submodule (like core or > yarn, etc.). > One way would be to support changing directories into the submodule and > running mvn test from there. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2224) allow running tests for one sub module
[ https://issues.apache.org/jira/browse/SPARK-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-2224. -- Resolution: Invalid > allow running tests for one sub module > -- > > Key: SPARK-2224 > URL: https://issues.apache.org/jira/browse/SPARK-2224 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.0.0 >Reporter: Thomas Graves > > We should have a way to run just the unit tests in a submodule (like core or > yarn, etc.). > One way would be to support changing directories into the submodule and > running mvn test from there. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2223) Building and running tests with maven is extremely slow
[ https://issues.apache.org/jira/browse/SPARK-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14039076#comment-14039076 ] Thomas Graves commented on SPARK-2223: -- [~srowen] do you know how long it takes you to do from scratch build (just as a comparison). Maybe there is something messed up in our infrastructure if it isn't slow for others. > Building and running tests with maven is extremely slow > --- > > Key: SPARK-2223 > URL: https://issues.apache.org/jira/browse/SPARK-2223 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.0.0 >Reporter: Thomas Graves > > For some reason using maven with Spark is extremely slow. Building and > running tests takes way longer then other projects I have used that use > maven. We should investigate to see why. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2202) saveAsTextFile hangs on final 2 tasks
[ https://issues.apache.org/jira/browse/SPARK-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suren Hiraman updated SPARK-2202: - Attachment: spark_trace.2.txt spark_trace.1.txt Here are two stack traces from the same server. This time around, we ended up with 5 stuck threads on 2 different servers. As a reminder, this is with re-enabling the various custom settings listed above. > saveAsTextFile hangs on final 2 tasks > - > > Key: SPARK-2202 > URL: https://issues.apache.org/jira/browse/SPARK-2202 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 > Environment: CentOS 5.7 > 16 nodes, 24 cores per node, 14g RAM per executor >Reporter: Suren Hiraman > Attachments: spark_trace.1.txt, spark_trace.2.txt > > > I have a flow that takes in about 10 GB of data and writes out about 10 GB of > data. > The final step is saveAsTextFile() to HDFS. This seems to hang on 2 remaining > tasks, always on the same node. > It seems that the 2 tasks are waiting for data from a remote task/RDD > partition. > After about 2 hours or so, the stuck tasks get a closed connection exception > and you can see the remote side logging that as well. Log lines are below. > My custom settings are: > conf.set("spark.executor.memory", "14g") // TODO make this > configurable > > // shuffle configs > conf.set("spark.default.parallelism", "320") > conf.set("spark.shuffle.file.buffer.kb", "200") > conf.set("spark.reducer.maxMbInFlight", "96") > > conf.set("spark.rdd.compress","true") > > conf.set("spark.worker.timeout","180") > > // akka settings > conf.set("spark.akka.threads", "300") > conf.set("spark.akka.timeout", "180") > conf.set("spark.akka.frameSize", "100") > conf.set("spark.akka.batchSize", "30") > conf.set("spark.akka.askTimeout", "30") > > // block manager > conf.set("spark.storage.blockManagerTimeoutIntervalMs", "18") > conf.set("spark.blockManagerHeartBeatMs", "8") > "STUCK" WORKER > 14/06/18 19:41:18 WARN network.ReceivingConnection: Error reading from > connection to ConnectionManagerId(172.16.25.103,57626) > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcher.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:251) > at sun.nio.ch.IOUtil.read(IOUtil.java:224) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:254) > at org.apache.spark.network.ReceivingConnection.read(Connection.scala:496) > REMOTE WORKER > 14/06/18 19:41:18 INFO network.ConnectionManager: Removing > ReceivingConnection to ConnectionManagerId(172.16.25.124,55610) > 14/06/18 19:41:18 ERROR network.ConnectionManager: Corresponding > SendingConnectionManagerId not found -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1882) Support dynamic memory sharing in Mesos
[ https://issues.apache.org/jira/browse/SPARK-1882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038987#comment-14038987 ] Andrew Ash commented on SPARK-1882: --- Yeah, for homogeneous environments, I think you can get full utilization of both CPU and memory across the cluster. It could work like this: Suppose each machine has 16 cores and 256GB memory, which is 16GB per core. Leave {{spark.task.cpus}} at the default of 1 and set {{spark.executor.memory}} to 16GB. Now each task launched grabs one core and 16GB. Once they're all taken on a machine, it's fully maxed out in both memory and CPU. But if we have heterogeneous machines with different CPU:memory ratios, I think we'd be in trouble. We couldn't pick a ratio that fully utilizes all machines, so we'd have either under-utilized CPUs or under-utilized memory for machines with low CPU:memory vs high CPU:memory ratios, respectively. The suggestion of un-coupling cores and memory is a good one -- if each task accepted an amount of memory proportional to the remaining memory on the machine, then I think you could get good utilization even across heterogeneous environments > Support dynamic memory sharing in Mesos > --- > > Key: SPARK-1882 > URL: https://issues.apache.org/jira/browse/SPARK-1882 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 1.0.0 >Reporter: Andrew Ash > > Fine grained mode Mesos currently supports sharing CPUs very well, but > requires that memory be pre-partitioned according to the executor memory > parameter. Mesos supports dynamic memory allocation in addition to dynamic > CPU allocation, so we should utilize this feature in Spark. > See below where when the Mesos backend accepts a resource offer it only > checks that there's enough memory to cover sc.executorMemory, and doesn't > ever take a fraction of the memory available. The memory offer is accepted > all or nothing from a pre-defined parameter. > Coarse mode: > https://github.com/apache/spark/blob/3ce526b168050c572a1feee8e0121e1426f7d9ee/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala#L208 > Fine mode: > https://github.com/apache/spark/blob/a5150d199ca97ab2992bc2bb221a3ebf3d3450ba/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackend.scala#L114 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2224) allow running tests for one sub module
[ https://issues.apache.org/jira/browse/SPARK-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038981#comment-14038981 ] Thomas Graves commented on SPARK-2224: -- Oh, I bet only ran with package and not install, I'll try that and if it works we can close this. > allow running tests for one sub module > -- > > Key: SPARK-2224 > URL: https://issues.apache.org/jira/browse/SPARK-2224 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.0.0 >Reporter: Thomas Graves > > We should have a way to run just the unit tests in a submodule (like core or > yarn, etc.). > One way would be to support changing directories into the submodule and > running mvn test from there. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2223) Building and running tests with maven is extremely slow
[ https://issues.apache.org/jira/browse/SPARK-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038976#comment-14038976 ] Thomas Graves commented on SPARK-2223: -- Some of our nightly builds are taking 50 minutes without running the tests. Many of them do update to latest but its still not a total from scratch build. I also just tried to run 1 single test inside of yarn and it took over 20 minutes for it to even get to that module to try to build and this was after already having built with -skipTests. Now when I do this a second time its very quick, but the first time took forever. > Building and running tests with maven is extremely slow > --- > > Key: SPARK-2223 > URL: https://issues.apache.org/jira/browse/SPARK-2223 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.0.0 >Reporter: Thomas Graves > > For some reason using maven with Spark is extremely slow. Building and > running tests takes way longer then other projects I have used that use > maven. We should investigate to see why. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2224) allow running tests for one sub module
[ https://issues.apache.org/jira/browse/SPARK-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038974#comment-14038974 ] Sean Owen commented on SPARK-2224: -- Ah yes you still need to have built the packages. "mvn -DskipTests install" before. > allow running tests for one sub module > -- > > Key: SPARK-2224 > URL: https://issues.apache.org/jira/browse/SPARK-2224 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.0.0 >Reporter: Thomas Graves > > We should have a way to run just the unit tests in a submodule (like core or > yarn, etc.). > One way would be to support changing directories into the submodule and > running mvn test from there. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2224) allow running tests for one sub module
[ https://issues.apache.org/jira/browse/SPARK-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038971#comment-14038971 ] Thomas Graves commented on SPARK-2224: -- Thanks, wasn't away of -pl.. doesn't seem to work for yarn though. get error: [WARNING] The POM for org.apache.spark:spark-core_2.10:jar:1.1.0-SNAPSHOT is missing, no dependency information available -DwildcardSuites still loops though all the modules. > allow running tests for one sub module > -- > > Key: SPARK-2224 > URL: https://issues.apache.org/jira/browse/SPARK-2224 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.0.0 >Reporter: Thomas Graves > > We should have a way to run just the unit tests in a submodule (like core or > yarn, etc.). > One way would be to support changing directories into the submodule and > running mvn test from there. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2224) allow running tests for one sub module
[ https://issues.apache.org/jira/browse/SPARK-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038964#comment-14038964 ] Guoqiang Li commented on SPARK-2224: {{mvn -Phadoop-2.3 -Pyarn -DwildcardSuites=org.apache.spark.deploy.yarn.* test}} ? > allow running tests for one sub module > -- > > Key: SPARK-2224 > URL: https://issues.apache.org/jira/browse/SPARK-2224 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.0.0 >Reporter: Thomas Graves > > We should have a way to run just the unit tests in a submodule (like core or > yarn, etc.). > One way would be to support changing directories into the submodule and > running mvn test from there. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2224) allow running tests for one sub module
[ https://issues.apache.org/jira/browse/SPARK-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038956#comment-14038956 ] Sean Owen commented on SPARK-2224: -- mvn test -pl [module], no? > allow running tests for one sub module > -- > > Key: SPARK-2224 > URL: https://issues.apache.org/jira/browse/SPARK-2224 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.0.0 >Reporter: Thomas Graves > > We should have a way to run just the unit tests in a submodule (like core or > yarn, etc.). > One way would be to support changing directories into the submodule and > running mvn test from there. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2223) Building and running tests with maven is extremely slow
[ https://issues.apache.org/jira/browse/SPARK-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038944#comment-14038944 ] Sean Owen commented on SPARK-2223: -- Is it not just because there are a lot of tests, and they generally can't be run in parallel? I agree, an hour for tests is really long, but I'm not sure if there is a problem except having lots of tests. You can run subsets of tests with Maven though to test targeted changes. > Building and running tests with maven is extremely slow > --- > > Key: SPARK-2223 > URL: https://issues.apache.org/jira/browse/SPARK-2223 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.0.0 >Reporter: Thomas Graves > > For some reason using maven with Spark is extremely slow. Building and > running tests takes way longer then other projects I have used that use > maven. We should investigate to see why. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2224) allow running tests for one sub module
Thomas Graves created SPARK-2224: Summary: allow running tests for one sub module Key: SPARK-2224 URL: https://issues.apache.org/jira/browse/SPARK-2224 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.0.0 Reporter: Thomas Graves We should have a way to run just the unit tests in a submodule (like core or yarn, etc.). One way would be to support changing directories into the submodule and running mvn test from there. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2163) Set ``setConvergenceTol'' with a parameter of type Double instead of Int
[ https://issues.apache.org/jira/browse/SPARK-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2163: - Fix Version/s: 1.1.0 1.0.1 > Set ``setConvergenceTol'' with a parameter of type Double instead of Int > > > Key: SPARK-2163 > URL: https://issues.apache.org/jira/browse/SPARK-2163 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Gang Bai > Fix For: 1.0.1, 1.1.0 > > > The class LBFGS in mllib.optimization currently provides a > {{setConvergenceTol(tolerance: Int)}} method for setting the convergence > tolerance. The tolerance parameter is of type {{Int}}. The specified > tolerance is then used as parameter in calling {{LBFGS.runLBFGS}}, where the > parameter {{convergenceTol}} is of type {{Double}}. > The Int parameter may cause problem when one creates an optimizer and sets a > Double-valued tolerance. e.g: > {code:borderStyle=solid} > override val optimizer = new LBFGS(gradient, updater) > .setNumCorrections(9) > .setConvergenceTol(1e-4) // *type mismatch here* > .setMaxNumIterations(100) > .setRegParam(1.0) > {code} > IMHO there is no need to make the tolerance of type Int. Let's change it into > a Double parameter and eliminate the type mismatch problem. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2223) Building and running tests with maven is extremely slow
Thomas Graves created SPARK-2223: Summary: Building and running tests with maven is extremely slow Key: SPARK-2223 URL: https://issues.apache.org/jira/browse/SPARK-2223 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.0 Reporter: Thomas Graves For some reason using maven with Spark is extremely slow. Building and running tests takes way longer then other projects I have used that use maven. We should investigate to see why. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1493) Apache RAT excludes don't work with file path (instead of file name)
[ https://issues.apache.org/jira/browse/SPARK-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038921#comment-14038921 ] Erik Erlandson commented on SPARK-1493: --- I submitted a proposal patch for RAT-161, which allows one to request path-spanning patterns by including a leading '/' If '--dir' argument is /path/to/repo, and contents of '-E' file includes: /subpath/to/.*ext then the pattern induced is: /path/to/repo + /subpath/to/.*ex t --> /path/to/repo/subpath/to/.*ext > Apache RAT excludes don't work with file path (instead of file name) > > > Key: SPARK-1493 > URL: https://issues.apache.org/jira/browse/SPARK-1493 > Project: Spark > Issue Type: Bug > Components: Project Infra >Reporter: Patrick Wendell > Labels: starter > > Right now the way we do RAT checks, it doesn't work if you try to exclude: > /path/to/file.ext > you have to just exclude > file.ext -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2222) Add multiclass evaluation metrics
Alexander Ulanov created SPARK-: --- Summary: Add multiclass evaluation metrics Key: SPARK- URL: https://issues.apache.org/jira/browse/SPARK- Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.0.0 Reporter: Alexander Ulanov There is no class in Spark MLlib for measuring the performance of multiclass classifiers. This task involves adding such class and unit tests. The following measures are to be implemented: per class, micro averaged and weighted averaged Precision, Recall and F1-Measure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2221) Spark fails on windows YARN (HortonWorks 2.4)
Vlad Rabenok created SPARK-2221: --- Summary: Spark fails on windows YARN (HortonWorks 2.4) Key: SPARK-2221 URL: https://issues.apache.org/jira/browse/SPARK-2221 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0 Environment: Windows 2008 R2 HortonWorks Hadoop cluster 2.4 Spark 1.0 Reporter: Vlad Rabenok Windows based Yarn cluster failed to execute submitted Spark job. Can be reproduced by submitting JavaSparkPi on Yarn cluster. C:\hdp\spark-1.0.0-bin-hadoop2\bin>spark-submit --class org.apache.spark.examples.JavaSparkPi ./../lib/spark-examples-1.0.0-hadoop2.2.0.jar --master yarn-cluster --deploy-mode cluster The origin of the problems is org.apache.spark.deploy.yarn.ExecutorRunnableUtil.scala. JavaOpts parameters that contain environment variable need to be quoted. - '%' should be escaped like "kill %%p" . Still doesn't have any sense on Windows. Ideally should be passed through "spark.executor.extraJavaOptions". - javaOpts += "-Djava.io.tmpdir=\"" + new Path(Environment.PWD.$(), YarnConfiguration.DEFAULT_CONTAINER_TEMP_DIR) + "\"" -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2112) ParquetTypesConverter should not create its own conf
[ https://issues.apache.org/jira/browse/SPARK-2112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038666#comment-14038666 ] Andre Schumacher commented on SPARK-2112: - Since commit https://github.com/apache/spark/commit/f479cf3743e416ee08e62806e1b34aff5998ac22 the SparkContext's Hadoop configuration should be used when reading metadata from the file source. I wasn't yet able to test this with say S3 bucket names. Are the the S3 credentials copied from SparkConfig to its Hadoop configuration? If someone could confirm this to be working we could close this issue. > ParquetTypesConverter should not create its own conf > > > Key: SPARK-2112 > URL: https://issues.apache.org/jira/browse/SPARK-2112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Michael Armbrust > > [~adav]: "this actually makes it so that we can't use S3 credentials set in > the SparkContext, or add new FileSystems at runtime, for instance." -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2195) Parquet extraMetadata can contain key information
[ https://issues.apache.org/jira/browse/SPARK-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038661#comment-14038661 ] Andre Schumacher commented on SPARK-2195: - Since commit https://github.com/apache/spark/commit/f479cf3743e416ee08e62806e1b34aff5998ac22 the path is no longer stored in the extraMetadata. So I guess this issue can be closed? > Parquet extraMetadata can contain key information > - > > Key: SPARK-2195 > URL: https://issues.apache.org/jira/browse/SPARK-2195 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Michael Armbrust >Priority: Blocker > > {code} > 14/06/19 01:52:05 INFO NewHadoopRDD: Input split: ParquetInputSplit{part: > file:/Users/pat/Projects/spark-summit-training-2014/usb/data/wiki-parquet/part-r-1.parquet > start: 0 length: 24971040 hosts: [localhost] blocks: 1 requestedSchema: same > as file fileSchema: message root { > optional int32 id; > optional binary title; > optional int64 modified; > optional binary text; > optional binary username; > } > extraMetadata: > {org.apache.spark.sql.parquet.row.metadata=StructType(List(StructField(id,IntegerType,true), > StructField(title,StringType,true), StructField(modified,LongType,true), > StructField(text,StringType,true), StructField(username,StringType,true))), > path= MY AWS KEYS!!! } > readSupportMetadata: > {org.apache.spark.sql.parquet.row.metadata=StructType(List(StructField(id,IntegerType,true), > StructField(title,StringType,true), StructField(modified,LongType,true), > StructField(text,StringType,true), StructField(username,StringType,true))), > path= MY AWS KEYS > ***}} > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2163) Set ``setConvergenceTol'' with a parameter of type Double instead of Int
[ https://issues.apache.org/jira/browse/SPARK-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Bai resolved SPARK-2163. - Resolution: Implemented > Set ``setConvergenceTol'' with a parameter of type Double instead of Int > > > Key: SPARK-2163 > URL: https://issues.apache.org/jira/browse/SPARK-2163 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Gang Bai > > The class LBFGS in mllib.optimization currently provides a > {{setConvergenceTol(tolerance: Int)}} method for setting the convergence > tolerance. The tolerance parameter is of type {{Int}}. The specified > tolerance is then used as parameter in calling {{LBFGS.runLBFGS}}, where the > parameter {{convergenceTol}} is of type {{Double}}. > The Int parameter may cause problem when one creates an optimizer and sets a > Double-valued tolerance. e.g: > {code:borderStyle=solid} > override val optimizer = new LBFGS(gradient, updater) > .setNumCorrections(9) > .setConvergenceTol(1e-4) // *type mismatch here* > .setMaxNumIterations(100) > .setRegParam(1.0) > {code} > IMHO there is no need to make the tolerance of type Int. Let's change it into > a Double parameter and eliminate the type mismatch problem. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2220) Fix remaining Hive Commands
Michael Armbrust created SPARK-2220: --- Summary: Fix remaining Hive Commands Key: SPARK-2220 URL: https://issues.apache.org/jira/browse/SPARK-2220 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Fix For: 1.1.0 None of the following have an execution plan: {code} private[hive] case class DfsCommand(cmd: String) extends Command private[hive] case class ShellCommand(cmd: String) extends Command private[hive] case class SourceCommand(filePath: String) extends Command private[hive] case class AddFile(filePath: String) extends Command {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2219) AddJar doesn't work
Michael Armbrust created SPARK-2219: --- Summary: AddJar doesn't work Key: SPARK-2219 URL: https://issues.apache.org/jira/browse/SPARK-2219 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2089) With YARN, preferredNodeLocalityData isn't honored
[ https://issues.apache.org/jira/browse/SPARK-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038569#comment-14038569 ] Patrick Wendell commented on SPARK-2089: Hey Sandy, I think I removed this from the primary constructor because it was really unclear to me what it was for or how we wanted users to use it and it wasn't documented anywhere. So I figured it probably didn't belong in the parent constructor that we'll be supporting for the next 2 years :) My bad for causing this bug though. I'd propose fixing this by just cleaning up the interface in the first place and having it so we can encode it in a SparkConf. For one thing, there is no reason we should accept this as a Map[hostname, SplitInfo]. SplitInfo is a hadoop specific class... this means people can't express locality constraints for a YARN spark job if they are e.g. reading data from other storage systems, or they want to express any other type of policy. I think we should instead just have something like `spark.yarn.preferredNodes` which is a comma-separated list of nodes. If we wanted to be a bit fancier, we could accept a weight or count for each node. Then we could offer users a utility function somewhere that, given a Map[hostname, splitInfo] it could compute the correct value for the SparkConf setting. But users who are operating with non-hadoop storage systems, or who just wanted to express some locality constraints unrelated to storage (e.g., put all of my Spark containers on the same rack), could do it. One thing is - I actually think this means the current constructor would just be broken, so I'm not sure whether to remove it (and break compatbility) or to just have log an ERROR or something telling users to use the new way. > With YARN, preferredNodeLocalityData isn't honored > --- > > Key: SPARK-2089 > URL: https://issues.apache.org/jira/browse/SPARK-2089 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.0.0 >Reporter: Sandy Ryza >Assignee: Sandy Ryza >Priority: Critical > > When running in YARN cluster mode, apps can pass preferred locality data when > constructing a Spark context that will dictate where to request executor > containers. > This is currently broken because of a race condition. The Spark-YARN code > runs the user class and waits for it to start up a SparkContext. During its > initialization, the SparkContext will create a YarnClusterScheduler, which > notifies a monitor in the Spark-YARN code that . The Spark-Yarn code then > immediately fetches the preferredNodeLocationData from the SparkContext and > uses it to start requesting containers. > But in the SparkContext constructor that takes the preferredNodeLocationData, > setting preferredNodeLocationData comes after the rest of the initialization, > so, if the Spark-YARN code comes around quickly enough after being notified, > the data that's fetched is the empty unset version. The occurred during all > of my runs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2218) rename Equals to EqualTo in Spark SQL expressions
[ https://issues.apache.org/jira/browse/SPARK-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-2218. Resolution: Fixed Fix Version/s: 1.1.0 1.0.1 > rename Equals to EqualTo in Spark SQL expressions > - > > Key: SPARK-2218 > URL: https://issues.apache.org/jira/browse/SPARK-2218 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 1.0.1, 1.1.0 > > > The class name Equals is very error prone because there exists scala.Equals. > I just wasted a bunch of time debugging the optimizer because of this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2114) groupByKey and joins on raw data
[ https://issues.apache.org/jira/browse/SPARK-2114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038558#comment-14038558 ] Patrick Wendell commented on SPARK-2114: Hey Sandy, One thing that would be really helpful is if you could produce a micro benchmark that can create the problems you are seeing. That way we could profile this and look at the relative benefits of various optimizations. For instance, it's possible that simpler things like keeping only the values serialized would yield substantial benefit with less complexity than what is proposed here. A second general note is that use a sort-based shuffle would alleviate a lot of these issues. At least my guess is that in the MR shuffle objects are sorted in serialized form and only lazily serialized at the last possible second. > groupByKey and joins on raw data > > > Key: SPARK-2114 > URL: https://issues.apache.org/jira/browse/SPARK-2114 > Project: Spark > Issue Type: New Feature > Components: Shuffle, Spark Core >Affects Versions: 1.0.0 >Reporter: Sandy Ryza > > For groupByKey and join transformations, Spark tasks on the reduce side > deserialize every record into a Java object before calling any user function. > > This causes all kinds of problems for garbage collection - when aggregating > enough data, objects can escape the young gen and trigger full GCs down the > line. Additionally, when records are spilled, they must be serialized and > deserialized multiple times. > It would be helpful to allow aggregations on serialized data - using some > sort of RawHasher interface that could implement hashCode and equals for > serialized records. This would also require encoding record boundaries in > the serialized format, which I'm not sure we currently do. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2209) Cast shouldn't do null check twice
[ https://issues.apache.org/jira/browse/SPARK-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-2209. Resolution: Fixed > Cast shouldn't do null check twice > -- > > Key: SPARK-2209 > URL: https://issues.apache.org/jira/browse/SPARK-2209 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 1.0.1, 1.1.0 > > > Cast does two null checks, one in eval and another one in the function > returned by nullOrCast. It's best to get rid of the one in nullOrCast (since > eval will be the more common code path). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2196) Fix nullability of CaseWhen.
[ https://issues.apache.org/jira/browse/SPARK-2196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-2196. Resolution: Fixed Fix Version/s: 1.1.0 1.0.1 Assignee: Takuya Ueshin > Fix nullability of CaseWhen. > > > Key: SPARK-2196 > URL: https://issues.apache.org/jira/browse/SPARK-2196 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin > Fix For: 1.0.1, 1.1.0 > > > {{CaseWhen}} should use {{branches.length}} to check if {{elseValue}} is > provided or not. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2203) PySpark does not infer default numPartitions in same way as Spark
[ https://issues.apache.org/jira/browse/SPARK-2203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-2203. Resolution: Fixed Fix Version/s: 1.1.0 > PySpark does not infer default numPartitions in same way as Spark > - > > Key: SPARK-2203 > URL: https://issues.apache.org/jira/browse/SPARK-2203 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.0.0 >Reporter: Aaron Davidson >Assignee: Aaron Davidson > Fix For: 1.1.0 > > > For shuffle-based operators, such as rdd.groupBy() or rdd.sortByKey(), > PySpark will always assume that the default parallelism to use for the reduce > side is ctx.defaultParallelism, which is a constant typically determined by > the number of cores in cluster. > In contrast, Spark's Partitioner#defaultPartitioner will use the same number > of reduce partitions as map partitions unless the defaultParallelism config > is explicitly set. This tends to be a better default in order to avoid OOMs, > and should also be the behavior of PySpark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2210) cast to boolean on boolean value gets turned into NOT((boolean_condition) = 0)
[ https://issues.apache.org/jira/browse/SPARK-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-2210. Resolution: Fixed Fix Version/s: 1.1.0 1.0.1 > cast to boolean on boolean value gets turned into NOT((boolean_condition) = 0) > -- > > Key: SPARK-2210 > URL: https://issues.apache.org/jira/browse/SPARK-2210 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 1.0.1, 1.1.0 > > > {code} > explain select cast(cast(key=0 as boolean) as boolean) aaa from src > {code} > should be > {code} > [Physical execution plan:] > [Project [(key#10:0 = 0) AS aaa#7]] > [ HiveTableScan [key#10], (MetastoreRelation default, src, None), None] > {code} > However, it is currently > {code} > [Physical execution plan:] > [Project [NOT((key#10=0) = 0) AS aaa#7]] > [ HiveTableScan [key#10], (MetastoreRelation default, src, None), None] > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)