[jira] [Comment Edited] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster
[ https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15642858#comment-15642858 ] William Benton edited comment on SPARK-18278 at 11/7/16 2:27 AM: - [~srowen] Currently {{ExternalClusterManager}} is Spark-private, so there isn't a great way to implement a new scheduler backend outside of Spark proper. I think it would be great if an extension API for new cluster managers were public and developers could work with it with some expectation of stability! But even if this API were exposed and (relatively) stable, I think there's a good argument that if any cluster managers besides standalone are to live in Spark proper that a Kubernetes scheduler should be there, too. (Why draw the line just past Mesos and YARN when Kubernetes also enjoys a large community and many deployments? And if the Spark community is to draw the line somewhere, why not draw it around the standalone scheduler? If we're using the messaging connectors in Bahir as an example, then historical precedent isn't an argument for keeping existing schedulers in Spark proper.) was (Author: willbenton): [~srowen] Currently {{ExternalClusterManager}} is Spark-private, so there isn't a great way to implement a new scheduler backend outside of Spark proper. I think it would be great if an extension API for new cluster managers were public and developers could work with it with some expectation of stability! But even if this API were exposed and (relatively) stable, I I think there's a good argument that if any cluster managers besides standalone are to live in Spark proper that a Kubernetes scheduler should be there, too. (Why draw the line just past Mesos and YARN when Kubernetes also enjoys a large community and many deployments? And if the Spark community is to draw the line somewhere, why not draw it around the standalone scheduler? If we're using the messaging connectors in Bahir as an example, then historical precedent isn't an argument for keeping existing schedulers in Spark proper.) > Support native submission of spark jobs to a kubernetes cluster > --- > > Key: SPARK-18278 > URL: https://issues.apache.org/jira/browse/SPARK-18278 > Project: Spark > Issue Type: Umbrella > Components: Build, Deploy, Documentation, Scheduler, Spark Core >Affects Versions: 2.2.0 >Reporter: Erik Erlandson > > A new Apache Spark sub-project that enables native support for submitting > Spark applications to a kubernetes cluster. The submitted application runs > in a driver executing on a kubernetes pod, and executors lifecycles are also > managed as pods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster
[ https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15642858#comment-15642858 ] William Benton commented on SPARK-18278: [~srowen] Currently {{ExternalClusterManager}} is Spark-private, so there isn't a great way to implement a new scheduler backend outside of Spark proper. I think it would be great if an extension API for new cluster managers were public and developers could work with it with some expectation of stability! But even if this API were exposed and (relatively) stable, I I think there's a good argument that if any cluster managers besides standalone are to live in Spark proper that a Kubernetes scheduler should be there, too. (Why draw the line just past Mesos and YARN when Kubernetes also enjoys a large community and many deployments? And if the Spark community is to draw the line somewhere, why not draw it around the standalone scheduler? If we're using the messaging connectors in Bahir as an example, then historical precedent isn't an argument for keeping existing schedulers in Spark proper.) > Support native submission of spark jobs to a kubernetes cluster > --- > > Key: SPARK-18278 > URL: https://issues.apache.org/jira/browse/SPARK-18278 > Project: Spark > Issue Type: Umbrella > Components: Build, Deploy, Documentation, Scheduler, Spark Core >Affects Versions: 2.2.0 >Reporter: Erik Erlandson > > A new Apache Spark sub-project that enables native support for submitting > Spark applications to a kubernetes cluster. The submitted application runs > in a driver executing on a kubernetes pod, and executors lifecycles are also > managed as pods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17595) Inefficient selection in Word2VecModel.findSynonyms
William Benton created SPARK-17595: -- Summary: Inefficient selection in Word2VecModel.findSynonyms Key: SPARK-17595 URL: https://issues.apache.org/jira/browse/SPARK-17595 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 2.0.0 Reporter: William Benton Priority: Minor The code in `Word2VecModel.findSynonyms` to choose the vocabulary elements with the highest similarity to the query vector currently sorts the similarities for every vocabulary element. This involves making multiple copies of the collection of similarities while doing a (relatively) expensive sort. It would be more efficient to find the best matches by maintaining a bounded priority queue and populating it with a single pass over the vocabulary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17548) Word2VecModel.findSynonyms can spuriously reject the best match when invoked with a vector
William Benton created SPARK-17548: -- Summary: Word2VecModel.findSynonyms can spuriously reject the best match when invoked with a vector Key: SPARK-17548 URL: https://issues.apache.org/jira/browse/SPARK-17548 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 2.0.0, 1.6.2, 1.5.2, 1.4.1 Environment: any Reporter: William Benton Priority: Minor The `findSynonyms` method in `Word2VecModel` currently rejects the best match a priori. When `findSynonyms` is invoked with a word, the best match is almost certain to be that word, but `findSynonyms` can also be invoked with a vector, which might not correspond to any of the words in the model's vocabulary. In the latter case, rejecting the best match is spurious. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5281) Registering table on RDD is giving MissingRequirementError
[ https://issues.apache.org/jira/browse/SPARK-5281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482291#comment-14482291 ] William Benton commented on SPARK-5281: --- As [~marmbrus] recently pointed out on the user list, this happens when you don't have all of the dependencies for Scala reflection loaded by the primordial classloader. For running apps from sbt, setting {{fork := true}} should do the trick. For running a REPL from sbt, try [this workaround|http://chapeau.freevariable.com/2015/04/spark-sql-repl.html]. (Sorry to not have a solution for Eclipse.) Registering table on RDD is giving MissingRequirementError -- Key: SPARK-5281 URL: https://issues.apache.org/jira/browse/SPARK-5281 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: sarsol Priority: Critical Application crashes on this line {{rdd.registerTempTable(temp)}} in 1.2 version when using sbt or Eclipse SCALA IDE Stacktrace: {code} Exception in thread main scala.reflect.internal.MissingRequirementError: class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with primordial classloader with boot classpath [C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-library.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-reflect.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-actor.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-swing.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-compiler.jar;C:\Program Files\Java\jre7\lib\resources.jar;C:\Program Files\Java\jre7\lib\rt.jar;C:\Program Files\Java\jre7\lib\sunrsasign.jar;C:\Program Files\Java\jre7\lib\jsse.jar;C:\Program Files\Java\jre7\lib\jce.jar;C:\Program Files\Java\jre7\lib\charsets.jar;C:\Program Files\Java\jre7\lib\jfr.jar;C:\Program Files\Java\jre7\classes] not found. at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61) at scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21) at org.apache.spark.sql.catalyst.ScalaReflection$$typecreator1$1.apply(ScalaReflection.scala:115) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335) at scala.reflect.api.Universe.typeOf(Universe.scala:59) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:115) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:100) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33) at org.apache.spark.sql.catalyst.ScalaReflection$class.attributesFor(ScalaReflection.scala:94) at org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(ScalaReflection.scala:33) at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:111) at com.sar.spark.dq.poc.SparkPOC$delayedInit$body.apply(SparkPOC.scala:43) at scala.Function0$class.apply$mcV$sp(Function0.scala:40) at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12) at scala.App$$anonfun$main$1.apply(App.scala:71) at scala.App$$anonfun$main$1.apply(App.scala:71) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32) at scala.App$class.main(App.scala:71) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4867) UDF clean up
[ https://issues.apache.org/jira/browse/SPARK-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14253579#comment-14253579 ] William Benton commented on SPARK-4867: --- [~marmbrus] I actually think exposing an interface that looks something like overloading might be the right approach. (To be clear, I think polymorphism poses a far greater difficulty with implicit coercion than without it, but it might be possible to solve the ambiguity there by letting users register functions in a priority order.) UDF clean up Key: SPARK-4867 URL: https://issues.apache.org/jira/browse/SPARK-4867 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Priority: Blocker Right now our support and internal implementation of many functions has a few issues. Specifically: - UDFS don't know their input types and thus don't do type coercion. - We hard code a bunch of built in functions into the parser. This is bad because in SQL it creates new reserved words for things that aren't actually keywords. Also it means that for each function we need to add support to both SQLContext and HiveContext separately. For this JIRA I propose we do the following: - Change the interfaces for registerFunction and ScalaUdf to include types for the input arguments as well as the output type. - Add a rule to analysis that does type coercion for UDFs. - Add a parse rule for functions to SQLParser. - Rewrite all the UDFs that are currently hacked into the various parsers using this new functionality. Depending on how big this refactoring becomes we could split parts 12 from part 3 above. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4867) UDF clean up
[ https://issues.apache.org/jira/browse/SPARK-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14251822#comment-14251822 ] William Benton commented on SPARK-4867: --- I think in general it's a great idea to declare or register SQL functions with type signatures. I also think that the more we can lean on Scala's type system here, the better. The reason why I didn't make declaring signatures a requirement for all functions in my PR for SPARK-2863 is that it seems like the interface gets hairy pretty quickly if it's going to be general enough to work for the use cases we can reasonably expect. The simplest example of where things get complicated is numeric types: many functions in existing systems are polymorphic, taking either Doubles or Decimals, and then returning the type they took. (In practice, it seems like Hive in particular doesn't do much that keeps the precision of Decimals intact, but that's another matter.) So we'd need a type signature interface that supports type variables and constraints, so that the expected signature for addition could look something like (A, B) = C, with annotations indicating that A is either Double or Decimal, B is either Double or Decimal, and C is the least upper bound of A and B. (It's certainly possible to special-case numeric coercions, but that's sort of a bummer -- and this class of problem is one we'd want to solve for UDFs anyway.) In any case, I'm definitely interested in working with you on both design and implementation! UDF clean up Key: SPARK-4867 URL: https://issues.apache.org/jira/browse/SPARK-4867 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Priority: Blocker Right now our support and internal implementation of many functions has a few issues. Specifically: - UDFS don't know their input types and thus don't do type coercion. - We hard code a bunch of built in functions into the parser. This is bad because in SQL it creates new reserved words for things that aren't actually keywords. Also it means that for each function we need to add support to both SQLContext and HiveContext separately. For this JIRA I propose we do the following: - Change the interfaces for registerFunction and ScalaUdf to include types for the input arguments as well as the output type. - Add a rule to analysis that does type coercion for UDFs. - Add a parse rule for functions to SQLParser. - Rewrite all the UDFs that are currently hacked into the various parsers using this new functionality. Depending on how big this refactoring becomes we could split parts 12 from part 3 above. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4185) JSON schema inference failed when dealing with type conflicts in arrays
[ https://issues.apache.org/jira/browse/SPARK-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14193491#comment-14193491 ] William Benton commented on SPARK-4185: --- I'm actually not sure this is a bug! My main concern in this case is that inferring any typing for this collection of objects makes it very difficult to write meaningful queries. In the fedmsg case, the problem was that the source data overloaded the meaning of a field name, so I was able to preprocess the fields to do the renaming. I was thinking that maybe a good solution might be to have Spark SQL automatically rename fields with conflicting types in different records (e.g. to “branches_1” and “branches_2” in this case). JSON schema inference failed when dealing with type conflicts in arrays --- Key: SPARK-4185 URL: https://issues.apache.org/jira/browse/SPARK-4185 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Yin Huai Assignee: Yin Huai {code} val sqlContext = new org.apache.spark.sql.SQLContext(sparkContext) val diverging = sparkContext.parallelize(List({branches: [foo]}, {branches: [{foo:42}]})) sqlContext.jsonRDD(diverging) // throws a MatchError {code} The case is from http://chapeau.freevariable.com/2014/10/fedmsg-and-spark.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4190) Allow users to provide transformation rules at JSON ingest
William Benton created SPARK-4190: - Summary: Allow users to provide transformation rules at JSON ingest Key: SPARK-4190 URL: https://issues.apache.org/jira/browse/SPARK-4190 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0, 1.2.0 Reporter: William Benton It would be great if it were possible to provide transformation rules (to be executed within jsonRDD or jsonFile) so that users could (1) deal with JSON files that confound schema inference or are otherwise insufficiently disciplined, or (2) simply perform arbitrary object transformations at ingest before a schema is inferred. json4s, which Spark already uses, has nice interfaces for specifying transformations as partial functions on objects and accessing nested structures via path expressions. (We might want to introduce an abstraction atop json4s for a public API, but the json4s API seems like a good first step.) There are some examples of these transformations at https://github.com/json4s/json4s and at http://chapeau.freevariable.com/2014/10/fedmsg-and-spark.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4190) Allow users to provide transformation rules at JSON ingest
[ https://issues.apache.org/jira/browse/SPARK-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14193573#comment-14193573 ] William Benton commented on SPARK-4190: --- I'll take this, since I'm interested in working on it and it seems like a quick fix. [~yhuai], will you be willing to review a WIP PR sometime soon? Allow users to provide transformation rules at JSON ingest -- Key: SPARK-4190 URL: https://issues.apache.org/jira/browse/SPARK-4190 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0, 1.2.0 Reporter: William Benton It would be great if it were possible to provide transformation rules (to be executed within jsonRDD or jsonFile) so that users could (1) deal with JSON files that confound schema inference or are otherwise insufficiently disciplined, or (2) simply perform arbitrary object transformations at ingest before a schema is inferred. json4s, which Spark already uses, has nice interfaces for specifying transformations as partial functions on objects and accessing nested structures via path expressions. (We might want to introduce an abstraction atop json4s for a public API, but the json4s API seems like a good first step.) There are some examples of these transformations at https://github.com/json4s/json4s and at http://chapeau.freevariable.com/2014/10/fedmsg-and-spark.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2863) Emulate Hive type coercion in native reimplementations of Hive functions
[ https://issues.apache.org/jira/browse/SPARK-2863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168403#comment-14168403 ] William Benton commented on SPARK-2863: --- I submitted a PR for this issue this morning: https://github.com/apache/spark/pull/2768 (I would have liked to have a solution that leaned on the type system a little more -- in particular, even just representing function signatures as tuples rather than as lists -- but the approach I took in the PR is simple, easy to understand, and easy to validate.) Emulate Hive type coercion in native reimplementations of Hive functions Key: SPARK-2863 URL: https://issues.apache.org/jira/browse/SPARK-2863 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: William Benton Assignee: William Benton Native reimplementations of Hive functions no longer have the same type-coercion behavior as they would if executed via Hive. As [Michael Armbrust points out|https://github.com/apache/spark/pull/1750#discussion_r15790970], queries like {{SELECT SQRT(2) FROM src LIMIT 1}} succeed in Hive but fail if {{SQRT}} is implemented natively. Spark SQL should have Hive-compatible type coercions for arguments to natively-implemented functions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3699) sbt console tasks don't clean up SparkContext
William Benton created SPARK-3699: - Summary: sbt console tasks don't clean up SparkContext Key: SPARK-3699 URL: https://issues.apache.org/jira/browse/SPARK-3699 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: William Benton Priority: Minor Fix For: 1.1.1 Because the sbt console tasks for the hive and sql projects don't stop the SparkContext upon exit, users are faced with an ugly stack trace. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3423) Implement BETWEEN support for regular SQL parser
William Benton created SPARK-3423: - Summary: Implement BETWEEN support for regular SQL parser Key: SPARK-3423 URL: https://issues.apache.org/jira/browse/SPARK-3423 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: William Benton Priority: Minor The HQL parser supports BETWEEN but the SQLParser currently does not. It would be great if it did. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3423) Implement BETWEEN support for regular SQL parser
[ https://issues.apache.org/jira/browse/SPARK-3423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123814#comment-14123814 ] William Benton commented on SPARK-3423: --- (PR is here: https://github.com/apache/spark/pull/2295 ) Implement BETWEEN support for regular SQL parser Key: SPARK-3423 URL: https://issues.apache.org/jira/browse/SPARK-3423 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.1.0 Reporter: William Benton Assignee: William Benton Priority: Minor The HQL parser supports BETWEEN but the SQLParser currently does not. It would be great if it did. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3329) HiveQuerySuite SET tests depend on map orderings
William Benton created SPARK-3329: - Summary: HiveQuerySuite SET tests depend on map orderings Key: SPARK-3329 URL: https://issues.apache.org/jira/browse/SPARK-3329 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2, 1.1.0 Reporter: William Benton Priority: Trivial The SET tests in HiveQuerySuite that return multiple values depend on the ordering in which map pairs are returned from Hive and can fail spuriously if this changes due to environment or library changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2863) Emulate Hive type coercion in native reimplementations of Hive functions
[ https://issues.apache.org/jira/browse/SPARK-2863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105537#comment-14105537 ] William Benton commented on SPARK-2863: --- I wrote up how Hive handles type coercions in a blog post: http://chapeau.freevariable.com/2014/08/existing-system-coercion.html The short version is that strings can be coerced to doubles or decimals and (in Hive 0.13) decimals can be coerced to doubles for numeric functions. As a first pass, I propose extending the numeric function helpers to handle strings. Emulate Hive type coercion in native reimplementations of Hive functions Key: SPARK-2863 URL: https://issues.apache.org/jira/browse/SPARK-2863 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: William Benton Assignee: William Benton Native reimplementations of Hive functions no longer have the same type-coercion behavior as they would if executed via Hive. As [Michael Armbrust points out|https://github.com/apache/spark/pull/1750#discussion_r15790970], queries like {{SELECT SQRT(2) FROM src LIMIT 1}} succeed in Hive but fail if {{SQRT}} is implemented natively. Spark SQL should have Hive-compatible type coercions for arguments to natively-implemented functions. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2863) Emulate Hive type coercion in native reimplementations of Hive UDFs
William Benton created SPARK-2863: - Summary: Emulate Hive type coercion in native reimplementations of Hive UDFs Key: SPARK-2863 URL: https://issues.apache.org/jira/browse/SPARK-2863 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: William Benton Native reimplementations of Hive functions no longer have the same type-coercion behavior as they would if executed via Hive. As a href=https://github.com/apache/spark/pull/1750#discussion_r15790970; Michael Armbrust points out/a, queries like {{SELECT SQRT(2) FROM src LIMIT 1}} succeed in Hive but fail if {{SQRT}} is implemented natively. Spark SQL should have Hive-compatible type coercions for arguments to natively-implemented functions. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2813) Implement SQRT() directly in Catalyst
William Benton created SPARK-2813: - Summary: Implement SQRT() directly in Catalyst Key: SPARK-2813 URL: https://issues.apache.org/jira/browse/SPARK-2813 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.0 Reporter: William Benton Priority: Minor Fix For: 1.1.0 Instead of delegating square root computation to a Hive UDF, Spark should implement SQL SQRT() directly. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2813) Implement SQRT() directly in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-2813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] William Benton updated SPARK-2813: -- Summary: Implement SQRT() directly in Spark SQL (was: Implement SQRT() directly in Spark SQP) Implement SQRT() directly in Spark SQL -- Key: SPARK-2813 URL: https://issues.apache.org/jira/browse/SPARK-2813 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.0 Reporter: William Benton Priority: Minor Fix For: 1.1.0 Instead of delegating square root computation to a Hive UDF, Spark should implement SQL SQRT() directly. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2813) Implement SQRT() directly in Spark SQP
[ https://issues.apache.org/jira/browse/SPARK-2813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] William Benton updated SPARK-2813: -- Summary: Implement SQRT() directly in Spark SQP (was: Implement SQRT() directly in Catalyst) Implement SQRT() directly in Spark SQP -- Key: SPARK-2813 URL: https://issues.apache.org/jira/browse/SPARK-2813 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.0 Reporter: William Benton Priority: Minor Fix For: 1.1.0 Instead of delegating square root computation to a Hive UDF, Spark should implement SQL SQRT() directly. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2226) HAVING should be able to contain aggregate expressions that don't appear in the aggregation list.
[ https://issues.apache.org/jira/browse/SPARK-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067503#comment-14067503 ] William Benton commented on SPARK-2226: --- [~rxin], yes, and I'm mostly done. I'll post a PR soon! HAVING should be able to contain aggregate expressions that don't appear in the aggregation list. -- Key: SPARK-2226 URL: https://issues.apache.org/jira/browse/SPARK-2226 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.0 Reporter: Reynold Xin Assignee: William Benton https://github.com/apache/hive/blob/trunk/ql/src/test/queries/clientpositive/having.q This test file contains the following query: {code} SELECT key FROM src GROUP BY key HAVING max(value) val_255; {code} Once we fixed this issue, we should whitelist having.q. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2486) Utils.getCallSite can crash under JVMTI profilers
William Benton created SPARK-2486: - Summary: Utils.getCallSite can crash under JVMTI profilers Key: SPARK-2486 URL: https://issues.apache.org/jira/browse/SPARK-2486 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1 Environment: running under profilers (observed on OS X under YourKit with CPU profiling and/or object allocation site tracking enabled) Reporter: William Benton Priority: Minor When running under an instrumenting profiler, Utils.getCallSite sometimes crashes with an NPE while examining stack trace elements. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2486) Utils.getCallSite can crash under JVMTI profilers
[ https://issues.apache.org/jira/browse/SPARK-2486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14061625#comment-14061625 ] William Benton commented on SPARK-2486: --- A (trivial but functional) workaround is here: https://github.com/apache/spark/pull/1413 Utils.getCallSite can crash under JVMTI profilers - Key: SPARK-2486 URL: https://issues.apache.org/jira/browse/SPARK-2486 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1 Environment: running under profilers (observed on OS X under YourKit with CPU profiling and/or object allocation site tracking enabled) Reporter: William Benton Priority: Minor When running under an instrumenting profiler, Utils.getCallSite sometimes crashes with an NPE while examining stack trace elements. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2407) Implement SQL SUBSTR() directly in Catalyst
[ https://issues.apache.org/jira/browse/SPARK-2407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057725#comment-14057725 ] William Benton commented on SPARK-2407: --- Here's the PR: https://github.com/apache/spark/pull/1359 Implement SQL SUBSTR() directly in Catalyst --- Key: SPARK-2407 URL: https://issues.apache.org/jira/browse/SPARK-2407 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: William Benton Assignee: William Benton Currently SQL SUBSTR/SUBSTRING() is delegated to Hive. It would be nice to implement this directly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2407) Implement SQL SUBSTR() directly in Catalyst
William Benton created SPARK-2407: - Summary: Implement SQL SUBSTR() directly in Catalyst Key: SPARK-2407 URL: https://issues.apache.org/jira/browse/SPARK-2407 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: William Benton Currently SQL SUBSTR/SUBSTRING() is delegated to Hive. It would be nice to implement this directly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2407) Implement SQL SUBSTR() directly in Catalyst
[ https://issues.apache.org/jira/browse/SPARK-2407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14055232#comment-14055232 ] William Benton commented on SPARK-2407: --- I have this on a branch and will submit a PR as soon as I'm done running the test suite locally. Implement SQL SUBSTR() directly in Catalyst --- Key: SPARK-2407 URL: https://issues.apache.org/jira/browse/SPARK-2407 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: William Benton Currently SQL SUBSTR/SUBSTRING() is delegated to Hive. It would be nice to implement this directly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2226) HAVING should be able to contain aggregate expressions that don't appear in the aggregation list.
[ https://issues.apache.org/jira/browse/SPARK-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14055499#comment-14055499 ] William Benton commented on SPARK-2226: --- Thanks, [~marmbrus], this makes sense! I'll ping you if I get stuck. HAVING should be able to contain aggregate expressions that don't appear in the aggregation list. -- Key: SPARK-2226 URL: https://issues.apache.org/jira/browse/SPARK-2226 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.0 Reporter: Reynold Xin Assignee: William Benton https://github.com/apache/hive/blob/trunk/ql/src/test/queries/clientpositive/having.q This test file contains the following query: {code} SELECT key FROM src GROUP BY key HAVING max(value) val_255; {code} Once we fixed this issue, we should whitelist having.q. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2225) Turn HAVING without GROUP BY into WHERE
[ https://issues.apache.org/jira/browse/SPARK-2225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039335#comment-14039335 ] William Benton commented on SPARK-2225: --- So the Hive test suite treats HAVING without GROUP BY as an error (see having1.q), although it is accepted by some dialects (like SQL Server). I'm happy to take this, though, if this is what we want to do here. Turn HAVING without GROUP BY into WHERE --- Key: SPARK-2225 URL: https://issues.apache.org/jira/browse/SPARK-2225 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.0 Reporter: Reynold Xin See http://msdn.microsoft.com/en-US/library/8hhs5f4e(v=vs.80).aspx The HAVING clause specifies conditions that determines the groups included in the query. If the SQL SELECT statement does not contain aggregate functions, you can use a SQL SELECT statement that contains a HAVING clause without a GROUP BY clause. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2180) HiveQL doesn't support GROUP BY with HAVING clauses
[ https://issues.apache.org/jira/browse/SPARK-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037697#comment-14037697 ] William Benton commented on SPARK-2180: --- PR is here: https://github.com/apache/spark/pull/1136 HiveQL doesn't support GROUP BY with HAVING clauses --- Key: SPARK-2180 URL: https://issues.apache.org/jira/browse/SPARK-2180 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: William Benton Priority: Minor The HiveQL implementation doesn't support HAVING clauses for aggregations. This prevents some of the TPCDS benchmarks from running. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2180) HiveQL doesn't support GROUP BY with HAVING clauses
[ https://issues.apache.org/jira/browse/SPARK-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14035907#comment-14035907 ] William Benton commented on SPARK-2180: --- (I'm working on a fix and will submit a PR soon.) HiveQL doesn't support GROUP BY with HAVING clauses --- Key: SPARK-2180 URL: https://issues.apache.org/jira/browse/SPARK-2180 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: William Benton Priority: Minor The HiveQL implementation doesn't support HAVING clauses for aggregations. This prevents some of the TPCDS benchmarks from running. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-571) Forbid return statements when cleaning closures
[ https://issues.apache.org/jira/browse/SPARK-571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993919#comment-13993919 ] William Benton commented on SPARK-571: -- I have a patch for this and will submit a PR later today; can someone please assign this to me? Forbid return statements when cleaning closures --- Key: SPARK-571 URL: https://issues.apache.org/jira/browse/SPARK-571 Project: Spark Issue Type: Improvement Reporter: tjhunter By mistake, I wrote some code like this: pre object Foo { def main() { val sc = new SparkContext(...) sc.parallelize(0 to 10,10).map({ ... return 1 ... }).collect } } /pre This compiles fine and actually runs using the local scheduler. However, using the mesos scheduler throws a NotSerializableException in the CollectTask . I agree the result of the program above should be undefined or it should be an error. Would it be possible to have more explicit messages? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1807) Modify SPARK_EXECUTOR_URI to allow for script execution in Mesos.
[ https://issues.apache.org/jira/browse/SPARK-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995469#comment-13995469 ] William Benton commented on SPARK-1807: --- I'll be working on some related things this week; could someone assign this to me? Modify SPARK_EXECUTOR_URI to allow for script execution in Mesos. - Key: SPARK-1807 URL: https://issues.apache.org/jira/browse/SPARK-1807 Project: Spark Issue Type: Improvement Components: Mesos Affects Versions: 0.9.0 Reporter: Timothy St. Clair Modify Mesos Scheduler integration to allow SPARK_EXECUTOR_URI to be an executable script. This allows admins to launch spark in any fashion they desire, vs. just tarball fetching + implied context. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1789) Multiple versions of Netty dependencies cause FlumeStreamSuite failure
[ https://issues.apache.org/jira/browse/SPARK-1789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995560#comment-13995560 ] William Benton commented on SPARK-1789: --- Sean, we're currently building against Akka 2.3.0 in Fedora (it's a trivial source patch against 0.9.1; I haven't investigated the delta against 1.0 yet). Are there reasons why Akka 2.3.0 is a bad idea for Spark in general? If not, I'm happy to file a JIRA for updating the dependency and contribute my patch upstream. Multiple versions of Netty dependencies cause FlumeStreamSuite failure -- Key: SPARK-1789 URL: https://issues.apache.org/jira/browse/SPARK-1789 Project: Spark Issue Type: Bug Components: Build Affects Versions: 0.9.1 Reporter: Sean Owen Assignee: Sean Owen Labels: flume, netty, test Fix For: 1.0.0 TL;DR is there is a bit of JAR hell trouble with Netty, that can be mostly resolved and will resolve a test failure. I hit the error described at http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-startup-time-out-td1753.html while running FlumeStreamingSuite, and have for a short while (is it just me?) velvia notes: I have found a workaround. If you add akka 2.2.4 to your dependencies, then everything works, probably because akka 2.2.4 brings in newer version of Jetty. There are at least 3 versions of Netty in play in the build: - the new Flume 1.4.0 dependency brings in io.netty:netty:3.4.0.Final, and that is the immediate problem - the custom version of akka 2.2.3 depends on io.netty:netty:3.6.6. - but, Spark Core directly uses io.netty:netty-all:4.0.17.Final The POMs try to exclude other versions of netty, but are excluding org.jboss.netty:netty, when in fact older versions of io.netty:netty (not netty-all) are also an issue. The org.jboss.netty:netty excludes are largely unnecessary. I replaced many of them with io.netty:netty exclusions until everything agreed on io.netty:netty-all:4.0.17.Final. But this didn't work, since Akka 2.2.3 doesn't work with Netty 4.x. Down-grading to 3.6.6.Final across the board made some Spark code not compile. If the build *keeps* io.netty:netty:3.6.6.Final as well, everything seems to work. Part of the reason seems to be that Netty 3.x used the old `org.jboss.netty` packages. This is less than ideal, but is no worse than the current situation. So this PR resolves the issue and improves the JAR hell, even if it leaves the existing theoretical Netty 3-vs-4 conflict: - Remove org.jboss.netty excludes where possible, for clarity; they're not needed except with Hadoop artifacts - Add io.netty:netty excludes where needed -- except, let akka keep its io.netty:netty - Change a bit of test code that actually depended on Netty 3.x, to use 4.x equivalent - Update SBT build accordingly A better change would be to update Akka far enough such that it agrees on Netty 4.x, but I don't know if that's feasible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1781) Generalized validity checking for configuration parameters
[ https://issues.apache.org/jira/browse/SPARK-1781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993893#comment-13993893 ] William Benton commented on SPARK-1781: --- Could someone assign this issue to me? Generalized validity checking for configuration parameters -- Key: SPARK-1781 URL: https://issues.apache.org/jira/browse/SPARK-1781 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: William Benton Priority: Minor Issues like SPARK-1779 could be handled easily by a general mechanism for specifying whether or not a configuration parameter value is valid or not (and then excepting or warning and switching to a default value if it is not). I think it's possible to do this in a fairly lightweight fashion. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-729) Closures not always serialized at capture time
[ https://issues.apache.org/jira/browse/SPARK-729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13990729#comment-13990729 ] William Benton commented on SPARK-729: -- So the straightforward approach (immediately serializing and deserializing a closure in ClosureCleaner.clean) causes a couple of problems in Spark 1.0 that weren't obvious from the 0.9.1 test suite (that is, they existed but weren't exposed by the suite). Most notably, if we serialize closures immediately, we might replace the only reference to a broadcast variable object with a serialized copy of that object, the original could be cleaned up by ContextCleaner before the closure has a chance to execute. I've been looking at ways to solve this but thought I'd provide a status update here in the meantime. Closures not always serialized at capture time -- Key: SPARK-729 URL: https://issues.apache.org/jira/browse/SPARK-729 Project: Spark Issue Type: Bug Affects Versions: 0.7.0, 0.7.1 Reporter: Matei Zaharia Assignee: William Benton As seen in https://groups.google.com/forum/?fromgroups=#!topic/spark-users/8pTchwuP2Kk and its corresponding fix on https://github.com/mesos/spark/commit/adba773fab6294b5764d101d248815a7d3cb3558, it is possible for a closure referencing a var to see the latest version of that var, instead of the version that was there when the closure was passed to Spark. This is not good when failures or recomputations happen. We need to serialize the closures on capture if possible, perhaps as part of ClosureCleaner.clean. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1501) Assertions in Graph.apply test are never executed
[ https://issues.apache.org/jira/browse/SPARK-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13969652#comment-13969652 ] William Benton commented on SPARK-1501: --- Here's the PR: https://github.com/apache/spark/pull/415 Assertions in Graph.apply test are never executed - Key: SPARK-1501 URL: https://issues.apache.org/jira/browse/SPARK-1501 Project: Spark Issue Type: Test Components: GraphX Affects Versions: 1.0.0 Reporter: William Benton Priority: Minor Labels: test The current Graph.apply test in GraphSuite contains assertions within an RDD transformation. These never execute because the transformation never executes. I have a (trivial) patch to fix this by collecting the graph triplets first. -- This message was sent by Atlassian JIRA (v6.2#6252)