[jira] [Comment Edited] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster

2016-11-06 Thread William Benton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15642858#comment-15642858
 ] 

William Benton edited comment on SPARK-18278 at 11/7/16 2:27 AM:
-

[~srowen] Currently {{ExternalClusterManager}} is Spark-private, so there isn't 
a great way to implement a new scheduler backend outside of Spark proper.  I 
think it would be great if an extension API for new cluster managers were 
public and developers could work with it with some expectation of stability!  
But even if this API were exposed and (relatively) stable, I think there's a 
good argument that if any cluster managers besides standalone are to live in 
Spark proper that a Kubernetes scheduler should be there, too.  

(Why draw the line just past Mesos and YARN when Kubernetes also enjoys a large 
community and many deployments?  And if the Spark community is to draw the line 
somewhere, why not draw it around the standalone scheduler?  If we're using the 
messaging connectors in Bahir as an example, then historical precedent isn't an 
argument for keeping existing schedulers in Spark proper.)


was (Author: willbenton):
[~srowen] Currently {{ExternalClusterManager}} is Spark-private, so there isn't 
a great way to implement a new scheduler backend outside of Spark proper.  I 
think it would be great if an extension API for new cluster managers were 
public and developers could work with it with some expectation of stability!  
But even if this API were exposed and (relatively) stable, I I think there's a 
good argument that if any cluster managers besides standalone are to live in 
Spark proper that a Kubernetes scheduler should be there, too.  

(Why draw the line just past Mesos and YARN when Kubernetes also enjoys a large 
community and many deployments?  And if the Spark community is to draw the line 
somewhere, why not draw it around the standalone scheduler?  If we're using the 
messaging connectors in Bahir as an example, then historical precedent isn't an 
argument for keeping existing schedulers in Spark proper.)

> Support native submission of spark jobs to a kubernetes cluster
> ---
>
> Key: SPARK-18278
> URL: https://issues.apache.org/jira/browse/SPARK-18278
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Deploy, Documentation, Scheduler, Spark Core
>Affects Versions: 2.2.0
>Reporter: Erik Erlandson
>
> A new Apache Spark sub-project that enables native support for submitting 
> Spark applications to a kubernetes cluster.   The submitted application runs 
> in a driver executing on a kubernetes pod, and executors lifecycles are also 
> managed as pods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster

2016-11-06 Thread William Benton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15642858#comment-15642858
 ] 

William Benton commented on SPARK-18278:


[~srowen] Currently {{ExternalClusterManager}} is Spark-private, so there isn't 
a great way to implement a new scheduler backend outside of Spark proper.  I 
think it would be great if an extension API for new cluster managers were 
public and developers could work with it with some expectation of stability!  
But even if this API were exposed and (relatively) stable, I I think there's a 
good argument that if any cluster managers besides standalone are to live in 
Spark proper that a Kubernetes scheduler should be there, too.  

(Why draw the line just past Mesos and YARN when Kubernetes also enjoys a large 
community and many deployments?  And if the Spark community is to draw the line 
somewhere, why not draw it around the standalone scheduler?  If we're using the 
messaging connectors in Bahir as an example, then historical precedent isn't an 
argument for keeping existing schedulers in Spark proper.)

> Support native submission of spark jobs to a kubernetes cluster
> ---
>
> Key: SPARK-18278
> URL: https://issues.apache.org/jira/browse/SPARK-18278
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Deploy, Documentation, Scheduler, Spark Core
>Affects Versions: 2.2.0
>Reporter: Erik Erlandson
>
> A new Apache Spark sub-project that enables native support for submitting 
> Spark applications to a kubernetes cluster.   The submitted application runs 
> in a driver executing on a kubernetes pod, and executors lifecycles are also 
> managed as pods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17595) Inefficient selection in Word2VecModel.findSynonyms

2016-09-19 Thread William Benton (JIRA)
William Benton created SPARK-17595:
--

 Summary: Inefficient selection in Word2VecModel.findSynonyms
 Key: SPARK-17595
 URL: https://issues.apache.org/jira/browse/SPARK-17595
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 2.0.0
Reporter: William Benton
Priority: Minor


The code in `Word2VecModel.findSynonyms` to choose the vocabulary elements with 
the highest similarity to the query vector currently sorts the similarities for 
every vocabulary element.  This involves making multiple copies of the 
collection of similarities while doing a (relatively) expensive sort.  It would 
be more efficient to find the best matches by maintaining a bounded priority 
queue and populating it with a single pass over the vocabulary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17548) Word2VecModel.findSynonyms can spuriously reject the best match when invoked with a vector

2016-09-14 Thread William Benton (JIRA)
William Benton created SPARK-17548:
--

 Summary: Word2VecModel.findSynonyms can spuriously reject the best 
match when invoked with a vector
 Key: SPARK-17548
 URL: https://issues.apache.org/jira/browse/SPARK-17548
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 2.0.0, 1.6.2, 1.5.2, 1.4.1
 Environment: any
Reporter: William Benton
Priority: Minor


The `findSynonyms` method in `Word2VecModel` currently rejects the best match a 
priori. When `findSynonyms` is invoked with a word, the best match is almost 
certain to be that word, but `findSynonyms` can also be invoked with a vector, 
which might not correspond to any of the words in the model's vocabulary.  In 
the latter case, rejecting the best match is spurious.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5281) Registering table on RDD is giving MissingRequirementError

2015-04-06 Thread William Benton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482291#comment-14482291
 ] 

William Benton commented on SPARK-5281:
---

As [~marmbrus] recently pointed out on the user list, this happens when you 
don't have all of the dependencies for Scala reflection loaded by the 
primordial classloader.  For running apps from sbt, setting {{fork := true}} 
should do the trick.  For running a REPL from sbt, try [this 
workaround|http://chapeau.freevariable.com/2015/04/spark-sql-repl.html].  
(Sorry to not have a solution for Eclipse.)

 Registering table on RDD is giving MissingRequirementError
 --

 Key: SPARK-5281
 URL: https://issues.apache.org/jira/browse/SPARK-5281
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: sarsol
Priority: Critical

 Application crashes on this line  {{rdd.registerTempTable(temp)}}  in 1.2 
 version when using sbt or Eclipse SCALA IDE
 Stacktrace:
 {code}
 Exception in thread main scala.reflect.internal.MissingRequirementError: 
 class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with 
 primordial classloader with boot classpath 
 [C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-library.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-reflect.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-actor.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-swing.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-compiler.jar;C:\Program
  Files\Java\jre7\lib\resources.jar;C:\Program 
 Files\Java\jre7\lib\rt.jar;C:\Program 
 Files\Java\jre7\lib\sunrsasign.jar;C:\Program 
 Files\Java\jre7\lib\jsse.jar;C:\Program 
 Files\Java\jre7\lib\jce.jar;C:\Program 
 Files\Java\jre7\lib\charsets.jar;C:\Program 
 Files\Java\jre7\lib\jfr.jar;C:\Program Files\Java\jre7\classes] not found.
   at 
 scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
   at 
 scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
   at 
 scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
   at 
 scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
   at 
 scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72)
   at 
 scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119)
   at 
 scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$$typecreator1$1.apply(ScalaReflection.scala:115)
   at 
 scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
   at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
   at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335)
   at scala.reflect.api.Universe.typeOf(Universe.scala:59)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:115)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:100)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$class.attributesFor(ScalaReflection.scala:94)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(ScalaReflection.scala:33)
   at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:111)
   at 
 com.sar.spark.dq.poc.SparkPOC$delayedInit$body.apply(SparkPOC.scala:43)
   at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
   at 
 scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
   at scala.App$$anonfun$main$1.apply(App.scala:71)
   at scala.App$$anonfun$main$1.apply(App.scala:71)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at 
 scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
   at scala.App$class.main(App.scala:71)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4867) UDF clean up

2014-12-19 Thread William Benton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14253579#comment-14253579
 ] 

William Benton commented on SPARK-4867:
---

[~marmbrus] I actually think exposing an interface that looks something like 
overloading might be the right approach.  (To be clear, I think polymorphism 
poses a far greater difficulty with implicit coercion than without it, but it 
might be possible to solve the ambiguity there by letting users register 
functions in a priority order.)

 UDF clean up
 

 Key: SPARK-4867
 URL: https://issues.apache.org/jira/browse/SPARK-4867
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Priority: Blocker

 Right now our support and internal implementation of many functions has a few 
 issues.  Specifically:
  - UDFS don't know their input types and thus don't do type coercion.
  - We hard code a bunch of built in functions into the parser.  This is bad 
 because in SQL it creates new reserved words for things that aren't actually 
 keywords.  Also it means that for each function we need to add support to 
 both SQLContext and HiveContext separately.
 For this JIRA I propose we do the following:
  - Change the interfaces for registerFunction and ScalaUdf to include types 
 for the input arguments as well as the output type.
  - Add a rule to analysis that does type coercion for UDFs.
  - Add a parse rule for functions to SQLParser.
  - Rewrite all the UDFs that are currently hacked into the various parsers 
 using this new functionality.
 Depending on how big this refactoring becomes we could split parts 12 from 
 part 3 above.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4867) UDF clean up

2014-12-18 Thread William Benton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14251822#comment-14251822
 ] 

William Benton commented on SPARK-4867:
---

I think in general it's a great idea to declare or register SQL functions with 
type signatures.  I also think that the more we can lean on Scala's type system 
here, the better.  The reason why I didn't make declaring signatures a 
requirement for all functions in my PR for SPARK-2863 is that it seems like the 
interface gets hairy pretty quickly if it's going to be general enough to work 
for the use cases we can reasonably expect.

The simplest example of where things get complicated is numeric types:  many 
functions in existing systems are polymorphic, taking either Doubles or 
Decimals, and then returning the type they took.  (In practice, it seems like 
Hive in particular doesn't do much that keeps the precision of Decimals intact, 
but that's another matter.)  So we'd need a type signature interface that 
supports type variables and constraints, so that the expected signature for 
addition could look something like (A, B) = C, with annotations indicating 
that A is either Double or Decimal, B is either Double or Decimal, and C is the 
least upper bound of A and B.  (It's certainly possible to special-case numeric 
coercions, but that's sort of a bummer -- and this class of problem is one we'd 
want to solve for UDFs anyway.)

In any case, I'm definitely interested in working with you on both design and 
implementation!

 UDF clean up
 

 Key: SPARK-4867
 URL: https://issues.apache.org/jira/browse/SPARK-4867
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Priority: Blocker

 Right now our support and internal implementation of many functions has a few 
 issues.  Specifically:
  - UDFS don't know their input types and thus don't do type coercion.
  - We hard code a bunch of built in functions into the parser.  This is bad 
 because in SQL it creates new reserved words for things that aren't actually 
 keywords.  Also it means that for each function we need to add support to 
 both SQLContext and HiveContext separately.
 For this JIRA I propose we do the following:
  - Change the interfaces for registerFunction and ScalaUdf to include types 
 for the input arguments as well as the output type.
  - Add a rule to analysis that does type coercion for UDFs.
  - Add a parse rule for functions to SQLParser.
  - Rewrite all the UDFs that are currently hacked into the various parsers 
 using this new functionality.
 Depending on how big this refactoring becomes we could split parts 12 from 
 part 3 above.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4185) JSON schema inference failed when dealing with type conflicts in arrays

2014-11-01 Thread William Benton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14193491#comment-14193491
 ] 

William Benton commented on SPARK-4185:
---

I'm actually not sure this is a bug!  My main concern in this case is that 
inferring any typing for this collection of objects makes it very difficult to 
write meaningful queries.  In the fedmsg case, the problem was that the source 
data overloaded the meaning of a field name, so I was able to preprocess the 
fields to do the renaming.  I was thinking that maybe a good solution might be 
to have Spark SQL automatically rename fields with conflicting types in 
different records (e.g. to “branches_1” and “branches_2” in this case).

 JSON schema inference failed when dealing with type conflicts in arrays
 ---

 Key: SPARK-4185
 URL: https://issues.apache.org/jira/browse/SPARK-4185
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Yin Huai
Assignee: Yin Huai

 {code}
 val sqlContext = new org.apache.spark.sql.SQLContext(sparkContext)
 val diverging = sparkContext.parallelize(List({branches: [foo]}, 
 {branches: [{foo:42}]}))
 sqlContext.jsonRDD(diverging)  // throws a MatchError
 {code}
 The case is from http://chapeau.freevariable.com/2014/10/fedmsg-and-spark.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4190) Allow users to provide transformation rules at JSON ingest

2014-11-01 Thread William Benton (JIRA)
William Benton created SPARK-4190:
-

 Summary: Allow users to provide transformation rules at JSON ingest
 Key: SPARK-4190
 URL: https://issues.apache.org/jira/browse/SPARK-4190
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0, 1.2.0
Reporter: William Benton


It would be great if it were possible to provide transformation rules (to be 
executed within jsonRDD or jsonFile) so that users could 

   (1) deal with JSON files that confound schema inference or are otherwise 
insufficiently disciplined, or
   (2) simply perform arbitrary object transformations at ingest before a 
schema is inferred.

json4s, which Spark already uses, has nice interfaces for specifying 
transformations as partial functions on objects and accessing nested structures 
via path expressions.  (We might want to introduce an abstraction atop json4s 
for a public API, but the json4s API seems like a good first step.)  There are 
some examples of these transformations at https://github.com/json4s/json4s and 
at http://chapeau.freevariable.com/2014/10/fedmsg-and-spark.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4190) Allow users to provide transformation rules at JSON ingest

2014-11-01 Thread William Benton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14193573#comment-14193573
 ] 

William Benton commented on SPARK-4190:
---

I'll take this, since I'm interested in working on it and it seems like a quick 
fix.  [~yhuai], will you be willing to review a WIP PR sometime soon?

 Allow users to provide transformation rules at JSON ingest
 --

 Key: SPARK-4190
 URL: https://issues.apache.org/jira/browse/SPARK-4190
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0, 1.2.0
Reporter: William Benton

 It would be great if it were possible to provide transformation rules (to be 
 executed within jsonRDD or jsonFile) so that users could 
(1) deal with JSON files that confound schema inference or are otherwise 
 insufficiently disciplined, or
(2) simply perform arbitrary object transformations at ingest before a 
 schema is inferred.
 json4s, which Spark already uses, has nice interfaces for specifying 
 transformations as partial functions on objects and accessing nested 
 structures via path expressions.  (We might want to introduce an abstraction 
 atop json4s for a public API, but the json4s API seems like a good first 
 step.)  There are some examples of these transformations at 
 https://github.com/json4s/json4s and at 
 http://chapeau.freevariable.com/2014/10/fedmsg-and-spark.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2863) Emulate Hive type coercion in native reimplementations of Hive functions

2014-10-11 Thread William Benton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168403#comment-14168403
 ] 

William Benton commented on SPARK-2863:
---

I submitted a PR for this issue this morning:  
https://github.com/apache/spark/pull/2768

(I would have liked to have a solution that leaned on the type system a little 
more -- in particular, even just representing function signatures as tuples 
rather than as lists -- but the approach I took in the PR is simple, easy to 
understand, and easy to validate.)

 Emulate Hive type coercion in native reimplementations of Hive functions
 

 Key: SPARK-2863
 URL: https://issues.apache.org/jira/browse/SPARK-2863
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: William Benton
Assignee: William Benton

 Native reimplementations of Hive functions no longer have the same 
 type-coercion behavior as they would if executed via Hive.  As [Michael 
 Armbrust points 
 out|https://github.com/apache/spark/pull/1750#discussion_r15790970], queries 
 like {{SELECT SQRT(2) FROM src LIMIT 1}} succeed in Hive but fail if 
 {{SQRT}} is implemented natively.
 Spark SQL should have Hive-compatible type coercions for arguments to 
 natively-implemented functions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3699) sbt console tasks don't clean up SparkContext

2014-09-26 Thread William Benton (JIRA)
William Benton created SPARK-3699:
-

 Summary: sbt console tasks don't clean up SparkContext
 Key: SPARK-3699
 URL: https://issues.apache.org/jira/browse/SPARK-3699
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: William Benton
Priority: Minor
 Fix For: 1.1.1


Because the sbt console tasks for the hive and sql projects don't stop the 
SparkContext upon exit, users are faced with an ugly stack trace.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3423) Implement BETWEEN support for regular SQL parser

2014-09-05 Thread William Benton (JIRA)
William Benton created SPARK-3423:
-

 Summary: Implement BETWEEN support for regular SQL parser
 Key: SPARK-3423
 URL: https://issues.apache.org/jira/browse/SPARK-3423
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: William Benton
Priority: Minor


The HQL parser supports BETWEEN but the SQLParser currently does not.  It would 
be great if it did.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3423) Implement BETWEEN support for regular SQL parser

2014-09-05 Thread William Benton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123814#comment-14123814
 ] 

William Benton commented on SPARK-3423:
---

(PR is here:  https://github.com/apache/spark/pull/2295 ) 

 Implement BETWEEN support for regular SQL parser
 

 Key: SPARK-3423
 URL: https://issues.apache.org/jira/browse/SPARK-3423
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: William Benton
Assignee: William Benton
Priority: Minor

 The HQL parser supports BETWEEN but the SQLParser currently does not.  It 
 would be great if it did.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3329) HiveQuerySuite SET tests depend on map orderings

2014-08-31 Thread William Benton (JIRA)
William Benton created SPARK-3329:
-

 Summary: HiveQuerySuite SET tests depend on map orderings
 Key: SPARK-3329
 URL: https://issues.apache.org/jira/browse/SPARK-3329
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2, 1.1.0
Reporter: William Benton
Priority: Trivial


The SET tests in HiveQuerySuite that return multiple values depend on the 
ordering in which map pairs are returned from Hive and can fail spuriously if 
this changes due to environment or library changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2863) Emulate Hive type coercion in native reimplementations of Hive functions

2014-08-21 Thread William Benton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105537#comment-14105537
 ] 

William Benton commented on SPARK-2863:
---

I wrote up how Hive handles type coercions in a blog post:

http://chapeau.freevariable.com/2014/08/existing-system-coercion.html

The short version is that strings can be coerced to doubles or decimals and (in 
Hive 0.13) decimals can be coerced to doubles for numeric functions.  As a 
first pass, I propose extending the numeric function helpers to handle strings.

 Emulate Hive type coercion in native reimplementations of Hive functions
 

 Key: SPARK-2863
 URL: https://issues.apache.org/jira/browse/SPARK-2863
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: William Benton
Assignee: William Benton

 Native reimplementations of Hive functions no longer have the same 
 type-coercion behavior as they would if executed via Hive.  As [Michael 
 Armbrust points 
 out|https://github.com/apache/spark/pull/1750#discussion_r15790970], queries 
 like {{SELECT SQRT(2) FROM src LIMIT 1}} succeed in Hive but fail if 
 {{SQRT}} is implemented natively.
 Spark SQL should have Hive-compatible type coercions for arguments to 
 natively-implemented functions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2863) Emulate Hive type coercion in native reimplementations of Hive UDFs

2014-08-05 Thread William Benton (JIRA)
William Benton created SPARK-2863:
-

 Summary: Emulate Hive type coercion in native reimplementations of 
Hive UDFs
 Key: SPARK-2863
 URL: https://issues.apache.org/jira/browse/SPARK-2863
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: William Benton


Native reimplementations of Hive functions no longer have the same 
type-coercion behavior as they would if executed via Hive.  As a 
href=https://github.com/apache/spark/pull/1750#discussion_r15790970; Michael 
Armbrust points out/a, queries like {{SELECT SQRT(2) FROM src LIMIT 1}} 
succeed in Hive but fail if {{SQRT}} is implemented natively.

Spark SQL should have Hive-compatible type coercions for arguments to 
natively-implemented functions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2813) Implement SQRT() directly in Catalyst

2014-08-02 Thread William Benton (JIRA)
William Benton created SPARK-2813:
-

 Summary: Implement SQRT() directly in Catalyst
 Key: SPARK-2813
 URL: https://issues.apache.org/jira/browse/SPARK-2813
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.0
Reporter: William Benton
Priority: Minor
 Fix For: 1.1.0


Instead of delegating square root computation to a Hive UDF, Spark should 
implement SQL SQRT() directly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2813) Implement SQRT() directly in Spark SQL

2014-08-02 Thread William Benton (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Benton updated SPARK-2813:
--

Summary: Implement SQRT() directly in Spark SQL  (was: Implement SQRT() 
directly in Spark SQP)

 Implement SQRT() directly in Spark SQL
 --

 Key: SPARK-2813
 URL: https://issues.apache.org/jira/browse/SPARK-2813
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.0
Reporter: William Benton
Priority: Minor
 Fix For: 1.1.0


 Instead of delegating square root computation to a Hive UDF, Spark should 
 implement SQL SQRT() directly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2813) Implement SQRT() directly in Spark SQP

2014-08-02 Thread William Benton (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Benton updated SPARK-2813:
--

Summary: Implement SQRT() directly in Spark SQP  (was: Implement SQRT() 
directly in Catalyst)

 Implement SQRT() directly in Spark SQP
 --

 Key: SPARK-2813
 URL: https://issues.apache.org/jira/browse/SPARK-2813
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.0
Reporter: William Benton
Priority: Minor
 Fix For: 1.1.0


 Instead of delegating square root computation to a Hive UDF, Spark should 
 implement SQL SQRT() directly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2226) HAVING should be able to contain aggregate expressions that don't appear in the aggregation list.

2014-07-19 Thread William Benton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14067503#comment-14067503
 ] 

William Benton commented on SPARK-2226:
---

[~rxin], yes, and I'm mostly done.  I'll post a PR soon!

 HAVING should be able to contain aggregate expressions that don't appear in 
 the aggregation list. 
 --

 Key: SPARK-2226
 URL: https://issues.apache.org/jira/browse/SPARK-2226
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.0
Reporter: Reynold Xin
Assignee: William Benton

 https://github.com/apache/hive/blob/trunk/ql/src/test/queries/clientpositive/having.q
 This test file contains the following query:
 {code}
 SELECT key FROM src GROUP BY key HAVING max(value)  val_255;
 {code}
 Once we fixed this issue, we should whitelist having.q.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2486) Utils.getCallSite can crash under JVMTI profilers

2014-07-14 Thread William Benton (JIRA)
William Benton created SPARK-2486:
-

 Summary: Utils.getCallSite can crash under JVMTI profilers
 Key: SPARK-2486
 URL: https://issues.apache.org/jira/browse/SPARK-2486
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
 Environment: running under profilers (observed on OS X under YourKit 
with CPU profiling and/or object allocation site tracking enabled)
Reporter: William Benton
Priority: Minor


When running under an instrumenting profiler, Utils.getCallSite sometimes 
crashes with an NPE while examining stack trace elements.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2486) Utils.getCallSite can crash under JVMTI profilers

2014-07-14 Thread William Benton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14061625#comment-14061625
 ] 

William Benton commented on SPARK-2486:
---

A (trivial but functional) workaround is here:  
https://github.com/apache/spark/pull/1413

 Utils.getCallSite can crash under JVMTI profilers
 -

 Key: SPARK-2486
 URL: https://issues.apache.org/jira/browse/SPARK-2486
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
 Environment: running under profilers (observed on OS X under YourKit 
 with CPU profiling and/or object allocation site tracking enabled)
Reporter: William Benton
Priority: Minor

 When running under an instrumenting profiler, Utils.getCallSite sometimes 
 crashes with an NPE while examining stack trace elements.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2407) Implement SQL SUBSTR() directly in Catalyst

2014-07-10 Thread William Benton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057725#comment-14057725
 ] 

William Benton commented on SPARK-2407:
---

Here's the PR: https://github.com/apache/spark/pull/1359


 Implement SQL SUBSTR() directly in Catalyst
 ---

 Key: SPARK-2407
 URL: https://issues.apache.org/jira/browse/SPARK-2407
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: William Benton
Assignee: William Benton

 Currently SQL SUBSTR/SUBSTRING() is delegated to Hive.  It would be nice to 
 implement this directly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2407) Implement SQL SUBSTR() directly in Catalyst

2014-07-08 Thread William Benton (JIRA)
William Benton created SPARK-2407:
-

 Summary: Implement SQL SUBSTR() directly in Catalyst
 Key: SPARK-2407
 URL: https://issues.apache.org/jira/browse/SPARK-2407
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: William Benton


Currently SQL SUBSTR/SUBSTRING() is delegated to Hive.  It would be nice to 
implement this directly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2407) Implement SQL SUBSTR() directly in Catalyst

2014-07-08 Thread William Benton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14055232#comment-14055232
 ] 

William Benton commented on SPARK-2407:
---

I have this on a branch and will submit a PR as soon as I'm done running the 
test suite locally.

 Implement SQL SUBSTR() directly in Catalyst
 ---

 Key: SPARK-2407
 URL: https://issues.apache.org/jira/browse/SPARK-2407
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: William Benton

 Currently SQL SUBSTR/SUBSTRING() is delegated to Hive.  It would be nice to 
 implement this directly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2226) HAVING should be able to contain aggregate expressions that don't appear in the aggregation list.

2014-07-08 Thread William Benton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14055499#comment-14055499
 ] 

William Benton commented on SPARK-2226:
---

Thanks, [~marmbrus], this makes sense!  I'll ping you if I get stuck.

 HAVING should be able to contain aggregate expressions that don't appear in 
 the aggregation list. 
 --

 Key: SPARK-2226
 URL: https://issues.apache.org/jira/browse/SPARK-2226
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.0
Reporter: Reynold Xin
Assignee: William Benton

 https://github.com/apache/hive/blob/trunk/ql/src/test/queries/clientpositive/having.q
 This test file contains the following query:
 {code}
 SELECT key FROM src GROUP BY key HAVING max(value)  val_255;
 {code}
 Once we fixed this issue, we should whitelist having.q.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2225) Turn HAVING without GROUP BY into WHERE

2014-06-20 Thread William Benton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039335#comment-14039335
 ] 

William Benton commented on SPARK-2225:
---

So the Hive test suite treats HAVING without GROUP BY as an error (see 
having1.q), although it is accepted by some dialects (like SQL Server).  I'm 
happy to take this, though, if this is what we want to do here.

 Turn HAVING without GROUP BY into WHERE
 ---

 Key: SPARK-2225
 URL: https://issues.apache.org/jira/browse/SPARK-2225
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.0
Reporter: Reynold Xin

 See http://msdn.microsoft.com/en-US/library/8hhs5f4e(v=vs.80).aspx
 The HAVING clause specifies conditions that determines the groups included in 
 the query. If the SQL SELECT statement does not contain aggregate functions, 
 you can use a SQL SELECT statement that contains a HAVING clause without a 
 GROUP BY clause.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2180) HiveQL doesn't support GROUP BY with HAVING clauses

2014-06-19 Thread William Benton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037697#comment-14037697
 ] 

William Benton commented on SPARK-2180:
---

PR is here:  https://github.com/apache/spark/pull/1136

 HiveQL doesn't support GROUP BY with HAVING clauses
 ---

 Key: SPARK-2180
 URL: https://issues.apache.org/jira/browse/SPARK-2180
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: William Benton
Priority: Minor

 The HiveQL implementation doesn't support HAVING clauses for aggregations.  
 This prevents some of the TPCDS benchmarks from running.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2180) HiveQL doesn't support GROUP BY with HAVING clauses

2014-06-18 Thread William Benton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14035907#comment-14035907
 ] 

William Benton commented on SPARK-2180:
---

(I'm working on a fix and will submit a PR soon.)

 HiveQL doesn't support GROUP BY with HAVING clauses
 ---

 Key: SPARK-2180
 URL: https://issues.apache.org/jira/browse/SPARK-2180
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: William Benton
Priority: Minor

 The HiveQL implementation doesn't support HAVING clauses for aggregations.  
 This prevents some of the TPCDS benchmarks from running.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-571) Forbid return statements when cleaning closures

2014-05-14 Thread William Benton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993919#comment-13993919
 ] 

William Benton commented on SPARK-571:
--

I have a patch for this and will submit a PR later today; can someone please 
assign this to me?

 Forbid return statements when cleaning closures
 ---

 Key: SPARK-571
 URL: https://issues.apache.org/jira/browse/SPARK-571
 Project: Spark
  Issue Type: Improvement
Reporter: tjhunter

 By mistake, I wrote some code like this:
 pre
 object Foo {
  def main() {
val sc = new SparkContext(...)
sc.parallelize(0 to 10,10).map({  ... return 1 ... }).collect
  }
 }
 /pre
 This compiles fine and actually runs using the local scheduler. However, 
 using the mesos scheduler throws a NotSerializableException in the 
 CollectTask . I agree the result of the program above should be undefined or 
 it should be an error. Would it be possible to have more explicit messages?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1807) Modify SPARK_EXECUTOR_URI to allow for script execution in Mesos.

2014-05-14 Thread William Benton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995469#comment-13995469
 ] 

William Benton commented on SPARK-1807:
---

I'll be working on some related things this week; could someone assign this to 
me?

 Modify SPARK_EXECUTOR_URI to allow for script execution in Mesos.
 -

 Key: SPARK-1807
 URL: https://issues.apache.org/jira/browse/SPARK-1807
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Affects Versions: 0.9.0
Reporter: Timothy St. Clair

 Modify Mesos Scheduler integration to allow SPARK_EXECUTOR_URI to be an 
 executable script.  This allows admins to launch spark in any fashion they 
 desire, vs. just tarball fetching + implied context.   



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1789) Multiple versions of Netty dependencies cause FlumeStreamSuite failure

2014-05-13 Thread William Benton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13995560#comment-13995560
 ] 

William Benton commented on SPARK-1789:
---

Sean, we're currently building against Akka 2.3.0 in Fedora (it's a trivial 
source patch against 0.9.1; I haven't investigated the delta against 1.0 yet).  
Are there reasons why Akka 2.3.0 is a bad idea for Spark in general?  If not, 
I'm happy to file a JIRA for updating the dependency and contribute my patch 
upstream.

 Multiple versions of Netty dependencies cause FlumeStreamSuite failure
 --

 Key: SPARK-1789
 URL: https://issues.apache.org/jira/browse/SPARK-1789
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 0.9.1
Reporter: Sean Owen
Assignee: Sean Owen
  Labels: flume, netty, test
 Fix For: 1.0.0


 TL;DR is there is a bit of JAR hell trouble with Netty, that can be mostly 
 resolved and will resolve a test failure.
 I hit the error described at 
 http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-startup-time-out-td1753.html
  while running FlumeStreamingSuite, and have for a short while (is it just 
 me?)
 velvia notes:
 I have found a workaround.  If you add akka 2.2.4 to your dependencies, then 
 everything works, probably because akka 2.2.4 brings in newer version of 
 Jetty. 
 There are at least 3 versions of Netty in play in the build:
 - the new Flume 1.4.0 dependency brings in io.netty:netty:3.4.0.Final, and 
 that is the immediate problem
 - the custom version of akka 2.2.3 depends on io.netty:netty:3.6.6.
 - but, Spark Core directly uses io.netty:netty-all:4.0.17.Final
 The POMs try to exclude other versions of netty, but are excluding 
 org.jboss.netty:netty, when in fact older versions of io.netty:netty (not 
 netty-all) are also an issue.
 The org.jboss.netty:netty excludes are largely unnecessary. I replaced many 
 of them with io.netty:netty exclusions until everything agreed on 
 io.netty:netty-all:4.0.17.Final.
 But this didn't work, since Akka 2.2.3 doesn't work with Netty 4.x. 
 Down-grading to 3.6.6.Final across the board made some Spark code not compile.
 If the build *keeps* io.netty:netty:3.6.6.Final as well, everything seems to 
 work. Part of the reason seems to be that Netty 3.x used the old 
 `org.jboss.netty` packages. This is less than ideal, but is no worse than the 
 current situation. 
 So this PR resolves the issue and improves the JAR hell, even if it leaves 
 the existing theoretical Netty 3-vs-4 conflict:
 - Remove org.jboss.netty excludes where possible, for clarity; they're not 
 needed except with Hadoop artifacts
 - Add io.netty:netty excludes where needed -- except, let akka keep its 
 io.netty:netty
 - Change a bit of test code that actually depended on Netty 3.x, to use 4.x 
 equivalent
 - Update SBT build accordingly
 A better change would be to update Akka far enough such that it agrees on 
 Netty 4.x, but I don't know if that's feasible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1781) Generalized validity checking for configuration parameters

2014-05-12 Thread William Benton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993893#comment-13993893
 ] 

William Benton commented on SPARK-1781:
---

Could someone assign this issue to me?

 Generalized validity checking for configuration parameters
 --

 Key: SPARK-1781
 URL: https://issues.apache.org/jira/browse/SPARK-1781
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: William Benton
Priority: Minor

 Issues like SPARK-1779 could be handled easily by a general mechanism for 
 specifying whether or not a configuration parameter value is valid or not 
 (and then excepting or warning and switching to a default value if it is 
 not).  I think it's possible to do this in a fairly lightweight fashion.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-729) Closures not always serialized at capture time

2014-05-06 Thread William Benton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13990729#comment-13990729
 ] 

William Benton commented on SPARK-729:
--

So the straightforward approach (immediately serializing and deserializing a 
closure in ClosureCleaner.clean) causes a couple of problems in Spark 1.0 that 
weren't obvious from the 0.9.1 test suite (that is, they existed but weren't 
exposed by the suite).  

Most notably, if we serialize closures immediately, we might replace the only 
reference to a broadcast variable object with a serialized copy of that object, 
the original could be cleaned up by ContextCleaner before the closure has a 
chance to execute.  

I've been looking at ways to solve this but thought I'd provide a status update 
here in the meantime.

 Closures not always serialized at capture time
 --

 Key: SPARK-729
 URL: https://issues.apache.org/jira/browse/SPARK-729
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.7.0, 0.7.1
Reporter: Matei Zaharia
Assignee: William Benton

 As seen in 
 https://groups.google.com/forum/?fromgroups=#!topic/spark-users/8pTchwuP2Kk 
 and its corresponding fix on 
 https://github.com/mesos/spark/commit/adba773fab6294b5764d101d248815a7d3cb3558,
  it is possible for a closure referencing a var to see the latest version of 
 that var, instead of the version that was there when the closure was passed 
 to Spark. This is not good when failures or recomputations happen. We need to 
 serialize the closures on capture if possible, perhaps as part of 
 ClosureCleaner.clean.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1501) Assertions in Graph.apply test are never executed

2014-04-15 Thread William Benton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13969652#comment-13969652
 ] 

William Benton commented on SPARK-1501:
---

Here's the PR:  https://github.com/apache/spark/pull/415

 Assertions in Graph.apply test are never executed
 -

 Key: SPARK-1501
 URL: https://issues.apache.org/jira/browse/SPARK-1501
 Project: Spark
  Issue Type: Test
  Components: GraphX
Affects Versions: 1.0.0
Reporter: William Benton
Priority: Minor
  Labels: test

 The current Graph.apply test in GraphSuite contains assertions within an RDD 
 transformation.  These never execute because the transformation never 
 executes.  I have a (trivial) patch to fix this by collecting the graph 
 triplets first.



--
This message was sent by Atlassian JIRA
(v6.2#6252)