date:20150526


[ 
https://issues.apache.org/jira/browse/SPARK-7042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560180#comment-14560180
 ] 

Konstantin Shaposhnikov commented on SPARK-7042:


It looks like akka-zeromq_2.11 is only available for versions 2.3.7+, though 
the rest of the akka libraries are available for 2.3.4.

I wonder if akka version can just be updated to the latest 2.3.11?

 Spark version of akka-actor_2.11 is not compatible with the official 
 akka-actor_2.11 2.3.x
 --

 Key: SPARK-7042
 URL: https://issues.apache.org/jira/browse/SPARK-7042
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.1
Reporter: Konstantin Shaposhnikov
Assignee: Konstantin Shaposhnikov
Priority: Minor
 Fix For: 1.5.0


 When connecting to a remote Spark cluster (that runs Spark branch-1.3 built 
 with Scala 2.11) from an application that uses akka 2.3.9 I get the following 
 error:
 {noformat}
 2015-04-22 09:01:38,924 - [WARN] - [akka.remote.ReliableDeliverySupervisor] 
 [sparkDriver-akka.actor.default-dispatcher-5] -
 Association with remote system [akka.tcp://sparkExecutor@server:59007] has 
 failed, address is now gated for [5000] ms.
 Reason is: [akka.actor.Identify; local class incompatible: stream classdesc 
 serialVersionUID = -213377755528332889, local class serialVersionUID = 1].
 {noformat}
 It looks like akka-actor_2.11 2.3.4-spark that is used by Spark has been 
 built using Scala compiler 2.11.0 that ignores SerialVersionUID annotations 
 (see https://issues.scala-lang.org/browse/SI-8549).
 The following steps can resolve the issue:
 - re-build the custom akka library that is used by Spark with the more recent 
 version of Scala compiler (e.g. 2.11.6) 
 - deploy a new version (e.g. 2.3.4.1-spark) to a maven repo
 - update version of akka used by spark (master and 1.3 branch)
 I would also suggest to upgrade to the latest version of akka 2.3.9 (or 
 2.3.10 that should be released soon).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7104) Support model save/load in Python's Word2Vec


 [ 
https://issues.apache.org/jira/browse/SPARK-7104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7104:
-
Target Version/s: 1.5.0  (was: 1.4.0)

 Support model save/load in Python's Word2Vec
 

 Key: SPARK-7104
 URL: https://issues.apache.org/jira/browse/SPARK-7104
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Joseph K. Bradley
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7605) Python API for ElementwiseProduct


 [ 
https://issues.apache.org/jira/browse/SPARK-7605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7605:
-
Target Version/s: 1.5.0
Assignee: Manoj Kumar

 Python API for ElementwiseProduct
 -

 Key: SPARK-7605
 URL: https://issues.apache.org/jira/browse/SPARK-7605
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Affects Versions: 1.4.0
Reporter: Yanbo Liang
Assignee: Manoj Kumar

 Python API for org.apache.spark.mllib.feature.ElementwiseProduct



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6263) Python MLlib API missing items: Utils


 [ 
https://issues.apache.org/jira/browse/SPARK-6263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6263:
-
Target Version/s: 1.5.0  (was: 1.4.0)

 Python MLlib API missing items: Utils
 -

 Key: SPARK-6263
 URL: https://issues.apache.org/jira/browse/SPARK-6263
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Kai Sasaki

 This JIRA lists items missing in the Python API for this sub-package of MLlib.
 This list may be incomplete, so please check again when sending a PR to add 
 these features to the Python API.
 Also, please check for major disparities between documentation; some parts of 
 the Python API are less well-documented than their Scala counterparts.  Some 
 items may be listed in the umbrella JIRA linked to this task.
 MLUtils
 * appendBias
 * kFold
 * loadVectors



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7605) Python API for ElementwiseProduct


[ 
https://issues.apache.org/jira/browse/SPARK-7605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560212#comment-14560212
 ] 

Joseph K. Bradley commented on SPARK-7605:
--

Updated for 1.5. I don't think we'll be able to merge more features into 1.4. 
But we can see about merging this soon after 1.4 QA is done.

 Python API for ElementwiseProduct
 -

 Key: SPARK-7605
 URL: https://issues.apache.org/jira/browse/SPARK-7605
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Affects Versions: 1.4.0
Reporter: Yanbo Liang
Assignee: Manoj Kumar

 Python API for org.apache.spark.mllib.feature.ElementwiseProduct



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6263) Python MLlib API missing items: Utils


[ 
https://issues.apache.org/jira/browse/SPARK-6263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560214#comment-14560214
 ] 

Joseph K. Bradley commented on SPARK-6263:
--

Updated for 1.5. I don't think we'll be able to merge more features into 1.4. 
But we can see about merging this soon after 1.4 QA is done.

 Python MLlib API missing items: Utils
 -

 Key: SPARK-6263
 URL: https://issues.apache.org/jira/browse/SPARK-6263
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Kai Sasaki

 This JIRA lists items missing in the Python API for this sub-package of MLlib.
 This list may be incomplete, so please check again when sending a PR to add 
 these features to the Python API.
 Also, please check for major disparities between documentation; some parts of 
 the Python API are less well-documented than their Scala counterparts.  Some 
 items may be listed in the umbrella JIRA linked to this task.
 MLUtils
 * appendBias
 * kFold
 * loadVectors



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6099) Stabilize mllib ClassificationModel, RegressionModel APIs


 [ 
https://issues.apache.org/jira/browse/SPARK-6099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6099:
-
Target Version/s: 1.5.0  (was: 1.4.0)

 Stabilize mllib ClassificationModel, RegressionModel APIs
 -

 Key: SPARK-6099
 URL: https://issues.apache.org/jira/browse/SPARK-6099
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 The abstractions spark.mllib.classification.ClassificationModel and 
 spark.mllib.regression.RegressionModel have been Experimental for a while.  
 This is a problem since some of the implementing classes are not Experimental 
 (e.g., LogisticRegressionModel).
 We should finalize the API and make it non-Experimental ASAP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6099) Stabilize mllib ClassificationModel, RegressionModel APIs


 [ 
https://issues.apache.org/jira/browse/SPARK-6099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6099:
-
Target Version/s: 1.4.0  (was: 1.5.0)

 Stabilize mllib ClassificationModel, RegressionModel APIs
 -

 Key: SPARK-6099
 URL: https://issues.apache.org/jira/browse/SPARK-6099
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 The abstractions spark.mllib.classification.ClassificationModel and 
 spark.mllib.regression.RegressionModel have been Experimental for a while.  
 This is a problem since some of the implementing classes are not Experimental 
 (e.g., LogisticRegressionModel).
 We should finalize the API and make it non-Experimental ASAP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7883) Fixing broken trainImplicit example in MLlib Collaborative Filtering documentation.


 [ 
https://issues.apache.org/jira/browse/SPARK-7883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-7883.
--
   Resolution: Fixed
Fix Version/s: 1.4.0
   1.0.3
   1.6.0
   1.2.3
   1.1.2
   1.3.2

Issue resolved by pull request 6422
[https://github.com/apache/spark/pull/6422]

 Fixing broken trainImplicit example in MLlib Collaborative Filtering 
 documentation.
 ---

 Key: SPARK-7883
 URL: https://issues.apache.org/jira/browse/SPARK-7883
 Project: Spark
  Issue Type: Bug
  Components: Documentation, MLlib
Affects Versions: 1.3.1, 1.4.0
Reporter: Mike Dusenberry
Priority: Trivial
 Fix For: 1.3.2, 1.1.2, 1.2.3, 1.6.0, 1.0.3, 1.4.0


 The trainImplicit Scala example near the end of the MLlib Collaborative 
 Filtering documentation refers to an ALS.trainImplicit function signature 
 that does not exist.  Rather than add an extra function, let's just fix the 
 example.
 Currently, the example refers to a function that would have the following 
 signature: 
 def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int, alpha: 
 Double) : MatrixFactorizationModel
 Instead, let's change the example to refer to this function, which does exist 
 (notice the addition of the lambda parameter):
 def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int, lambda: 
 Double, alpha: Double) : MatrixFactorizationModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7883) Fixing broken trainImplicit example in MLlib Collaborative Filtering documentation.


 [ 
https://issues.apache.org/jira/browse/SPARK-7883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7883:
-
Fix Version/s: (was: 1.6.0)

 Fixing broken trainImplicit example in MLlib Collaborative Filtering 
 documentation.
 ---

 Key: SPARK-7883
 URL: https://issues.apache.org/jira/browse/SPARK-7883
 Project: Spark
  Issue Type: Bug
  Components: Documentation, MLlib
Affects Versions: 1.3.1, 1.4.0
Reporter: Mike Dusenberry
Priority: Trivial
 Fix For: 1.0.3, 1.1.2, 1.2.3, 1.3.2, 1.4.0


 The trainImplicit Scala example near the end of the MLlib Collaborative 
 Filtering documentation refers to an ALS.trainImplicit function signature 
 that does not exist.  Rather than add an extra function, let's just fix the 
 example.
 Currently, the example refers to a function that would have the following 
 signature: 
 def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int, alpha: 
 Double) : MatrixFactorizationModel
 Instead, let's change the example to refer to this function, which does exist 
 (notice the addition of the lambda parameter):
 def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int, lambda: 
 Double, alpha: Double) : MatrixFactorizationModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7865) Hadoop Filesystem for eventlog closed before sparkContext stopped

2015-05-26 Thread Zhang, Liye (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560266#comment-14560266
 ] 

Zhang, Liye commented on SPARK-7865:


Thank [~vanzin]'s reply, I'm just thinking about why Hadoop filesystem closed 
before Spark JVM stoping. I think the reason can be found in description of PR 
[#5560|https://github.com/apache/spark/pull/5560]. 
[~srowen], sorry for opening one more duplicated JIRA, I'll take care of it 
next time.

 Hadoop Filesystem for eventlog closed before sparkContext stopped
 -

 Key: SPARK-7865
 URL: https://issues.apache.org/jira/browse/SPARK-7865
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Zhang, Liye

 After [SPARK-3090|https://issues.apache.org/jira/browse/SPARK-3090] (patch 
 [#5969|https://github.com/apache/spark/pull/5696]), SparkContext will be 
 automatically stop if user forget to.
 While when shutdownhook is called, Eventlog will give out following exception 
 for flushing content:
 {noformat}
 15/05/26 17:40:38 INFO spark.SparkContext: Invoking stop() from shutdown hook
 15/05/26 17:40:38 ERROR scheduler.LiveListenerBus: Listener 
 EventLoggingListener threw an exception
 java.lang.reflect.InvocationTargetException
 at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:497)
 at 
 org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144)
 at 
 org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144)
 at scala.Option.foreach(Option.scala:236)
 at 
 org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:144)
 at 
 org.apache.spark.scheduler.EventLoggingListener.onApplicationEnd(EventLoggingListener.scala:188)
 at 
 org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:54)
 at 
 org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
 at 
 org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
 at 
 org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:56)
 at 
 org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
 at 
 org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:79)
 at 
 org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1180)
 at 
 org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
 Caused by: java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:323)
 at org.apache.hadoop.hdfs.DFSClient.access$1200(DFSClient.java:78)
 at 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:3877)
 at 
 org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:97)
 ... 16 more
 {noformat}
 And exception for stopping:
 {noformat}
 15/05/26 17:40:39 INFO cluster.SparkDeploySchedulerBackend: Asking each 
 executor to shut down
 15/05/26 17:40:39 ERROR util.Utils: Uncaught exception in thread Spark 
 Shutdown Hook
 java.io.IOException: Filesystem closed
 at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:323)
 at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1057)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:554)
 at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:788)
 at 
 org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:209)
 at 
 org.apache.spark.SparkContext$$anonfun$stop$5.apply(SparkContext.scala:1515)
 at 
 org.apache.spark.SparkContext$$anonfun$stop$5.apply(SparkContext.scala:1515)
 at scala.Option.foreach(Option.scala:236)
 at org.apache.spark.SparkContext.stop(SparkContext.scala:1515)
 at 
 org.apache.spark.SparkContext$$anonfun$3.apply$mcV$sp(SparkContext.scala:527)
 at org.apache.spark.util.SparkShutdownHook.run(Utils.scala:2211)
 at 
 org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(Utils.scala:2181)
 at 
 org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(Utils.scala:2181)
 at 
 org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(Utils.scala:2181)
 at

[jira] [Commented] (SPARK-7555) User guide update for ElasticNet


[ 
https://issues.apache.org/jira/browse/SPARK-7555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560171#comment-14560171
 ] 

Joseph K. Bradley commented on SPARK-7555:
--

Thanks!

 User guide update for ElasticNet
 

 Key: SPARK-7555
 URL: https://issues.apache.org/jira/browse/SPARK-7555
 Project: Spark
  Issue Type: Documentation
  Components: ML
Reporter: Joseph K. Bradley
Assignee: DB Tsai

 Copied from [SPARK-7443]:
 {quote}
 Now that we have algorithms in spark.ml which are not in spark.mllib, we 
 should start making subsections for the spark.ml API as needed. We can follow 
 the structure of the spark.mllib user guide.
 * The spark.ml user guide can provide: (a) code examples and (b) info on 
 algorithms which do not exist in spark.mllib.
 * We should not duplicate info in the spark.ml guides. Since spark.mllib is 
 still the primary API, we should provide links to the corresponding 
 algorithms in the spark.mllib user guide for more info.
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7576) User guide update for spark.ml ElementwiseProduct


[ 
https://issues.apache.org/jira/browse/SPARK-7576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560170#comment-14560170
 ] 

Joseph K. Bradley commented on SPARK-7576:
--

Thanks!

 User guide update for spark.ml ElementwiseProduct
 -

 Key: SPARK-7576
 URL: https://issues.apache.org/jira/browse/SPARK-7576
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Reporter: Joseph K. Bradley
Assignee: Octavian Geagla

 Copied from [SPARK-7443]:
 {quote}
 Now that we have algorithms in spark.ml which are not in spark.mllib, we 
 should start making subsections for the spark.ml API as needed. We can follow 
 the structure of the spark.mllib user guide.
 * The spark.ml user guide can provide: (a) code examples and (b) info on 
 algorithms which do not exist in spark.mllib.
 * We should not duplicate info in the spark.ml guides. Since spark.mllib is 
 still the primary API, we should provide links to the corresponding 
 algorithms in the spark.mllib user guide for more info.
 {quote}
 Note: I created a new subsection for links to spark.ml-specific guides in 
 this JIRA's PR: [SPARK-7557]. This transformer can go within the new 
 subsection. I'll try to get that PR merged ASAP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7856) Scalable PCA implementation for tall and fat matrices


 [ 
https://issues.apache.org/jira/browse/SPARK-7856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7856:
-
Issue Type: Improvement  (was: Bug)

 Scalable PCA implementation for tall and fat matrices
 -

 Key: SPARK-7856
 URL: https://issues.apache.org/jira/browse/SPARK-7856
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Tarek Elgamal

 Currently the PCA implementation has a limitation of fitting d^2 
 covariance/grammian matrix entries in memory (d is the number of 
 columns/dimensions of the matrix). We often need only the largest k principal 
 components. To make pca really scalable, I suggest an implementation where 
 the memory usage is proportional to the principal components k rather than 
 the full dimensionality d. 
 I suggest adopting the solution described in this paper that is published in 
 SIGMOD 2015 (http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf). 
 The paper offers an implementation for Probabilistic PCA (PPCA) which has 
 less memory and time complexity and could potentially scale to tall and fat 
 matrices rather than tall and skinny matrices that is supported by the 
 current PCA impelmentation. 
 Probablistic PCA could be potentially added to the set of algorithms 
 supported by MLlib and it does not necessarily replace the old PCA 
 implementation.
 PPCA implementation is adopted in Matlab's Statistics and Machine Learning 
 Toolbox (http://www.mathworks.com/help/stats/ppca.html)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7042) Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x


[ 
https://issues.apache.org/jira/browse/SPARK-7042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560181#comment-14560181
 ] 

Konstantin Shaposhnikov commented on SPARK-7042:


It looks like akka-zeromq_2.11 is only available for versions 2.3.7+, though 
the rest of the akka libraries are available for 2.3.4.

I wonder if akka version can just be updated to the latest 2.3.11?

 Spark version of akka-actor_2.11 is not compatible with the official 
 akka-actor_2.11 2.3.x
 --

 Key: SPARK-7042
 URL: https://issues.apache.org/jira/browse/SPARK-7042
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.1
Reporter: Konstantin Shaposhnikov
Assignee: Konstantin Shaposhnikov
Priority: Minor
 Fix For: 1.5.0


 When connecting to a remote Spark cluster (that runs Spark branch-1.3 built 
 with Scala 2.11) from an application that uses akka 2.3.9 I get the following 
 error:
 {noformat}
 2015-04-22 09:01:38,924 - [WARN] - [akka.remote.ReliableDeliverySupervisor] 
 [sparkDriver-akka.actor.default-dispatcher-5] -
 Association with remote system [akka.tcp://sparkExecutor@server:59007] has 
 failed, address is now gated for [5000] ms.
 Reason is: [akka.actor.Identify; local class incompatible: stream classdesc 
 serialVersionUID = -213377755528332889, local class serialVersionUID = 1].
 {noformat}
 It looks like akka-actor_2.11 2.3.4-spark that is used by Spark has been 
 built using Scala compiler 2.11.0 that ignores SerialVersionUID annotations 
 (see https://issues.scala-lang.org/browse/SI-8549).
 The following steps can resolve the issue:
 - re-build the custom akka library that is used by Spark with the more recent 
 version of Scala compiler (e.g. 2.11.6) 
 - deploy a new version (e.g. 2.3.4.1-spark) to a maven repo
 - update version of akka used by spark (master and 1.3 branch)
 I would also suggest to upgrade to the latest version of akka 2.3.9 (or 
 2.3.10 that should be released soon).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-7042) Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x


 [ 
https://issues.apache.org/jira/browse/SPARK-7042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Shaposhnikov updated SPARK-7042:
---
Comment: was deleted

(was: It looks like akka-zeromq_2.11 is only available for versions 2.3.7+, 
though the rest of the akka libraries are available for 2.3.4.

I wonder if akka version can just be updated to the latest 2.3.11?)

 Spark version of akka-actor_2.11 is not compatible with the official 
 akka-actor_2.11 2.3.x
 --

 Key: SPARK-7042
 URL: https://issues.apache.org/jira/browse/SPARK-7042
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.1
Reporter: Konstantin Shaposhnikov
Assignee: Konstantin Shaposhnikov
Priority: Minor
 Fix For: 1.5.0


 When connecting to a remote Spark cluster (that runs Spark branch-1.3 built 
 with Scala 2.11) from an application that uses akka 2.3.9 I get the following 
 error:
 {noformat}
 2015-04-22 09:01:38,924 - [WARN] - [akka.remote.ReliableDeliverySupervisor] 
 [sparkDriver-akka.actor.default-dispatcher-5] -
 Association with remote system [akka.tcp://sparkExecutor@server:59007] has 
 failed, address is now gated for [5000] ms.
 Reason is: [akka.actor.Identify; local class incompatible: stream classdesc 
 serialVersionUID = -213377755528332889, local class serialVersionUID = 1].
 {noformat}
 It looks like akka-actor_2.11 2.3.4-spark that is used by Spark has been 
 built using Scala compiler 2.11.0 that ignores SerialVersionUID annotations 
 (see https://issues.scala-lang.org/browse/SI-8549).
 The following steps can resolve the issue:
 - re-build the custom akka library that is used by Spark with the more recent 
 version of Scala compiler (e.g. 2.11.6) 
 - deploy a new version (e.g. 2.3.4.1-spark) to a maven repo
 - update version of akka used by spark (master and 1.3 branch)
 I would also suggest to upgrade to the latest version of akka 2.3.9 (or 
 2.3.10 that should be released soon).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7604) Python API for PCA and PCAModel


[ 
https://issues.apache.org/jira/browse/SPARK-7604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560209#comment-14560209
 ] 

Joseph K. Bradley commented on SPARK-7604:
--

Updated for 1.5.  I don't think we'll be able to merge more features into 1.4.  
But we can see about merging this soon after 1.4 QA is done.

 Python API for PCA and PCAModel
 ---

 Key: SPARK-7604
 URL: https://issues.apache.org/jira/browse/SPARK-7604
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Affects Versions: 1.4.0
Reporter: Yanbo Liang
Assignee: Yanbo Liang

 Python API for org.apache.spark.mllib.feature.PCA and 
 org.apache.spark.mllib.feature.PCAModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7884) Allow Spark shuffle APIs to be more customizable


 [ 
https://issues.apache.org/jira/browse/SPARK-7884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7884:
---

Assignee: Apache Spark

 Allow Spark shuffle APIs to be more customizable
 

 Key: SPARK-7884
 URL: https://issues.apache.org/jira/browse/SPARK-7884
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Matt Massie
Assignee: Apache Spark

 The current Spark shuffle has some hard-coded assumptions about how shuffle 
 managers will read and write data.
 The FileShuffleBlockResolver.forMapTask method creates disk writers by 
 calling BlockManager.getDiskWriter. This forces all shuffle managers to store 
 data using the DiskBlockObjectWriter which read/write data as record-oriented 
 (preventing column-orient record writing).
 The BlockStoreShuffleFetcher.fetch method relies on the 
 ShuffleBlockFetcherIterator that assumes shuffle data is written using the 
 BlockManager.getDiskWriter method and doesn't allow for customization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7884) Allow Spark shuffle APIs to be more customizable


 [ 
https://issues.apache.org/jira/browse/SPARK-7884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7884:
---

Assignee: (was: Apache Spark)

 Allow Spark shuffle APIs to be more customizable
 

 Key: SPARK-7884
 URL: https://issues.apache.org/jira/browse/SPARK-7884
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Matt Massie

 The current Spark shuffle has some hard-coded assumptions about how shuffle 
 managers will read and write data.
 The FileShuffleBlockResolver.forMapTask method creates disk writers by 
 calling BlockManager.getDiskWriter. This forces all shuffle managers to store 
 data using the DiskBlockObjectWriter which read/write data as record-oriented 
 (preventing column-orient record writing).
 The BlockStoreShuffleFetcher.fetch method relies on the 
 ShuffleBlockFetcherIterator that assumes shuffle data is written using the 
 BlockManager.getDiskWriter method and doesn't allow for customization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7884) Allow Spark shuffle APIs to be more customizable

2015-05-26 Thread Matt Massie (JIRA)

Matt Massie created SPARK-7884:
--

 Summary: Allow Spark shuffle APIs to be more customizable
 Key: SPARK-7884
 URL: https://issues.apache.org/jira/browse/SPARK-7884
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Matt Massie


The current Spark shuffle has some hard-coded assumptions about how shuffle 
managers will read and write data.

The FileShuffleBlockResolver.forMapTask method creates disk writers by calling 
BlockManager.getDiskWriter. This forces all shuffle managers to store data 
using the DiskBlockObjectWriter which read/write data as record-oriented 
(preventing column-orient record writing).

The BlockStoreShuffleFetcher.fetch method relies on the 
ShuffleBlockFetcherIterator that assumes shuffle data is written using the 
BlockManager.getDiskWriter method and doesn't allow for customization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7884) Allow Spark shuffle APIs to be more customizable


[ 
https://issues.apache.org/jira/browse/SPARK-7884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560241#comment-14560241
 ] 

Apache Spark commented on SPARK-7884:
-

User 'massie' has created a pull request for this issue:
https://github.com/apache/spark/pull/6423

 Allow Spark shuffle APIs to be more customizable
 

 Key: SPARK-7884
 URL: https://issues.apache.org/jira/browse/SPARK-7884
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Matt Massie

 The current Spark shuffle has some hard-coded assumptions about how shuffle 
 managers will read and write data.
 The FileShuffleBlockResolver.forMapTask method creates disk writers by 
 calling BlockManager.getDiskWriter. This forces all shuffle managers to store 
 data using the DiskBlockObjectWriter which read/write data as record-oriented 
 (preventing column-orient record writing).
 The BlockStoreShuffleFetcher.fetch method relies on the 
 ShuffleBlockFetcherIterator that assumes shuffle data is written using the 
 BlockManager.getDiskWriter method and doesn't allow for customization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3674) Add support for launching YARN clusters in spark-ec2


[ 
https://issues.apache.org/jira/browse/SPARK-3674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560282#comment-14560282
 ] 

Apache Spark commented on SPARK-3674:
-

User 'shivaram' has created a pull request for this issue:
https://github.com/apache/spark/pull/6424

 Add support for launching YARN clusters in spark-ec2
 

 Key: SPARK-3674
 URL: https://issues.apache.org/jira/browse/SPARK-3674
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
 Fix For: 1.4.0


 Right now spark-ec2 only supports launching Spark Standalone clusters. While 
 this is sufficient for basic usage it is hard to test features or do 
 performance benchmarking on YARN. It will be good to add support for 
 installing, configuring a Apache YARN cluster at a fixed version -- say the 
 latest stable version 2.4.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7824) Collapsing operator reordering and constant folding into a single batch to push down the single side.

2015-05-26 Thread Zhongshuai Pei (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhongshuai Pei updated SPARK-7824:
--
Summary: Collapsing operator reordering and constant folding into a single 
batch to push down the single side.  (was: Extracting and/or condition 
optimizer from BooleanSimplification optimizer and put it before  
PushPredicateThroughJoin optimizer  to push down the single side.)

 Collapsing operator reordering and constant folding into a single batch to 
 push down the single side.
 -

 Key: SPARK-7824
 URL: https://issues.apache.org/jira/browse/SPARK-7824
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Zhongshuai Pei

 SQL:
 {noformat}
 select * from tableA join tableB on (a  3 and b = d) or (a  3 and b = e)
 {noformat}
 Plan before modify
 {noformat}
 == Optimized Logical Plan ==
 Project [a#293,b#294,c#295,d#296,e#297]
  Join Inner, Some(((a#293  3)  ((b#294 = d#296) || (b#294 = e#297
   MetastoreRelation default, tablea, None
   MetastoreRelation default, tableb, None
 {noformat}
 Plan after modify
 {noformat}
 == Optimized Logical Plan ==
 Project [a#293,b#294,c#295,d#296,e#297]
  Join Inner, Some(((b#294 = d#296) || (b#294 = e#297)))
   Filter (a#293  3)
MetastoreRelation default, tablea, None
   MetastoreRelation default, tableb, None
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7852) Use continuation when GLMs are run with multiple regParams


 [ 
https://issues.apache.org/jira/browse/SPARK-7852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7852:
-
Description: 
Per the discussion from 
https://github.com/apache/spark/pull/6386#discussion_r30964263 once we have 
support for specifying the initial weights we can use this to create speed up 
our training.

keywords: continuation, warm start, homotopy (related)

  was:Per the discussion from 
https://github.com/apache/spark/pull/6386#discussion_r30964263 once we have 
support for specifying the initial weights we can use this to create speed up 
our training.


 Use continuation when GLMs are run with multiple regParams
 --

 Key: SPARK-7852
 URL: https://issues.apache.org/jira/browse/SPARK-7852
 Project: Spark
  Issue Type: Bug
  Components: ML
Reporter: holdenk
Priority: Minor

 Per the discussion from 
 https://github.com/apache/spark/pull/6386#discussion_r30964263 once we have 
 support for specifying the initial weights we can use this to create speed up 
 our training.
 keywords: continuation, warm start, homotopy (related)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7880) Silent failure if assembly jar is corrupted

2015-05-26 Thread Andrew Or (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560167#comment-14560167
 ] 

Andrew Or commented on SPARK-7880:
--

I was testing out RC2 for 1.4 and somehow ended up with a corrupted one. I 
thought it was a java 6 java 7 incompatibility issue but turns out something's 
just wrong with the way I downloaded it (?). Either way we should not hide the 
error message.

 Silent failure if assembly jar is corrupted
 ---

 Key: SPARK-7880
 URL: https://issues.apache.org/jira/browse/SPARK-7880
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 1.3.0
Reporter: Andrew Or

 If you try to run `bin/spark-submit` with a corrupted jar, you get no output 
 and your application does not run. We should have an informative message that 
 indicates the failure to open the jar instead of silently swallowing it.
 This is caused by this line:
 https://github.com/apache/spark/blob/61664732b25b35f94be35a42cde651cbfd0e02b7/bin/spark-class#L75



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7852) Use continuation when GLMs are run with multiple regParams


 [ 
https://issues.apache.org/jira/browse/SPARK-7852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7852:
-
Summary: Use continuation when GLMs are run with multiple regParams  (was: 
Add support for re-using weights when training with multiple lambdas)

 Use continuation when GLMs are run with multiple regParams
 --

 Key: SPARK-7852
 URL: https://issues.apache.org/jira/browse/SPARK-7852
 Project: Spark
  Issue Type: Bug
  Components: ML
Reporter: holdenk
Priority: Minor

 Per the discussion from 
 https://github.com/apache/spark/pull/6386#discussion_r30964263 once we have 
 support for specifying the initial weights we can use this to create speed up 
 our training.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6915) VectorIndexer improvements

[
https://issues.apache.org/jira/browse/SPARK-6915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Joseph K. Bradley updated SPARK-6915:
-
Target Version/s: (was: 1.4.0)

VectorIndexer improvements
--

Key: SPARK-6915
URL: https://issues.apache.org/jira/browse/SPARK-6915
Project: Spark
Issue Type: Improvement
Components: ML
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley
Priority: Minor

This covers several improvements to VectorIndexer. They could be handled
separately or in 1 PR.
*Preserving metadata*
Currently, it preserves non-ML metadata. This is different from
StringIndexer. We should change it so it does not maintain non-ML metadata.
Currently, it does not preserve ML-specific input metadata in the output
column. If a feature is already marked as categorical or continuous, we
should preserve that metadata (rather than recomputing it). We should also
check that the input data is valid for that metadata.
*Allow unknown categories*
Add option for allowing unknown categories, probably via a parameter like
allowUnknownCategories.
If true, then handle unknown categories during transform by assigning them to
an extra category index.
*Index particular features*
Add option for limiting indexing to particular features.
This could be specified by an option, or we could handle it via the Preserve
metadata task above, where users would denote features as continuous in
order to have VectorIndexer ignore them.
*Performance optimizations*
See the TODO items within VectorIndexer.scala

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6634) Allow replacing columns in Transformers


 [ 
https://issues.apache.org/jira/browse/SPARK-6634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6634:
-
Target Version/s: 1.5.0  (was: 1.4.0)

 Allow replacing columns in Transformers
 ---

 Key: SPARK-6634
 URL: https://issues.apache.org/jira/browse/SPARK-6634
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Priority: Minor

 Currently, Transformers do not allow input and output columns to share the 
 same name.  (In fact, this is not allowed but also not even checked.)
 Short-term proposal: Disallow input and output columns with the same name, 
 and add a check in transformSchema.
 Long-term proposal: Allow input  output columns with the same name, and 
 where the behavior is that the output columns replace input columns with the 
 same name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6295) spark.ml.Evaluator should have evaluate method not taking ParamMap


 [ 
https://issues.apache.org/jira/browse/SPARK-6295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-6295.
--
   Resolution: Fixed
Fix Version/s: 1.4.0
 Assignee: Xiangrui Meng  (was: Joseph K. Bradley)

I'm pretty sure [~mengxr] fixed this in some PR, but I'm having trouble finding 
which one.  But it's fixed for 1.4.0.

 spark.ml.Evaluator should have evaluate method not taking ParamMap
 --

 Key: SPARK-6295
 URL: https://issues.apache.org/jira/browse/SPARK-6295
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Xiangrui Meng
Priority: Minor
 Fix For: 1.4.0


 spark.ml.Evaluator requires that the user pass a ParamMap, but it is not 
 always necessary.  It should have a default implementation with no ParamMap 
 (similar to fit() and transform() in Estimator and Transformer).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7604) Python API for PCA and PCAModel


 [ 
https://issues.apache.org/jira/browse/SPARK-7604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7604:
-
Target Version/s: 1.5.0
Assignee: Yanbo Liang

 Python API for PCA and PCAModel
 ---

 Key: SPARK-7604
 URL: https://issues.apache.org/jira/browse/SPARK-7604
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Affects Versions: 1.4.0
Reporter: Yanbo Liang
Assignee: Yanbo Liang

 Python API for org.apache.spark.mllib.feature.PCA and 
 org.apache.spark.mllib.feature.PCAModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7461) Remove spark.ml Model, and have all Transformers have parent


 [ 
https://issues.apache.org/jira/browse/SPARK-7461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7461:
-
Target Version/s: 1.5.0  (was: 1.4.0)

 Remove spark.ml Model, and have all Transformers have parent
 

 Key: SPARK-7461
 URL: https://issues.apache.org/jira/browse/SPARK-7461
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Joseph K. Bradley

 A recent PR [https://github.com/apache/spark/pull/5980] brought up an issue 
 with the Model abstraction: There are transformers which could be 
 Transformers (created by a user) or Models (created by an Estimator).  This 
 is the first instance, but there will be more such transformers in the future.
 Some possible fixes are:
 * Create 2 separate classes, 1 extending Transformer and 1 extending Model.  
 These would be essentially the same, and they could share code (or have 1 
 wrap the other).  This would bloat the API.
 * Just use Model, with a possibly null parent class.  There is precedence 
 (meta-algorithms like RandomForest producing weak hypothesis Models with no 
 parent).
 * Change Transformer to have a parent which may be null.
 ** *-- Unless there is strong disagreement, I think we should go with this 
 last option.*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6261) Python MLlib API missing items: Feature


 [ 
https://issues.apache.org/jira/browse/SPARK-6261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6261:
-
Target Version/s:   (was: 1.4.0)

 Python MLlib API missing items: Feature
 ---

 Key: SPARK-6261
 URL: https://issues.apache.org/jira/browse/SPARK-6261
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 This JIRA lists items missing in the Python API for this sub-package of MLlib.
 This list may be incomplete, so please check again when sending a PR to add 
 these features to the Python API.
 Also, please check for major disparities between documentation; some parts of 
 the Python API are less well-documented than their Scala counterparts.  Some 
 items may be listed in the umbrella JIRA linked to this task.
 StandardScalerModel
 * All functionality except predict() is missing.
 IDFModel
 * idf
 Word2Vec
 * setMinCount
 Word2VecModel
 * getVectors



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-7577) User guide update for Bucketizer

2015-05-26 Thread Xusen Yin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-7577:
-
Comment: was deleted

(was: Oh, yes. Sorry, I forget it. Thanks for the reminder.)

 User guide update for Bucketizer
 

 Key: SPARK-7577
 URL: https://issues.apache.org/jira/browse/SPARK-7577
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Reporter: Joseph K. Bradley
Assignee: Xusen Yin

 Copied from [SPARK-7443]:
 {quote}
 Now that we have algorithms in spark.ml which are not in spark.mllib, we 
 should start making subsections for the spark.ml API as needed. We can follow 
 the structure of the spark.mllib user guide.
 * The spark.ml user guide can provide: (a) code examples and (b) info on 
 algorithms which do not exist in spark.mllib.
 * We should not duplicate info in the spark.ml guides. Since spark.mllib is 
 still the primary API, we should provide links to the corresponding 
 algorithms in the spark.mllib user guide for more info.
 {quote}
 Note: I created a new subsection for links to spark.ml-specific guides in 
 this JIRA's PR: [SPARK-7557]. This transformer can go within the new 
 subsection. I'll try to get that PR merged ASAP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7042) Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x


[ 
https://issues.apache.org/jira/browse/SPARK-7042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560166#comment-14560166
 ] 

Konstantin Shaposhnikov commented on SPARK-7042:


That is not true: http://search.maven.org/#browse%7C-1552622333 
(http://search.maven.org/#artifactdetails%7Ccom.typesafe.akka%7Cakka-actor_2.11%7C2.3.4%7Cjar)

What exactly was broken in scala 2.11 build?

 Spark version of akka-actor_2.11 is not compatible with the official 
 akka-actor_2.11 2.3.x
 --

 Key: SPARK-7042
 URL: https://issues.apache.org/jira/browse/SPARK-7042
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.1
Reporter: Konstantin Shaposhnikov
Assignee: Konstantin Shaposhnikov
Priority: Minor
 Fix For: 1.5.0


 When connecting to a remote Spark cluster (that runs Spark branch-1.3 built 
 with Scala 2.11) from an application that uses akka 2.3.9 I get the following 
 error:
 {noformat}
 2015-04-22 09:01:38,924 - [WARN] - [akka.remote.ReliableDeliverySupervisor] 
 [sparkDriver-akka.actor.default-dispatcher-5] -
 Association with remote system [akka.tcp://sparkExecutor@server:59007] has 
 failed, address is now gated for [5000] ms.
 Reason is: [akka.actor.Identify; local class incompatible: stream classdesc 
 serialVersionUID = -213377755528332889, local class serialVersionUID = 1].
 {noformat}
 It looks like akka-actor_2.11 2.3.4-spark that is used by Spark has been 
 built using Scala compiler 2.11.0 that ignores SerialVersionUID annotations 
 (see https://issues.scala-lang.org/browse/SI-8549).
 The following steps can resolve the issue:
 - re-build the custom akka library that is used by Spark with the more recent 
 version of Scala compiler (e.g. 2.11.6) 
 - deploy a new version (e.g. 2.3.4.1-spark) to a maven repo
 - update version of akka used by spark (master and 1.3 branch)
 I would also suggest to upgrade to the latest version of akka 2.3.9 (or 
 2.3.10 that should be released soon).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7461) Remove spark.ml Model, and have all Transformers have parent

[
https://issues.apache.org/jira/browse/SPARK-7461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560205#comment-14560205
]

Joseph K. Bradley commented on SPARK-7461:
--

Speaking with [~mengxr], we're going to delay this decision. It may not longer
be a good idea since there is discussion of ML models including more
model-specific functionality, such as transient references to the training data
and results [SPARK-7674]

Remove spark.ml Model, and have all Transformers have parent

Key: SPARK-7461
URL: https://issues.apache.org/jira/browse/SPARK-7461
Project: Spark
Issue Type: Sub-task
Components: ML
Reporter: Joseph K. Bradley

A recent PR [https://github.com/apache/spark/pull/5980] brought up an issue
with the Model abstraction: There are transformers which could be
Transformers (created by a user) or Models (created by an Estimator). This
is the first instance, but there will be more such transformers in the future.
Some possible fixes are:
* Create 2 separate classes, 1 extending Transformer and 1 extending Model.
These would be essentially the same, and they could share code (or have 1
wrap the other). This would bloat the API.
* Just use Model, with a possibly null parent class. There is precedence
(meta-algorithms like RandomForest producing weak hypothesis Models with no
parent).
* Change Transformer to have a parent which may be null.
** *-- Unless there is strong disagreement, I think we should go with this
last option.*

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7699) Number of executors can be reduced from initial before work is scheduled

2015-05-26 Thread Sandy Ryza (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560227#comment-14560227
 ] 

Sandy Ryza commented on SPARK-7699:
---

I think tying this the AM-RM heartbeat would just make things more confusing, 
especially now that the heartbeat interval is variable.  Whether we've had a 
fair chance to allocate resources also depends on internal YARN configurations, 
like the NM-RM heartbeat interval or whether continuous scheduling is enabled.  
I don't think there's any easy notion of fair chance that doesn't rely on a 
timeout.

Another option would be to avoid adjusting targetNumExecutors down before the 
first job is submitted.



 Number of executors can be reduced from initial before work is scheduled
 

 Key: SPARK-7699
 URL: https://issues.apache.org/jira/browse/SPARK-7699
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: meiyoula
Priority: Minor

 spark.dynamicAllocation.minExecutors 2
 spark.dynamicAllocation.initialExecutors  3
 spark.dynamicAllocation.maxExecutors 4
 Just run the spark-shell with above configurations, the initial executor 
 number is 2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7867) Support revoke role ...


 [ 
https://issues.apache.org/jira/browse/SPARK-7867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7867:
---

Assignee: (was: Apache Spark)

 Support revoke role ...
 -

 Key: SPARK-7867
 URL: https://issues.apache.org/jira/browse/SPARK-7867
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Zhongshuai Pei
Priority: Minor

 sql like 
 {noformat}
 revoke role role_a from user user1;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7867) Support revoke role ...


 [ 
https://issues.apache.org/jira/browse/SPARK-7867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7867:
---

Assignee: Apache Spark

 Support revoke role ...
 -

 Key: SPARK-7867
 URL: https://issues.apache.org/jira/browse/SPARK-7867
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Zhongshuai Pei
Assignee: Apache Spark
Priority: Minor

 sql like 
 {noformat}
 revoke role role_a from user user1;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7883) Fixing broken trainImplicit example in MLlib Collaborative Filtering documentation.


 [ 
https://issues.apache.org/jira/browse/SPARK-7883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7883:
-
Assignee: Mike Dusenberry

 Fixing broken trainImplicit example in MLlib Collaborative Filtering 
 documentation.
 ---

 Key: SPARK-7883
 URL: https://issues.apache.org/jira/browse/SPARK-7883
 Project: Spark
  Issue Type: Bug
  Components: Documentation, MLlib
Affects Versions: 1.0.2, 1.1.1, 1.2.2, 1.3.1, 1.4.0
Reporter: Mike Dusenberry
Assignee: Mike Dusenberry
Priority: Trivial
 Fix For: 1.0.3, 1.1.2, 1.2.3, 1.3.2, 1.4.0


 The trainImplicit Scala example near the end of the MLlib Collaborative 
 Filtering documentation refers to an ALS.trainImplicit function signature 
 that does not exist.  Rather than add an extra function, let's just fix the 
 example.
 Currently, the example refers to a function that would have the following 
 signature: 
 def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int, alpha: 
 Double) : MatrixFactorizationModel
 Instead, let's change the example to refer to this function, which does exist 
 (notice the addition of the lambda parameter):
 def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int, lambda: 
 Double, alpha: Double) : MatrixFactorizationModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7883) Fixing broken trainImplicit example in MLlib Collaborative Filtering documentation.


 [ 
https://issues.apache.org/jira/browse/SPARK-7883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7883:
-
 Target Version/s: 1.0.3, 1.1.2, 1.2.3, 1.3.2, 1.4.0
Affects Version/s: 1.0.2
   1.1.1
   1.2.2

 Fixing broken trainImplicit example in MLlib Collaborative Filtering 
 documentation.
 ---

 Key: SPARK-7883
 URL: https://issues.apache.org/jira/browse/SPARK-7883
 Project: Spark
  Issue Type: Bug
  Components: Documentation, MLlib
Affects Versions: 1.0.2, 1.1.1, 1.2.2, 1.3.1, 1.4.0
Reporter: Mike Dusenberry
Assignee: Mike Dusenberry
Priority: Trivial
 Fix For: 1.0.3, 1.1.2, 1.2.3, 1.3.2, 1.4.0


 The trainImplicit Scala example near the end of the MLlib Collaborative 
 Filtering documentation refers to an ALS.trainImplicit function signature 
 that does not exist.  Rather than add an extra function, let's just fix the 
 example.
 Currently, the example refers to a function that would have the following 
 signature: 
 def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int, alpha: 
 Double) : MatrixFactorizationModel
 Instead, let's change the example to refer to this function, which does exist 
 (notice the addition of the lambda parameter):
 def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int, lambda: 
 Double, alpha: Double) : MatrixFactorizationModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7637) StructType.merge slow with large nenormalised tables O(N2)

2015-05-26 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-7637.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

 StructType.merge slow with large nenormalised tables O(N2)
 --

 Key: SPARK-7637
 URL: https://issues.apache.org/jira/browse/SPARK-7637
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Rowan Chattaway
Priority: Minor
 Fix For: 1.5.0

   Original Estimate: 24h
  Remaining Estimate: 24h

 StructType.merge does a linear scan through the left schema and for each 
 element scans the right schema. This results in a O(N2) algorithm. 
 I have found this to be very slow when dealing with large denormalised 
 parquet files.
 I would like to make a small change to this function to map the fields of 
 both the left and right schemas resulting in O(N).
 This has a sizable increase in performance for large denormalised schemas.
 1x1 column merge 
 2891ms Original  
 32ms with mapped field approach.
 This merge can be called many times depending upon the number of files that 
 you need to merge the schemas for, compounding the performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7577) User guide update for Bucketizer

2015-05-26 Thread Xusen Yin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560278#comment-14560278
 ] 

Xusen Yin commented on SPARK-7577:
--

Oh, yes. Sorry, I forget it. Thanks for the reminder.

 User guide update for Bucketizer
 

 Key: SPARK-7577
 URL: https://issues.apache.org/jira/browse/SPARK-7577
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Reporter: Joseph K. Bradley
Assignee: Xusen Yin

 Copied from [SPARK-7443]:
 {quote}
 Now that we have algorithms in spark.ml which are not in spark.mllib, we 
 should start making subsections for the spark.ml API as needed. We can follow 
 the structure of the spark.mllib user guide.
 * The spark.ml user guide can provide: (a) code examples and (b) info on 
 algorithms which do not exist in spark.mllib.
 * We should not duplicate info in the spark.ml guides. Since spark.mllib is 
 still the primary API, we should provide links to the corresponding 
 algorithms in the spark.mllib user guide for more info.
 {quote}
 Note: I created a new subsection for links to spark.ml-specific guides in 
 this JIRA's PR: [SPARK-7557]. This transformer can go within the new 
 subsection. I'll try to get that PR merged ASAP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7577) User guide update for Bucketizer

2015-05-26 Thread Xusen Yin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560279#comment-14560279
 ] 

Xusen Yin commented on SPARK-7577:
--

Oh, yes. Sorry, I forget it. Thanks for the reminder.

 User guide update for Bucketizer
 

 Key: SPARK-7577
 URL: https://issues.apache.org/jira/browse/SPARK-7577
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Reporter: Joseph K. Bradley
Assignee: Xusen Yin

 Copied from [SPARK-7443]:
 {quote}
 Now that we have algorithms in spark.ml which are not in spark.mllib, we 
 should start making subsections for the spark.ml API as needed. We can follow 
 the structure of the spark.mllib user guide.
 * The spark.ml user guide can provide: (a) code examples and (b) info on 
 algorithms which do not exist in spark.mllib.
 * We should not duplicate info in the spark.ml guides. Since spark.mllib is 
 still the primary API, we should provide links to the corresponding 
 algorithms in the spark.mllib user guide for more info.
 {quote}
 Note: I created a new subsection for links to spark.ml-specific guides in 
 this JIRA's PR: [SPARK-7557]. This transformer can go within the new 
 subsection. I'll try to get that PR merged ASAP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7042) Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x


[ 
https://issues.apache.org/jira/browse/SPARK-7042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560283#comment-14560283
 ] 

Konstantin Shaposhnikov commented on SPARK-7042:


It looks like the Spark specific akka-zeromq version (2.3.4-spark) has been 
modified to work with Scala 2.11

In fact the standard build of akka-zeromq_2.11 (that is available for versions 
2.3.7+) depends on zeromq scala bindings created by Spark project 
(org.spark-project.zeromq:zeromq-scala-binding_2.11:0.0.7-spark).


 Spark version of akka-actor_2.11 is not compatible with the official 
 akka-actor_2.11 2.3.x
 --

 Key: SPARK-7042
 URL: https://issues.apache.org/jira/browse/SPARK-7042
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.1
Reporter: Konstantin Shaposhnikov
Assignee: Konstantin Shaposhnikov
Priority: Minor
 Fix For: 1.5.0


 When connecting to a remote Spark cluster (that runs Spark branch-1.3 built 
 with Scala 2.11) from an application that uses akka 2.3.9 I get the following 
 error:
 {noformat}
 2015-04-22 09:01:38,924 - [WARN] - [akka.remote.ReliableDeliverySupervisor] 
 [sparkDriver-akka.actor.default-dispatcher-5] -
 Association with remote system [akka.tcp://sparkExecutor@server:59007] has 
 failed, address is now gated for [5000] ms.
 Reason is: [akka.actor.Identify; local class incompatible: stream classdesc 
 serialVersionUID = -213377755528332889, local class serialVersionUID = 1].
 {noformat}
 It looks like akka-actor_2.11 2.3.4-spark that is used by Spark has been 
 built using Scala compiler 2.11.0 that ignores SerialVersionUID annotations 
 (see https://issues.scala-lang.org/browse/SI-8549).
 The following steps can resolve the issue:
 - re-build the custom akka library that is used by Spark with the more recent 
 version of Scala compiler (e.g. 2.11.6) 
 - deploy a new version (e.g. 2.3.4.1-spark) to a maven repo
 - update version of akka used by spark (master and 1.3 branch)
 I would also suggest to upgrade to the latest version of akka 2.3.9 (or 
 2.3.10 that should be released soon).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7885) add config to control map aggregation in spark sql

2015-05-26 Thread jeanlyn (JIRA)

jeanlyn created SPARK-7885:
--

 Summary: add config to control map aggregation in spark sql
 Key: SPARK-7885
 URL: https://issues.apache.org/jira/browse/SPARK-7885
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.3.1, 1.2.2, 1.2.0
Reporter: jeanlyn


For now, *execution.HashAggregation* add the map aggregation in oder to 
decrease the shuffle data.However,we found gc problem when we use this 
optimization and finally the executor crash.For example,
{noformat} 
select sale_ord_id as order_id,
  coalesce(sum(sku_offer_amount),0.0) as sku_offer_amount,
  coalesce(sum(suit_offer_amount),0.0) as suit_offer_amount,
  coalesce(sum(flash_gp_offer_amount),0.0) + 
coalesce(sum(gp_offer_amount),0.0) as gp_offer_amount,
  coalesce(sum(flash_gp_offer_amount),0.0) as flash_gp_offer_amount,
  coalesce(sum(full_minus_offer_amount),0.0) as full_rebate_offer_amount,
  0.0 as telecom_point_offer_amount,
  coalesce(sum(coupon_pay_amount),0.0) as dq_and_jq_pay_amount,
  coalesce(sum(jq_pay_amount),0.0) + 
coalesce(sum(pop_shop_jq_pay_amount),0.0) + 
coalesce(sum(lim_cate_jq_pay_amount),0.0) as jq_pay_amount,
  coalesce(sum(dq_pay_amount),0.0) + 
coalesce(sum(pop_shop_dq_pay_amount),0.0) + 
coalesce(sum(lim_cate_dq_pay_amount),0.0) as dq_pay_amount,
  coalesce(sum(gift_cps_pay_amount),0.0) as gift_cps_pay_amount ,
  coalesce(sum(mobile_red_packet_pay_amount),0.0) as 
mobile_red_packet_pay_amount,
  coalesce(sum(acct_bal_pay_amount),0.0) as acct_bal_pay_amount,
  coalesce(sum(jbean_pay_amount),0.0) as jbean_pay_amount,
  coalesce(sum(sku_rebate_amount),0.0) as sku_rebate_amount,
  coalesce(sum(yixun_point_pay_amount),0.0) as yixun_point_pay_amount,
  coalesce(sum(sku_freight_coupon_amount),0.0) as freight_coupon_amount
fromord_at_det_di
where   ds = '2015-05-20'
group  by   sale_ord_id
{noformat}
the sql scan two text files and each file is 360MB,we use 6 executor, each 
executor has 8GB memory and 2 cpu.
We can add a config control map aggregation to avoid it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-7866) print the format string in dataframe explain

2015-05-26 Thread Fei Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Wang closed SPARK-7866.
---
Resolution: Won't Fix

Wrong print in intelli idea, this is really ok, not a problem.

 print the format string in dataframe explain
 

 Key: SPARK-7866
 URL: https://issues.apache.org/jira/browse/SPARK-7866
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Fei Wang
Priority: Trivial

 QueryExecution.toString give a format and clear string, so we print it in 
 DataFrame.explain method



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7885) add config to control map aggregation in spark sql


 [ 
https://issues.apache.org/jira/browse/SPARK-7885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7885:
---

Assignee: (was: Apache Spark)

 add config to control map aggregation in spark sql
 --

 Key: SPARK-7885
 URL: https://issues.apache.org/jira/browse/SPARK-7885
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.2.0, 1.2.2, 1.3.1
Reporter: jeanlyn

 For now, *execution.HashAggregation* add the map aggregation in oder to 
 decrease the shuffle data.However,we found gc problem when we use this 
 optimization and finally the executor crash.For example,
 {noformat} 
 select sale_ord_id as order_id,
   coalesce(sum(sku_offer_amount),0.0) as sku_offer_amount,
   coalesce(sum(suit_offer_amount),0.0) as suit_offer_amount,
   coalesce(sum(flash_gp_offer_amount),0.0) + 
 coalesce(sum(gp_offer_amount),0.0) as gp_offer_amount,
   coalesce(sum(flash_gp_offer_amount),0.0) as flash_gp_offer_amount,
   coalesce(sum(full_minus_offer_amount),0.0) as full_rebate_offer_amount,
   0.0 as telecom_point_offer_amount,
   coalesce(sum(coupon_pay_amount),0.0) as dq_and_jq_pay_amount,
   coalesce(sum(jq_pay_amount),0.0) + 
 coalesce(sum(pop_shop_jq_pay_amount),0.0) + 
 coalesce(sum(lim_cate_jq_pay_amount),0.0) as jq_pay_amount,
   coalesce(sum(dq_pay_amount),0.0) + 
 coalesce(sum(pop_shop_dq_pay_amount),0.0) + 
 coalesce(sum(lim_cate_dq_pay_amount),0.0) as dq_pay_amount,
   coalesce(sum(gift_cps_pay_amount),0.0) as gift_cps_pay_amount ,
   coalesce(sum(mobile_red_packet_pay_amount),0.0) as 
 mobile_red_packet_pay_amount,
   coalesce(sum(acct_bal_pay_amount),0.0) as acct_bal_pay_amount,
   coalesce(sum(jbean_pay_amount),0.0) as jbean_pay_amount,
   coalesce(sum(sku_rebate_amount),0.0) as sku_rebate_amount,
   coalesce(sum(yixun_point_pay_amount),0.0) as yixun_point_pay_amount,
   coalesce(sum(sku_freight_coupon_amount),0.0) as freight_coupon_amount
 fromord_at_det_di
 where   ds = '2015-05-20'
 group  by   sale_ord_id
 {noformat}
 the sql scan two text files and each file is 360MB,we use 6 executor, each 
 executor has 8GB memory and 2 cpu.
 We can add a config control map aggregation to avoid it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7885) add config to control map aggregation in spark sql


 [ 
https://issues.apache.org/jira/browse/SPARK-7885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7885:
---

Assignee: Apache Spark

 add config to control map aggregation in spark sql
 --

 Key: SPARK-7885
 URL: https://issues.apache.org/jira/browse/SPARK-7885
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.2.0, 1.2.2, 1.3.1
Reporter: jeanlyn
Assignee: Apache Spark

 For now, *execution.HashAggregation* add the map aggregation in oder to 
 decrease the shuffle data.However,we found gc problem when we use this 
 optimization and finally the executor crash.For example,
 {noformat} 
 select sale_ord_id as order_id,
   coalesce(sum(sku_offer_amount),0.0) as sku_offer_amount,
   coalesce(sum(suit_offer_amount),0.0) as suit_offer_amount,
   coalesce(sum(flash_gp_offer_amount),0.0) + 
 coalesce(sum(gp_offer_amount),0.0) as gp_offer_amount,
   coalesce(sum(flash_gp_offer_amount),0.0) as flash_gp_offer_amount,
   coalesce(sum(full_minus_offer_amount),0.0) as full_rebate_offer_amount,
   0.0 as telecom_point_offer_amount,
   coalesce(sum(coupon_pay_amount),0.0) as dq_and_jq_pay_amount,
   coalesce(sum(jq_pay_amount),0.0) + 
 coalesce(sum(pop_shop_jq_pay_amount),0.0) + 
 coalesce(sum(lim_cate_jq_pay_amount),0.0) as jq_pay_amount,
   coalesce(sum(dq_pay_amount),0.0) + 
 coalesce(sum(pop_shop_dq_pay_amount),0.0) + 
 coalesce(sum(lim_cate_dq_pay_amount),0.0) as dq_pay_amount,
   coalesce(sum(gift_cps_pay_amount),0.0) as gift_cps_pay_amount ,
   coalesce(sum(mobile_red_packet_pay_amount),0.0) as 
 mobile_red_packet_pay_amount,
   coalesce(sum(acct_bal_pay_amount),0.0) as acct_bal_pay_amount,
   coalesce(sum(jbean_pay_amount),0.0) as jbean_pay_amount,
   coalesce(sum(sku_rebate_amount),0.0) as sku_rebate_amount,
   coalesce(sum(yixun_point_pay_amount),0.0) as yixun_point_pay_amount,
   coalesce(sum(sku_freight_coupon_amount),0.0) as freight_coupon_amount
 fromord_at_det_di
 where   ds = '2015-05-20'
 group  by   sale_ord_id
 {noformat}
 the sql scan two text files and each file is 360MB,we use 6 executor, each 
 executor has 8GB memory and 2 cpu.
 We can add a config control map aggregation to avoid it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7885) add config to control map aggregation in spark sql


[ 
https://issues.apache.org/jira/browse/SPARK-7885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560344#comment-14560344
 ] 

Apache Spark commented on SPARK-7885:
-

User 'jeanlyn' has created a pull request for this issue:
https://github.com/apache/spark/pull/6426

 add config to control map aggregation in spark sql
 --

 Key: SPARK-7885
 URL: https://issues.apache.org/jira/browse/SPARK-7885
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.2.0, 1.2.2, 1.3.1
Reporter: jeanlyn

 For now, *execution.HashAggregation* add the map aggregation in oder to 
 decrease the shuffle data.However,we found gc problem when we use this 
 optimization and finally the executor crash.For example,
 {noformat} 
 select sale_ord_id as order_id,
   coalesce(sum(sku_offer_amount),0.0) as sku_offer_amount,
   coalesce(sum(suit_offer_amount),0.0) as suit_offer_amount,
   coalesce(sum(flash_gp_offer_amount),0.0) + 
 coalesce(sum(gp_offer_amount),0.0) as gp_offer_amount,
   coalesce(sum(flash_gp_offer_amount),0.0) as flash_gp_offer_amount,
   coalesce(sum(full_minus_offer_amount),0.0) as full_rebate_offer_amount,
   0.0 as telecom_point_offer_amount,
   coalesce(sum(coupon_pay_amount),0.0) as dq_and_jq_pay_amount,
   coalesce(sum(jq_pay_amount),0.0) + 
 coalesce(sum(pop_shop_jq_pay_amount),0.0) + 
 coalesce(sum(lim_cate_jq_pay_amount),0.0) as jq_pay_amount,
   coalesce(sum(dq_pay_amount),0.0) + 
 coalesce(sum(pop_shop_dq_pay_amount),0.0) + 
 coalesce(sum(lim_cate_dq_pay_amount),0.0) as dq_pay_amount,
   coalesce(sum(gift_cps_pay_amount),0.0) as gift_cps_pay_amount ,
   coalesce(sum(mobile_red_packet_pay_amount),0.0) as 
 mobile_red_packet_pay_amount,
   coalesce(sum(acct_bal_pay_amount),0.0) as acct_bal_pay_amount,
   coalesce(sum(jbean_pay_amount),0.0) as jbean_pay_amount,
   coalesce(sum(sku_rebate_amount),0.0) as sku_rebate_amount,
   coalesce(sum(yixun_point_pay_amount),0.0) as yixun_point_pay_amount,
   coalesce(sum(sku_freight_coupon_amount),0.0) as freight_coupon_amount
 fromord_at_det_di
 where   ds = '2015-05-20'
 group  by   sale_ord_id
 {noformat}
 the sql scan two text files and each file is 360MB,we use 6 executor, each 
 executor has 8GB memory and 2 cpu.
 We can add a config control map aggregation to avoid it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-7734) make explode support struct type


 [ 
https://issues.apache.org/jira/browse/SPARK-7734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-7734.
--
Resolution: Not A Problem

 make explode support struct type
 

 Key: SPARK-7734
 URL: https://issues.apache.org/jira/browse/SPARK-7734
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Wenchen Fan





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-7742) Figure out what to do with insertInto w.r.t. DataFrameWriter API


 [ 
https://issues.apache.org/jira/browse/SPARK-7742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-7742.
--
   Resolution: Fixed
Fix Version/s: 1.4.0
 Assignee: Yin Huai

We decided to add insertInto into write.


 Figure out what to do with insertInto w.r.t. DataFrameWriter API
 

 Key: SPARK-7742
 URL: https://issues.apache.org/jira/browse/SPARK-7742
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Yin Huai
Priority: Critical
 Fix For: 1.4.0


 See https://github.com/apache/spark/pull/6216



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7853) ClassNotFoundException for SparkSQL

2015-05-26 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-7853:
--
Description: 
Reproduce steps:
{code}
bin/spark-sql --jars ./sql/hive/src/test/resources/hive-hcatalog-core-0.13.1.jar
CREATE TABLE t1(a string, b string) ROW FORMAT SERDE 
'org.apache.hive.hcatalog.data.JsonSerDe';
{code}

Throws Exception like:
{noformat}
15/05/26 00:16:33 ERROR SparkSQLDriver: Failed in [CREATE TABLE t1(a string, b 
string) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe']
org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution 
Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Cannot 
validate serde: org.apache.hive.hcatalog.data.JsonSerDe
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:333)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:310)
at 
org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:139)
at 
org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:310)
at 
org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:300)
at 
org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:457)
at 
org.apache.spark.sql.hive.execution.HiveNativeCommand.run(HiveNativeCommand.scala:33)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
at 
org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:68)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:87)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:922)
at 
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:922)
at org.apache.spark.sql.DataFrame.init(DataFrame.scala:147)
at org.apache.spark.sql.DataFrame.init(DataFrame.scala:131)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:727)
at 
org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:57)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:283)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:218)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{noformat}

  was:
Reproduce steps:
{code}
bin/spark-sql --jars ./sql/hive/src/test/resources/hive-hcatalog-core-0.13.1.jar
CREATE TABLE t1(a string, b string) ROW FORMAT SERDE 
'org.apache.hive.hcatalog.data.JsonSerDe';
{code}

Throws Exception like:
{panel}
15/05/26 00:16:33 ERROR SparkSQLDriver: Failed in [CREATE TABLE t1(a string, b 
string) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe']
org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution 
Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Cannot 
validate serde: org.apache.hive.hcatalog.data.JsonSerDe
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:333)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:310)
at 
org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:139)
at 
org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:310)
at 
org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:300)
at

[jira] [Commented] (SPARK-4867) UDF clean up


[ 
https://issues.apache.org/jira/browse/SPARK-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560434#comment-14560434
 ] 

Reynold Xin commented on SPARK-4867:


That's a good idea. I created SPARK-4867 for that. Can you submit a pull 
request for it?

It'd also be good to look into what expressions cannot be constructed this way. 
Ideally all functions should just go through function registry without being 
hardcoded into the parsers.

 UDF clean up
 

 Key: SPARK-4867
 URL: https://issues.apache.org/jira/browse/SPARK-4867
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Priority: Blocker

 Right now our support and internal implementation of many functions has a few 
 issues.  Specifically:
  - UDFS don't know their input types and thus don't do type coercion.
  - We hard code a bunch of built in functions into the parser.  This is bad 
 because in SQL it creates new reserved words for things that aren't actually 
 keywords.  Also it means that for each function we need to add support to 
 both SQLContext and HiveContext separately.
 For this JIRA I propose we do the following:
  - Change the interfaces for registerFunction and ScalaUdf to include types 
 for the input arguments as well as the output type.
  - Add a rule to analysis that does type coercion for UDFs.
  - Add a parse rule for functions to SQLParser.
  - Rewrite all the UDFs that are currently hacked into the various parsers 
 using this new functionality.
 Depending on how big this refactoring becomes we could split parts 12 from 
 part 3 above.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7886) Add built-in expressions to FunctionRegistry

Reynold Xin created SPARK-7886:
--

 Summary: Add built-in expressions to FunctionRegistry
 Key: SPARK-7886
 URL: https://issues.apache.org/jira/browse/SPARK-7886
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker


Once we do this, we no longer needs to hardcode expressions into the parser 
(both for internal SQL and Hive QL).






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7550) Support setting the right schema serde when writing to Hive metastore


[ 
https://issues.apache.org/jira/browse/SPARK-7550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560444#comment-14560444
 ] 

Reynold Xin commented on SPARK-7550:


[~chenghao] can you work on this one? 

 Support setting the right schema  serde when writing to Hive metastore
 ---

 Key: SPARK-7550
 URL: https://issues.apache.org/jira/browse/SPARK-7550
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.0
Reporter: Reynold Xin

 As of 1.4, Spark SQL does not properly set the table schema and serde when 
 writing a table to Hive's metastore. Would be great to do that properly so 
 users can use non-Spark SQL systems to read those tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7858) DataSourceStrategy.createPhysicalRDD should use output schema when performing row conversions, not relation schema


 [ 
https://issues.apache.org/jira/browse/SPARK-7858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-7858.
-
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 6400
[https://github.com/apache/spark/pull/6400]

 DataSourceStrategy.createPhysicalRDD should use output schema when performing 
 row conversions, not relation schema
 --

 Key: SPARK-7858
 URL: https://issues.apache.org/jira/browse/SPARK-7858
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Critical
 Fix For: 1.4.0


 In {{DataSourceStrategy.createPhysicalRDD}}, we use the relation schema as 
 the target schema for converting incoming rows into Catalyst rows.  However, 
 we should be using the output schema instead, since our scan might return a 
 subset of the relation's columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7868) Ignores _temporary directories while listing files in HadoopFsRelation


 [ 
https://issues.apache.org/jira/browse/SPARK-7868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-7868.
-

Issue resolved by pull request 6411
[https://github.com/apache/spark/pull/6411]

 Ignores _temporary directories while listing files in HadoopFsRelation
 

 Key: SPARK-7868
 URL: https://issues.apache.org/jira/browse/SPARK-7868
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Cheng Lian
Assignee: Cheng Lian
 Fix For: 1.4.0


 In some cases, failed tasks/jobs may leave uncommitted partial/corrupted data 
 in {{_temporary}} directory. These files should be counted as input files of 
 a {{HadoopFsRelation}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6012) Deadlock when asking for partitions from CoalescedRDD on top of a TakeOrdered operator


 [ 
https://issues.apache.org/jira/browse/SPARK-6012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-6012.
-
Resolution: Not A Problem

I think we do not have this issue after 1.3. I am going to resolve it as Not A 
Problem.

 Deadlock when asking for partitions from CoalescedRDD on top of a TakeOrdered 
 operator
 --

 Key: SPARK-6012
 URL: https://issues.apache.org/jira/browse/SPARK-6012
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1
Reporter: Max Seiden
Priority: Critical

 h3. Summary
 I've found that a deadlock occurs when asking for the partitions from a 
 SchemaRDD that has a TakeOrdered as its terminal operator. The problem occurs 
 when a child RDDs asks the DAGScheduler for preferred partition locations 
 (which locks the scheduler) and eventually hits the #execute() of the 
 TakeOrdered operator, which submits tasks but is blocked when it also tries 
 to get preferred locations (in a separate thread). It seems like the 
 TakeOrdered op's #execute() method should not actually submit a task (it is 
 calling #executeCollect() and creating a new RDD) and should instead stay 
 more true to the comment a logically apply a Limit on top of a Sort. 
 In my particular case, I am forcing a repartition of a SchemaRDD with a 
 terminal Limit(..., Sort(...)), which is where the CoalescedRDD comes into 
 play.
 h3. Stack Traces
 h4. Task Submission
 {noformat}
 main prio=5 tid=0x7f8e7280 nid=0x1303 in Object.wait() 
 [0x00010ed5e000]
java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 - waiting on 0x0007c4c239b8 (a 
 org.apache.spark.scheduler.JobWaiter)
 at java.lang.Object.wait(Object.java:503)
 at 
 org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73)
 - locked 0x0007c4c239b8 (a org.apache.spark.scheduler.JobWaiter)
 at 
 org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:514)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1321)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1390)
 at org.apache.spark.rdd.RDD.reduce(RDD.scala:884)
 at org.apache.spark.rdd.RDD.takeOrdered(RDD.scala:1161)
 at 
 org.apache.spark.sql.execution.TakeOrdered.executeCollect(basicOperators.scala:183)
 at 
 org.apache.spark.sql.execution.TakeOrdered.execute(basicOperators.scala:188)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
 - locked 0x0007c36ce038 (a 
 org.apache.spark.sql.hive.HiveContext$$anon$7)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
 at org.apache.spark.sql.SchemaRDD.getDependencies(SchemaRDD.scala:127)
 at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:209)
 at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:207)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.dependencies(RDD.scala:207)
 at org.apache.spark.rdd.RDD.firstParent(RDD.scala:1278)
 at org.apache.spark.sql.SchemaRDD.getPartitions(SchemaRDD.scala:122)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:220)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:220)
 at org.apache.spark.ShuffleDependency.init(Dependency.scala:79)
 at 
 org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80)
 at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:209)
 at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:207)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.dependencies(RDD.scala:207)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1333)
 at 
 org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1304)
 - locked 0x0007f55c2238 (a 
 org.apache.spark.scheduler.DAGScheduler)
 at 
 org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:1148)
 at

[jira] [Updated] (SPARK-6012) Deadlock when asking for partitions from CoalescedRDD on top of a TakeOrdered operator


 [ 
https://issues.apache.org/jira/browse/SPARK-6012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-6012:

Target Version/s:   (was: 1.4.0)

 Deadlock when asking for partitions from CoalescedRDD on top of a TakeOrdered 
 operator
 --

 Key: SPARK-6012
 URL: https://issues.apache.org/jira/browse/SPARK-6012
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1
Reporter: Max Seiden
Priority: Critical

 h3. Summary
 I've found that a deadlock occurs when asking for the partitions from a 
 SchemaRDD that has a TakeOrdered as its terminal operator. The problem occurs 
 when a child RDDs asks the DAGScheduler for preferred partition locations 
 (which locks the scheduler) and eventually hits the #execute() of the 
 TakeOrdered operator, which submits tasks but is blocked when it also tries 
 to get preferred locations (in a separate thread). It seems like the 
 TakeOrdered op's #execute() method should not actually submit a task (it is 
 calling #executeCollect() and creating a new RDD) and should instead stay 
 more true to the comment a logically apply a Limit on top of a Sort. 
 In my particular case, I am forcing a repartition of a SchemaRDD with a 
 terminal Limit(..., Sort(...)), which is where the CoalescedRDD comes into 
 play.
 h3. Stack Traces
 h4. Task Submission
 {noformat}
 main prio=5 tid=0x7f8e7280 nid=0x1303 in Object.wait() 
 [0x00010ed5e000]
java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 - waiting on 0x0007c4c239b8 (a 
 org.apache.spark.scheduler.JobWaiter)
 at java.lang.Object.wait(Object.java:503)
 at 
 org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73)
 - locked 0x0007c4c239b8 (a org.apache.spark.scheduler.JobWaiter)
 at 
 org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:514)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1321)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1390)
 at org.apache.spark.rdd.RDD.reduce(RDD.scala:884)
 at org.apache.spark.rdd.RDD.takeOrdered(RDD.scala:1161)
 at 
 org.apache.spark.sql.execution.TakeOrdered.executeCollect(basicOperators.scala:183)
 at 
 org.apache.spark.sql.execution.TakeOrdered.execute(basicOperators.scala:188)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
 - locked 0x0007c36ce038 (a 
 org.apache.spark.sql.hive.HiveContext$$anon$7)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
 at org.apache.spark.sql.SchemaRDD.getDependencies(SchemaRDD.scala:127)
 at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:209)
 at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:207)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.dependencies(RDD.scala:207)
 at org.apache.spark.rdd.RDD.firstParent(RDD.scala:1278)
 at org.apache.spark.sql.SchemaRDD.getPartitions(SchemaRDD.scala:122)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:220)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:220)
 at org.apache.spark.ShuffleDependency.init(Dependency.scala:79)
 at 
 org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80)
 at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:209)
 at 
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:207)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.dependencies(RDD.scala:207)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1333)
 at 
 org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1304)
 - locked 0x0007f55c2238 (a 
 org.apache.spark.scheduler.DAGScheduler)
 at 
 org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:1148)
 at 
 org.apache.spark.rdd.PartitionCoalescer.currPrefLocs(CoalescedRDD.scala:175)

[jira] [Commented] (SPARK-7853) ClassNotFoundException for SparkSQL

2015-05-26 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560429#comment-14560429
 ] 

Cheng Lian commented on SPARK-7853:
---

OT: [~chenghao] Just edited the JIRA description. When pasting exception stack 
trace {{noformat}} can be more preferable than {{panel}} since it uses 
monospace font :)

 ClassNotFoundException for SparkSQL
 ---

 Key: SPARK-7853
 URL: https://issues.apache.org/jira/browse/SPARK-7853
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Cheng Hao
Priority: Blocker

 Reproduce steps:
 {code}
 bin/spark-sql --jars 
 ./sql/hive/src/test/resources/hive-hcatalog-core-0.13.1.jar
 CREATE TABLE t1(a string, b string) ROW FORMAT SERDE 
 'org.apache.hive.hcatalog.data.JsonSerDe';
 {code}
 Throws Exception like:
 {noformat}
 15/05/26 00:16:33 ERROR SparkSQLDriver: Failed in [CREATE TABLE t1(a string, 
 b string) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe']
 org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution 
 Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Cannot 
 validate serde: org.apache.hive.hcatalog.data.JsonSerDe
   at 
 org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:333)
   at 
 org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:310)
   at 
 org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:139)
   at 
 org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:310)
   at 
 org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:300)
   at 
 org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:457)
   at 
 org.apache.spark.sql.hive.execution.HiveNativeCommand.run(HiveNativeCommand.scala:33)
   at 
 org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
   at 
 org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
   at 
 org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:68)
   at 
 org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
   at 
 org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88)
   at 
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:87)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:922)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:922)
   at org.apache.spark.sql.DataFrame.init(DataFrame.scala:147)
   at org.apache.spark.sql.DataFrame.init(DataFrame.scala:131)
   at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:727)
   at 
 org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:57)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:283)
   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:218)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
   at 
 org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7886) Add built-in expressions to FunctionRegistry


 [ 
https://issues.apache.org/jira/browse/SPARK-7886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7886:
---
Target Version/s: 1.5.0

 Add built-in expressions to FunctionRegistry
 

 Key: SPARK-7886
 URL: https://issues.apache.org/jira/browse/SPARK-7886
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker

 Once we do this, we no longer needs to hardcode expressions into the parser 
 (both for internal SQL and Hive QL).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7550) Support setting the right schema serde when writing to Hive metastore


[ 
https://issues.apache.org/jira/browse/SPARK-7550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560455#comment-14560455
 ] 

Reynold Xin commented on SPARK-7550:


cc [~yhuai]. I think the proposed design is to write the schema out according 
to Hive's format, for data types that Hive supports. For UDTs, just write them 
out like what we do right now.


 Support setting the right schema  serde when writing to Hive metastore
 ---

 Key: SPARK-7550
 URL: https://issues.apache.org/jira/browse/SPARK-7550
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.0
Reporter: Reynold Xin

 As of 1.4, Spark SQL does not properly set the table schema and serde when 
 writing a table to Hive's metastore. Would be great to do that properly so 
 users can use non-Spark SQL systems to read those tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6923) Spark SQL CLI does not read Data Source schema correctly


 [ 
https://issues.apache.org/jira/browse/SPARK-6923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6923:
---
Target Version/s: 1.5.0  (was: 1.4.0)

 Spark SQL CLI does not read Data Source schema correctly
 

 Key: SPARK-6923
 URL: https://issues.apache.org/jira/browse/SPARK-6923
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: pin_zhang
Priority: Critical

 {code:java}
 HiveContext hctx = new HiveContext(sc);
 ListString sample = new ArrayListString();
 sample.add( {\id\: \id_1\, \age\:1} );
 RDDString sampleRDD = new JavaSparkContext(sc).parallelize(sample).rdd();   
 DataFrame df = hctx.jsonRDD(sampleRDD);
 String table=test;
 df.saveAsTable(table, json,SaveMode.Overwrite);
 Table t = hctx.catalog().client().getTable(table);
 System.out.println( t.getCols());
 {code}
 --
 With the code above to save DataFrame to hive table,
 Get table cols returns one column named 'col'
 [FieldSchema(name:col, type:arraystring, comment:from deserializer)]
 Expected return fields schema id, age.
 This results in the jdbc API cannot retrieves the table columns via ResultSet 
 DatabaseMetaData.getColumns(String catalog, String schemaPattern,String 
 tableNamePattern, String columnNamePattern)
 But resultset metadata for query  select * from test   contains fields id, 
 age.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7887) Remove EvaluatedType from SQL Expression


 [ 
https://issues.apache.org/jira/browse/SPARK-7887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7887:
---

Assignee: Apache Spark  (was: Reynold Xin)

 Remove EvaluatedType from SQL Expression
 

 Key: SPARK-7887
 URL: https://issues.apache.org/jira/browse/SPARK-7887
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 It's not a very useful type to use. We can just remove it to simplify 
 expressions slightly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7887) Remove EvaluatedType from SQL Expression


[ 
https://issues.apache.org/jira/browse/SPARK-7887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560462#comment-14560462
 ] 

Apache Spark commented on SPARK-7887:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/6427

 Remove EvaluatedType from SQL Expression
 

 Key: SPARK-7887
 URL: https://issues.apache.org/jira/browse/SPARK-7887
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin

 It's not a very useful type to use. We can just remove it to simplify 
 expressions slightly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7887) Remove EvaluatedType from SQL Expression


 [ 
https://issues.apache.org/jira/browse/SPARK-7887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7887:
---

Assignee: Reynold Xin  (was: Apache Spark)

 Remove EvaluatedType from SQL Expression
 

 Key: SPARK-7887
 URL: https://issues.apache.org/jira/browse/SPARK-7887
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin

 It's not a very useful type to use. We can just remove it to simplify 
 expressions slightly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7550) Support setting the right schema serde when writing to Hive metastore

2015-05-26 Thread Cheng Hao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560450#comment-14560450
 ] 

Cheng Hao commented on SPARK-7550:
--

Similar issue with SPARK-6923 

 Support setting the right schema  serde when writing to Hive metastore
 ---

 Key: SPARK-7550
 URL: https://issues.apache.org/jira/browse/SPARK-7550
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.0
Reporter: Reynold Xin

 As of 1.4, Spark SQL does not properly set the table schema and serde when 
 writing a table to Hive's metastore. Would be great to do that properly so 
 users can use non-Spark SQL systems to read those tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-7550) Support setting the right schema serde when writing to Hive metastore

2015-05-26 Thread Cheng Hao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560450#comment-14560450
 ] 

Cheng Hao edited comment on SPARK-7550 at 5/27/15 5:48 AM:
---

Similar issue with SPARK-6923 ?


was (Author: chenghao):
Similar issue with SPARK-6923 

 Support setting the right schema  serde when writing to Hive metastore
 ---

 Key: SPARK-7550
 URL: https://issues.apache.org/jira/browse/SPARK-7550
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.0
Reporter: Reynold Xin

 As of 1.4, Spark SQL does not properly set the table schema and serde when 
 writing a table to Hive's metastore. Would be great to do that properly so 
 users can use non-Spark SQL systems to read those tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7887) Remove EvaluatedType from SQL Expression

Reynold Xin created SPARK-7887:
--

 Summary: Remove EvaluatedType from SQL Expression
 Key: SPARK-7887
 URL: https://issues.apache.org/jira/browse/SPARK-7887
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


It's not a very useful type to use. We can just remove it to simplify 
expressions slightly.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7699) Number of executors can be reduced from initial before work is scheduled

2015-05-26 Thread Sandy Ryza (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558748#comment-14558748
 ] 

Sandy Ryza commented on SPARK-7699:
---

We can't wait only on the initial allocation being made because YARN might not 
be able to fully satisfy it in any finite amount of time.

 Number of executors can be reduced from initial before work is scheduled
 

 Key: SPARK-7699
 URL: https://issues.apache.org/jira/browse/SPARK-7699
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: meiyoula
Priority: Minor

 spark.dynamicAllocation.minExecutors 2
 spark.dynamicAllocation.initialExecutors  3
 spark.dynamicAllocation.maxExecutors 4
 Just run the spark-shell with above configurations, the initial executor 
 number is 2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7042) Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x


 [ 
https://issues.apache.org/jira/browse/SPARK-7042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7042.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6341
[https://github.com/apache/spark/pull/6341]

 Spark version of akka-actor_2.11 is not compatible with the official 
 akka-actor_2.11 2.3.x
 --

 Key: SPARK-7042
 URL: https://issues.apache.org/jira/browse/SPARK-7042
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.1
Reporter: Konstantin Shaposhnikov
Priority: Minor
 Fix For: 1.5.0


 When connecting to a remote Spark cluster (that runs Spark branch-1.3 built 
 with Scala 2.11) from an application that uses akka 2.3.9 I get the following 
 error:
 {noformat}
 2015-04-22 09:01:38,924 - [WARN] - [akka.remote.ReliableDeliverySupervisor] 
 [sparkDriver-akka.actor.default-dispatcher-5] -
 Association with remote system [akka.tcp://sparkExecutor@server:59007] has 
 failed, address is now gated for [5000] ms.
 Reason is: [akka.actor.Identify; local class incompatible: stream classdesc 
 serialVersionUID = -213377755528332889, local class serialVersionUID = 1].
 {noformat}
 It looks like akka-actor_2.11 2.3.4-spark that is used by Spark has been 
 built using Scala compiler 2.11.0 that ignores SerialVersionUID annotations 
 (see https://issues.scala-lang.org/browse/SI-8549).
 The following steps can resolve the issue:
 - re-build the custom akka library that is used by Spark with the more recent 
 version of Scala compiler (e.g. 2.11.6) 
 - deploy a new version (e.g. 2.3.4.1-spark) to a maven repo
 - update version of akka used by spark (master and 1.3 branch)
 I would also suggest to upgrade to the latest version of akka 2.3.9 (or 
 2.3.10 that should be released soon).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7042) Spark version of akka-actor_2.11 is not compatible with the official akka-actor_2.11 2.3.x


 [ 
https://issues.apache.org/jira/browse/SPARK-7042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7042:
-
Assignee: Konstantin Shaposhnikov

 Spark version of akka-actor_2.11 is not compatible with the official 
 akka-actor_2.11 2.3.x
 --

 Key: SPARK-7042
 URL: https://issues.apache.org/jira/browse/SPARK-7042
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.1
Reporter: Konstantin Shaposhnikov
Assignee: Konstantin Shaposhnikov
Priority: Minor
 Fix For: 1.5.0


 When connecting to a remote Spark cluster (that runs Spark branch-1.3 built 
 with Scala 2.11) from an application that uses akka 2.3.9 I get the following 
 error:
 {noformat}
 2015-04-22 09:01:38,924 - [WARN] - [akka.remote.ReliableDeliverySupervisor] 
 [sparkDriver-akka.actor.default-dispatcher-5] -
 Association with remote system [akka.tcp://sparkExecutor@server:59007] has 
 failed, address is now gated for [5000] ms.
 Reason is: [akka.actor.Identify; local class incompatible: stream classdesc 
 serialVersionUID = -213377755528332889, local class serialVersionUID = 1].
 {noformat}
 It looks like akka-actor_2.11 2.3.4-spark that is used by Spark has been 
 built using Scala compiler 2.11.0 that ignores SerialVersionUID annotations 
 (see https://issues.scala-lang.org/browse/SI-8549).
 The following steps can resolve the issue:
 - re-build the custom akka library that is used by Spark with the more recent 
 version of Scala compiler (e.g. 2.11.6) 
 - deploy a new version (e.g. 2.3.4.1-spark) to a maven repo
 - update version of akka used by spark (master and 1.3 branch)
 I would also suggest to upgrade to the latest version of akka 2.3.9 (or 
 2.3.10 that should be released soon).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7699) Number of executors can be reduced from initial before work is scheduled


[ 
https://issues.apache.org/jira/browse/SPARK-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558752#comment-14558752
 ] 

Sean Owen commented on SPARK-7699:
--

Hm, waiting for x seconds also seems suboptimal since, in the case where the 
executors aren't needed, you're just delaying releasing them. Does it make 
sense to wait 1 heartbeat? x heartbeats? to start changing the allocation?

 Number of executors can be reduced from initial before work is scheduled
 

 Key: SPARK-7699
 URL: https://issues.apache.org/jira/browse/SPARK-7699
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: meiyoula
Priority: Minor

 spark.dynamicAllocation.minExecutors 2
 spark.dynamicAllocation.initialExecutors  3
 spark.dynamicAllocation.maxExecutors 4
 Just run the spark-shell with above configurations, the initial executor 
 number is 2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7110) when use saveAsNewAPIHadoopFile, sometimes it throws Delegation Token can be issued only with kerberos or web authentication


 [ 
https://issues.apache.org/jira/browse/SPARK-7110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7110.
--
Resolution: Duplicate

 when use saveAsNewAPIHadoopFile, sometimes it throws Delegation Token can be 
 issued only with kerberos or web authentication
 --

 Key: SPARK-7110
 URL: https://issues.apache.org/jira/browse/SPARK-7110
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: gu-chi
Assignee: Sean Owen

 Under yarn-client mode, this issue random occurs. Authentication method is 
 set to kerberos, and use saveAsNewAPIHadoopFile in PairRDDFunctions to save 
 data to HDFS, then exception comes as:
 org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
 can be issued only with kerberos or web authentication



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7727) Avoid inner classes in RuleExecutor


[ 
https://issues.apache.org/jira/browse/SPARK-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558758#comment-14558758
 ] 

Santiago M. Mola commented on SPARK-7727:
-

[~chenghao] I think that is a good idea. Analyzer could be converted into a 
trait, moving current Analyzer to DefaultAnalyzer. It is probably a good idea 
to use a separate JIRA and pull request for that though.

 Avoid inner classes in RuleExecutor
 ---

 Key: SPARK-7727
 URL: https://issues.apache.org/jira/browse/SPARK-7727
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Santiago M. Mola
  Labels: easyfix, starter

 In RuleExecutor, the following classes and objects are defined as inner 
 classes or objects: Strategy, Once, FixedPoint, Batch.
 This does not seem to accomplish anything in this case, but makes 
 extensibility harder. For example, if I want to define a new Optimizer that 
 uses all batches from the DefaultOptimizer plus some more, I would do 
 something like:
 {code}
 new Optimizer {
 override protected val batches: Seq[Batch] =
   DefaultOptimizer.batches ++ myBatches
  }
 {code}
 But this will give a typing error because batches in DefaultOptimizer are of 
 type DefaultOptimizer#Batch while myBatches are this#Batch.
 Workarounds include either copying the list of batches from DefaultOptimizer 
 or using a method like this:
 {code}
 private def transformBatchType(b: DefaultOptimizer.Batch): Batch = {
   val strategy = b.strategy.maxIterations match {
 case 1 = Once
 case n = FixedPoint(n)
   }
   Batch(b.name, strategy, b.rules)
 }
 {code}
 However, making these classes outer would solve the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7699) Number of executors can be reduced from initial before work is scheduled

2015-05-26 Thread Sandy Ryza (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558757#comment-14558757
 ] 

Sandy Ryza commented on SPARK-7699:
---

I think delaying releasing them is exactly the point of the property.  If we 
don't want to do that, what's it there for? 

 Number of executors can be reduced from initial before work is scheduled
 

 Key: SPARK-7699
 URL: https://issues.apache.org/jira/browse/SPARK-7699
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: meiyoula
Priority: Minor

 spark.dynamicAllocation.minExecutors 2
 spark.dynamicAllocation.initialExecutors  3
 spark.dynamicAllocation.maxExecutors 4
 Just run the spark-shell with above configurations, the initial executor 
 number is 2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7699) Number of executors can be reduced from initial before work is scheduled

[
https://issues.apache.org/jira/browse/SPARK-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558764#comment-14558764
]

Sean Owen commented on SPARK-7699:
--

initialExecutors? I think it's there for the ramp-up case really. If load will
start soon, but your minimum is 1 because load is variable, it's best to not
have to ramp up through 1, 2, 4, 8 executors if you need 100.

The problem is evaluating load before any load has had a chance to schedule.
Ramping down at all is bad if, actually, load is applied right away.

I'd rather not add another lever here, but is it principled to wait for some
multiple of the RM heartbeat here? so that the allocation isn't changed until
the RM has had a fair chance to allocate resources? Sure, bets are off if there
is a delay in scheduling but what can you do? nothing breaks here it's just
suboptimal then.

Number of executors can be reduced from initial before work is scheduled

Key: SPARK-7699
URL: https://issues.apache.org/jira/browse/SPARK-7699
Project: Spark
Issue Type: Improvement
Components: Spark Core
Reporter: meiyoula
Priority: Minor

spark.dynamicAllocation.minExecutors 2
spark.dynamicAllocation.initialExecutors 3
spark.dynamicAllocation.maxExecutors 4
Just run the spark-shell with above configurations, the initial executor
number is 2.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-7727) Avoid inner classes in RuleExecutor


 [ 
https://issues.apache.org/jira/browse/SPARK-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santiago M. Mola updated SPARK-7727:

Comment: was deleted

(was: [~evacchi] I'm sorry I opened this duplicate for: 
https://issues.apache.org/jira/browse/SPARK-7823

Not sure which one to mark as duplicate since both have pull requests.)

 Avoid inner classes in RuleExecutor
 ---

 Key: SPARK-7727
 URL: https://issues.apache.org/jira/browse/SPARK-7727
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Santiago M. Mola
  Labels: easyfix, starter

 In RuleExecutor, the following classes and objects are defined as inner 
 classes or objects: Strategy, Once, FixedPoint, Batch.
 This does not seem to accomplish anything in this case, but makes 
 extensibility harder. For example, if I want to define a new Optimizer that 
 uses all batches from the DefaultOptimizer plus some more, I would do 
 something like:
 {code}
 new Optimizer {
 override protected val batches: Seq[Batch] =
   DefaultOptimizer.batches ++ myBatches
  }
 {code}
 But this will give a typing error because batches in DefaultOptimizer are of 
 type DefaultOptimizer#Batch while myBatches are this#Batch.
 Workarounds include either copying the list of batches from DefaultOptimizer 
 or using a method like this:
 {code}
 private def transformBatchType(b: DefaultOptimizer.Batch): Batch = {
   val strategy = b.strategy.maxIterations match {
 case 1 = Once
 case n = FixedPoint(n)
   }
   Batch(b.name, strategy, b.rules)
 }
 {code}
 However, making these classes outer would solve the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7823) [SQL] Batch, FixedPoint, Strategy should not be inner classes of class RuleExecutor


 [ 
https://issues.apache.org/jira/browse/SPARK-7823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santiago M. Mola resolved SPARK-7823.
-
Resolution: Duplicate

This is a duplicate of https://issues.apache.org/jira/browse/SPARK-7727

 [SQL] Batch, FixedPoint, Strategy should not be inner classes of class 
 RuleExecutor
 ---

 Key: SPARK-7823
 URL: https://issues.apache.org/jira/browse/SPARK-7823
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1, 1.4.0
Reporter: Edoardo Vacchi
Priority: Minor

 Batch, FixedPoint, Strategy, Once, are defined within the class 
 RuleExecutor[TreeType]. This makes unnecessarily complicated to reuse batches 
 of rules within custom optimizers. E.g:
 {code:java}
 object DefaultOptimizer extends Optimizer {
   override val batches = /* batches defined here */
 }
 object MyCustomOptimizer extends Optimizer {
   override val batches = 
 Batch(my custom batch ...) ::
 DefaultOptimizer.batches
 }
 {code}
 MyCustomOptimizer won't compile, because DefaultOptimizer.batches has type 
 Seq[DefaultOptimizer.this.Batch]. 
 Solution: Batch, FixedPoint, etc. should be moved *outside* the 
 RuleExecutor[T] class body, either in a companion object or right in the 
 `rules` package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-7727) Avoid inner classes in RuleExecutor


[ 
https://issues.apache.org/jira/browse/SPARK-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558796#comment-14558796
 ] 

Edoardo Vacchi edited comment on SPARK-7727 at 5/26/15 7:44 AM:


[~smolav] (about the duplicate) that's fine, since I opened my PR later (I 
didn't see the other). My PR wraps the case classes in a companion object, 
though. Don't know which solution would be best

about trait v. object. Object is currently fine, since batches can be reused 
through `val batches = DefaultOptimizer.batches`. If we go with traits, though 
(which I am in favor of) I would turn into traits also SparkPlanner, for 
symmetry


was (Author: evacchi):
[~smolav] (about the duplicate) that's fine, since I opened my PR later (I 
didn't see the other). My PR wraps the case classes in a companion object, 
though. Don't know which solution would be best

 Avoid inner classes in RuleExecutor
 ---

 Key: SPARK-7727
 URL: https://issues.apache.org/jira/browse/SPARK-7727
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Santiago M. Mola
  Labels: easyfix, starter

 In RuleExecutor, the following classes and objects are defined as inner 
 classes or objects: Strategy, Once, FixedPoint, Batch.
 This does not seem to accomplish anything in this case, but makes 
 extensibility harder. For example, if I want to define a new Optimizer that 
 uses all batches from the DefaultOptimizer plus some more, I would do 
 something like:
 {code}
 new Optimizer {
 override protected val batches: Seq[Batch] =
   DefaultOptimizer.batches ++ myBatches
  }
 {code}
 But this will give a typing error because batches in DefaultOptimizer are of 
 type DefaultOptimizer#Batch while myBatches are this#Batch.
 Workarounds include either copying the list of batches from DefaultOptimizer 
 or using a method like this:
 {code}
 private def transformBatchType(b: DefaultOptimizer.Batch): Batch = {
   val strategy = b.strategy.maxIterations match {
 case 1 = Once
 case n = FixedPoint(n)
   }
   Batch(b.name, strategy, b.rules)
 }
 {code}
 However, making these classes outer would solve the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3846) KryoException when doing joins in SparkSQL


[ 
https://issues.apache.org/jira/browse/SPARK-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558808#comment-14558808
 ] 

Santiago M. Mola commented on SPARK-3846:
-

[~huangjs]  Would you mind adding a test case here (an example of data and 
exact code used to produce the exception)?

 KryoException when doing joins in SparkSQL 
 ---

 Key: SPARK-3846
 URL: https://issues.apache.org/jira/browse/SPARK-3846
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0, 1.2.0
Reporter: Jianshi Huang

 The error is reproducible when I join two tables manually. The error message 
 is like follows.
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 645 
 in stage 3.0 failed 4 times, most recent failure: Lost task 645.3 in stage 
 3.0 (TID 3802, ...): com.esotericsoftware.kryo.KryoException:
 Unable to find class: 
 __wrapper$1$18e31777385a452ba0bc030e899bf5d1.__wrapper$1$18e31777385a452ba0bc030e899bf5d1$SpecificRow$1
 
 com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:138)
 
 com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:115)
 com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721)
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42)
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 
 org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133)
 
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
 scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 
 org.apache.spark.sql.execution.HashJoin$$anon$1.hasNext(joins.scala:101)
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 
 org.apache.spark.sql.execution.GeneratedAggregate$$anonfun$8.apply(GeneratedAggregate.scala:198)
 
 org.apache.spark.sql.execution.GeneratedAggregate$$anonfun$8.apply(GeneratedAggregate.scala:165)
 org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599)
 org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599)
 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 org.apache.spark.scheduler.Task.run(Task.scala:56)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:724)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-7727) Avoid inner classes in RuleExecutor


[ 
https://issues.apache.org/jira/browse/SPARK-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558796#comment-14558796
 ] 

Edoardo Vacchi edited comment on SPARK-7727 at 5/26/15 7:41 AM:


[~smolav] (about the duplicate) that's fine, since I opened my PR later (I 
didn't see the other). My PR wraps the case classes in a companion object, 
though. Don't know which solution would be best


was (Author: evacchi):
[~smolav] that's fine, since I opened my PR later (I didn't see the other). My 
PR wraps the case classes in a companion object, though. Don't know which 
solution would be best

 Avoid inner classes in RuleExecutor
 ---

 Key: SPARK-7727
 URL: https://issues.apache.org/jira/browse/SPARK-7727
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Santiago M. Mola
  Labels: easyfix, starter

 In RuleExecutor, the following classes and objects are defined as inner 
 classes or objects: Strategy, Once, FixedPoint, Batch.
 This does not seem to accomplish anything in this case, but makes 
 extensibility harder. For example, if I want to define a new Optimizer that 
 uses all batches from the DefaultOptimizer plus some more, I would do 
 something like:
 {code}
 new Optimizer {
 override protected val batches: Seq[Batch] =
   DefaultOptimizer.batches ++ myBatches
  }
 {code}
 But this will give a typing error because batches in DefaultOptimizer are of 
 type DefaultOptimizer#Batch while myBatches are this#Batch.
 Workarounds include either copying the list of batches from DefaultOptimizer 
 or using a method like this:
 {code}
 private def transformBatchType(b: DefaultOptimizer.Batch): Batch = {
   val strategy = b.strategy.maxIterations match {
 case 1 = Once
 case n = FixedPoint(n)
   }
   Batch(b.name, strategy, b.rules)
 }
 {code}
 However, making these classes outer would solve the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7862) Query would hang when the using script has error output in SparkSQL

2015-05-26 Thread zhichao-li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhichao-li updated SPARK-7862:
--
Description: 
Steps to reproduce:

val data = (1 to 10).map { i = (i, i, i) }
data.toDF(d1, d2, d3).registerTempTable(script_trans)
 sql(SELECT TRANSFORM (d1, d2, d3) USING 'cat 12' AS (a,b,c) FROM 
script_trans)

 Query would hang when the using script has error output in SparkSQL
 ---

 Key: SPARK-7862
 URL: https://issues.apache.org/jira/browse/SPARK-7862
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: zhichao-li

 Steps to reproduce:
 val data = (1 to 10).map { i = (i, i, i) }
 data.toDF(d1, d2, d3).registerTempTable(script_trans)
  sql(SELECT TRANSFORM (d1, d2, d3) USING 'cat 12' AS (a,b,c) FROM 
 script_trans)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-7727) Avoid inner classes in RuleExecutor


[ 
https://issues.apache.org/jira/browse/SPARK-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558796#comment-14558796
 ] 

Edoardo Vacchi edited comment on SPARK-7727 at 5/26/15 7:44 AM:


[~smolav] (about the duplicate) that's fine, since I opened my PR later (I 
didn't see the other). My PR wraps the case classes in a companion object, 
though. Don't know which solution would be best

about trait v. object. Object is currently fine, since batches can be reused 
through `val batches = DefaultOptimizer.batches`. If we go with traits, though 
(which I am in favor of) I would turn into traits also SparkPlanner, for 
symmetry (see also SPARK-6981)


was (Author: evacchi):
[~smolav] (about the duplicate) that's fine, since I opened my PR later (I 
didn't see the other). My PR wraps the case classes in a companion object, 
though. Don't know which solution would be best

about trait v. object. Object is currently fine, since batches can be reused 
through `val batches = DefaultOptimizer.batches`. If we go with traits, though 
(which I am in favor of) I would turn into traits also SparkPlanner, for 
symmetry

 Avoid inner classes in RuleExecutor
 ---

 Key: SPARK-7727
 URL: https://issues.apache.org/jira/browse/SPARK-7727
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Santiago M. Mola
  Labels: easyfix, starter

 In RuleExecutor, the following classes and objects are defined as inner 
 classes or objects: Strategy, Once, FixedPoint, Batch.
 This does not seem to accomplish anything in this case, but makes 
 extensibility harder. For example, if I want to define a new Optimizer that 
 uses all batches from the DefaultOptimizer plus some more, I would do 
 something like:
 {code}
 new Optimizer {
 override protected val batches: Seq[Batch] =
   DefaultOptimizer.batches ++ myBatches
  }
 {code}
 But this will give a typing error because batches in DefaultOptimizer are of 
 type DefaultOptimizer#Batch while myBatches are this#Batch.
 Workarounds include either copying the list of batches from DefaultOptimizer 
 or using a method like this:
 {code}
 private def transformBatchType(b: DefaultOptimizer.Batch): Batch = {
   val strategy = b.strategy.maxIterations match {
 case 1 = Once
 case n = FixedPoint(n)
   }
   Batch(b.name, strategy, b.rules)
 }
 {code}
 However, making these classes outer would solve the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3846) KryoException when doing joins in SparkSQL


 [ 
https://issues.apache.org/jira/browse/SPARK-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santiago M. Mola updated SPARK-3846:

Priority: Blocker  (was: Major)

 KryoException when doing joins in SparkSQL 
 ---

 Key: SPARK-3846
 URL: https://issues.apache.org/jira/browse/SPARK-3846
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0, 1.2.0
Reporter: Jianshi Huang
Priority: Blocker

 The error is reproducible when I join two tables manually. The error message 
 is like follows.
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 645 
 in stage 3.0 failed 4 times, most recent failure: Lost task 645.3 in stage 
 3.0 (TID 3802, ...): com.esotericsoftware.kryo.KryoException:
 Unable to find class: 
 __wrapper$1$18e31777385a452ba0bc030e899bf5d1.__wrapper$1$18e31777385a452ba0bc030e899bf5d1$SpecificRow$1
 
 com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:138)
 
 com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:115)
 com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721)
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42)
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 
 org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133)
 
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
 scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 
 org.apache.spark.sql.execution.HashJoin$$anon$1.hasNext(joins.scala:101)
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 
 org.apache.spark.sql.execution.GeneratedAggregate$$anonfun$8.apply(GeneratedAggregate.scala:198)
 
 org.apache.spark.sql.execution.GeneratedAggregate$$anonfun$8.apply(GeneratedAggregate.scala:165)
 org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599)
 org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599)
 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 org.apache.spark.scheduler.Task.run(Task.scala:56)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:724)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3846) [SQL] Serialization exception (Kryo) on joins when enabling codegen


 [ 
https://issues.apache.org/jira/browse/SPARK-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santiago M. Mola updated SPARK-3846:

Summary: [SQL] Serialization exception (Kryo) on joins when enabling 
codegen   (was: [SQL] Serialization exception (Kryo and Java) on joins when 
enabling codegen )

 [SQL] Serialization exception (Kryo) on joins when enabling codegen 
 

 Key: SPARK-3846
 URL: https://issues.apache.org/jira/browse/SPARK-3846
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0, 1.2.0
Reporter: Jianshi Huang
Priority: Blocker

 The error is reproducible when I join two tables manually. The error message 
 is like follows.
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 645 
 in stage 3.0 failed 4 times, most recent failure: Lost task 645.3 in stage 
 3.0 (TID 3802, ...): com.esotericsoftware.kryo.KryoException:
 Unable to find class: 
 __wrapper$1$18e31777385a452ba0bc030e899bf5d1.__wrapper$1$18e31777385a452ba0bc030e899bf5d1$SpecificRow$1
 
 com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:138)
 
 com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:115)
 com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721)
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42)
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 
 org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133)
 
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
 scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 
 org.apache.spark.sql.execution.HashJoin$$anon$1.hasNext(joins.scala:101)
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 
 org.apache.spark.sql.execution.GeneratedAggregate$$anonfun$8.apply(GeneratedAggregate.scala:198)
 
 org.apache.spark.sql.execution.GeneratedAggregate$$anonfun$8.apply(GeneratedAggregate.scala:165)
 org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599)
 org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599)
 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 org.apache.spark.scheduler.Task.run(Task.scala:56)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:724)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-3846) [SQL] Serialization exception (Kryo) on joins when enabling codegen


 [ 
https://issues.apache.org/jira/browse/SPARK-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-3846.
--
Resolution: Duplicate

 [SQL] Serialization exception (Kryo) on joins when enabling codegen 
 

 Key: SPARK-3846
 URL: https://issues.apache.org/jira/browse/SPARK-3846
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0, 1.2.0
Reporter: Jianshi Huang
Priority: Blocker

 The error is reproducible when I join two tables manually. The error message 
 is like follows.
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 645 
 in stage 3.0 failed 4 times, most recent failure: Lost task 645.3 in stage 
 3.0 (TID 3802, ...): com.esotericsoftware.kryo.KryoException:
 Unable to find class: 
 __wrapper$1$18e31777385a452ba0bc030e899bf5d1.__wrapper$1$18e31777385a452ba0bc030e899bf5d1$SpecificRow$1
 
 com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:138)
 
 com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:115)
 com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721)
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42)
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 
 org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133)
 
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
 scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 
 org.apache.spark.sql.execution.HashJoin$$anon$1.hasNext(joins.scala:101)
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 
 org.apache.spark.sql.execution.GeneratedAggregate$$anonfun$8.apply(GeneratedAggregate.scala:198)
 
 org.apache.spark.sql.execution.GeneratedAggregate$$anonfun$8.apply(GeneratedAggregate.scala:165)
 org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599)
 org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599)
 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 org.apache.spark.scheduler.Task.run(Task.scala:56)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:724)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7864) Clicking a job's DAG graph on Web UI kills the job as the link is broken

2015-05-26 Thread Carson Wang (JIRA)

Carson Wang created SPARK-7864:
--

 Summary: Clicking a job's DAG graph on Web UI kills the job as the 
link is broken
 Key: SPARK-7864
 URL: https://issues.apache.org/jira/browse/SPARK-7864
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.4.0
Reporter: Carson Wang


When clicking a job's DAG graph on Web UI, the user is expected to be 
redirected to the corresponding stage page. The link is got from the stage 
table by selecting the first link. But there are two links in each row, the 
first one is the killLink, the second is the nameLink.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7727) Avoid inner classes in RuleExecutor


[ 
https://issues.apache.org/jira/browse/SPARK-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558777#comment-14558777
 ] 

Santiago M. Mola commented on SPARK-7727:
-

[~evacchi] I'm sorry I opened this duplicate for: 
https://issues.apache.org/jira/browse/SPARK-7823

Not sure which one to mark as duplicate since both have pull requests.

 Avoid inner classes in RuleExecutor
 ---

 Key: SPARK-7727
 URL: https://issues.apache.org/jira/browse/SPARK-7727
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Santiago M. Mola
  Labels: easyfix, starter

 In RuleExecutor, the following classes and objects are defined as inner 
 classes or objects: Strategy, Once, FixedPoint, Batch.
 This does not seem to accomplish anything in this case, but makes 
 extensibility harder. For example, if I want to define a new Optimizer that 
 uses all batches from the DefaultOptimizer plus some more, I would do 
 something like:
 {code}
 new Optimizer {
 override protected val batches: Seq[Batch] =
   DefaultOptimizer.batches ++ myBatches
  }
 {code}
 But this will give a typing error because batches in DefaultOptimizer are of 
 type DefaultOptimizer#Batch while myBatches are this#Batch.
 Workarounds include either copying the list of batches from DefaultOptimizer 
 or using a method like this:
 {code}
 private def transformBatchType(b: DefaultOptimizer.Batch): Batch = {
   val strategy = b.strategy.maxIterations match {
 case 1 = Once
 case n = FixedPoint(n)
   }
   Batch(b.name, strategy, b.rules)
 }
 {code}
 However, making these classes outer would solve the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7862) Query would hang when the using script has error output in SparkSQL


 [ 
https://issues.apache.org/jira/browse/SPARK-7862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7862:
---

Assignee: Apache Spark

 Query would hang when the using script has error output in SparkSQL
 ---

 Key: SPARK-7862
 URL: https://issues.apache.org/jira/browse/SPARK-7862
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: zhichao-li
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7562) Improve error reporting for expression data type mismatch

2015-05-26 Thread Wenchen Fan (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-7562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558836#comment-14558836
]

Wenchen Fan commented on SPARK-7562:

After thinking about it, I think we can't just use the ExpectsInputTypes
interface. There are some cases that we don't know the accurate required input
types, like Add, we only need the left and right expressions have same data
type which is numeric.
I have sent a PR to add a `TypeConstraint` interface, which defines when an
Expression has correct data types and what error message should be generated if
type mismatch.

Improve error reporting for expression data type mismatch
-

Key: SPARK-7562
URL: https://issues.apache.org/jira/browse/SPARK-7562
Project: Spark
Issue Type: Improvement
Components: SQL
Reporter: Reynold Xin

There is currently no error reporting for expression data types in analysis
(we rely on resolved for that, which doesn't provide great error messages
for types). It would be great to have that in checkAnalysis.
Ideally, it should be the responsibility of each Expression itself to specify
the types it requires, and report errors that way. We would need to define a
simple interface for that so each Expression can implement. The default
implementation can just use the information provided by
ExpectsInputTypes.expectedChildTypes.
cc [~marmbrus] what we discussed offline today.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7562) Improve error reporting for expression data type mismatch


 [ 
https://issues.apache.org/jira/browse/SPARK-7562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7562:
---

Assignee: (was: Apache Spark)

 Improve error reporting for expression data type mismatch
 -

 Key: SPARK-7562
 URL: https://issues.apache.org/jira/browse/SPARK-7562
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin

 There is currently no error reporting for expression data types in analysis 
 (we rely on resolved for that, which doesn't provide great error messages 
 for types). It would be great to have that in checkAnalysis.
 Ideally, it should be the responsibility of each Expression itself to specify 
 the types it requires, and report errors that way. We would need to define a 
 simple interface for that so each Expression can implement. The default 
 implementation can just use the information provided by 
 ExpectsInputTypes.expectedChildTypes. 
 cc [~marmbrus] what we discussed offline today.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7863) SimpleDateParam should not use SimpleDateFormat in multiple threads because SimpleDateFormat is not thread-safe


 [ 
https://issues.apache.org/jira/browse/SPARK-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7863:
---

Assignee: (was: Apache Spark)

 SimpleDateParam should not use SimpleDateFormat in multiple threads because 
 SimpleDateFormat is not thread-safe
 ---

 Key: SPARK-7863
 URL: https://issues.apache.org/jira/browse/SPARK-7863
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Shixiong Zhu





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7863) SimpleDateParam should not use SimpleDateFormat in multiple threads because SimpleDateFormat is not thread-safe


[ 
https://issues.apache.org/jira/browse/SPARK-7863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558841#comment-14558841
 ] 

Apache Spark commented on SPARK-7863:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/6406

 SimpleDateParam should not use SimpleDateFormat in multiple threads because 
 SimpleDateFormat is not thread-safe
 ---

 Key: SPARK-7863
 URL: https://issues.apache.org/jira/browse/SPARK-7863
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Shixiong Zhu





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7862) Query would hang when the using script has error output in SparkSQL


 [ 
https://issues.apache.org/jira/browse/SPARK-7862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7862:
---

Assignee: (was: Apache Spark)

 Query would hang when the using script has error output in SparkSQL
 ---

 Key: SPARK-7862
 URL: https://issues.apache.org/jira/browse/SPARK-7862
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: zhichao-li





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7862) Query would hang when the using script has error output in SparkSQL


[ 
https://issues.apache.org/jira/browse/SPARK-7862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558797#comment-14558797
 ] 

Apache Spark commented on SPARK-7862:
-

User 'zhichao-li' has created a pull request for this issue:
https://github.com/apache/spark/pull/6404

 Query would hang when the using script has error output in SparkSQL
 ---

 Key: SPARK-7862
 URL: https://issues.apache.org/jira/browse/SPARK-7862
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: zhichao-li





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7727) Avoid inner classes in RuleExecutor


[ 
https://issues.apache.org/jira/browse/SPARK-7727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558796#comment-14558796
 ] 

Edoardo Vacchi commented on SPARK-7727:
---

[~smolav] that's fine, since I opened my PR later (I didn't see the other). My 
PR wraps the case classes in a companion object, though. Don't know which 
solution would be best

 Avoid inner classes in RuleExecutor
 ---

 Key: SPARK-7727
 URL: https://issues.apache.org/jira/browse/SPARK-7727
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Santiago M. Mola
  Labels: easyfix, starter

 In RuleExecutor, the following classes and objects are defined as inner 
 classes or objects: Strategy, Once, FixedPoint, Batch.
 This does not seem to accomplish anything in this case, but makes 
 extensibility harder. For example, if I want to define a new Optimizer that 
 uses all batches from the DefaultOptimizer plus some more, I would do 
 something like:
 {code}
 new Optimizer {
 override protected val batches: Seq[Batch] =
   DefaultOptimizer.batches ++ myBatches
  }
 {code}
 But this will give a typing error because batches in DefaultOptimizer are of 
 type DefaultOptimizer#Batch while myBatches are this#Batch.
 Workarounds include either copying the list of batches from DefaultOptimizer 
 or using a method like this:
 {code}
 private def transformBatchType(b: DefaultOptimizer.Batch): Batch = {
   val strategy = b.strategy.maxIterations match {
 case 1 = Once
 case n = FixedPoint(n)
   }
   Batch(b.name, strategy, b.rules)
 }
 {code}
 However, making these classes outer would solve the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5707) Enabling spark.sql.codegen throws ClassNotFound exception


[ 
https://issues.apache.org/jira/browse/SPARK-5707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558823#comment-14558823
 ] 

Santiago M. Mola commented on SPARK-5707:
-

This is probably a duplicate of SPARK-3846.

 Enabling spark.sql.codegen throws ClassNotFound exception
 -

 Key: SPARK-5707
 URL: https://issues.apache.org/jira/browse/SPARK-5707
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.1
 Environment: yarn-client mode, spark.sql.codegen=true
Reporter: Yi Yao
Assignee: Ram Sriharsha
Priority: Blocker

 Exception thrown:
 {noformat}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 13 in 
 stage 133.0 failed 4 times, most recent failure: Lost task 13.3 in stage 
 133.0 (TID 3066, cdh52-node2): java.io.IOException: 
 com.esotericsoftware.kryo.KryoException: Unable to find class: 
 __wrapper$1$81257352e1c844aebf09cb84fe9e7459.__wrapper$1$81257352e1c844aebf09cb84fe9e7459$SpecificRow$1
 Serialization trace:
 hashTable (org.apache.spark.sql.execution.joins.UniqueKeyHashedRelation)
 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1011)
 at 
 org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
 at 
 org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
 at 
 org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
 at 
 org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
 at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
 at 
 org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$3.apply(BroadcastHashJoin.scala:62)
 at 
 org.apache.spark.sql.execution.joins.BroadcastHashJoin$$anonfun$3.apply(BroadcastHashJoin.scala:61)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at org.apache.spark.rdd.CartesianRDD.compute(CartesianRDD.scala:75)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at org.apache.spark.rdd.CartesianRDD.compute(CartesianRDD.scala:75)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at

[jira] [Created] (SPARK-7863) SimpleDateParam should not use SimpleDateFormat in multiple threads because SimpleDateFormat is not thread-safe

2015-05-26 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-7863:
---

 Summary: SimpleDateParam should not use SimpleDateFormat in 
multiple threads because SimpleDateFormat is not thread-safe
 Key: SPARK-7863
 URL: https://issues.apache.org/jira/browse/SPARK-7863
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Shixiong Zhu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7864) Clicking a job's DAG graph on Web UI kills the job as the link is broken