[jira] [Commented] (SPARK-5991) Python API for ML model import/export

2015-02-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339853#comment-14339853
 ] 

Apache Spark commented on SPARK-5991:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/4811

> Python API for ML model import/export
> -
>
> Key: SPARK-5991
> URL: https://issues.apache.org/jira/browse/SPARK-5991
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Many ML models support save/load in Scala and Java.  The Python API needs 
> this.  It should mostly be a simple matter of calling the JVM methods for 
> save/load, except for models which are stored in Python (e.g., linear models).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6056) Unlimit offHeap memory use cause RM killing the container

2015-02-26 Thread SaintBacchus (JIRA)
SaintBacchus created SPARK-6056:
---

 Summary: Unlimit offHeap memory use cause RM killing the container
 Key: SPARK-6056
 URL: https://issues.apache.org/jira/browse/SPARK-6056
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.2.1
Reporter: SaintBacchus






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6055) memory leak in pyspark sql

2015-02-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339850#comment-14339850
 ] 

Apache Spark commented on SPARK-6055:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/4810

> memory leak in pyspark sql
> --
>
> Key: SPARK-6055
> URL: https://issues.apache.org/jira/browse/SPARK-6055
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.1.1, 1.3.0, 1.2.1
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Blocker
>
> The __eq__ of DataType is not correct, class cache is not use correctly 
> (created class can not be find by dataType), then it will create lots of 
> classes (saved in _cached_cls), never released.
> Also, all same DataType have same hash code, there will be many object in a 
> dict with the same hash code, end with hash attach, it's very slow to access 
> this dict (depends on the implementation of CPython).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler

2015-02-26 Thread Pedro Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339849#comment-14339849
 ] 

Pedro Rodriguez commented on SPARK-5556:


Based on initial testing, I recall FastLDA in practice being O(1), should be 
able to confirm that at a larger scale test soon. LightLDA definitely worth 
looking into I think, at this point though my focus is on getting the FastLDA 
Gibbs to a mergable state (tests pass, refactoring/api for LDA is good, and 
performs at scale as good as or better than EM).

> Latent Dirichlet Allocation (LDA) using Gibbs sampler 
> --
>
> Key: SPARK-5556
> URL: https://issues.apache.org/jira/browse/SPARK-5556
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler

2015-02-26 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339847#comment-14339847
 ] 

Guoqiang Li commented on SPARK-5556:


[This branch|https://github.com/witgo/spark/tree/lda_Gibbs]'s computational 
complexity is O(Ndk), 
is the number of topic (unique) in document d

> Latent Dirichlet Allocation (LDA) using Gibbs sampler 
> --
>
> Key: SPARK-5556
> URL: https://issues.apache.org/jira/browse/SPARK-5556
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6055) memory leak in pyspark sql

2015-02-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339846#comment-14339846
 ] 

Apache Spark commented on SPARK-6055:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/4809

> memory leak in pyspark sql
> --
>
> Key: SPARK-6055
> URL: https://issues.apache.org/jira/browse/SPARK-6055
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.1.1, 1.3.0, 1.2.1
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Blocker
>
> The __eq__ of DataType is not correct, class cache is not use correctly 
> (created class can not be find by dataType), then it will create lots of 
> classes (saved in _cached_cls), never released.
> Also, all same DataType have same hash code, there will be many object in a 
> dict with the same hash code, end with hash attach, it's very slow to access 
> this dict (depends on the implementation of CPython).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6055) memory leak in pyspark sql

2015-02-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339840#comment-14339840
 ] 

Apache Spark commented on SPARK-6055:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/4808

> memory leak in pyspark sql
> --
>
> Key: SPARK-6055
> URL: https://issues.apache.org/jira/browse/SPARK-6055
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.1.1, 1.3.0, 1.2.1
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Blocker
>
> The __eq__ of DataType is not correct, class cache is not use correctly 
> (created class can not be find by dataType), then it will create lots of 
> classes (saved in _cached_cls), never released.
> Also, all same DataType have same hash code, there will be many object in a 
> dict with the same hash code, end with hash attach, it's very slow to access 
> this dict (depends on the implementation of CPython).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6054) SQL UDF returning object of case class; regression from 1.2.0

2015-02-26 Thread Spiro Michaylov (JIRA)
Spiro Michaylov created SPARK-6054:
--

 Summary: SQL UDF returning object of case class; regression from 
1.2.0
 Key: SPARK-6054
 URL: https://issues.apache.org/jira/browse/SPARK-6054
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.3.0
 Environment: Windows 8, Scala 2.11.2, Spark 1.3.0 RC1
Reporter: Spiro Michaylov


The following code fails with a stack trace beginning with:

15/02/26 23:21:32 ERROR Executor: Exception in task 2.0 in stage 7.0 (TID 422)
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, tree: 
scalaUDF(sales#2,discounts#3)
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:309)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:237)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:192)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:207)

Here is the 1.3.0 version of the code:

case class SalesDisc(sales: Double, discounts: Double)
def makeStruct(sales: Double, disc:Double) = SalesDisc(sales, disc)
sqlContext.udf.register("makeStruct", makeStruct _)
val withStruct =
  sqlContext.sql("SELECT id, sd.sales FROM (SELECT id, makeStruct(sales, 
discounts) AS sd FROM customerTable) AS d")
withStruct.foreach(println)

This used to work in 1.2.0. Interestingly, the following simplified version 
fails similarly, even though it seems to me to be VERY similar to the last test 
in the UDFSuite:

SELECT makeStruct(sales, discounts) AS sd FROM customerTable

The data table is defined thus:

  val custs = Seq(
  Cust(1, "Widget Co", 12.00, 0.00, "AZ"),
  Cust(2, "Acme Widgets", 410500.00, 500.00, "CA"),
  Cust(3, "Widgetry", 410500.00, 200.00, "CA"),
  Cust(4, "Widgets R Us", 410500.00, 0.0, "CA"),
  Cust(5, "Ye Olde Widgete", 500.00, 0.0, "MA")
)
val customerTable = sc.parallelize(custs, 4).toDF()

customerTable.registerTempTable("customerTable")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6055) memory leak in pyspark sql

2015-02-26 Thread Davies Liu (JIRA)
Davies Liu created SPARK-6055:
-

 Summary: memory leak in pyspark sql
 Key: SPARK-6055
 URL: https://issues.apache.org/jira/browse/SPARK-6055
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.2.1, 1.1.1, 1.3.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker


The __eq__ of DataType is not correct, class cache is not use correctly 
(created class can not be find by dataType), then it will create lots of 
classes (saved in _cached_cls), never released.

Also, all same DataType have same hash code, there will be many object in a 
dict with the same hash code, end with hash attach, it's very slow to access 
this dict (depends on the implementation of CPython).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6036) EventLog process logic has race condition with Akka actor system

2015-02-26 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-6036:
-
Labels: backport-needed  (was: )

> EventLog process logic has race condition with Akka actor system
> 
>
> Key: SPARK-6036
> URL: https://issues.apache.org/jira/browse/SPARK-6036
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 1.3.0
>Reporter: Zhang, Liye
>Assignee: Zhang, Liye
>  Labels: backport-needed
> Fix For: 1.4.0
>
>
> when application finished, akka actor system will trigger disassociated 
> event, and Master will rebuild SparkUI on web, in which progress will check 
> whether the eventlog files are still in progress. The current logic in 
> SparkContext is first stop the actorsystem, and then stop enentLogListener. 
> This will cause that the enentLogListener has not finished renaming the 
> eventLog dir name (from app-.inprogress to app-)  when Spark Master 
> try to access the dir. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6036) EventLog process logic has race condition with Akka actor system

2015-02-26 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-6036:
-
Target Version/s: 1.4.0, 1.3.1
   Fix Version/s: 1.4.0
Assignee: Zhang, Liye

> EventLog process logic has race condition with Akka actor system
> 
>
> Key: SPARK-6036
> URL: https://issues.apache.org/jira/browse/SPARK-6036
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 1.3.0
>Reporter: Zhang, Liye
>Assignee: Zhang, Liye
>  Labels: backport-needed
> Fix For: 1.4.0
>
>
> when application finished, akka actor system will trigger disassociated 
> event, and Master will rebuild SparkUI on web, in which progress will check 
> whether the eventlog files are still in progress. The current logic in 
> SparkContext is first stop the actorsystem, and then stop enentLogListener. 
> This will cause that the enentLogListener has not finished renaming the 
> eventLog dir name (from app-.inprogress to app-)  when Spark Master 
> try to access the dir. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6050) Spark on YARN does not work --executor-cores is specified

2015-02-26 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339798#comment-14339798
 ] 

Mridul Muralidharan commented on SPARK-6050:


With more verbose debug added, the problem surfaces.
Atleast with hadoop 2.5, the returned response always has vCores == 1 (and at 
the RM, it is treated as vCores == 1 too ... sigh, unimplemented ?)


So in effect, we must not set executorCores while creating "resource" in 
YarnAllocator.

See below for log snippet :


15/02/27 06:37:33 INFO YarnAllocator: Will request 1 executor containers, each 
with 2 cores and 32870 MB memory including 2150 MB overhead
15/02/27 06:37:33 DEBUG AMRMClientImpl: Added priority=1
15/02/27 06:37:33 DEBUG AMRMClientImpl: addResourceRequest: applicationId= 
priority=1 resourceName=* numContainers=1 #asks=1
15/02/27 06:37:33 INFO YarnAllocator: Container request (host: Any, capability: 
)
15/02/27 06:37:33 INFO YarnAllocator: missing = 0, targetNumExecutors = 1, 
numPendingAllocate = 1, numExecutorsRunning = 0
15/02/27 06:37:33 INFO AMRMClientImpl: Received new token for : :8041
15/02/27 06:37:33 DEBUG YarnAllocator: Allocated containers: 1. Current 
executor count: 0. Cluster resources: .
15/02/27 06:37:33 INFO YarnAllocator: allocatedContainer = Container: 
[ContainerId: , NodeId: :8041, NodeHttpAddress: 
:8042, Resource: , Priority: 1, Token: Token { 
kind: ContainerToken, service: :8041 }, ], location = 
15/02/27 06:37:33 INFO YarnAllocator: allocatedContainer = Container: 
[ContainerId: , NodeId: :8041, NodeHttpAddress: 
:8042, Resource: , Priority: 1, Token: Token { 
kind: ContainerToken, service: :8041 }, ], location = /
15/02/27 06:37:33 INFO YarnAllocator: allocatedContainer = Container: 
[ContainerId: , NodeId: :8041, NodeHttpAddress: 
:8042, Resource: , Priority: 1, Token: Token { 
kind: ContainerToken, service: :8041 }, ], location = *
15/02/27 06:37:33 DEBUG YarnAllocator: Releasing 1 unneeded containers that 
were allocated to us
15/02/27 06:37:33 INFO YarnAllocator: Received 1 containers from YARN, 
launching executors on 0 of them.


> Spark on YARN does not work --executor-cores is specified
> -
>
> Key: SPARK-6050
> URL: https://issues.apache.org/jira/browse/SPARK-6050
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.3.0
> Environment: 2.5 based YARN cluster.
>Reporter: Mridul Muralidharan
>Priority: Blocker
>
> There are multiple issues here (which I will detail as comments), but to 
> reproduce running the following ALWAYS hangs in our cluster with the 1.3 RC
> ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master 
> yarn-cluster --executor-cores 8--num-executors 15 --driver-memory 4g  
>--executor-memory 2g  --queue webmap lib/spark-examples*.jar   
>   10



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6053) Support model save/load in Python's ALS.

2015-02-26 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-6053:


 Summary: Support model save/load in Python's ALS.
 Key: SPARK-6053
 URL: https://issues.apache.org/jira/browse/SPARK-6053
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Minor


It should be a simple wrapper of the Scala's implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5991) Python API for ML model import/export

2015-02-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5991:
-
Issue Type: Umbrella  (was: Sub-task)
Parent: (was: SPARK-4587)

> Python API for ML model import/export
> -
>
> Key: SPARK-5991
> URL: https://issues.apache.org/jira/browse/SPARK-5991
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Many ML models support save/load in Scala and Java.  The Python API needs 
> this.  It should mostly be a simple matter of calling the JVM methods for 
> save/load, except for models which are stored in Python (e.g., linear models).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5991) Python API for ML model import/export

2015-02-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5991:
-
Target Version/s:   (was: 1.4.0)

> Python API for ML model import/export
> -
>
> Key: SPARK-5991
> URL: https://issues.apache.org/jira/browse/SPARK-5991
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Many ML models support save/load in Scala and Java.  The Python API needs 
> this.  It should mostly be a simple matter of calling the JVM methods for 
> save/load, except for models which are stored in Python (e.g., linear models).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5845) Time to cleanup spilled shuffle files not included in shuffle write time

2015-02-26 Thread Ilya Ganelin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339794#comment-14339794
 ] 

Ilya Ganelin commented on SPARK-5845:
-

I'm code complete on this, will submit a PR shortly.

> Time to cleanup spilled shuffle files not included in shuffle write time
> 
>
> Key: SPARK-5845
> URL: https://issues.apache.org/jira/browse/SPARK-5845
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.3.0, 1.2.1
>Reporter: Kay Ousterhout
>Assignee: Ilya Ganelin
>Priority: Minor
>
> When the disk is contended, I've observed cases when it takes as long as 7 
> seconds to clean up all of the intermediate spill files for a shuffle (when 
> using the sort based shuffle, but bypassing merging because there are <=200 
> shuffle partitions).  This is even when the shuffle data is non-huge (152MB 
> written from one of the tasks where I observed this).  This is effectively 
> part of the shuffle write time (because it's a necessary side effect of 
> writing data to disk) so should be added to the shuffle write time to 
> facilitate debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5281) Registering table on RDD is giving MissingRequirementError

2015-02-26 Thread Sangkyoon Nam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339779#comment-14339779
 ] 

Sangkyoon Nam commented on SPARK-5281:
--

I have same problem.
In my case, I used CDH 5.3.x

> Registering table on RDD is giving MissingRequirementError
> --
>
> Key: SPARK-5281
> URL: https://issues.apache.org/jira/browse/SPARK-5281
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: sarsol
>Priority: Critical
>
> Application crashes on this line  rdd.registerTempTable("temp")  in 1.2 
> version when using sbt or Eclipse SCALA IDE
> Stacktrace 
> Exception in thread "main" scala.reflect.internal.MissingRequirementError: 
> class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with 
> primordial classloader with boot classpath 
> [C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-library.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-reflect.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-actor.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-swing.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-compiler.jar;C:\Program
>  Files\Java\jre7\lib\resources.jar;C:\Program 
> Files\Java\jre7\lib\rt.jar;C:\Program 
> Files\Java\jre7\lib\sunrsasign.jar;C:\Program 
> Files\Java\jre7\lib\jsse.jar;C:\Program 
> Files\Java\jre7\lib\jce.jar;C:\Program 
> Files\Java\jre7\lib\charsets.jar;C:\Program 
> Files\Java\jre7\lib\jfr.jar;C:\Program Files\Java\jre7\classes] not found.
>   at 
> scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
>   at 
> scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119)
>   at 
> scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$typecreator1$1.apply(ScalaReflection.scala:115)
>   at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
>   at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
>   at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335)
>   at scala.reflect.api.Universe.typeOf(Universe.scala:59)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:115)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:100)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.attributesFor(ScalaReflection.scala:94)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(ScalaReflection.scala:33)
>   at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:111)
>   at 
> com.sar.spark.dq.poc.SparkPOC$delayedInit$body.apply(SparkPOC.scala:43)
>   at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
>   at 
> scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
>   at scala.App$$anonfun$main$1.apply(App.scala:71)
>   at scala.App$$anonfun$main$1.apply(App.scala:71)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
>   at scala.App$class.main(App.scala:71)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler

2015-02-26 Thread Pedro Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339772#comment-14339772
 ] 

Pedro Rodriguez commented on SPARK-5556:


See PR for info, TLDR: contains refactoring for multiple LDA algorithms, 
including how EM would be refactored. Will in the near future contain Gibbs 
implementation I have/had been working on.

> Latent Dirichlet Allocation (LDA) using Gibbs sampler 
> --
>
> Key: SPARK-5556
> URL: https://issues.apache.org/jira/browse/SPARK-5556
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler

2015-02-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339767#comment-14339767
 ] 

Apache Spark commented on SPARK-5556:
-

User 'EntilZha' has created a pull request for this issue:
https://github.com/apache/spark/pull/4807

> Latent Dirichlet Allocation (LDA) using Gibbs sampler 
> --
>
> Key: SPARK-5556
> URL: https://issues.apache.org/jira/browse/SPARK-5556
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6052) In JSON schema inference, we should always set containsNull of an ArrayType to true

2015-02-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339761#comment-14339761
 ] 

Apache Spark commented on SPARK-6052:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/4806

> In JSON schema inference, we should always set containsNull of an ArrayType 
> to true
> ---
>
> Key: SPARK-6052
> URL: https://issues.apache.org/jira/browse/SPARK-6052
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Blocker
>
> We should not try to figure out if an array contains null or not because we 
> may miss arrays with null if we do sampling or future data may have nulls in 
> the array.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6051) Add an option for DirectKafkaInputDStream to commit the offsets into ZK

2015-02-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339758#comment-14339758
 ] 

Apache Spark commented on SPARK-6051:
-

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/4805

> Add an option for DirectKafkaInputDStream to commit the offsets into ZK
> ---
>
> Key: SPARK-6051
> URL: https://issues.apache.org/jira/browse/SPARK-6051
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.3.0
>Reporter: Saisai Shao
>
> Currently in DirectKafkaInputDStream, offset is managed by Spark Streaming  
> itself without ZK or Kafka involved, which will make several third-party 
> offset monitoring tools fail to monitor the status of Kafka consumer. So here 
> as a option to commit the offset to ZK when each job is finished, the process 
> is implemented as a asynchronized way, so the main processing flow will not 
> be blocked, already tested with KafkaOffsetMonitor tools.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6050) Spark on YARN does not work --executor-cores is specified

2015-02-26 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339686#comment-14339686
 ] 

Mridul Muralidharan edited comment on SPARK-6050 at 2/27/15 5:24 AM:
-

Thanks to [~tgraves] for helping investigate this.

There are multiple issues in the codebase - and not all of them have been fully 
understood.

a) For some reason, either YARN returns incorrect response to an allocate 
request or we are not setting the right param.
Note the snippet [1] to detail this.
(I cant share the logs unfortunately - but Tom has access to it and should be 
trivial for others to reproduce the issue).

b) For whatever reason (a) happens, we do not recover from it.
All subsequent requests heartbeat requests DO NOT contain pending allocation 
requests (and we have rejected/de-allocated whatever yarn just sent us due to 
(a)).

To elaborate; updateResourceRequests has missing == 0 since it is relying on 
getNumPendingAllocate() - which DOES NOT do the right thing in our context. 
Note: the 'ask' list in the super class was cleared as part of the previous 
allocate() call.



Fixing (a) will mask (b) - but IMO we should address it at the earliest too.




[1] Note the vCore in the response, and the subsequent ignoring of all 
containers.
15/02/27 01:40:30 INFO YarnAllocator: Will request 1000 executor containers, 
each with 8 cores and 38912 MB memory including 10240 MB overhead
15/02/27 01:40:30 INFO YarnAllocator: Container request (host: Any, capability: 
)
15/02/27 01:40:30 INFO ApplicationMaster: Started progress reporter thread - 
sleep time : 5000
15/02/27 01:40:30 DEBUG ApplicationMaster: Sending progress
15/02/27 01:40:30 INFO YarnAllocator: missing = 0, targetNumExecutors = 1000, 
numPendingAllocate = 1000, numExecutorsRunning = 0
15/02/27 01:40:35 DEBUG ApplicationMaster: Sending progress
15/02/27 01:40:35 INFO YarnAllocator: missing = 0, targetNumExecutors = 1000, 
numPendingAllocate = 1000, numExecutorsRunning = 0
15/02/27 01:40:36 DEBUG YarnAllocator: Allocated containers: 1000. Current 
executor count: 0. Cluster resources: .
15/02/27 01:40:36 DEBUG YarnAllocator: Releasing 1000 unneeded containers that 
were allocated to us
15/02/27 01:40:36 INFO YarnAllocator: Received 1000 containers from YARN, 
launching executors on 0 of them.


was (Author: mridulm80):
Thanks to [~tgraves] for helping investigate this.

There are multiple issues in the codebase - and not all of them have been fully 
understood.

a) For some reason, either YARN returns incorrect response to an allocate 
request or we are not setting the right param.
Note the snippet [1] to detail this.
(I cant share the logs unfortunately - but Tom has access to it and should be 
trivial for others to reproduce the issue).

b) For whatever reason (a) happens, we do not recover from it.
All subsequent requests heartbeat requests DO NOT contain pending allocation 
requests (and we have rejected/de-allocated whatever yarn just sent us due to 
(a)).

To elaborate; updateResourceRequests has missing == 0 since it is relying on 
getNumPendingAllocate() - which DOES NOT do the right thing in our context. 
Note: the 'ask' list in the super class was cleared as part of the previous 
allocate() call.


Essentially we were defending against these sort of corner cases in our code 
earlier - but the move to depend on AMRMClientImpl and the subsequent changes 
to it from under us has caused these problems for spark IMO. We should be more 
careful in future and only depend on interfaces and not implementation when it 
is relatively straight forward for us to own that aspect.


Fixing (a) will mask (b) - but IMO we should address it at the earliest too.




[1] Note the vCore in the response, and the subsequent ignoring of all 
containers.
15/02/27 01:40:30 INFO YarnAllocator: Will request 1000 executor containers, 
each with 8 cores and 38912 MB memory including 10240 MB overhead
15/02/27 01:40:30 INFO YarnAllocator: Container request (host: Any, capability: 
)
15/02/27 01:40:30 INFO ApplicationMaster: Started progress reporter thread - 
sleep time : 5000
15/02/27 01:40:30 DEBUG ApplicationMaster: Sending progress
15/02/27 01:40:30 INFO YarnAllocator: missing = 0, targetNumExecutors = 1000, 
numPendingAllocate = 1000, numExecutorsRunning = 0
15/02/27 01:40:35 DEBUG ApplicationMaster: Sending progress
15/02/27 01:40:35 INFO YarnAllocator: missing = 0, targetNumExecutors = 1000, 
numPendingAllocate = 1000, numExecutorsRunning = 0
15/02/27 01:40:36 DEBUG YarnAllocator: Allocated containers: 1000. Current 
executor count: 0. Cluster resources: .
15/02/27 01:40:36 DEBUG YarnAllocator: Releasing 1000 unneeded containers that 
were allocated to us
15/02/27 01:40:36 INFO YarnAllocator: Received 1000 containers from YARN, 
launching executors on 0 of them.

> Spark on YARN does not work --executor-cores is specified
> --

[jira] [Created] (SPARK-6052) In JSON schema inference, we should always set containsNull of an ArrayType to true

2015-02-26 Thread Yin Huai (JIRA)
Yin Huai created SPARK-6052:
---

 Summary: In JSON schema inference, we should always set 
containsNull of an ArrayType to true
 Key: SPARK-6052
 URL: https://issues.apache.org/jira/browse/SPARK-6052
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Priority: Blocker


We should not try to figure out if an array contains null or not because we may 
miss arrays with null if we do sampling or future data may have nulls in the 
array.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6051) Add an option for DirectKafkaInputDStream to commit the offsets into ZK

2015-02-26 Thread Saisai Shao (JIRA)
Saisai Shao created SPARK-6051:
--

 Summary: Add an option for DirectKafkaInputDStream to commit the 
offsets into ZK
 Key: SPARK-6051
 URL: https://issues.apache.org/jira/browse/SPARK-6051
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Saisai Shao


Currently in DirectKafkaInputDStream, offset is managed by Spark Streaming  
itself without ZK or Kafka involved, which will make several third-party offset 
monitoring tools fail to monitor the status of Kafka consumer. So here as a 
option to commit the offset to ZK when each job is finished, the process is 
implemented as a asynchronized way, so the main processing flow will not be 
blocked, already tested with KafkaOffsetMonitor tools.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6024) When a data source table has too many columns, it's schema cannot be stored in metastore.

2015-02-26 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-6024.

   Resolution: Fixed
Fix Version/s: 1.3.0
 Assignee: Yin Huai

> When a data source table has too many columns, it's schema cannot be stored 
> in metastore.
> -
>
> Key: SPARK-6024
> URL: https://issues.apache.org/jira/browse/SPARK-6024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
> Fix For: 1.3.0
>
>
> Because we are using table properties of a Hive metastore table to store the 
> schema, when a schema is too wide, we cannot persist it in metastore.
> {code}
> 15/02/25 18:13:50 ERROR metastore.RetryingHMSHandler: Retrying HMSHandler 
> after 1000 ms (attempt 1 of 1) with error: javax.jdo.JDODataStoreException: 
> Put request failed : INSERT INTO TABLE_PARAMS (PARAM_VALUE,TBL_ID,PARAM_KEY) 
> VALUES (?,?,?) 
>   at 
> org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:451)
>   at 
> org.datanucleus.api.jdo.JDOPersistenceManager.jdoMakePersistent(JDOPersistenceManager.java:732)
>   at 
> org.datanucleus.api.jdo.JDOPersistenceManager.makePersistent(JDOPersistenceManager.java:752)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.createTable(ObjectStore.java:719)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:108)
>   at com.sun.proxy.$Proxy15.createTable(Unknown Source)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_core(HiveMetaStore.java:1261)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_with_environment_context(HiveMetaStore.java:1294)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:105)
>   at com.sun.proxy.$Proxy16.create_table_with_environment_context(Unknown 
> Source)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:558)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:547)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
>   at com.sun.proxy.$Proxy17.createTable(Unknown Source)
>   at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:613)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.createDataSourceTable(HiveMetastoreCatalog.scala:136)
>   at 
> org.apache.spark.sql.hive.execution.CreateMetastoreDataSourceAsSelect.run(commands.scala:243)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:55)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:55)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:65)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1092)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1092)
>   at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:1013)
>   at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:963)
>   at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:929)
>   at org.apache.spark.sql.DataFrame.saveAsTable(DataFrame.scala:907)
>   at 
> $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:25)
>   at 
> $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:30)
>   at 
> $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:32)
>   at $line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:34)
>   at $line39.$read$$iwC$$i

[jira] [Commented] (SPARK-5984) TimSort broken

2015-02-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339706#comment-14339706
 ] 

Apache Spark commented on SPARK-5984:
-

User 'hotou' has created a pull request for this issue:
https://github.com/apache/spark/pull/4804

> TimSort broken
> --
>
> Key: SPARK-5984
> URL: https://issues.apache.org/jira/browse/SPARK-5984
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.3.0, 1.2.1
>Reporter: Reynold Xin
>Assignee: Aaron Davidson
>Priority: Minor
>
> See 
> http://envisage-project.eu/proving-android-java-and-python-sorting-algorithm-is-broken-and-how-to-fix-it/
> Our TimSort is based on Android's TimSort, which is broken in some corner 
> case. Marking it minor as this problem exists for almost all TimSort 
> implementations out there, including Android, OpenJDK, Python, and it hasn't 
> manifested itself in practice yet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3664) Graduate GraphX from alpha to stable

2015-02-26 Thread Ankur Dave (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave resolved SPARK-3664.
---
   Resolution: Fixed
Fix Version/s: 1.2.0

> Graduate GraphX from alpha to stable
> 
>
> Key: SPARK-3664
> URL: https://issues.apache.org/jira/browse/SPARK-3664
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Reporter: Ankur Dave
>Assignee: Ankur Dave
> Fix For: 1.2.0
>
>
> The GraphX API is officially marked as alpha but has been moving toward 
> stability. This ticket tracks what will be necessary to mark it a stable part 
> of Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-1015) Visualize the DAG of RDD

2015-02-26 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339690#comment-14339690
 ] 

Jeff Zhang edited comment on SPARK-1015 at 2/27/15 4:03 AM:


[~sowen] I may not have time for this recently. 
bq. How would the visualization work with spark-shell? Is this just a utility 
you can host outside Spark?
I would prefer to use graphviz for visualize the RDD. And spark just build the 
dot file for graphviz and let the graphviz to visualize it. Besides, I think 
integrating the DAG view to spark ui may be helpful for users to debug the RDD 
(especially on performance perspective ) 


was (Author: zjffdu):
[~sowen] I may not have time for this recently. 
bq. How would the visualization work with spark-shell? Is this just a utility 
you can host outside Spark?
I would prefer to use graphviz for visualize the RDD. And spark just build the 
dot file for graphviz and let the graphviz to visualize it. 

> Visualize the DAG of RDD 
> -
>
> Key: SPARK-1015
> URL: https://issues.apache.org/jira/browse/SPARK-1015
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Jeff Zhang
>
> The DAG of RDD can help user understand the data flow and how spark get the 
> final RDD executed.  It could help user to find chances to optimize the 
> execution of some complex RDD.  I will leverage graphviz to visualize the 
> DAG. 
> For this task, I plan to split it into 2 steps.
> Step 1.  Just visualize the simple DAG graph.  Each RDD is one node, and 
> there will be one edge between the parent RDD and child RDD. ( I attach one 
> simple graph in the attachments )
> Step 2.  Put RDD in the same stage into one sub graph. This may need to 
> extract the splitting staging related code in DAGSchduler. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1015) Visualize the DAG of RDD

2015-02-26 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339690#comment-14339690
 ] 

Jeff Zhang commented on SPARK-1015:
---

[~sowen] I may not have time for this recently. 
bq. How would the visualization work with spark-shell? Is this just a utility 
you can host outside Spark?
I would prefer to use graphviz for visualize the RDD. And spark just build the 
dot file for graphviz and let the graphviz to visualize it. 

> Visualize the DAG of RDD 
> -
>
> Key: SPARK-1015
> URL: https://issues.apache.org/jira/browse/SPARK-1015
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Jeff Zhang
>
> The DAG of RDD can help user understand the data flow and how spark get the 
> final RDD executed.  It could help user to find chances to optimize the 
> execution of some complex RDD.  I will leverage graphviz to visualize the 
> DAG. 
> For this task, I plan to split it into 2 steps.
> Step 1.  Just visualize the simple DAG graph.  Each RDD is one node, and 
> there will be one edge between the parent RDD and child RDD. ( I attach one 
> simple graph in the attachments )
> Step 2.  Put RDD in the same stage into one sub graph. This may need to 
> extract the splitting staging related code in DAGSchduler. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6050) Spark on YARN does not work --executor-cores is specified

2015-02-26 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339686#comment-14339686
 ] 

Mridul Muralidharan edited comment on SPARK-6050 at 2/27/15 3:50 AM:
-

Thanks to [~tgraves] for helping investigate this.

There are multiple issues in the codebase - and not all of them have been fully 
understood.

a) For some reason, either YARN returns incorrect response to an allocate 
request or we are not setting the right param.
Note the snippet [1] to detail this.
(I cant share the logs unfortunately - but Tom has access to it and should be 
trivial for others to reproduce the issue).

b) For whatever reason (a) happens, we do not recover from it.
All subsequent requests heartbeat requests DO NOT contain pending allocation 
requests (and we have rejected/de-allocated whatever yarn just sent us due to 
(a)).

To elaborate; updateResourceRequests has missing == 0 since it is relying on 
getNumPendingAllocate() - which DOES NOT do the right thing in our context. 
Note: the 'ask' list in the super class was cleared as part of the previous 
allocate() call.


Essentially we were defending against these sort of corner cases in our code 
earlier - but the move to depend on AMRMClientImpl and the subsequent changes 
to it from under us has caused these problems for spark IMO. We should be more 
careful in future and only depend on interfaces and not implementation when it 
is relatively straight forward for us to own that aspect.


Fixing (a) will mask (b) - but IMO we should address it at the earliest too.




[1] Note the vCore in the response, and the subsequent ignoring of all 
containers.
15/02/27 01:40:30 INFO YarnAllocator: Will request 1000 executor containers, 
each with 8 cores and 38912 MB memory including 10240 MB overhead
15/02/27 01:40:30 INFO YarnAllocator: Container request (host: Any, capability: 
)
15/02/27 01:40:30 INFO ApplicationMaster: Started progress reporter thread - 
sleep time : 5000
15/02/27 01:40:30 DEBUG ApplicationMaster: Sending progress
15/02/27 01:40:30 INFO YarnAllocator: missing = 0, targetNumExecutors = 1000, 
numPendingAllocate = 1000, numExecutorsRunning = 0
15/02/27 01:40:35 DEBUG ApplicationMaster: Sending progress
15/02/27 01:40:35 INFO YarnAllocator: missing = 0, targetNumExecutors = 1000, 
numPendingAllocate = 1000, numExecutorsRunning = 0
15/02/27 01:40:36 DEBUG YarnAllocator: Allocated containers: 1000. Current 
executor count: 0. Cluster resources: .
15/02/27 01:40:36 DEBUG YarnAllocator: Releasing 1000 unneeded containers that 
were allocated to us
15/02/27 01:40:36 INFO YarnAllocator: Received 1000 containers from YARN, 
launching executors on 0 of them.


was (Author: mridulm80):

Thanks to [~tgraves] for helping investigate this.

There are multiple issues in the codebase - and not all of them have been fully 
understood.

a) For some reason, either YARN returns incorrect response to an allocate 
request or we are not setting the right param.
Note the snippet [1] to detail this.
(I cant share the logs unfortunately - but Tom has access to it and should be 
trivial for others to reproduce the issue).

b) For whatever reason (a) happens, we do not recover from it.
All subsequent requests heartbeat requests DO NOT contain pending allocation 
requests (and we have rejected/de-allocated whatever yarn just sent us due to 
(a)).

To elaborate; updateResourceRequests has missing == 0 since it is relying on 
getNumPendingAllocate() - which DOES NOT do the right thing in our context. 
Note: the 'ask' list in the super class was cleared as part of the previous 
allocate() call.


Essentially we were defending against these sort of corner cases in our code 
earlier - but the move to depend on AMRMClientImpl and the subsequent changes 
to it from under us has caused these problems for spark.


Fixing (a) will mask (b) - but IMO we should address it at the earliest too.




[1] Not the vCore in the response, and the subsequent ignoring of all 
containers.
15/02/27 01:40:30 INFO YarnAllocator: Will request 1000 executor containers, 
each with 8 cores and 38912 MB memory including 10240 MB overhead
15/02/27 01:40:30 INFO YarnAllocator: Container request (host: Any, capability: 
)
15/02/27 01:40:30 INFO ApplicationMaster: Started progress reporter thread - 
sleep time : 5000
15/02/27 01:40:30 DEBUG ApplicationMaster: Sending progress
15/02/27 01:40:30 INFO YarnAllocator: missing = 0, targetNumExecutors = 1000, 
numPendingAllocate = 1000, numExecutorsRunning = 0
15/02/27 01:40:35 DEBUG ApplicationMaster: Sending progress
15/02/27 01:40:35 INFO YarnAllocator: missing = 0, targetNumExecutors = 1000, 
numPendingAllocate = 1000, numExecutorsRunning = 0
15/02/27 01:40:36 DEBUG YarnAllocator: Allocated containers: 1000. Current 
executor count: 0. Cluster resources: .
15/02/27 01:40:36 DEBUG YarnAllocator: Releasing 1000 unneeded container

[jira] [Commented] (SPARK-6050) Spark on YARN does not work --executor-cores is specified

2015-02-26 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339686#comment-14339686
 ] 

Mridul Muralidharan commented on SPARK-6050:



Thanks to [~tgraves] for helping investigate this.

There are multiple issues in the codebase - and not all of them have been fully 
understood.

a) For some reason, either YARN returns incorrect response to an allocate 
request or we are not setting the right param.
Note the snippet [1] to detail this.
(I cant share the logs unfortunately - but Tom has access to it and should be 
trivial for others to reproduce the issue).

b) For whatever reason (a) happens, we do not recover from it.
All subsequent requests heartbeat requests DO NOT contain pending allocation 
requests (and we have rejected/de-allocated whatever yarn just sent us due to 
(a)).

To elaborate; updateResourceRequests has missing == 0 since it is relying on 
getNumPendingAllocate() - which DOES NOT do the right thing in our context. 
Note: the 'ask' list in the super class was cleared as part of the previous 
allocate() call.


Essentially we were defending against these sort of corner cases in our code 
earlier - but the move to depend on AMRMClientImpl and the subsequent changes 
to it from under us has caused these problems for spark.


Fixing (a) will mask (b) - but IMO we should address it at the earliest too.




[1] Not the vCore in the response, and the subsequent ignoring of all 
containers.
15/02/27 01:40:30 INFO YarnAllocator: Will request 1000 executor containers, 
each with 8 cores and 38912 MB memory including 10240 MB overhead
15/02/27 01:40:30 INFO YarnAllocator: Container request (host: Any, capability: 
)
15/02/27 01:40:30 INFO ApplicationMaster: Started progress reporter thread - 
sleep time : 5000
15/02/27 01:40:30 DEBUG ApplicationMaster: Sending progress
15/02/27 01:40:30 INFO YarnAllocator: missing = 0, targetNumExecutors = 1000, 
numPendingAllocate = 1000, numExecutorsRunning = 0
15/02/27 01:40:35 DEBUG ApplicationMaster: Sending progress
15/02/27 01:40:35 INFO YarnAllocator: missing = 0, targetNumExecutors = 1000, 
numPendingAllocate = 1000, numExecutorsRunning = 0
15/02/27 01:40:36 DEBUG YarnAllocator: Allocated containers: 1000. Current 
executor count: 0. Cluster resources: .
15/02/27 01:40:36 DEBUG YarnAllocator: Releasing 1000 unneeded containers that 
were allocated to us
15/02/27 01:40:36 INFO YarnAllocator: Received 1000 containers from YARN, 
launching executors on 0 of them.

> Spark on YARN does not work --executor-cores is specified
> -
>
> Key: SPARK-6050
> URL: https://issues.apache.org/jira/browse/SPARK-6050
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.3.0
> Environment: 2.5 based YARN cluster.
>Reporter: Mridul Muralidharan
>Priority: Blocker
>
> There are multiple issues here (which I will detail as comments), but to 
> reproduce running the following ALWAYS hangs in our cluster with the 1.3 RC
> ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master 
> yarn-cluster --executor-cores 8--num-executors 15 --driver-memory 4g  
>--executor-memory 2g  --queue webmap lib/spark-examples*.jar   
>   10



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6033) the description abou the "spark.worker.cleanup.enabled" is not matched with the code

2015-02-26 Thread pengxu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339684#comment-14339684
 ] 

pengxu commented on SPARK-6033:
---

I've already made a PR. Can you help me to review it, thx

> the description abou the "spark.worker.cleanup.enabled" is not matched with 
> the code
> 
>
> Key: SPARK-6033
> URL: https://issues.apache.org/jira/browse/SPARK-6033
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.2.0, 1.2.1
>Reporter: pengxu
>Priority: Minor
>
> Some error about the section _Cluster Launch Scripts_ in the 
> http://spark.apache.org/docs/latest/spark-standalone.html
> In the description about the property spark.worker.cleanup.enabled, it states 
> that *all the directory* under the work dir will be removed whether the 
> application is running or not.
> After checking the implementation in the code level, I found that +only the 
> stopped application+ dirs would be removed. So the description in the 
> document is incorrect.
> the code implementation in worker.scala
> {code: title=WorkDirCleanup}
> case WorkDirCleanup =>
>   // Spin up a separate thread (in a future) to do the dir cleanup; don't 
> tie up worker actor
>   val cleanupFuture = concurrent.future {
> val appDirs = workDir.listFiles()
> if (appDirs == null) {
>   throw new IOException("ERROR: Failed to list files in " + appDirs)
> }
> appDirs.filter { dir =>
>   // the directory is used by an application - check that the 
> application is not running
>   // when cleaning up
>   val appIdFromDir = dir.getName
>   val isAppStillRunning = 
> executors.values.map(_.appId).contains(appIdFromDir)
>   dir.isDirectory && !isAppStillRunning &&
>   !Utils.doesDirectoryContainAnyNewFiles(dir, APP_DATA_RETENTION_SECS)
> }.foreach { dir => 
>   logInfo(s"Removing directory: ${dir.getPath}")
>   Utils.deleteRecursively(dir)
> }
>   }
>   cleanupFuture onFailure {
> case e: Throwable =>
>   logError("App dir cleanup failed: " + e.getMessage, e)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6033) the description abou the "spark.worker.cleanup.enabled" is not matched with the code

2015-02-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339683#comment-14339683
 ] 

Apache Spark commented on SPARK-6033:
-

User 'hseagle' has created a pull request for this issue:
https://github.com/apache/spark/pull/4803

> the description abou the "spark.worker.cleanup.enabled" is not matched with 
> the code
> 
>
> Key: SPARK-6033
> URL: https://issues.apache.org/jira/browse/SPARK-6033
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.2.0, 1.2.1
>Reporter: pengxu
>Priority: Minor
>
> Some error about the section _Cluster Launch Scripts_ in the 
> http://spark.apache.org/docs/latest/spark-standalone.html
> In the description about the property spark.worker.cleanup.enabled, it states 
> that *all the directory* under the work dir will be removed whether the 
> application is running or not.
> After checking the implementation in the code level, I found that +only the 
> stopped application+ dirs would be removed. So the description in the 
> document is incorrect.
> the code implementation in worker.scala
> {code: title=WorkDirCleanup}
> case WorkDirCleanup =>
>   // Spin up a separate thread (in a future) to do the dir cleanup; don't 
> tie up worker actor
>   val cleanupFuture = concurrent.future {
> val appDirs = workDir.listFiles()
> if (appDirs == null) {
>   throw new IOException("ERROR: Failed to list files in " + appDirs)
> }
> appDirs.filter { dir =>
>   // the directory is used by an application - check that the 
> application is not running
>   // when cleaning up
>   val appIdFromDir = dir.getName
>   val isAppStillRunning = 
> executors.values.map(_.appId).contains(appIdFromDir)
>   dir.isDirectory && !isAppStillRunning &&
>   !Utils.doesDirectoryContainAnyNewFiles(dir, APP_DATA_RETENTION_SECS)
> }.foreach { dir => 
>   logInfo(s"Removing directory: ${dir.getPath}")
>   Utils.deleteRecursively(dir)
> }
>   }
>   cleanupFuture onFailure {
> case e: Throwable =>
>   logError("App dir cleanup failed: " + e.getMessage, e)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6050) Spark on YARN does not work --executor-cores is specified

2015-02-26 Thread Mridul Muralidharan (JIRA)
Mridul Muralidharan created SPARK-6050:
--

 Summary: Spark on YARN does not work --executor-cores is specified
 Key: SPARK-6050
 URL: https://issues.apache.org/jira/browse/SPARK-6050
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.3.0
 Environment: 
2.5 based YARN cluster.
Reporter: Mridul Muralidharan
Priority: Blocker



There are multiple issues here (which I will detail as comments), but to 
reproduce running the following ALWAYS hangs in our cluster with the 1.3 RC

./bin/spark-submit --class org.apache.spark.examples.SparkPi --master 
yarn-cluster --executor-cores 8--num-executors 15 --driver-memory 4g
 --executor-memory 2g  --queue webmap lib/spark-examples*.jar 10



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6033) the description abou the "spark.worker.cleanup.enabled" is not matched with the code

2015-02-26 Thread pengxu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339667#comment-14339667
 ] 

pengxu commented on SPARK-6033:
---

Ok, I'll do it.

> the description abou the "spark.worker.cleanup.enabled" is not matched with 
> the code
> 
>
> Key: SPARK-6033
> URL: https://issues.apache.org/jira/browse/SPARK-6033
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.2.0, 1.2.1
>Reporter: pengxu
>Priority: Minor
>
> Some error about the section _Cluster Launch Scripts_ in the 
> http://spark.apache.org/docs/latest/spark-standalone.html
> In the description about the property spark.worker.cleanup.enabled, it states 
> that *all the directory* under the work dir will be removed whether the 
> application is running or not.
> After checking the implementation in the code level, I found that +only the 
> stopped application+ dirs would be removed. So the description in the 
> document is incorrect.
> the code implementation in worker.scala
> {code: title=WorkDirCleanup}
> case WorkDirCleanup =>
>   // Spin up a separate thread (in a future) to do the dir cleanup; don't 
> tie up worker actor
>   val cleanupFuture = concurrent.future {
> val appDirs = workDir.listFiles()
> if (appDirs == null) {
>   throw new IOException("ERROR: Failed to list files in " + appDirs)
> }
> appDirs.filter { dir =>
>   // the directory is used by an application - check that the 
> application is not running
>   // when cleaning up
>   val appIdFromDir = dir.getName
>   val isAppStillRunning = 
> executors.values.map(_.appId).contains(appIdFromDir)
>   dir.isDirectory && !isAppStillRunning &&
>   !Utils.doesDirectoryContainAnyNewFiles(dir, APP_DATA_RETENTION_SECS)
> }.foreach { dir => 
>   logInfo(s"Removing directory: ${dir.getPath}")
>   Utils.deleteRecursively(dir)
> }
>   }
>   cleanupFuture onFailure {
> case e: Throwable =>
>   logError("App dir cleanup failed: " + e.getMessage, e)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6037) Avoiding duplicate Parquet schema merging

2015-02-26 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-6037.
---
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4786
[https://github.com/apache/spark/pull/4786]

> Avoiding duplicate Parquet schema merging
> -
>
> Key: SPARK-6037
> URL: https://issues.apache.org/jira/browse/SPARK-6037
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Minor
> Fix For: 1.3.0
>
>
> FilteringParquetRowInputFormat manually merges Parquet schemas before 
> computing splits. However, it is duplicate because the schemas are already 
> merged in ParquetRelation2. We don't need to re-merge them at InputFormat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6037) Avoiding duplicate Parquet schema merging

2015-02-26 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6037:
--
Assignee: Liang-Chi Hsieh

> Avoiding duplicate Parquet schema merging
> -
>
> Key: SPARK-6037
> URL: https://issues.apache.org/jira/browse/SPARK-6037
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Minor
>
> FilteringParquetRowInputFormat manually merges Parquet schemas before 
> computing splits. However, it is duplicate because the schemas are already 
> merged in ParquetRelation2. We don't need to re-merge them at InputFormat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5710) Combines two adjacent `Cast` expressions into one

2015-02-26 Thread guowei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339649#comment-14339649
 ] 

guowei commented on SPARK-5710:
---

How about limit merging adjacent casts that are only added in 
`typeCoercionRules` ?
We can add a label in `cast` to  mark them which are added in 
`typeCoercionRules`.

> Combines two adjacent `Cast` expressions into one
> -
>
> Key: SPARK-5710
> URL: https://issues.apache.org/jira/browse/SPARK-5710
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.1
>Reporter: guowei
>Priority: Minor
>
> A plan after `analyzer` with `typeCoercionRules` may produce many `cast` 
> expressions. we can combine the adjacent ones.
> For example. 
> create table test(a decimal(3,1));
> explain select * from test where a*2-1>1;
> == Physical Plan ==
> Filter (CAST(CAST((CAST(CAST((CAST(a#5, DecimalType()) * 2), 
> DecimalType(21,1)), DecimalType()) - 1), DecimalType(22,1)), DecimalType()) > 
> 1)
>  HiveTableScan [a#5], (MetastoreRelation default, test, None), None



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5529) BlockManager heartbeat expiration does not kill executor

2015-02-26 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-5529.

  Resolution: Fixed
   Fix Version/s: 1.4.0
Target Version/s: 1.4.0

> BlockManager heartbeat expiration does not kill executor
> 
>
> Key: SPARK-5529
> URL: https://issues.apache.org/jira/browse/SPARK-5529
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.2.0
>Reporter: Hong Shen
>Assignee: Hong Shen
> Fix For: 1.4.0
>
> Attachments: SPARK-5529.patch
>
>
> When I run a spark job, one executor is hold, after 120s, blockManager is 
> removed by driver, but after half an hour before the executor is remove by  
> driver. Here is the log:
> {code}
> 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms 
> exceeds 12ms
> 
> 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 
> 10.215.143.14: remote Akka client disassociated
> 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote 
> system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is 
> now gated for [5000] ms. Reason is: [Disassociated].
> 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 
> 0.0
> 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 
> 10.215.143.14): ExecutorLostFailure (executor 1 lost)
> 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove 
> non-existent executor 1
> 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0)
> 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 
> from BlockManagerMaster.
> 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in 
> removeExecutor
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6048) SparkConf.translateConfKey should translate on get, not set

2015-02-26 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339629#comment-14339629
 ] 

Patrick Wendell edited comment on SPARK-6048 at 2/27/15 2:33 AM:
-

Hey All,

No opinions on which design we chose to implement internally. However, I do 
feel strongly that the user-facing precedence should not change between 
versions. It's not reasonable to assume that no user has both the old and new 
names for a config value. Configuration files can be very long, or there can be 
multiple sources of configuration, such as a user using both flags and a file. 
So changing the semantics randomly in a release constitutes a breaking change 
of behavior.

In terms of the nicest possible way to achieve these semantics, that's up to 
you guys since you're much more familiar with this code. The current patch 
seems to just rewind the behavior that was introduced earlier. Marcello, unless 
you see some correctness problem with that patch, I'd like to merge it to 
unblock the release. If you guys think it's way better to do translation on 
writes than reads, it's fine to propose that in a new patch.



was (Author: pwendell):
Hey All,

No options on which design we chose to implement internally. However, I do feel 
strongly that the user-facing precedence should not change between versions. 
It's not reasonable to assume that no user has both the old and new names for a 
config value. Configuration files can be very long, or there can be multiple 
sources of configuration, such as a user using both flags and a file. So 
changing the semantics randomly in a release constitutes a breaking change of 
behavior.

In terms of the nicest possible way to achieve these semantics, that's up to 
you guys since you're much more familiar with this code. The current patch 
seems to just rewind the behavior that was introduced earlier. Marcello, unless 
you see some correctness problem with that patch, I'd like to merge it to 
unblock the release. If you guys think it's way better to do translation on 
writes than reads, it's fine to propose that in a new patch.


> SparkConf.translateConfKey should translate on get, not set
> ---
>
> Key: SPARK-6048
> URL: https://issues.apache.org/jira/browse/SPARK-6048
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
>
> There are several issues with translating on set.
> (1) The most serious one is that if the user has both the deprecated and the 
> latest version of the same config set, then the value picked up by SparkConf 
> will be arbitrary. Why? Because during initialization of the conf we call 
> `conf.set` on each property in `sys.props` in an order arbitrarily defined by 
> Java. As a result, the value of the more recent config may be overridden by 
> that of the deprecated one. Instead, we should always use the value of the 
> most recent config.
> (2) If we translate on set, then we must keep translating everywhere else. In 
> fact, the current code does not translate on remove, which means the 
> following won't work if X is deprecated:
> {code}
> conf.set(X, Y)
> conf.remove(X) // X is not in the conf
> {code}
> This requires us to also translate in remove and other places, as we already 
> do for contains, leading to more duplicate code.
> (3) Since we call `conf.set` on all configs when initializing the conf, we 
> print all deprecation warnings in the beginning. Elsewhere in Spark, however, 
> we warn the user when the deprecated config / option / env var is actually 
> being used.
> We should keep this consistent so the user won't expect to find all 
> deprecation messages in the beginning of his logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6048) SparkConf.translateConfKey should translate on get, not set

2015-02-26 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339629#comment-14339629
 ] 

Patrick Wendell commented on SPARK-6048:


Hey All,

No options on which design we chose to implement internally. However, I do feel 
strongly that the user-facing precedence should not change between versions. 
It's not reasonable to assume that no user has both the old and new names for a 
config value. Configuration files can be very long, or there can be multiple 
sources of configuration, such as a user using both flags and a file. So 
changing the semantics randomly in a release constitutes a breaking change of 
behavior.

In terms of the nicest possible way to achieve these semantics, that's up to 
you guys since you're much more familiar with this code. The current patch 
seems to just rewind the behavior that was introduced earlier. Marcello, unless 
you see some correctness problem with that patch, I'd like to merge it to 
unblock the release. If you guys think it's way better to do translation on 
writes than reads, it's fine to propose that in a new patch.


> SparkConf.translateConfKey should translate on get, not set
> ---
>
> Key: SPARK-6048
> URL: https://issues.apache.org/jira/browse/SPARK-6048
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
>
> There are several issues with translating on set.
> (1) The most serious one is that if the user has both the deprecated and the 
> latest version of the same config set, then the value picked up by SparkConf 
> will be arbitrary. Why? Because during initialization of the conf we call 
> `conf.set` on each property in `sys.props` in an order arbitrarily defined by 
> Java. As a result, the value of the more recent config may be overridden by 
> that of the deprecated one. Instead, we should always use the value of the 
> most recent config.
> (2) If we translate on set, then we must keep translating everywhere else. In 
> fact, the current code does not translate on remove, which means the 
> following won't work if X is deprecated:
> {code}
> conf.set(X, Y)
> conf.remove(X) // X is not in the conf
> {code}
> This requires us to also translate in remove and other places, as we already 
> do for contains, leading to more duplicate code.
> (3) Since we call `conf.set` on all configs when initializing the conf, we 
> print all deprecation warnings in the beginning. Elsewhere in Spark, however, 
> we warn the user when the deprecated config / option / env var is actually 
> being used.
> We should keep this consistent so the user won't expect to find all 
> deprecation messages in the beginning of his logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6048) SparkConf.translateConfKey should translate on get, not set

2015-02-26 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339624#comment-14339624
 ] 

Andrew Or commented on SPARK-6048:
--

bq. What do you mean by "duplicates the translation" (regarding 2)? It's just a 
call to "translateKey()".

It's not about the number of lines that are being duplicated. It's about the 
translation logic. Right now it's not correct to translate in set but not in 
all interfaces exposed by SparkConf. As we have seen with the case of `remove` 
it's easy to miss one or two of these interfaces. If we only translate in `get` 
then we don't have to worry about this.

bq. Regarding 1, that problem exists regardless of my change.

That's actually not true. Before your change, if we specify both the deprecated 
config and the most recent one, the behavior will be determined by the place 
where these values are used. Even if we called `set` on the deprecated config 
over the more recent one, the value of the latter is still preserved because we 
didn't translate on `set`. To answer your question, the expected behavior is 
for the value of the more recent config to *always* take precedence.

bq. Note the goal of the deprecated configs was to make the Spark code only 
have to care about the most recent key name. Your proposal goes against that, 
and would require the deprecated names to live both in SparkConf and in the 
code that needs to read them.

Yes, unfortunately, and I agree it's something we need to fix in the future. My 
eventual goal is to do hide all the deprecation logic throughout the Spark 
code, and this is why I filed SPARK-5933 before. Currently, however, this is a 
correctness issue that is blocking the 1.3 release, so my personal opinion is 
that we should first fix this broken behavior and worry about the code style 
later.

> SparkConf.translateConfKey should translate on get, not set
> ---
>
> Key: SPARK-6048
> URL: https://issues.apache.org/jira/browse/SPARK-6048
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
>
> There are several issues with translating on set.
> (1) The most serious one is that if the user has both the deprecated and the 
> latest version of the same config set, then the value picked up by SparkConf 
> will be arbitrary. Why? Because during initialization of the conf we call 
> `conf.set` on each property in `sys.props` in an order arbitrarily defined by 
> Java. As a result, the value of the more recent config may be overridden by 
> that of the deprecated one. Instead, we should always use the value of the 
> most recent config.
> (2) If we translate on set, then we must keep translating everywhere else. In 
> fact, the current code does not translate on remove, which means the 
> following won't work if X is deprecated:
> {code}
> conf.set(X, Y)
> conf.remove(X) // X is not in the conf
> {code}
> This requires us to also translate in remove and other places, as we already 
> do for contains, leading to more duplicate code.
> (3) Since we call `conf.set` on all configs when initializing the conf, we 
> print all deprecation warnings in the beginning. Elsewhere in Spark, however, 
> we warn the user when the deprecated config / option / env var is actually 
> being used.
> We should keep this consistent so the user won't expect to find all 
> deprecation messages in the beginning of his logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5979) `--packages` should not exclude spark streaming assembly jars for kafka and flume

2015-02-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339620#comment-14339620
 ] 

Apache Spark commented on SPARK-5979:
-

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/4802

> `--packages` should not exclude spark streaming assembly jars for kafka and 
> flume 
> --
>
> Key: SPARK-5979
> URL: https://issues.apache.org/jira/browse/SPARK-5979
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Submit
>Affects Versions: 1.3.0
>Reporter: Burak Yavuz
>Priority: Blocker
>
> Currently `--packages` has an exclude rule for all dependencies with the 
> groupId `org.apache.spark` assuming that these are packaged inside the 
> spark-assembly jar. This is not the case and more fine grained filtering is 
> required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6032) Move ivy logging to System.err in --packages

2015-02-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339621#comment-14339621
 ] 

Apache Spark commented on SPARK-6032:
-

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/4802

> Move ivy logging to System.err in --packages
> 
>
> Key: SPARK-6032
> URL: https://issues.apache.org/jira/browse/SPARK-6032
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 1.3.0
>Reporter: Burak Yavuz
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6049) HiveThriftServer2 may expose Inheritable methods

2015-02-26 Thread Littlestar (JIRA)
Littlestar created SPARK-6049:
-

 Summary: HiveThriftServer2 may expose Inheritable methods
 Key: SPARK-6049
 URL: https://issues.apache.org/jira/browse/SPARK-6049
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.1
Reporter: Littlestar
Priority: Minor


Does HiveThriftServer2 may expose Inheritable methods?
HiveThriftServer2  is very good when used as a JDBC Server, but 
HiveThriftServer2.scala  is not Inheritable or invokable by app.

My app use JavaSQLContext and registerTempTable.
I want to expose these TempTables by HiveThriftServer2(JDBC Server).

Thanks.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5537) Expand user guide for multinomial logistic regression

2015-02-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339602#comment-14339602
 ] 

Apache Spark commented on SPARK-5537:
-

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/4801

> Expand user guide for multinomial logistic regression
> -
>
> Key: SPARK-5537
> URL: https://issues.apache.org/jira/browse/SPARK-5537
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Reporter: Xiangrui Meng
>Assignee: DB Tsai
>
> We probably don't need to work out the math in the user guide. We can point 
> users to wikipedia for details and focus on the public APIs and how to use it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6048) SparkConf.translateConfKey should translate on get, not set

2015-02-26 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339591#comment-14339591
 ] 

Marcelo Vanzin commented on SPARK-6048:
---

What do you mean by "duplicates the translation" (regarding 2)? It's just a 
call to "translateKey()".

Regarding 1, that problem exists regardless of my change. You need to specify 
some precedence order. See the case in FsHistoryProvider, where there are 2 (!) 
deprecated keys for the same config. What if the user sets those two deprecated 
keys in the conf? What's the expectation? Perhaps you need to enforce some sort 
of ordering for the deprecated keys in SparkConf, but I don't see why that 
means translating on get and not on set.

Note the goal of the deprecated configs was to make the Spark code only have to 
care about the most recent key name. Your proposal goes against that, and would 
require the deprecated names to live both in SparkConf and in the code that 
needs to read them.

> SparkConf.translateConfKey should translate on get, not set
> ---
>
> Key: SPARK-6048
> URL: https://issues.apache.org/jira/browse/SPARK-6048
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
>
> There are several issues with translating on set.
> (1) The most serious one is that if the user has both the deprecated and the 
> latest version of the same config set, then the value picked up by SparkConf 
> will be arbitrary. Why? Because during initialization of the conf we call 
> `conf.set` on each property in `sys.props` in an order arbitrarily defined by 
> Java. As a result, the value of the more recent config may be overridden by 
> that of the deprecated one. Instead, we should always use the value of the 
> most recent config.
> (2) If we translate on set, then we must keep translating everywhere else. In 
> fact, the current code does not translate on remove, which means the 
> following won't work if X is deprecated:
> {code}
> conf.set(X, Y)
> conf.remove(X) // X is not in the conf
> {code}
> This requires us to also translate in remove and other places, as we already 
> do for contains, leading to more duplicate code.
> (3) Since we call `conf.set` on all configs when initializing the conf, we 
> print all deprecation warnings in the beginning. Elsewhere in Spark, however, 
> we warn the user when the deprecated config / option / env var is actually 
> being used.
> We should keep this consistent so the user won't expect to find all 
> deprecation messages in the beginning of his logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5771) Number of Cores in Completed Applications of Standalone Master Web Page always be 0 if sc.stop() is called

2015-02-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339590#comment-14339590
 ] 

Apache Spark commented on SPARK-5771:
-

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/4800

> Number of Cores in Completed Applications of Standalone Master Web Page 
> always be 0 if sc.stop() is called
> --
>
> Key: SPARK-5771
> URL: https://issues.apache.org/jira/browse/SPARK-5771
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.2.1
>Reporter: Liangliang Gu
>Assignee: Liangliang Gu
>Priority: Minor
> Fix For: 1.4.0
>
>
> In Standalone mode, the number of cores in Completed Applications of the 
> Master Web Page will always be zero, if sc.stop() is called.
> But the number will always be right, if sc.stop() is not called.
> The reason maybe: 
> after sc.stop() is called, the function removeExecutor of class 
> ApplicationInfo will be called, thus reduce the variable coresGranted to 
> zero.  The variable coresGranted is used to display the number of Cores on 
> the Web Page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4897) Python 3 support

2015-02-26 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-4897:
-

Assignee: Davies Liu

> Python 3 support
> 
>
> Key: SPARK-4897
> URL: https://issues.apache.org/jira/browse/SPARK-4897
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Josh Rosen
>Assignee: Davies Liu
>Priority: Minor
>
> It would be nice to have Python 3 support in PySpark, provided that we can do 
> it in a way that maintains backwards-compatibility with Python 2.6.
> I started looking into porting this; my WIP work can be found at 
> https://github.com/JoshRosen/spark/compare/python3
> I was able to use the 
> [futurize|http://python-future.org/futurize.html#forwards-conversion-stage1] 
> tool to handle the basic conversion of things like {{print}} statements, etc. 
> and had to manually fix up a few imports for packages that moved / were 
> renamed, but the major blocker that I hit was {{cloudpickle}}:
> {code}
> [joshrosen python (python3)]$ PYSPARK_PYTHON=python3 ../bin/pyspark
> Python 3.4.2 (default, Oct 19 2014, 17:52:17)
> [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.51)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> Traceback (most recent call last):
>   File "/Users/joshrosen/Documents/Spark/python/pyspark/shell.py", line 28, 
> in 
> import pyspark
>   File "/Users/joshrosen/Documents/spark/python/pyspark/__init__.py", line 
> 41, in 
> from pyspark.context import SparkContext
>   File "/Users/joshrosen/Documents/spark/python/pyspark/context.py", line 26, 
> in 
> from pyspark import accumulators
>   File "/Users/joshrosen/Documents/spark/python/pyspark/accumulators.py", 
> line 97, in 
> from pyspark.cloudpickle import CloudPickler
>   File "/Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py", line 
> 120, in 
> class CloudPickler(pickle.Pickler):
>   File "/Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py", line 
> 122, in CloudPickler
> dispatch = pickle.Pickler.dispatch.copy()
> AttributeError: type object '_pickle.Pickler' has no attribute 'dispatch'
> {code}
> This code looks like it will be hard difficult to port to Python 3, so this 
> might be a good reason to switch to 
> [Dill|https://github.com/uqfoundation/dill] for Python serialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5950) Insert array into a metastore table saved as parquet should work when using datasource api

2015-02-26 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339557#comment-14339557
 ] 

Liang-Chi Hsieh commented on SPARK-5950:


Let me explain on github pull.

> Insert array into a metastore table saved as parquet should work when using 
> datasource api
> --
>
> Key: SPARK-5950
> URL: https://issues.apache.org/jira/browse/SPARK-5950
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4579) Scheduling Delay appears negative

2015-02-26 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-4579.

  Resolution: Fixed
   Fix Version/s: 1.2.2
  1.3.0
Assignee: Sean Owen  (was: Andrew Or)
Target Version/s: 1.3.0, 1.2.2

> Scheduling Delay appears negative
> -
>
> Key: SPARK-4579
> URL: https://issues.apache.org/jira/browse/SPARK-4579
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.2.0
>Reporter: Arun Ahuja
>Assignee: Sean Owen
> Fix For: 1.3.0, 1.2.2
>
>
> !https://cloud.githubusercontent.com/assets/455755/5174438/23d08604-73ff-11e4-9a76-97233b610544.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6048) SparkConf.translateConfKey should translate on get, not set

2015-02-26 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339544#comment-14339544
 ] 

Andrew Or commented on SPARK-6048:
--

[~vanzin]

Note that (3) is orthogonal to this change. We can still do all the warnings at 
the beginning rather than later. However I still don't see why warnings should 
necessarily be tied to when keys are set, though that is a separate discussion.

For (2), yes we can just fix remove(), but doing so means duplicating the 
translation and keeping track of one more place where the translation takes 
place. In the future if we add more methods to SparkConf, we'll have to 
remember to do the translation otherwise it won't work correctly. I personally 
find limiting the scope of translation much cleaner.

(1) Maybe it's unlikely, but it breaks existing user behavior in a confounding 
way nevertheless. When it fails it will be extremely difficult to debug which 
value is used without doing some querying of the conf itself.

> SparkConf.translateConfKey should translate on get, not set
> ---
>
> Key: SPARK-6048
> URL: https://issues.apache.org/jira/browse/SPARK-6048
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
>
> There are several issues with translating on set.
> (1) The most serious one is that if the user has both the deprecated and the 
> latest version of the same config set, then the value picked up by SparkConf 
> will be arbitrary. Why? Because during initialization of the conf we call 
> `conf.set` on each property in `sys.props` in an order arbitrarily defined by 
> Java. As a result, the value of the more recent config may be overridden by 
> that of the deprecated one. Instead, we should always use the value of the 
> most recent config.
> (2) If we translate on set, then we must keep translating everywhere else. In 
> fact, the current code does not translate on remove, which means the 
> following won't work if X is deprecated:
> {code}
> conf.set(X, Y)
> conf.remove(X) // X is not in the conf
> {code}
> This requires us to also translate in remove and other places, as we already 
> do for contains, leading to more duplicate code.
> (3) Since we call `conf.set` on all configs when initializing the conf, we 
> print all deprecation warnings in the beginning. Elsewhere in Spark, however, 
> we warn the user when the deprecated config / option / env var is actually 
> being used.
> We should keep this consistent so the user won't expect to find all 
> deprecation messages in the beginning of his logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5950) Insert array into a metastore table saved as parquet should work when using datasource api

2015-02-26 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339542#comment-14339542
 ] 

Yin Huai commented on SPARK-5950:
-

If it is just the part of the problem, mind explain what is the problem?

> Insert array into a metastore table saved as parquet should work when using 
> datasource api
> --
>
> Key: SPARK-5950
> URL: https://issues.apache.org/jira/browse/SPARK-5950
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5950) Insert array into a metastore table saved as parquet should work when using datasource api

2015-02-26 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339534#comment-14339534
 ] 

Liang-Chi Hsieh commented on SPARK-5950:


Yes. This is just the part of the problem. containsNull/valueContainsNull is 
most the problem. nullable should not be a problem.

> Insert array into a metastore table saved as parquet should work when using 
> datasource api
> --
>
> Key: SPARK-5950
> URL: https://issues.apache.org/jira/browse/SPARK-5950
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5508) Arrays and Maps stored with Hive Parquet Serde may not be able to read by the Parquet support in the Data Souce API

2015-02-26 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-5508:

Summary: Arrays and Maps stored with Hive Parquet Serde may not be able to 
read by the Parquet support in the Data Souce API  (was: [hive context] 
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0)

> Arrays and Maps stored with Hive Parquet Serde may not be able to read by the 
> Parquet support in the Data Souce API
> ---
>
> Key: SPARK-5508
> URL: https://issues.apache.org/jira/browse/SPARK-5508
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1
> Environment: mesos, cdh
>Reporter: Ayoub Benali
>  Labels: hivecontext, parquet
>
> When the table is saved as parquet, we cannot query a field which is an array 
> of struct after an INSERT statement, like show bellow:  
> {noformat}
> scala> val data1="""{
>  | "timestamp": 1422435598,
>  | "data_array": [
>  | {
>  | "field1": 1,
>  | "field2": 2
>  | }
>  | ]
>  | }"""
> scala> val data2="""{
>  | "timestamp": 1422435598,
>  | "data_array": [
>  | {
>  | "field1": 3,
>  | "field2": 4
>  | }
>  | ]
> scala> val jsonRDD = sc.makeRDD(data1 :: data2 :: Nil)
> scala> val rdd = hiveContext.jsonRDD(jsonRDD)
> scala> rdd.printSchema
> root
>  |-- data_array: array (nullable = true)
>  ||-- element: struct (containsNull = false)
>  |||-- field1: integer (nullable = true)
>  |||-- field2: integer (nullable = true)
>  |-- timestamp: integer (nullable = true)
> scala> rdd.registerTempTable("tmp_table")
> scala> hiveContext.sql("select data.field1 from tmp_table LATERAL VIEW 
> explode(data_array) nestedStuff AS data").collect
> res3: Array[org.apache.spark.sql.Row] = Array([1], [3])
> scala> hiveContext.sql("SET hive.exec.dynamic.partition = true")
> scala> hiveContext.sql("SET hive.exec.dynamic.partition.mode = nonstrict")
> scala> hiveContext.sql("set parquet.compression=GZIP")
> scala> hiveContext.setConf("spark.sql.parquet.binaryAsString", "true")
> scala> hiveContext.sql("create external table if not exists 
> persisted_table(data_array ARRAY >, 
> timestamp INT) STORED AS PARQUET Location 'hdfs:///test_table'")
> scala> hiveContext.sql("insert into table persisted_table select * from 
> tmp_table").collect
> scala> hiveContext.sql("select data.field1 from persisted_table LATERAL VIEW 
> explode(data_array) nestedStuff AS data").collect
> parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in 
> file hdfs://*/test_table/part-1
>   at 
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
>   at 
> parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
>   at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:797)
>   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:797)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
>   at 
> java.util.concurrent.ThreadPoolExecutor.ru

[jira] [Updated] (SPARK-5950) Insert array into a metastore table saved as parquet should work when using datasource api

2015-02-26 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-5950:

Priority: Blocker  (was: Major)

> Insert array into a metastore table saved as parquet should work when using 
> datasource api
> --
>
> Key: SPARK-5950
> URL: https://issues.apache.org/jira/browse/SPARK-5950
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5508) [hive context] java.lang.IndexOutOfBoundsException: Index: 0, Size: 0

2015-02-26 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-5508:

Target Version/s: 1.3.0

> [hive context] java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
> -
>
> Key: SPARK-5508
> URL: https://issues.apache.org/jira/browse/SPARK-5508
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1
> Environment: mesos, cdh
>Reporter: Ayoub Benali
>  Labels: hivecontext, parquet
>
> When the table is saved as parquet, we cannot query a field which is an array 
> of struct after an INSERT statement, like show bellow:  
> {noformat}
> scala> val data1="""{
>  | "timestamp": 1422435598,
>  | "data_array": [
>  | {
>  | "field1": 1,
>  | "field2": 2
>  | }
>  | ]
>  | }"""
> scala> val data2="""{
>  | "timestamp": 1422435598,
>  | "data_array": [
>  | {
>  | "field1": 3,
>  | "field2": 4
>  | }
>  | ]
> scala> val jsonRDD = sc.makeRDD(data1 :: data2 :: Nil)
> scala> val rdd = hiveContext.jsonRDD(jsonRDD)
> scala> rdd.printSchema
> root
>  |-- data_array: array (nullable = true)
>  ||-- element: struct (containsNull = false)
>  |||-- field1: integer (nullable = true)
>  |||-- field2: integer (nullable = true)
>  |-- timestamp: integer (nullable = true)
> scala> rdd.registerTempTable("tmp_table")
> scala> hiveContext.sql("select data.field1 from tmp_table LATERAL VIEW 
> explode(data_array) nestedStuff AS data").collect
> res3: Array[org.apache.spark.sql.Row] = Array([1], [3])
> scala> hiveContext.sql("SET hive.exec.dynamic.partition = true")
> scala> hiveContext.sql("SET hive.exec.dynamic.partition.mode = nonstrict")
> scala> hiveContext.sql("set parquet.compression=GZIP")
> scala> hiveContext.setConf("spark.sql.parquet.binaryAsString", "true")
> scala> hiveContext.sql("create external table if not exists 
> persisted_table(data_array ARRAY >, 
> timestamp INT) STORED AS PARQUET Location 'hdfs:///test_table'")
> scala> hiveContext.sql("insert into table persisted_table select * from 
> tmp_table").collect
> scala> hiveContext.sql("select data.field1 from persisted_table LATERAL VIEW 
> explode(data_array) nestedStuff AS data").collect
> parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in 
> file hdfs://*/test_table/part-1
>   at 
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
>   at 
> parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
>   at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:797)
>   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:797)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:744)
> Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
>   at java.util.ArrayList.rangeChe

[jira] [Commented] (SPARK-5950) Insert array into a metastore table saved as parquet should work when using datasource api

2015-02-26 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339530#comment-14339530
 ] 

Yin Huai commented on SPARK-5950:
-

OK. Now, I understand what's going on.  For this JIRA, the table is a 
MetastoreRelation and we are trying to use data source API's write path to 
insert into it. For a MetastoreRelation, containsNull, valueContainsNull and 
nullable will always be true. When we try to insert into this table through the 
data source write path, if any of containsNull/valueContainsNull/nullable is 
false, InsertIntoTable will not be resolved because of the nullability issue.

> Insert array into a metastore table saved as parquet should work when using 
> datasource api
> --
>
> Key: SPARK-5950
> URL: https://issues.apache.org/jira/browse/SPARK-5950
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-5508) [hive context] java.lang.IndexOutOfBoundsException: Index: 0, Size: 0

2015-02-26 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai reopened SPARK-5508:
-

I am reopening it since it is different from SPARK-5950.

> [hive context] java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
> -
>
> Key: SPARK-5508
> URL: https://issues.apache.org/jira/browse/SPARK-5508
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1
> Environment: mesos, cdh
>Reporter: Ayoub Benali
>  Labels: hivecontext, parquet
>
> When the table is saved as parquet, we cannot query a field which is an array 
> of struct after an INSERT statement, like show bellow:  
> {noformat}
> scala> val data1="""{
>  | "timestamp": 1422435598,
>  | "data_array": [
>  | {
>  | "field1": 1,
>  | "field2": 2
>  | }
>  | ]
>  | }"""
> scala> val data2="""{
>  | "timestamp": 1422435598,
>  | "data_array": [
>  | {
>  | "field1": 3,
>  | "field2": 4
>  | }
>  | ]
> scala> val jsonRDD = sc.makeRDD(data1 :: data2 :: Nil)
> scala> val rdd = hiveContext.jsonRDD(jsonRDD)
> scala> rdd.printSchema
> root
>  |-- data_array: array (nullable = true)
>  ||-- element: struct (containsNull = false)
>  |||-- field1: integer (nullable = true)
>  |||-- field2: integer (nullable = true)
>  |-- timestamp: integer (nullable = true)
> scala> rdd.registerTempTable("tmp_table")
> scala> hiveContext.sql("select data.field1 from tmp_table LATERAL VIEW 
> explode(data_array) nestedStuff AS data").collect
> res3: Array[org.apache.spark.sql.Row] = Array([1], [3])
> scala> hiveContext.sql("SET hive.exec.dynamic.partition = true")
> scala> hiveContext.sql("SET hive.exec.dynamic.partition.mode = nonstrict")
> scala> hiveContext.sql("set parquet.compression=GZIP")
> scala> hiveContext.setConf("spark.sql.parquet.binaryAsString", "true")
> scala> hiveContext.sql("create external table if not exists 
> persisted_table(data_array ARRAY >, 
> timestamp INT) STORED AS PARQUET Location 'hdfs:///test_table'")
> scala> hiveContext.sql("insert into table persisted_table select * from 
> tmp_table").collect
> scala> hiveContext.sql("select data.field1 from persisted_table LATERAL VIEW 
> explode(data_array) nestedStuff AS data").collect
> parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in 
> file hdfs://*/test_table/part-1
>   at 
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
>   at 
> parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
>   at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:797)
>   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:797)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:744)
> Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
>  

[jira] [Updated] (SPARK-5950) Insert array into a metastore table saved as parquet should work when using datasource api

2015-02-26 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-5950:

Summary: Insert array into a metastore table saved as parquet should work 
when using datasource api  (was: Insert array into a metastore table should 
work when using datasource api)

> Insert array into a metastore table saved as parquet should work when using 
> datasource api
> --
>
> Key: SPARK-5950
> URL: https://issues.apache.org/jira/browse/SPARK-5950
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5950) Insert array into a metastore table should work when using datasource api

2015-02-26 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-5950:

Summary: Insert array into a metastore table should work when using 
datasource api  (was: Insert array into a metastore table saved as parquet 
should work when using datasource api)

> Insert array into a metastore table should work when using datasource api
> -
>
> Key: SPARK-5950
> URL: https://issues.apache.org/jira/browse/SPARK-5950
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6048) SparkConf.translateConfKey should translate on get, not set

2015-02-26 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-6048:
-
Priority: Blocker  (was: Critical)

> SparkConf.translateConfKey should translate on get, not set
> ---
>
> Key: SPARK-6048
> URL: https://issues.apache.org/jira/browse/SPARK-6048
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
>
> There are several issues with translating on set.
> (1) The most serious one is that if the user has both the deprecated and the 
> latest version of the same config set, then the value picked up by SparkConf 
> will be arbitrary. Why? Because during initialization of the conf we call 
> `conf.set` on each property in `sys.props` in an order arbitrarily defined by 
> Java. As a result, the value of the more recent config may be overridden by 
> that of the deprecated one. Instead, we should always use the value of the 
> most recent config.
> (2) If we translate on set, then we must keep translating everywhere else. In 
> fact, the current code does not translate on remove, which means the 
> following won't work if X is deprecated:
> {code}
> conf.set(X, Y)
> conf.remove(X) // X is not in the conf
> {code}
> This requires us to also translate in remove and other places, as we already 
> do for contains, leading to more duplicate code.
> (3) Since we call `conf.set` on all configs when initializing the conf, we 
> print all deprecation warnings in the beginning. Elsewhere in Spark, however, 
> we warn the user when the deprecated config / option / env var is actually 
> being used.
> We should keep this consistent so the user won't expect to find all 
> deprecation messages in the beginning of his logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5950) Insert array into a metastore table saved as parquet should work when using datasource api

2015-02-26 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-5950:

Summary: Insert array into a metastore table saved as parquet should work 
when using datasource api  (was: Insert array into table saved as parquet 
should work when using datasource api)

> Insert array into a metastore table saved as parquet should work when using 
> datasource api
> --
>
> Key: SPARK-5950
> URL: https://issues.apache.org/jira/browse/SPARK-5950
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5950) Insert array into table saved as parquet should work when using datasource api

2015-02-26 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-5950:

Summary: Insert array into table saved as parquet should work when using 
datasource api  (was: Arrays and Maps stored with Hive Parquet Serde may not be 
able to read by the Parquet support in the Data Souce API )

> Insert array into table saved as parquet should work when using datasource api
> --
>
> Key: SPARK-5950
> URL: https://issues.apache.org/jira/browse/SPARK-5950
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6048) SparkConf.translateConfKey should translate on get, not set

2015-02-26 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339519#comment-14339519
 ] 

Marcelo Vanzin commented on SPARK-6048:
---

I sort of agree with (1). But I think it's both unlikely (users will probably 
use the old option or the new one, but not both), and probably sort of fixable 
(but not optimally). Basically, don't override a value that's already set when 
using the deprecated key.

I disagree with (2). Just fix remove().

I also disagree with (3), and it's not even the correct interpretation of what 
happens. Warning *only* happen when the configuration keys are set, never when 
reading. And I think it's actually a good thing that all (or most) of the 
warnings show up when creating the conf object, which generally happens early 
in the app's life. It means it's easier to see them.

> SparkConf.translateConfKey should translate on get, not set
> ---
>
> Key: SPARK-6048
> URL: https://issues.apache.org/jira/browse/SPARK-6048
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Critical
>
> There are several issues with translating on set.
> (1) The most serious one is that if the user has both the deprecated and the 
> latest version of the same config set, then the value picked up by SparkConf 
> will be arbitrary. Why? Because during initialization of the conf we call 
> `conf.set` on each property in `sys.props` in an order arbitrarily defined by 
> Java. As a result, the value of the more recent config may be overridden by 
> that of the deprecated one. Instead, we should always use the value of the 
> most recent config.
> (2) If we translate on set, then we must keep translating everywhere else. In 
> fact, the current code does not translate on remove, which means the 
> following won't work if X is deprecated:
> {code}
> conf.set(X, Y)
> conf.remove(X) // X is not in the conf
> {code}
> This requires us to also translate in remove and other places, as we already 
> do for contains, leading to more duplicate code.
> (3) Since we call `conf.set` on all configs when initializing the conf, we 
> print all deprecation warnings in the beginning. Elsewhere in Spark, however, 
> we warn the user when the deprecated config / option / env var is actually 
> being used.
> We should keep this consistent so the user won't expect to find all 
> deprecation messages in the beginning of his logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6048) SparkConf.translateConfKey should translate on get, not set

2015-02-26 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-6048:
-
Description: 
There are several issues with translating on set.

(1) The most serious one is that if the user has both the deprecated and the 
latest version of the same config set, then the value picked up by SparkConf 
will be arbitrary. Why? Because during initialization of the conf we call 
`conf.set` on each property in `sys.props` in an order arbitrarily defined by 
Java. As a result, the value of the more recent config may be overridden by 
that of the deprecated one. Instead, we should always use the value of the most 
recent config.

(2) If we translate on set, then we must keep translating everywhere else. In 
fact, the current code does not translate on remove, which means the following 
won't work if X is deprecated:
{code}
conf.set(X, Y)
conf.remove(X) // X is not in the conf
{code}
This requires us to also translate in remove and other places, as we already do 
for contains, leading to more duplicate code.

(3) Since we call `conf.set` on all configs when initializing the conf, we 
print all deprecation warnings in the beginning. Elsewhere in Spark, however, 
we warn the user when the deprecated config / option / env var is actually 
being used.
We should keep this consistent so the user won't expect to find all deprecation 
messages in the beginning of his logs.

  was:
There are several issues with translating on set.

(1) The most serious one is that if the user has both the deprecated and the 
latest version of the same config set, then the value picked up by SparkConf 
will be arbitrary. Why? Because during initialization of the conf we call 
`conf.set` on each property in `sys.props` in an order arbitrarily defined by 
Java. Instead, we should always use the value of the latest version of the 
config if that is provided.

(2) If we translate on set, then we must keep translating everywhere else. In 
fact, the current code does not translate on remove, which means the following 
won't work if X is deprecated:
{code}
conf.set(X, Y)
conf.remove(X) // X is not in the conf
{code}
This requires us to also translate in remove and other places, as we already do 
for contains, leading to more duplicate code.

(3) Since we call `conf.set` on all configs when initializing the conf, we 
print all deprecation warnings in the beginning. Elsewhere in Spark, however, 
we warn the user when the deprecated config / option / env var is actually 
being used.
We should keep this consistent so the user won't expect to find all deprecation 
messages in the beginning of his logs.


> SparkConf.translateConfKey should translate on get, not set
> ---
>
> Key: SPARK-6048
> URL: https://issues.apache.org/jira/browse/SPARK-6048
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Critical
>
> There are several issues with translating on set.
> (1) The most serious one is that if the user has both the deprecated and the 
> latest version of the same config set, then the value picked up by SparkConf 
> will be arbitrary. Why? Because during initialization of the conf we call 
> `conf.set` on each property in `sys.props` in an order arbitrarily defined by 
> Java. As a result, the value of the more recent config may be overridden by 
> that of the deprecated one. Instead, we should always use the value of the 
> most recent config.
> (2) If we translate on set, then we must keep translating everywhere else. In 
> fact, the current code does not translate on remove, which means the 
> following won't work if X is deprecated:
> {code}
> conf.set(X, Y)
> conf.remove(X) // X is not in the conf
> {code}
> This requires us to also translate in remove and other places, as we already 
> do for contains, leading to more duplicate code.
> (3) Since we call `conf.set` on all configs when initializing the conf, we 
> print all deprecation warnings in the beginning. Elsewhere in Spark, however, 
> we warn the user when the deprecated config / option / env var is actually 
> being used.
> We should keep this consistent so the user won't expect to find all 
> deprecation messages in the beginning of his logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6048) SparkConf.translateConfKey should translate on get, not set

2015-02-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339498#comment-14339498
 ] 

Apache Spark commented on SPARK-6048:
-

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/4799

> SparkConf.translateConfKey should translate on get, not set
> ---
>
> Key: SPARK-6048
> URL: https://issues.apache.org/jira/browse/SPARK-6048
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Critical
>
> There are several issues with translating on set.
> (1) The most serious one is that if the user has both the deprecated and the 
> latest version of the same config set, then the value picked up by SparkConf 
> will be arbitrary. Why? Because during initialization of the conf we call 
> `conf.set` on each property in `sys.props` in an order arbitrarily defined by 
> Java. Instead, we should always use the value of the latest version of the 
> config if that is provided.
> (2) If we translate on set, then we must keep translating everywhere else. In 
> fact, the current code does not translate on remove, which means the 
> following won't work if X is deprecated:
> {code}
> conf.set(X, Y)
> conf.remove(X) // X is not in the conf
> {code}
> (3) Since we call `conf.set` on all configs when initializing the conf, we 
> print all deprecation warnings in the beginning. Elsewhere in Spark, however, 
> we warn the user when the deprecated config / option / env var is actually 
> being used.
> We should keep this consistent so the user won't expect to find all 
> deprecation messages in the beginning of his logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6048) SparkConf.translateConfKey should translate on get, not set

2015-02-26 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-6048:
-
Description: 
There are several issues with translating on set.

(1) The most serious one is that if the user has both the deprecated and the 
latest version of the same config set, then the value picked up by SparkConf 
will be arbitrary. Why? Because during initialization of the conf we call 
`conf.set` on each property in `sys.props` in an order arbitrarily defined by 
Java. Instead, we should always use the value of the latest version of the 
config if that is provided.

(2) If we translate on set, then we must keep translating everywhere else. In 
fact, the current code does not translate on remove, which means the following 
won't work if X is deprecated:
{code}
conf.set(X, Y)
conf.remove(X) // X is not in the conf
{code}
This requires us to also translate in remove and other places, as we already do 
for contains, leading to more duplicate code.

(3) Since we call `conf.set` on all configs when initializing the conf, we 
print all deprecation warnings in the beginning. Elsewhere in Spark, however, 
we warn the user when the deprecated config / option / env var is actually 
being used.
We should keep this consistent so the user won't expect to find all deprecation 
messages in the beginning of his logs.

  was:
There are several issues with translating on set.

(1) The most serious one is that if the user has both the deprecated and the 
latest version of the same config set, then the value picked up by SparkConf 
will be arbitrary. Why? Because during initialization of the conf we call 
`conf.set` on each property in `sys.props` in an order arbitrarily defined by 
Java. Instead, we should always use the value of the latest version of the 
config if that is provided.

(2) If we translate on set, then we must keep translating everywhere else. In 
fact, the current code does not translate on remove, which means the following 
won't work if X is deprecated:
{code}
conf.set(X, Y)
conf.remove(X) // X is not in the conf
{code}

(3) Since we call `conf.set` on all configs when initializing the conf, we 
print all deprecation warnings in the beginning. Elsewhere in Spark, however, 
we warn the user when the deprecated config / option / env var is actually 
being used.
We should keep this consistent so the user won't expect to find all deprecation 
messages in the beginning of his logs.


> SparkConf.translateConfKey should translate on get, not set
> ---
>
> Key: SPARK-6048
> URL: https://issues.apache.org/jira/browse/SPARK-6048
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Critical
>
> There are several issues with translating on set.
> (1) The most serious one is that if the user has both the deprecated and the 
> latest version of the same config set, then the value picked up by SparkConf 
> will be arbitrary. Why? Because during initialization of the conf we call 
> `conf.set` on each property in `sys.props` in an order arbitrarily defined by 
> Java. Instead, we should always use the value of the latest version of the 
> config if that is provided.
> (2) If we translate on set, then we must keep translating everywhere else. In 
> fact, the current code does not translate on remove, which means the 
> following won't work if X is deprecated:
> {code}
> conf.set(X, Y)
> conf.remove(X) // X is not in the conf
> {code}
> This requires us to also translate in remove and other places, as we already 
> do for contains, leading to more duplicate code.
> (3) Since we call `conf.set` on all configs when initializing the conf, we 
> print all deprecation warnings in the beginning. Elsewhere in Spark, however, 
> we warn the user when the deprecated config / option / env var is actually 
> being used.
> We should keep this consistent so the user won't expect to find all 
> deprecation messages in the beginning of his logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6048) SparkConf.translateConfKey should translate on get, not set

2015-02-26 Thread Andrew Or (JIRA)
Andrew Or created SPARK-6048:


 Summary: SparkConf.translateConfKey should translate on get, not 
set
 Key: SPARK-6048
 URL: https://issues.apache.org/jira/browse/SPARK-6048
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Critical


There are several issues with translating on set.

(1) The most serious one is that if the user has both the deprecated and the 
latest version of the same config set, then the value picked up by SparkConf 
will be arbitrary. Why? Because during initialization of the conf we call 
`conf.set` on each property in `sys.props` in an order arbitrarily defined by 
Java. Instead, we should always use the value of the latest version of the 
config if that is provided.

(2) If we translate on set, then we must keep translating everywhere else. In 
fact, the current code does not translate on remove, which means the following 
won't work if X is deprecated:
{code}
conf.set(X, Y)
conf.remove(X) // X is not in the conf
{code}

(3) Since we call `conf.set` on all configs when initializing the conf, we 
print all deprecation warnings in the beginning. Elsewhere in Spark, however, 
we warn the user when the deprecated config / option / env var is actually 
being used.
We should keep this consistent so the user won't expect to find all deprecation 
messages in the beginning of his logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5775) GenericRow cannot be cast to SpecificMutableRow when nested data and partitioned table

2015-02-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339457#comment-14339457
 ] 

Apache Spark commented on SPARK-5775:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/4798

> GenericRow cannot be cast to SpecificMutableRow when nested data and 
> partitioned table
> --
>
> Key: SPARK-5775
> URL: https://issues.apache.org/jira/browse/SPARK-5775
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1
>Reporter: Ayoub Benali
>Assignee: Cheng Lian
>Priority: Blocker
>  Labels: hivecontext, nested, parquet, partition
>
> Using the "LOAD" sql command in Hive context to load parquet files into a 
> partitioned table causes exceptions during query time. 
> The bug requires the table to have a column of *type Array of struct* and to 
> be *partitioned*. 
> The example bellow shows how to reproduce the bug and you can see that if the 
> table is not partitioned the query works fine. 
> {noformat}
> scala> val data1 = """{"data_array":[{"field1":1,"field2":2}]}"""
> scala> val data2 = """{"data_array":[{"field1":3,"field2":4}]}"""
> scala> val jsonRDD = sc.makeRDD(data1 :: data2 :: Nil)
> scala> val schemaRDD = hiveContext.jsonRDD(jsonRDD)
> scala> schemaRDD.printSchema
> root
>  |-- data_array: array (nullable = true)
>  ||-- element: struct (containsNull = false)
>  |||-- field1: integer (nullable = true)
>  |||-- field2: integer (nullable = true)
> scala> hiveContext.sql("create external table if not exists 
> partitioned_table(data_array ARRAY >) 
> Partitioned by (date STRING) STORED AS PARQUET Location 
> 'hdfs:///partitioned_table'")
> scala> hiveContext.sql("create external table if not exists 
> none_partitioned_table(data_array ARRAY >) 
> STORED AS PARQUET Location 'hdfs:///none_partitioned_table'")
> scala> schemaRDD.saveAsParquetFile("hdfs:///tmp_data_1")
> scala> schemaRDD.saveAsParquetFile("hdfs:///tmp_data_2")
> scala> hiveContext.sql("LOAD DATA INPATH 
> 'hdfs://qa-hdc001.ffm.nugg.ad:8020/erlogd/tmp_data_1' INTO TABLE 
> partitioned_table PARTITION(date='2015-02-12')")
> scala> hiveContext.sql("LOAD DATA INPATH 
> 'hdfs://qa-hdc001.ffm.nugg.ad:8020/erlogd/tmp_data_2' INTO TABLE 
> none_partitioned_table")
> scala> hiveContext.sql("select data.field1 from none_partitioned_table 
> LATERAL VIEW explode(data_array) nestedStuff AS data").collect
> res23: Array[org.apache.spark.sql.Row] = Array([1], [3])
> scala> hiveContext.sql("select data.field1 from partitioned_table LATERAL 
> VIEW explode(data_array) nestedStuff AS data").collect
> 15/02/12 16:21:03 INFO ParseDriver: Parsing command: select data.field1 from 
> partitioned_table LATERAL VIEW explode(data_array) nestedStuff AS data
> 15/02/12 16:21:03 INFO ParseDriver: Parse Completed
> 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(260661) called with 
> curMem=0, maxMem=280248975
> 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_18 stored as values in 
> memory (estimated size 254.6 KB, free 267.0 MB)
> 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(28615) called with 
> curMem=260661, maxMem=280248975
> 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_18_piece0 stored as bytes 
> in memory (estimated size 27.9 KB, free 267.0 MB)
> 15/02/12 16:21:03 INFO BlockManagerInfo: Added broadcast_18_piece0 in memory 
> on *:51990 (size: 27.9 KB, free: 267.2 MB)
> 15/02/12 16:21:03 INFO BlockManagerMaster: Updated info of block 
> broadcast_18_piece0
> 15/02/12 16:21:03 INFO SparkContext: Created broadcast 18 from NewHadoopRDD 
> at ParquetTableOperations.scala:119
> 15/02/12 16:21:03 INFO FileInputFormat: Total input paths to process : 3
> 15/02/12 16:21:03 INFO ParquetInputFormat: Total input paths to process : 3
> 15/02/12 16:21:03 INFO FilteringParquetRowInputFormat: Using Task Side 
> Metadata Split Strategy
> 15/02/12 16:21:03 INFO SparkContext: Starting job: collect at 
> SparkPlan.scala:84
> 15/02/12 16:21:03 INFO DAGScheduler: Got job 12 (collect at 
> SparkPlan.scala:84) with 3 output partitions (allowLocal=false)
> 15/02/12 16:21:03 INFO DAGScheduler: Final stage: Stage 13(collect at 
> SparkPlan.scala:84)
> 15/02/12 16:21:03 INFO DAGScheduler: Parents of final stage: List()
> 15/02/12 16:21:03 INFO DAGScheduler: Missing parents: List()
> 15/02/12 16:21:03 INFO DAGScheduler: Submitting Stage 13 (MappedRDD[111] at 
> map at SparkPlan.scala:84), which has no missing parents
> 15/02/12 16:21:03 INFO MemoryStore: ensureFreeSpace(7632) called with 
> curMem=289276, maxMem=280248975
> 15/02/12 16:21:03 INFO MemoryStore: Block broadcast_19 stored as values in 
> memory (estimated size 7.5 KB, free 267.0 MB)
>

[jira] [Commented] (SPARK-6047) pyspark - class loading on driver failing with --jars and --packages

2015-02-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339453#comment-14339453
 ] 

Apache Spark commented on SPARK-6047:
-

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/4754

> pyspark - class loading on driver failing with --jars and --packages
> 
>
> Key: SPARK-6047
> URL: https://issues.apache.org/jira/browse/SPARK-6047
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Submit
>Affects Versions: 1.3.0
>Reporter: Burak Yavuz
>
> Because py4j uses the system ClassLoader instead of the contextClassLoader of 
> the thread, the dynamically added jars in Spark Submit can't be loaded in the 
> driver.
> This causes `Py4JError: Trying to call a package` errors.
> Usually `--packages` are downloaded from some remote repo before runtime, 
> adding them explicitly to `--driver-class-path` is not an option, like we can 
> do with `--jars`. One solution is to move the fetching of `--packages` to the 
> SparkSubmitDriverBootstrapper, and add it to the driver class-path there.
> A more complete solution can be achieved through [SPARK-4924].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6047) pyspark - class loading on driver failing with --jars and --packages

2015-02-26 Thread Burak Yavuz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz updated SPARK-6047:
---
Description: 
Because py4j uses the system ClassLoader instead of the contextClassLoader of 
the thread, the dynamically added jars in Spark Submit can't be loaded in the 
driver.

This causes `Py4JError: Trying to call a package` errors.

Usually `--packages` are downloaded from some remote repo before runtime, 
adding them explicitly to `--driver-class-path` is not an option, like we can 
do with `--jars`. One solution is to move the fetching of `--packages` to the 
SparkSubmitDriverBootstrapper, and add it to the driver class-path there.

A more complete solution can be achieved through [SPARK-4924].

  was:
Because py4j uses the system ClassLoader instead of the contextClassLoader of 
the thread, the dynamically added jars in Spark Submit can't be loaded in the 
driver.

This causes "package not found" errors in py4j. 


> pyspark - class loading on driver failing with --jars and --packages
> 
>
> Key: SPARK-6047
> URL: https://issues.apache.org/jira/browse/SPARK-6047
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Submit
>Affects Versions: 1.3.0
>Reporter: Burak Yavuz
>
> Because py4j uses the system ClassLoader instead of the contextClassLoader of 
> the thread, the dynamically added jars in Spark Submit can't be loaded in the 
> driver.
> This causes `Py4JError: Trying to call a package` errors.
> Usually `--packages` are downloaded from some remote repo before runtime, 
> adding them explicitly to `--driver-class-path` is not an option, like we can 
> do with `--jars`. One solution is to move the fetching of `--packages` to the 
> SparkSubmitDriverBootstrapper, and add it to the driver class-path there.
> A more complete solution can be achieved through [SPARK-4924].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6047) pyspark - class loading on driver failing with --jars and --packages

2015-02-26 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-6047:
--

 Summary: pyspark - class loading on driver failing with --jars and 
--packages
 Key: SPARK-6047
 URL: https://issues.apache.org/jira/browse/SPARK-6047
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Submit
Affects Versions: 1.3.0
Reporter: Burak Yavuz


Because py4j uses the system ClassLoader instead of the contextClassLoader of 
the thread, the dynamically added jars in Spark Submit can't be loaded in the 
driver.

This causes "package not found" errors in py4j. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5942) DataFrame should not do query optimization when dataFrameEagerAnalysis is off

2015-02-26 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh closed SPARK-5942.
--
Resolution: Won't Fix

> DataFrame should not do query optimization when dataFrameEagerAnalysis is off
> -
>
> Key: SPARK-5942
> URL: https://issues.apache.org/jira/browse/SPARK-5942
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Priority: Minor
>
> DataFrame will force query optimization to happen right away for the commands 
> and queries with side effects.
> However, I think we should not do that when dataFrameEagerAnalysis is off.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2989) Error sending message to BlockManagerMaster

2015-02-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-2989:
-
Component/s: (was: Deploy)
   Priority: Major  (was: Critical)

I'm not sure if there's enough info here. This basically says the executor 
couldn't talk to the block manager. Do you have any more detail? this itself 
isn't the error, but some underlying cause.

> Error sending message to BlockManagerMaster
> ---
>
> Key: SPARK-2989
> URL: https://issues.apache.org/jira/browse/SPARK-2989
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 1.0.2
>Reporter: pengyanhong
>
> run a simple hive sql Spark App via yarn-cluster,  got 3 segments log content 
> via yarn logs --applicationID command line, the detail as below:
> * 1st segment is about the Driver & Application Master, everything is fine 
> without error, start time is 16:43:49 and end time is 16:44:08.
> * 2nd & 3rd segment is about  the Executor, the start time is 16:43:52, then 
> from 16:44:38 encounter many times error as below:
> {quote}
> WARN org.apache.spark.Logging$class.logWarning(Logging.scala:91): Error 
> sending message to BlockManagerMaster in 1 attempts
> java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>   at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at scala.concurrent.Await$.result(package.scala:107)
>   at 
> org.apache.spark.storage.BlockManagerMaster.askDriverWithReply(BlockManagerMaster.scala:237)
>   at 
> org.apache.spark.storage.BlockManagerMaster.sendHeartBeat(BlockManagerMaster.scala:51)
>   at 
> org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$heartBeat(BlockManager.scala:113)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$initialize$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(BlockManager.scala:158)
>   at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:790)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$initialize$1.apply$mcV$sp(BlockManager.scala:158)
>   at akka.actor.Scheduler$$anon$9.run(Scheduler.scala:80)
>   at 
> akka.actor.LightArrayRevolverScheduler$$anon$3$$anon$2.run(Scheduler.scala:241)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>   at java.lang.Thread.run(Thread.java:662)
> 14/08/12 16:45:31 WARN 
> org.apache.spark.Logging$class.logWarning(Logging.scala:91): Error sending 
> message to BlockManagerMaster in 2 attempts
> ..
> {quote}
> confirmed that the date time of 3 nodes is sync.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5066) Can not get all key that has same hashcode when reading key ordered from different Streaming.

2015-02-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5066.
--
Resolution: Not a Problem

> Can not get all key that has same hashcode  when reading key ordered  from 
> different Streaming.
> ---
>
> Key: SPARK-5066
> URL: https://issues.apache.org/jira/browse/SPARK-5066
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.2.0
>Reporter: DoingDone9
>Priority: Critical
>
> when spill is open, data ordered by hashCode will be spilled to disk. We need 
> get all key that has the same hashCode from different tmp files when merge 
> value, but it just read the key that has the minHashCode that in a tmp file, 
> we can not read all key.
> Example :
> If file1 has [k1, k2, k3], file2 has [k4,k5,k1].
> And hashcode of k4 < hashcode of k5 < hashcode of k1 <  hashcode of k2 <  
> hashcode of k3
> we just  read k1 from file1 and k4 from file2. Can not read all k1.
> Code :
> private val inputStreams = (Seq(sortedMap) ++ spilledMaps).map(it => 
> it.buffered)
> inputStreams.foreach { it =>
>   val kcPairs = new ArrayBuffer[(K, C)]
>   readNextHashCode(it, kcPairs)
>   if (kcPairs.length > 0) {
> mergeHeap.enqueue(new StreamBuffer(it, kcPairs))
>   }
> }
>  private def readNextHashCode(it: BufferedIterator[(K, C)], buf: 
> ArrayBuffer[(K, C)]): Unit = {
>   if (it.hasNext) {
> var kc = it.next()
> buf += kc
> val minHash = hashKey(kc)
> while (it.hasNext && it.head._1.hashCode() == minHash) {
>   kc = it.next()
>   buf += kc
> }
>   }
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2356) Exception: Could not locate executable null\bin\winutils.exe in the Hadoop

2015-02-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-2356:
-
Component/s: (was: Spark Core)
 Windows

> Exception: Could not locate executable null\bin\winutils.exe in the Hadoop 
> ---
>
> Key: SPARK-2356
> URL: https://issues.apache.org/jira/browse/SPARK-2356
> Project: Spark
>  Issue Type: Bug
>  Components: Windows
>Affects Versions: 1.0.0
>Reporter: Kostiantyn Kudriavtsev
>Priority: Critical
>
> I'm trying to run some transformation on Spark, it works fine on cluster 
> (YARN, linux machines). However, when I'm trying to run it on local machine 
> (Windows 7) under unit test, I got errors (I don't use Hadoop, I'm read file 
> from local filesystem):
> {code}
> 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the 
> hadoop binary path
> java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
> Hadoop binaries.
>   at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
>   at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
>   at org.apache.hadoop.util.Shell.(Shell.java:326)
>   at org.apache.hadoop.util.StringUtils.(StringUtils.java:76)
>   at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
>   at org.apache.hadoop.security.Groups.(Groups.java:77)
>   at 
> org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)
>   at 
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
>   at 
> org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:36)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala:109)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala)
>   at org.apache.spark.SparkContext.(SparkContext.scala:228)
>   at org.apache.spark.SparkContext.(SparkContext.scala:97)
> {code}
> It's happened because Hadoop config is initialized each time when spark 
> context is created regardless is hadoop required or not.
> I propose to add some special flag to indicate if hadoop config is required 
> (or start this configuration manually)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2348) In Windows having a enviorinment variable named 'classpath' gives error

2015-02-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-2348:
-
Component/s: (was: Spark Core)
 Windows
   Priority: Major  (was: Critical)

I think that in general you shouldn't have a global CLASSPATH env variable set 
(in any platform). Hm, why would you want Scala to use it? I'm not getting why 
that's the fix.

> In Windows having a enviorinment variable named 'classpath' gives error
> ---
>
> Key: SPARK-2348
> URL: https://issues.apache.org/jira/browse/SPARK-2348
> Project: Spark
>  Issue Type: Bug
>  Components: Windows
>Affects Versions: 1.0.0
> Environment: Windows 7 Enterprise
>Reporter: Chirag Todarka
>Assignee: Chirag Todarka
>
> Operating System:: Windows 7 Enterprise
> If having enviorinment variable named 'classpath' gives then starting 
> 'spark-shell' gives below error::
> \spark\bin>spark-shell
> Failed to initialize compiler: object scala.runtime in compiler mirror not 
> found
> .
> ** Note that as of 2.8 scala does not assume use of the java classpath.
> ** For the old behavior pass -usejavacp to scala, or if using a Settings
> ** object programatically, settings.usejavacp.value = true.
> 14/07/02 14:22:06 WARN SparkILoop$SparkILoopInterpreter: Warning: compiler 
> acces
> sed before init set up.  Assuming no postInit code.
> Failed to initialize compiler: object scala.runtime in compiler mirror not 
> found
> .
> ** Note that as of 2.8 scala does not assume use of the java classpath.
> ** For the old behavior pass -usejavacp to scala, or if using a Settings
> ** object programatically, settings.usejavacp.value = true.
> Exception in thread "main" java.lang.AssertionError: assertion failed: null
> at scala.Predef$.assert(Predef.scala:179)
> at 
> org.apache.spark.repl.SparkIMain.initializeSynchronous(SparkIMain.sca
> la:202)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(Spar
> kILoop.scala:929)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.
> scala:884)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.
> scala:884)
> at 
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClass
> Loader.scala:135)
> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)
> at org.apache.spark.repl.Main$.main(Main.scala:31)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
> at java.lang.reflect.Method.invoke(Unknown Source)
> at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6046) Provide an easier way for developers to handle deprecated configs

2015-02-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339420#comment-14339420
 ] 

Apache Spark commented on SPARK-6046:
-

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/4797

> Provide an easier way for developers to handle deprecated configs
> -
>
> Key: SPARK-6046
> URL: https://issues.apache.org/jira/browse/SPARK-6046
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Minor
>
> Right now we have code that looks like this:
> https://github.com/apache/spark/blob/8942b522d8a3269a2a357e3a274ed4b3e66ebdde/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L52
> where a random class calls `SparkConf.translateConfKey` to warn the user 
> against a deprecated configs. We should refactor this slightly so we can make 
> `translateConfKey` private instead of calling it from everywhere.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6043) Error when trying to rename table with alter table after using INSERT OVERWITE to populate the table

2015-02-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6043:
-
Component/s: SQL

> Error when trying to rename table with alter table after using INSERT 
> OVERWITE to populate the table
> 
>
> Key: SPARK-6043
> URL: https://issues.apache.org/jira/browse/SPARK-6043
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1
>Reporter: Trystan Leftwich
>Priority: Minor
>
> If you populate a table using INSERT OVERWRITE and then try to rename the 
> table using alter table it fails with:
> {noformat}
> Error: org.apache.spark.sql.execution.QueryExecutionException: FAILED: 
> Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. 
> Unable to alter table. (state=,code=0)
> {noformat}
> Using the following SQL statement creates the error:
> {code:sql}
> CREATE TABLE `tmp_table` (salesamount_c1 DOUBLE);
> INSERT OVERWRITE table tmp_table SELECT
>MIN(sales_customer.salesamount) salesamount_c1
> FROM
> (
>   SELECT
>  SUM(sales.salesamount) salesamount
>   FROM
>  internalsales sales
> ) sales_customer;
> ALTER TABLE tmp_table RENAME to not_tmp;
> {code}
> But if you change the 'OVERWRITE' to be 'INTO' the SQL statement works.
> This is happening on our CDH5.3 cluster with multiple workers, If we use the 
> CDH5.3 Quickstart VM the SQL does not produce an error. Both cases were spark 
> 1.2.1 built for hadoop2.4+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6046) Provide an easier way for developers to handle deprecated configs

2015-02-26 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-6046:
-
Priority: Minor  (was: Major)

> Provide an easier way for developers to handle deprecated configs
> -
>
> Key: SPARK-6046
> URL: https://issues.apache.org/jira/browse/SPARK-6046
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Minor
>
> Right now we have code that looks like this:
> https://github.com/apache/spark/blob/8942b522d8a3269a2a357e3a274ed4b3e66ebdde/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L52
> where a random class calls `SparkConf.translateConfKey` to warn the user 
> against a deprecated configs. We should refactor this slightly so we can make 
> `translateConfKey` private instead of calling it from everywhere.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4579) Scheduling Delay appears negative

2015-02-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339412#comment-14339412
 ] 

Apache Spark commented on SPARK-4579:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/4796

> Scheduling Delay appears negative
> -
>
> Key: SPARK-4579
> URL: https://issues.apache.org/jira/browse/SPARK-4579
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.2.0
>Reporter: Arun Ahuja
>Assignee: Andrew Or
>
> !https://cloud.githubusercontent.com/assets/455755/5174438/23d08604-73ff-11e4-9a76-97233b610544.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4579) Scheduling Delay appears negative

2015-02-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4579:
-
Priority: Major  (was: Critical)

> Scheduling Delay appears negative
> -
>
> Key: SPARK-4579
> URL: https://issues.apache.org/jira/browse/SPARK-4579
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.2.0
>Reporter: Arun Ahuja
>Assignee: Andrew Or
>
> !https://cloud.githubusercontent.com/assets/455755/5174438/23d08604-73ff-11e4-9a76-97233b610544.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4571) History server shows negative time

2015-02-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4571.
--
   Resolution: Fixed
Fix Version/s: 1.3.0
 Assignee: Masayoshi TSUZUKI

I'm quite sure this was solved as part of SPARK-2458, and this change: 
https://github.com/apache/spark/commit/6e74edeca31acd7dc84a34402e430e017591d858#diff-a19a4359f1a7f63bc020acf145664af4R132

> History server shows negative time
> --
>
> Key: SPARK-4571
> URL: https://issues.apache.org/jira/browse/SPARK-4571
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>Assignee: Masayoshi TSUZUKI
> Fix For: 1.3.0
>
> Attachments: Screen Shot 2014-11-21 at 2.49.25 PM.png
>
>
> See attachment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6045) RecordWriter should be checked against null in PairRDDFunctions#saveAsNewAPIHadoopDataset

2015-02-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6045:
-
Component/s: Input/Output
   Assignee: Ted Yu

> RecordWriter should be checked against null in 
> PairRDDFunctions#saveAsNewAPIHadoopDataset
> -
>
> Key: SPARK-6045
> URL: https://issues.apache.org/jira/browse/SPARK-6045
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Trivial
> Fix For: 1.4.0
>
>
> gtinside reported in the thread 'NullPointerException in TaskSetManager' with 
> the following stack trace:
> {code}
> WARN 2015-02-26 14:21:43,217 [task-result-getter-0] TaskSetManager - Lost
> task 14.2 in stage 0.0 (TID 29, devntom003.dev.blackrock.com):
> java.lang.NullPointerException
> org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1007)
> com.bfm.spark.test.CassandraHadoopMigrator$.main(CassandraHadoopMigrator.scala:77)
> com.bfm.spark.test.CassandraHadoopMigrator.main(CassandraHadoopMigrator.scala)
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> java.lang.reflect.Method.invoke(Method.java:606)
> org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}
> Looks like the following call in finally block was the cause:
> {code}
> writer.close(hadoopContext)
> {code}
> We should check writer against null before calling close().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6045) RecordWriter should be checked against null in PairRDDFunctions#saveAsNewAPIHadoopDataset

2015-02-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6045.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 4794
[https://github.com/apache/spark/pull/4794]

> RecordWriter should be checked against null in 
> PairRDDFunctions#saveAsNewAPIHadoopDataset
> -
>
> Key: SPARK-6045
> URL: https://issues.apache.org/jira/browse/SPARK-6045
> Project: Spark
>  Issue Type: Bug
>Reporter: Ted Yu
>Priority: Trivial
> Fix For: 1.4.0
>
>
> gtinside reported in the thread 'NullPointerException in TaskSetManager' with 
> the following stack trace:
> {code}
> WARN 2015-02-26 14:21:43,217 [task-result-getter-0] TaskSetManager - Lost
> task 14.2 in stage 0.0 (TID 29, devntom003.dev.blackrock.com):
> java.lang.NullPointerException
> org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1007)
> com.bfm.spark.test.CassandraHadoopMigrator$.main(CassandraHadoopMigrator.scala:77)
> com.bfm.spark.test.CassandraHadoopMigrator.main(CassandraHadoopMigrator.scala)
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> java.lang.reflect.Method.invoke(Method.java:606)
> org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}
> Looks like the following call in finally block was the cause:
> {code}
> writer.close(hadoopContext)
> {code}
> We should check writer against null before calling close().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5977) PySpark SPARK_CLASSPATH doesn't distribute jars to executors

2015-02-26 Thread Michael Nazario (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339344#comment-14339344
 ] 

Michael Nazario commented on SPARK-5977:


I tried setting spark.executor.extraClassPath in the SparkConf and 
--driver-class-path in PYSPARK_SUBMIT_ARGS, and neither of those helped.

> PySpark SPARK_CLASSPATH doesn't distribute jars to executors
> 
>
> Key: SPARK-5977
> URL: https://issues.apache.org/jira/browse/SPARK-5977
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.2.1
>Reporter: Michael Nazario
>  Labels: jars
>
> In PySpark 1.2.1, I added a jar for avro support similar to the one in 
> spark-examples. This jar I need to convert avro files into rows. However, in 
> the worker logs, I kept getting a ClassNotFoundException for my 
> AvroToPythonConverter class.
> I double checked the jar to make sure the class was in there which it was. I 
> made sure I used the SPARK_CLASSPATH environment variable to place this jar 
> on the executor and driver classpaths. I then checked the application web UI 
> which also had this jar on both the executor and driver classpaths.
> The final thing I tried was explicitly dropping the jars in the same location 
> as on my driver machine. That made the ClassNotFoundException go away.
> This makes me think that the jars which back in 1.1.1 used to be sent to the 
> workers are no longer being sent over.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1673) GLMNET implementation in Spark

2015-02-26 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339331#comment-14339331
 ] 

Joseph K. Bradley commented on SPARK-1673:
--

That sounds good---I'll look forward to hearing how it does!

> GLMNET implementation in Spark
> --
>
> Key: SPARK-1673
> URL: https://issues.apache.org/jira/browse/SPARK-1673
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Sung Chung
>
> This is a Spark implementation of GLMNET by Jerome Friedman, Trevor Hastie, 
> Rob Tibshirani.
> http://www.jstatsoft.org/v33/i01/paper
> It's a straightforward implementation of the Coordinate-Descent based L1/L2 
> regularized linear models, including Linear/Logistic/Multinomial regressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1673) GLMNET implementation in Spark

2015-02-26 Thread mike bowles (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339320#comment-14339320
 ] 

mike bowles commented on SPARK-1673:


Good discussion.  I can see how it might be faster to propagate an approximate 
path as a way to provide good starting conditions for an accurate iteration.  
to some extent the accuracy of the glmnet path can be modulated by loosening 
the convergence criteria for the inner iteration (the iteration done to find 
the new minimum after the penalty parameter is decremented).  

The big time sink is making passes through the data.  with glmnet regression 
the inner iterations don't require making passes through the data so they are 
much less expensive than the steps in the penalty parameter, which may provoke 
a pass through the data to deal with a new element being added to the active 
list.  

It would be interesting to see what happens if the active set of coefficients 
was constrained to change less frequently than the penalty parameter.  I have a 
hunch that it might take more (inexpensive) inner iterations to converge when 
the coefficient were allowed to change, but it would save passes through the 
data.  

It would be relatively easy for us to implement this in our code.  We can try 
only letting the active set change every other or every third step in the 
penalty parameter and see how much change it makes in the coefficient curves.  

Thanks for the idea.  

> GLMNET implementation in Spark
> --
>
> Key: SPARK-1673
> URL: https://issues.apache.org/jira/browse/SPARK-1673
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Sung Chung
>
> This is a Spark implementation of GLMNET by Jerome Friedman, Trevor Hastie, 
> Rob Tibshirani.
> http://www.jstatsoft.org/v33/i01/paper
> It's a straightforward implementation of the Coordinate-Descent based L1/L2 
> regularized linear models, including Linear/Logistic/Multinomial regressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5951) Remove unreachable driver memory properties in yarn client mode (YarnClientSchedulerBackend)

2015-02-26 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-5951.

  Resolution: Fixed
Target Version/s: 1.3.0

> Remove unreachable driver memory properties in yarn client mode 
> (YarnClientSchedulerBackend)
> 
>
> Key: SPARK-5951
> URL: https://issues.apache.org/jira/browse/SPARK-5951
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.3.0
> Environment: yarn
>Reporter: Shekhar Bansal
>Assignee: Shekhar Bansal
>Priority: Trivial
> Fix For: 1.3.0
>
>
> In SPARK-4730 warning for deprecated was added
> and in SPARK-1953 driver memory configs were removed in yarn client mode
> During integration spark.master.memory and SPARK_MASTER_MEMORY were not 
> removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5951) Remove unreachable driver memory properties in yarn client mode (YarnClientSchedulerBackend)

2015-02-26 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5951:
-
Assignee: Shekhar Bansal

> Remove unreachable driver memory properties in yarn client mode 
> (YarnClientSchedulerBackend)
> 
>
> Key: SPARK-5951
> URL: https://issues.apache.org/jira/browse/SPARK-5951
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.3.0
> Environment: yarn
>Reporter: Shekhar Bansal
>Assignee: Shekhar Bansal
>Priority: Trivial
> Fix For: 1.3.0
>
>
> In SPARK-4730 warning for deprecated was added
> and in SPARK-1953 driver memory configs were removed in yarn client mode
> During integration spark.master.memory and SPARK_MASTER_MEMORY were not 
> removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4300) Race condition during SparkWorker shutdown

2015-02-26 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4300:
-
Labels: backport-needed  (was: )

> Race condition during SparkWorker shutdown
> --
>
> Key: SPARK-4300
> URL: https://issues.apache.org/jira/browse/SPARK-4300
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.1.0
>Reporter: Alex Liu
>Assignee: Sean Owen
>Priority: Minor
>  Labels: backport-needed
> Fix For: 1.2.2, 1.4.0
>
>
> When a shark job is done. there are some error message as following show in 
> the log
> {code}
> INFO 22:10:41,635 SparkMaster: 
> akka.tcp://sparkdri...@ip-172-31-11-204.us-west-1.compute.internal:57641 got 
> disassociated, removing it.
>  INFO 22:10:41,640 SparkMaster: Removing app app-20141106221014-
>  INFO 22:10:41,687 SparkMaster: Removing application 
> Shark::ip-172-31-11-204.us-west-1.compute.internal
>  INFO 22:10:41,710 SparkWorker: Asked to kill executor 
> app-20141106221014-/0
>  INFO 22:10:41,712 SparkWorker: Runner thread for executor 
> app-20141106221014-/0 interrupted
>  INFO 22:10:41,714 SparkWorker: Killing process!
> ERROR 22:10:41,738 SparkWorker: Error writing stream to file 
> /var/lib/spark/work/app-20141106221014-/0/stdout
> ERROR 22:10:41,739 SparkWorker: java.io.IOException: Stream closed
> ERROR 22:10:41,739 SparkWorker:   at 
> java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:162)
> ERROR 22:10:41,740 SparkWorker:   at 
> java.io.BufferedInputStream.read1(BufferedInputStream.java:272)
> ERROR 22:10:41,740 SparkWorker:   at 
> java.io.BufferedInputStream.read(BufferedInputStream.java:334)
> ERROR 22:10:41,740 SparkWorker:   at 
> java.io.FilterInputStream.read(FilterInputStream.java:107)
> ERROR 22:10:41,741 SparkWorker:   at 
> org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70)
> ERROR 22:10:41,741 SparkWorker:   at 
> org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
> ERROR 22:10:41,741 SparkWorker:   at 
> org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
> ERROR 22:10:41,742 SparkWorker:   at 
> org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
> ERROR 22:10:41,742 SparkWorker:   at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
> ERROR 22:10:41,742 SparkWorker:   at 
> org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38)
>  INFO 22:10:41,838 SparkMaster: Connected to Cassandra cluster: 4299
>  INFO 22:10:41,839 SparkMaster: Adding host 172.31.11.204 (Analytics)
>  INFO 22:10:41,840 SparkMaster: New Cassandra host /172.31.11.204:9042 added
>  INFO 22:10:41,841 SparkMaster: Adding host 172.31.11.204 (Analytics)
>  INFO 22:10:41,842 SparkMaster: Adding host 172.31.11.204 (Analytics)
>  INFO 22:10:41,852 SparkMaster: 
> akka.tcp://sparkdri...@ip-172-31-11-204.us-west-1.compute.internal:57641 got 
> disassociated, removing it.
>  INFO 22:10:41,853 SparkMaster: 
> akka.tcp://sparkdri...@ip-172-31-11-204.us-west-1.compute.internal:57641 got 
> disassociated, removing it.
>  INFO 22:10:41,853 SparkMaster: 
> akka.tcp://sparkdri...@ip-172-31-11-204.us-west-1.compute.internal:57641 got 
> disassociated, removing it.
>  INFO 22:10:41,857 SparkMaster: 
> akka.tcp://sparkdri...@ip-172-31-11-204.us-west-1.compute.internal:57641 got 
> disassociated, removing it.
>  INFO 22:10:41,862 SparkMaster: Adding host 172.31.11.204 (Analytics)
>  WARN 22:10:42,200 SparkMaster: Got status update for unknown executor 
> app-20141106221014-/0
>  INFO 22:10:42,211 SparkWorker: Executor app-20141106221014-/0 finished 
> with state KILLED exitStatus 143
> {code}
> /var/lib/spark/work/app-20141106221014-/0/stdout is on the disk. It is 
> trying to write to a close IO stream. 
> Spark worker shuts down by {code}
>  private def killProcess(message: Option[String]) {
> var exitCode: Option[Int] = None
> logInfo("Killing process!")
> process.destroy()
> process.waitFor()
> if (stdoutAppender != null) {
>   stdoutAppender.stop()
> }
> if (stderrAppender != null) {
>   stderrAppender.stop()
> }
> if (process != null) {
> exitCode = Some(process.waitFor())
> }
> worker ! ExecutorStateChanged(appId, execId, state, message, exitCode)
>  
> {code}
> But stdoutAppender concurrently writes to output log file, which creates race 
> condition. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional co

[jira] [Updated] (SPARK-794) Remove sleep() in ClusterScheduler.stop

2015-02-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-794:

Target Version/s:   (was: 1.2.1)
   Fix Version/s: 1.2.2
Assignee: Brennon York
  Labels:   (was: backport-needed)

Backported to 1.2

> Remove sleep() in ClusterScheduler.stop
> ---
>
> Key: SPARK-794
> URL: https://issues.apache.org/jira/browse/SPARK-794
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 0.9.0
>Reporter: Matei Zaharia
>Assignee: Brennon York
> Fix For: 1.3.0, 1.2.2
>
>
> This temporary change made a while back slows down the unit tests quite a bit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-794) Remove sleep() in ClusterScheduler.stop

2015-02-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-794.
-
Resolution: Fixed

> Remove sleep() in ClusterScheduler.stop
> ---
>
> Key: SPARK-794
> URL: https://issues.apache.org/jira/browse/SPARK-794
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 0.9.0
>Reporter: Matei Zaharia
>Assignee: Brennon York
> Fix For: 1.3.0, 1.2.2
>
>
> This temporary change made a while back slows down the unit tests quite a bit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4300) Race condition during SparkWorker shutdown

2015-02-26 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-4300.

  Resolution: Fixed
   Fix Version/s: 1.4.0
  1.2.2
Assignee: Sean Owen
Target Version/s: 1.3.0, 1.2.2, 1.4.0

> Race condition during SparkWorker shutdown
> --
>
> Key: SPARK-4300
> URL: https://issues.apache.org/jira/browse/SPARK-4300
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.1.0
>Reporter: Alex Liu
>Assignee: Sean Owen
>Priority: Minor
>  Labels: backport-needed
> Fix For: 1.2.2, 1.4.0
>
>
> When a shark job is done. there are some error message as following show in 
> the log
> {code}
> INFO 22:10:41,635 SparkMaster: 
> akka.tcp://sparkdri...@ip-172-31-11-204.us-west-1.compute.internal:57641 got 
> disassociated, removing it.
>  INFO 22:10:41,640 SparkMaster: Removing app app-20141106221014-
>  INFO 22:10:41,687 SparkMaster: Removing application 
> Shark::ip-172-31-11-204.us-west-1.compute.internal
>  INFO 22:10:41,710 SparkWorker: Asked to kill executor 
> app-20141106221014-/0
>  INFO 22:10:41,712 SparkWorker: Runner thread for executor 
> app-20141106221014-/0 interrupted
>  INFO 22:10:41,714 SparkWorker: Killing process!
> ERROR 22:10:41,738 SparkWorker: Error writing stream to file 
> /var/lib/spark/work/app-20141106221014-/0/stdout
> ERROR 22:10:41,739 SparkWorker: java.io.IOException: Stream closed
> ERROR 22:10:41,739 SparkWorker:   at 
> java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:162)
> ERROR 22:10:41,740 SparkWorker:   at 
> java.io.BufferedInputStream.read1(BufferedInputStream.java:272)
> ERROR 22:10:41,740 SparkWorker:   at 
> java.io.BufferedInputStream.read(BufferedInputStream.java:334)
> ERROR 22:10:41,740 SparkWorker:   at 
> java.io.FilterInputStream.read(FilterInputStream.java:107)
> ERROR 22:10:41,741 SparkWorker:   at 
> org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70)
> ERROR 22:10:41,741 SparkWorker:   at 
> org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
> ERROR 22:10:41,741 SparkWorker:   at 
> org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
> ERROR 22:10:41,742 SparkWorker:   at 
> org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
> ERROR 22:10:41,742 SparkWorker:   at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
> ERROR 22:10:41,742 SparkWorker:   at 
> org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38)
>  INFO 22:10:41,838 SparkMaster: Connected to Cassandra cluster: 4299
>  INFO 22:10:41,839 SparkMaster: Adding host 172.31.11.204 (Analytics)
>  INFO 22:10:41,840 SparkMaster: New Cassandra host /172.31.11.204:9042 added
>  INFO 22:10:41,841 SparkMaster: Adding host 172.31.11.204 (Analytics)
>  INFO 22:10:41,842 SparkMaster: Adding host 172.31.11.204 (Analytics)
>  INFO 22:10:41,852 SparkMaster: 
> akka.tcp://sparkdri...@ip-172-31-11-204.us-west-1.compute.internal:57641 got 
> disassociated, removing it.
>  INFO 22:10:41,853 SparkMaster: 
> akka.tcp://sparkdri...@ip-172-31-11-204.us-west-1.compute.internal:57641 got 
> disassociated, removing it.
>  INFO 22:10:41,853 SparkMaster: 
> akka.tcp://sparkdri...@ip-172-31-11-204.us-west-1.compute.internal:57641 got 
> disassociated, removing it.
>  INFO 22:10:41,857 SparkMaster: 
> akka.tcp://sparkdri...@ip-172-31-11-204.us-west-1.compute.internal:57641 got 
> disassociated, removing it.
>  INFO 22:10:41,862 SparkMaster: Adding host 172.31.11.204 (Analytics)
>  WARN 22:10:42,200 SparkMaster: Got status update for unknown executor 
> app-20141106221014-/0
>  INFO 22:10:42,211 SparkWorker: Executor app-20141106221014-/0 finished 
> with state KILLED exitStatus 143
> {code}
> /var/lib/spark/work/app-20141106221014-/0/stdout is on the disk. It is 
> trying to write to a close IO stream. 
> Spark worker shuts down by {code}
>  private def killProcess(message: Option[String]) {
> var exitCode: Option[Int] = None
> logInfo("Killing process!")
> process.destroy()
> process.waitFor()
> if (stdoutAppender != null) {
>   stdoutAppender.stop()
> }
> if (stderrAppender != null) {
>   stderrAppender.stop()
> }
> if (process != null) {
> exitCode = Some(process.waitFor())
> }
> worker ! ExecutorStateChanged(appId, execId, state, message, exitCode)
>  
> {code}
> But stdoutAppender concurrently writes to output log file, which creates race 
> condition. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (SPARK-4300) Race condition during SparkWorker shutdown

2015-02-26 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or reopened SPARK-4300:
--

> Race condition during SparkWorker shutdown
> --
>
> Key: SPARK-4300
> URL: https://issues.apache.org/jira/browse/SPARK-4300
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.1.0
>Reporter: Alex Liu
>Assignee: Sean Owen
>Priority: Minor
>  Labels: backport-needed
> Fix For: 1.2.2, 1.4.0
>
>
> When a shark job is done. there are some error message as following show in 
> the log
> {code}
> INFO 22:10:41,635 SparkMaster: 
> akka.tcp://sparkdri...@ip-172-31-11-204.us-west-1.compute.internal:57641 got 
> disassociated, removing it.
>  INFO 22:10:41,640 SparkMaster: Removing app app-20141106221014-
>  INFO 22:10:41,687 SparkMaster: Removing application 
> Shark::ip-172-31-11-204.us-west-1.compute.internal
>  INFO 22:10:41,710 SparkWorker: Asked to kill executor 
> app-20141106221014-/0
>  INFO 22:10:41,712 SparkWorker: Runner thread for executor 
> app-20141106221014-/0 interrupted
>  INFO 22:10:41,714 SparkWorker: Killing process!
> ERROR 22:10:41,738 SparkWorker: Error writing stream to file 
> /var/lib/spark/work/app-20141106221014-/0/stdout
> ERROR 22:10:41,739 SparkWorker: java.io.IOException: Stream closed
> ERROR 22:10:41,739 SparkWorker:   at 
> java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:162)
> ERROR 22:10:41,740 SparkWorker:   at 
> java.io.BufferedInputStream.read1(BufferedInputStream.java:272)
> ERROR 22:10:41,740 SparkWorker:   at 
> java.io.BufferedInputStream.read(BufferedInputStream.java:334)
> ERROR 22:10:41,740 SparkWorker:   at 
> java.io.FilterInputStream.read(FilterInputStream.java:107)
> ERROR 22:10:41,741 SparkWorker:   at 
> org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70)
> ERROR 22:10:41,741 SparkWorker:   at 
> org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
> ERROR 22:10:41,741 SparkWorker:   at 
> org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
> ERROR 22:10:41,742 SparkWorker:   at 
> org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
> ERROR 22:10:41,742 SparkWorker:   at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
> ERROR 22:10:41,742 SparkWorker:   at 
> org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38)
>  INFO 22:10:41,838 SparkMaster: Connected to Cassandra cluster: 4299
>  INFO 22:10:41,839 SparkMaster: Adding host 172.31.11.204 (Analytics)
>  INFO 22:10:41,840 SparkMaster: New Cassandra host /172.31.11.204:9042 added
>  INFO 22:10:41,841 SparkMaster: Adding host 172.31.11.204 (Analytics)
>  INFO 22:10:41,842 SparkMaster: Adding host 172.31.11.204 (Analytics)
>  INFO 22:10:41,852 SparkMaster: 
> akka.tcp://sparkdri...@ip-172-31-11-204.us-west-1.compute.internal:57641 got 
> disassociated, removing it.
>  INFO 22:10:41,853 SparkMaster: 
> akka.tcp://sparkdri...@ip-172-31-11-204.us-west-1.compute.internal:57641 got 
> disassociated, removing it.
>  INFO 22:10:41,853 SparkMaster: 
> akka.tcp://sparkdri...@ip-172-31-11-204.us-west-1.compute.internal:57641 got 
> disassociated, removing it.
>  INFO 22:10:41,857 SparkMaster: 
> akka.tcp://sparkdri...@ip-172-31-11-204.us-west-1.compute.internal:57641 got 
> disassociated, removing it.
>  INFO 22:10:41,862 SparkMaster: Adding host 172.31.11.204 (Analytics)
>  WARN 22:10:42,200 SparkMaster: Got status update for unknown executor 
> app-20141106221014-/0
>  INFO 22:10:42,211 SparkWorker: Executor app-20141106221014-/0 finished 
> with state KILLED exitStatus 143
> {code}
> /var/lib/spark/work/app-20141106221014-/0/stdout is on the disk. It is 
> trying to write to a close IO stream. 
> Spark worker shuts down by {code}
>  private def killProcess(message: Option[String]) {
> var exitCode: Option[Int] = None
> logInfo("Killing process!")
> process.destroy()
> process.waitFor()
> if (stdoutAppender != null) {
>   stdoutAppender.stop()
> }
> if (stderrAppender != null) {
>   stderrAppender.stop()
> }
> if (process != null) {
> exitCode = Some(process.waitFor())
> }
> worker ! ExecutorStateChanged(appId, execId, state, message, exitCode)
>  
> {code}
> But stdoutAppender concurrently writes to output log file, which creates race 
> condition. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.a

[jira] [Updated] (SPARK-5546) Improve path to Kafka assembly when trying Kafka Python API

2015-02-26 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5546:
-
Affects Version/s: 1.3.0

> Improve path to Kafka assembly when trying Kafka Python API
> ---
>
> Key: SPARK-5546
> URL: https://issues.apache.org/jira/browse/SPARK-5546
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Blocker
> Fix For: 1.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6018) NoSuchMethodError in Spark app is swallowed by YARN AM

2015-02-26 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-6018:
-
Assignee: Cheolsoo Park

> NoSuchMethodError in Spark app is swallowed by YARN AM
> --
>
> Key: SPARK-6018
> URL: https://issues.apache.org/jira/browse/SPARK-6018
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>Priority: Minor
>  Labels: yarn
> Fix For: 1.3.0, 1.2.2
>
>
> I discovered this bug while testing 1.3 RC with old 1.2 Spark job that I had. 
> Due to changes in DF and SchemaRDD, my app failed with 
> {{java.lang.NoSuchMethodError}}. However, AM was marked as succeeded, and the 
> error was silently swallowed.
> The problem is that pattern matching in Spark AM fails to catch 
> NoSuchMethodError-
> {code}
> 15/02/25 20:13:27 INFO cluster.YarnClusterScheduler: 
> YarnClusterScheduler.postStartHook done
> Exception in thread "Driver" scala.MatchError: java.lang.NoSuchMethodError: 
> org.apache.spark.sql.hive.HiveContext.table(Ljava/lang/String;)Lorg/apache/spark/sql/SchemaRDD;
>  (of class java.lang.NoSuchMethodError)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:485)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6018) NoSuchMethodError in Spark app is swallowed by YARN AM

2015-02-26 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-6018:
-
Affects Version/s: 1.2.0

> NoSuchMethodError in Spark app is swallowed by YARN AM
> --
>
> Key: SPARK-6018
> URL: https://issues.apache.org/jira/browse/SPARK-6018
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0
>Reporter: Cheolsoo Park
>Priority: Minor
>  Labels: yarn
> Fix For: 1.3.0, 1.2.2
>
>
> I discovered this bug while testing 1.3 RC with old 1.2 Spark job that I had. 
> Due to changes in DF and SchemaRDD, my app failed with 
> {{java.lang.NoSuchMethodError}}. However, AM was marked as succeeded, and the 
> error was silently swallowed.
> The problem is that pattern matching in Spark AM fails to catch 
> NoSuchMethodError-
> {code}
> 15/02/25 20:13:27 INFO cluster.YarnClusterScheduler: 
> YarnClusterScheduler.postStartHook done
> Exception in thread "Driver" scala.MatchError: java.lang.NoSuchMethodError: 
> org.apache.spark.sql.hive.HiveContext.table(Ljava/lang/String;)Lorg/apache/spark/sql/SchemaRDD;
>  (of class java.lang.NoSuchMethodError)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:485)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-6018) NoSuchMethodError in Spark app is swallowed by YARN AM

2015-02-26 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-6018.

  Resolution: Fixed
   Fix Version/s: 1.2.2
  1.3.0
Target Version/s: 1.3.0, 1.2.2

> NoSuchMethodError in Spark app is swallowed by YARN AM
> --
>
> Key: SPARK-6018
> URL: https://issues.apache.org/jira/browse/SPARK-6018
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0
>Reporter: Cheolsoo Park
>Priority: Minor
>  Labels: yarn
> Fix For: 1.3.0, 1.2.2
>
>
> I discovered this bug while testing 1.3 RC with old 1.2 Spark job that I had. 
> Due to changes in DF and SchemaRDD, my app failed with 
> {{java.lang.NoSuchMethodError}}. However, AM was marked as succeeded, and the 
> error was silently swallowed.
> The problem is that pattern matching in Spark AM fails to catch 
> NoSuchMethodError-
> {code}
> 15/02/25 20:13:27 INFO cluster.YarnClusterScheduler: 
> YarnClusterScheduler.postStartHook done
> Exception in thread "Driver" scala.MatchError: java.lang.NoSuchMethodError: 
> org.apache.spark.sql.hive.HiveContext.table(Ljava/lang/String;)Lorg/apache/spark/sql/SchemaRDD;
>  (of class java.lang.NoSuchMethodError)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:485)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >