[jira] [Resolved] (SPARK-3455) **HotFix** Unit test failed due to can not resolve the attribute references

2014-09-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3455.
-
Resolution: Fixed

 **HotFix** Unit test failed due to can not resolve the attribute references
 ---

 Key: SPARK-3455
 URL: https://issues.apache.org/jira/browse/SPARK-3455
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Blocker

 The test case SPARK-3349 partitioning after limit failed, the exception as :
 {panel}
 23:10:04.117 ERROR org.apache.spark.scheduler.TaskSetManager: Task 0 in stage 
 274.0 failed 1 times; aborting job
 [info] - SPARK-3349 partitioning after limit *** FAILED ***
 [info]   Exception thrown while executing query:
 [info]   == Parsed Logical Plan ==
 [info]   Project [*]
 [info]Join Inner, Some(('subset1.n = 'lowerCaseData.n))
 [info] UnresolvedRelation None, lowerCaseData, None
 [info] UnresolvedRelation None, subset1, None
 [info]   
 [info]   == Analyzed Logical Plan ==
 [info]   Project [n#605,l#606,n#12]
 [info]Join Inner, Some((n#12 = n#605))
 [info] SparkLogicalPlan (ExistingRdd [n#605,l#606], MapPartitionsRDD[13] 
 at mapPartitions at basicOperators.scala:219)
 [info] Limit 2
 [info]  Sort [n#12 DESC]
 [info]   Distinct 
 [info]Project [n#12]
 [info] SparkLogicalPlan (ExistingRdd [n#607,l#608], 
 MapPartitionsRDD[13] at mapPartitions at basicOperators.scala:219)
 [info]   
 [info]   == Optimized Logical Plan ==
 [info]   Project [n#605,l#606,n#12]
 [info]Join Inner, Some((n#12 = n#605))
 [info] SparkLogicalPlan (ExistingRdd [n#605,l#606], MapPartitionsRDD[13] 
 at mapPartitions at basicOperators.scala:219)
 [info] Limit 2
 [info]  Sort [n#12 DESC]
 [info]   Distinct 
 [info]Project [n#12]
 [info] SparkLogicalPlan (ExistingRdd [n#607,l#608], 
 MapPartitionsRDD[13] at mapPartitions at basicOperators.scala:219)
 [info]   
 [info]   == Physical Plan ==
 [info]   Project [n#605,l#606,n#12]
 [info]ShuffledHashJoin [n#605], [n#12], BuildRight
 [info] Exchange (HashPartitioning [n#605], 10)
 [info]  ExistingRdd [n#605,l#606], MapPartitionsRDD[13] at mapPartitions 
 at basicOperators.scala:219
 [info] Exchange (HashPartitioning [n#12], 10)
 [info]  TakeOrdered 2, [n#12 DESC]
 [info]   Distinct false
 [info]Exchange (HashPartitioning [n#12], 10)
 [info] Distinct true
 [info]  Project [n#12]
 [info]   ExistingRdd [n#607,l#608], MapPartitionsRDD[13] at 
 mapPartitions at basicOperators.scala:219
 [info]   
 [info]   Code Generation: false
 [info]   == RDD ==
 [info]   == Exception ==
 [info]   org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
 execute, tree:
 [info]   Exchange (HashPartitioning [n#12], 10)
 [info]TakeOrdered 2, [n#12 DESC]
 [info] Distinct false
 [info]  Exchange (HashPartitioning [n#12], 10)
 [info]   Distinct true
 [info]Project [n#12]
 [info] ExistingRdd [n#607,l#608], MapPartitionsRDD[13] at 
 mapPartitions at basicOperators.scala:219
 [info]   
 [info]   org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
 execute, tree:
 [info]   Exchange (HashPartitioning [n#12], 10)
 [info]TakeOrdered 2, [n#12 DESC]
 [info] Distinct false
 [info]  Exchange (HashPartitioning [n#12], 10)
 [info]   Distinct true
 [info]Project [n#12]
 [info] ExistingRdd [n#607,l#608], MapPartitionsRDD[13] at 
 mapPartitions at basicOperators.scala:219
 [info]   
 [info]at 
 org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
 [info]at 
 org.apache.spark.sql.execution.Exchange.execute(Exchange.scala:44)
 [info]at 
 org.apache.spark.sql.execution.ShuffledHashJoin.execute(joins.scala:354)
 [info]at 
 org.apache.spark.sql.execution.Project.execute(basicOperators.scala:42)
 [info]at 
 org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:85)
 [info]at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:438)
 [info]at 
 org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:40)
 [info]at 
 org.apache.spark.sql.SQLQuerySuite$$anonfun$31.apply$mcV$sp(SQLQuerySuite.scala:369)
 [info]at 
 org.apache.spark.sql.SQLQuerySuite$$anonfun$31.apply(SQLQuerySuite.scala:362)
 [info]at 
 org.apache.spark.sql.SQLQuerySuite$$anonfun$31.apply(SQLQuerySuite.scala:362)
 [info]at 
 org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
 [info]at 
 org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
 [info]at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
 [info]at 

[jira] [Updated] (SPARK-2883) Spark Support for ORCFile format

2014-09-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2883:

Priority: Blocker  (was: Major)

 Spark Support for ORCFile format
 

 Key: SPARK-2883
 URL: https://issues.apache.org/jira/browse/SPARK-2883
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, SQL
Reporter: Zhan Zhang
Priority: Blocker
 Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 
 pm jobtracker.png


 Verify the support of OrcInputFormat in spark, fix issues if exists and add 
 documentation of its usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2883) Spark Support for ORCFile format

2014-09-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2883:

Target Version/s: 1.2.0

 Spark Support for ORCFile format
 

 Key: SPARK-2883
 URL: https://issues.apache.org/jira/browse/SPARK-2883
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, SQL
Reporter: Zhan Zhang
Priority: Blocker
 Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 
 pm jobtracker.png


 Verify the support of OrcInputFormat in spark, fix issues if exists and add 
 documentation of its usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3481) HiveComparisonTest throws exception of org.apache.hadoop.hive.ql.metadata.HiveException: Database does not exist: default

2014-09-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132567#comment-14132567
 ] 

Apache Spark commented on SPARK-3481:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/2377

 HiveComparisonTest throws exception of 
 org.apache.hadoop.hive.ql.metadata.HiveException: Database does not exist: 
 default
 ---

 Key: SPARK-3481
 URL: https://issues.apache.org/jira/browse/SPARK-3481
 Project: Spark
  Issue Type: Test
  Components: SQL
Reporter: Cheng Hao
Priority: Minor

 In local test, lots of exception raised like:
 {panel}
 11:08:01.746 ERROR hive.ql.exec.DDLTask: 
 org.apache.hadoop.hive.ql.metadata.HiveException: Database does not exist: 
 default
   at 
 org.apache.hadoop.hive.ql.exec.DDLTask.switchDatabase(DDLTask.java:3480)
   at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:237)
   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151)
   at 
 org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65)
   at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1414)
   at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1192)
   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1020)
   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
   at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:298)
   at 
 org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:272)
   at 
 org.apache.spark.sql.hive.test.TestHiveContext.runSqlHive(TestHive.scala:88)
   at 
 org.apache.spark.sql.hive.test.TestHiveContext.reset(TestHive.scala:348)
   at 
 org.apache.spark.sql.hive.execution.HiveComparisonTest$$anonfun$createQueryTest$1.apply$mcV$sp(HiveComparisonTest.scala:255)
   at 
 org.apache.spark.sql.hive.execution.HiveComparisonTest$$anonfun$createQueryTest$1.apply(HiveComparisonTest.scala:225)
   at 
 org.apache.spark.sql.hive.execution.HiveComparisonTest$$anonfun$createQueryTest$1.apply(HiveComparisonTest.scala:225)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:158)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1121)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1559)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:155)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:167)
   at org.scalatest.FunSuite.runTest(FunSuite.scala:1559)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:200)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1559)
   at org.scalatest.Suite$class.run(Suite.scala:1423)
   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1559)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204)
   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:204)
   at 
 org.apache.spark.sql.hive.execution.HiveComparisonTest.org$scalatest$BeforeAndAfterAll$$super$run(HiveComparisonTest.scala:41)
   at 
 org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
   at 
 

[jira] [Commented] (SPARK-3491) Use pickle to serialize the data in MLlib Python

2014-09-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132590#comment-14132590
 ] 

Apache Spark commented on SPARK-3491:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/2378

 Use pickle to serialize the data in MLlib Python
 

 Key: SPARK-3491
 URL: https://issues.apache.org/jira/browse/SPARK-3491
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Reporter: Davies Liu
Assignee: Davies Liu

 Currently, we write the code for serialization/deserialization in Python and 
 Scala manually, it can not scale to the big number of MLlib API.
 If the serialization could be done in pickle (using Pyrolite in JVM) in 
 extensional way, then it should be much easy to add Python API for MLlib.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2098) All Spark processes should support spark-defaults.conf, config file

2014-09-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132593#comment-14132593
 ] 

Apache Spark commented on SPARK-2098:
-

User 'witgo' has created a pull request for this issue:
https://github.com/apache/spark/pull/2379

 All Spark processes should support spark-defaults.conf, config file
 ---

 Key: SPARK-2098
 URL: https://issues.apache.org/jira/browse/SPARK-2098
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Marcelo Vanzin
Assignee: Guoqiang Li

 SparkSubmit supports the idea of a config file to set SparkConf 
 configurations. This is handy because you can easily set a site-wide 
 configuration file, and power users can use their own when needed, or resort 
 to JVM properties or other means of overriding configs.
 It would be nice if all Spark processes (e.g. master / worker / history 
 server) also supported something like this. For daemon processes this is 
 particularly interesting because it makes it easy to decouple starting the 
 daemon (e.g. some /etc/init.d script packaged by some distribution) from 
 configuring that daemon. Right now you have to set environment variables to 
 modify the configuration of those daemons, which is not very friendly to the 
 above scenario.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2926) Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle

2014-09-13 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132598#comment-14132598
 ] 

Saisai Shao commented on SPARK-2926:


Ok, I will take a try and let you know then it is ready. Thanks a lot.

 Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
 --

 Key: SPARK-2926
 URL: https://issues.apache.org/jira/browse/SPARK-2926
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 1.1.0
Reporter: Saisai Shao
 Attachments: SortBasedShuffleRead.pdf, Spark Shuffle Test 
 Report(contd).pdf, Spark Shuffle Test Report.pdf


 Currently Spark has already integrated sort-based shuffle write, which 
 greatly improve the IO performance and reduce the memory consumption when 
 reducer number is very large. But for the reducer side, it still adopts the 
 implementation of hash-based shuffle reader, which neglects the ordering 
 attributes of map output data in some situations.
 Here we propose a MR style sort-merge like shuffle reader for sort-based 
 shuffle to better improve the performance of sort-based shuffle.
 Working in progress code and performance test report will be posted later 
 when some unit test bugs are fixed.
 Any comments would be greatly appreciated. 
 Thanks a lot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2926) Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle

2014-09-13 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132598#comment-14132598
 ] 

Saisai Shao edited comment on SPARK-2926 at 9/13/14 8:09 AM:
-

Ok, I will take a try and let you know when it is ready. Thanks a lot.


was (Author: jerryshao):
Ok, I will take a try and let you know then it is ready. Thanks a lot.

 Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
 --

 Key: SPARK-2926
 URL: https://issues.apache.org/jira/browse/SPARK-2926
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 1.1.0
Reporter: Saisai Shao
 Attachments: SortBasedShuffleRead.pdf, Spark Shuffle Test 
 Report(contd).pdf, Spark Shuffle Test Report.pdf


 Currently Spark has already integrated sort-based shuffle write, which 
 greatly improve the IO performance and reduce the memory consumption when 
 reducer number is very large. But for the reducer side, it still adopts the 
 implementation of hash-based shuffle reader, which neglects the ordering 
 attributes of map output data in some situations.
 Here we propose a MR style sort-merge like shuffle reader for sort-based 
 shuffle to better improve the performance of sort-based shuffle.
 Working in progress code and performance test report will be posted later 
 when some unit test bugs are fixed.
 Any comments would be greatly appreciated. 
 Thanks a lot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3518) Remove useless statement in JsonProtocol

2014-09-13 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-3518:
-

 Summary: Remove useless statement in JsonProtocol
 Key: SPARK-3518
 URL: https://issues.apache.org/jira/browse/SPARK-3518
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Kousuke Saruta
Priority: Minor


In org.apache.spark.util.JsonProtocol#taskInfoToJson, a variable named 
accumUpdateMap is created as follows.

{code}
val accumUpdateMap = taskInfo.accumulables
{code}

But accumUpdateMap is never used and there is 2nd invocation of 
taskInfo.accumlables as follows.

{code}
(Accumulables - 
JArray(taskInfo.accumulables.map(accumulableInfoToJson).toList))
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3518) Remove useless statement in JsonProtocol

2014-09-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132608#comment-14132608
 ] 

Apache Spark commented on SPARK-3518:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/2380

 Remove useless statement in JsonProtocol
 

 Key: SPARK-3518
 URL: https://issues.apache.org/jira/browse/SPARK-3518
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Kousuke Saruta
Priority: Minor

 In org.apache.spark.util.JsonProtocol#taskInfoToJson, a variable named 
 accumUpdateMap is created as follows.
 {code}
 val accumUpdateMap = taskInfo.accumulables
 {code}
 But accumUpdateMap is never used and there is 2nd invocation of 
 taskInfo.accumlables as follows.
 {code}
 (Accumulables - 
 JArray(taskInfo.accumulables.map(accumulableInfoToJson).toList))
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2593) Add ability to pass an existing Akka ActorSystem into Spark

2014-09-13 Thread Helena Edelson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Helena Edelson updated SPARK-2593:
--
Description: 
As a developer I want to pass an existing ActorSystem into StreamingContext in 
load-time so that I do not have 2 actor systems running on a node in an Akka 
application.

This would mean having spark's actor system on its own named-dispatchers as 
well as exposing the new private creation of its own actor system.
  
 

  was:
As a developer I want to pass an existing ActorSystem into StreamingContext in 
load-time so that I do not have 2 actor systems running on a node in an Akka 
application.

This would mean having spark's actor system on its own named-dispatchers as 
well as exposing the new private creation of its own actor system.
 
I would like to create an Akka Extension that wraps around Spark/Spark 
Streaming and Cassandra. So the programmatic creation would simply be this for 
a user

val extension = SparkCassandra(system)
 


 Add ability to pass an existing Akka ActorSystem into Spark
 ---

 Key: SPARK-2593
 URL: https://issues.apache.org/jira/browse/SPARK-2593
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Helena Edelson

 As a developer I want to pass an existing ActorSystem into StreamingContext 
 in load-time so that I do not have 2 actor systems running on a node in an 
 Akka application.
 This would mean having spark's actor system on its own named-dispatchers as 
 well as exposing the new private creation of its own actor system.
   
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2593) Add ability to pass an existing Akka ActorSystem into Spark

2014-09-13 Thread Helena Edelson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132797#comment-14132797
 ] 

Helena Edelson commented on SPARK-2593:
---

Here is a good example of just one of the issues: it is difficult to locate a 
remote spark actor to publish data to the stream. Here I have to have the 
streaming actor get created and in the preStart, publish a custom message with 
`self`which my actors in my ActorSystem can receive in order to get the 
ActorRef to send to. This is incredibly clunky.

I will try to carve out some time to do this PR this week.
 

 Add ability to pass an existing Akka ActorSystem into Spark
 ---

 Key: SPARK-2593
 URL: https://issues.apache.org/jira/browse/SPARK-2593
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Helena Edelson

 As a developer I want to pass an existing ActorSystem into StreamingContext 
 in load-time so that I do not have 2 actor systems running on a node in an 
 Akka application.
 This would mean having spark's actor system on its own named-dispatchers as 
 well as exposing the new private creation of its own actor system.
   
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3519) PySpark RDDs are missing the distinct(n) method

2014-09-13 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-3519:
---

 Summary: PySpark RDDs are missing the distinct(n) method
 Key: SPARK-3519
 URL: https://issues.apache.org/jira/browse/SPARK-3519
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 1.1.0
Reporter: Nicholas Chammas


{{distinct()}} works but {{distinct(N)}} doesn't.

{code}
 sc.parallelize([1,1,2]).distinct()
PythonRDD[15] at RDD at PythonRDD.scala:43
 sc.parallelize([1,1,2]).distinct(2)
Traceback (most recent call last):
  File stdin, line 1, in module
TypeError: distinct() takes exactly 1 argument (2 given)
{code}

The PySpark docs only call out [the {{distinct()}} 
signature|http://spark.apache.org/docs/1.1.0/api/python/pyspark.rdd.RDD-class.html#distinct],
 but the programming guide [includes the {{distinct(N)}} 
signature|http://spark.apache.org/docs/1.1.0/programming-guide.html#transformations]
 as well.

{quote}
{noformat}
distinct([numTasks]))   Return a new dataset that contains the distinct 
elements of the source dataset.
{noformat}
{quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3519) PySpark RDDs are missing the distinct(n) method

2014-09-13 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132833#comment-14132833
 ] 

Nicholas Chammas commented on SPARK-3519:
-

[~joshrosen]  [~davies]: Here is a ticket for the missing {{distinct(N)}} 
method. I marked it as a bug since the programming guide says it should exist.

 PySpark RDDs are missing the distinct(n) method
 ---

 Key: SPARK-3519
 URL: https://issues.apache.org/jira/browse/SPARK-3519
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 1.1.0
Reporter: Nicholas Chammas

 {{distinct()}} works but {{distinct(N)}} doesn't.
 {code}
  sc.parallelize([1,1,2]).distinct()
 PythonRDD[15] at RDD at PythonRDD.scala:43
  sc.parallelize([1,1,2]).distinct(2)
 Traceback (most recent call last):
   File stdin, line 1, in module
 TypeError: distinct() takes exactly 1 argument (2 given)
 {code}
 The PySpark docs only call out [the {{distinct()}} 
 signature|http://spark.apache.org/docs/1.1.0/api/python/pyspark.rdd.RDD-class.html#distinct],
  but the programming guide [includes the {{distinct(N)}} 
 signature|http://spark.apache.org/docs/1.1.0/programming-guide.html#transformations]
  as well.
 {quote}
 {noformat}
 distinct([numTasks])) Return a new dataset that contains the distinct 
 elements of the source dataset.
 {noformat}
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3407) Add Date type support

2014-09-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3407:

Target Version/s: 1.2.0

 Add Date type support
 -

 Key: SPARK-3407
 URL: https://issues.apache.org/jira/browse/SPARK-3407
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3407) Add Date type support

2014-09-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3407:

Assignee: Adrian Wang

 Add Date type support
 -

 Key: SPARK-3407
 URL: https://issues.apache.org/jira/browse/SPARK-3407
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Assignee: Adrian Wang





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2562) Add Date datatype support to Spark SQL

2014-09-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2562.
-
Resolution: Duplicate

 Add Date datatype support to Spark SQL
 --

 Key: SPARK-2562
 URL: https://issues.apache.org/jira/browse/SPARK-2562
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.1
Reporter: Zongheng Yang
Priority: Minor

 Spark SQL currently supports Timestamp, but not Date. Hive introduced support 
 for Date in [HIVE-4055|https://issues.apache.org/jira/browse/HIVE-4055], 
 where the underlying representation is {{java.sql.Date}}.
 (Thanks to user Rindra Ramamonjison for reporting this.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2594) Add CACHE TABLE name AS SELECT ...

2014-09-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132876#comment-14132876
 ] 

Apache Spark commented on SPARK-2594:
-

User 'ravipesala' has created a pull request for this issue:
https://github.com/apache/spark/pull/2381

 Add CACHE TABLE name AS SELECT ...
 

 Key: SPARK-2594
 URL: https://issues.apache.org/jira/browse/SPARK-2594
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Michael Armbrust
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-3414) Case insensitivity breaks when unresolved relation contains attributes with uppercase letters in their names

2014-09-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reopened SPARK-3414:
-
  Assignee: Michael Armbrust  (was: Cheng Lian)

 Case insensitivity breaks when unresolved relation contains attributes with 
 uppercase letters in their names
 

 Key: SPARK-3414
 URL: https://issues.apache.org/jira/browse/SPARK-3414
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2
Reporter: Cheng Lian
Assignee: Michael Armbrust
Priority: Critical
 Fix For: 1.2.0


 Paste the following snippet to {{spark-shell}} (need Hive support) to 
 reproduce this issue:
 {code}
 import org.apache.spark.sql.hive.HiveContext
 val hiveContext = new HiveContext(sc)
 import hiveContext._
 case class LogEntry(filename: String, message: String)
 case class LogFile(name: String)
 sc.makeRDD(Seq.empty[LogEntry]).registerTempTable(rawLogs)
 sc.makeRDD(Seq.empty[LogFile]).registerTempTable(logFiles)
 val srdd = sql(
   
 SELECT name, message
 FROM rawLogs
 JOIN (
   SELECT name
   FROM logFiles
 ) files
 ON rawLogs.filename = files.name
   )
 srdd.registerTempTable(boom)
 sql(select * from boom)
 {code}
 Exception thrown:
 {code}
 SchemaRDD[7] at RDD at SchemaRDD.scala:103
 == Query Plan ==
 == Physical Plan ==
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
 attributes: *, tree:
 Project [*]
  LowerCaseSchema
   Subquery boom
Project ['name,'message]
 Join Inner, Some(('rawLogs.filename = name#2))
  LowerCaseSchema
   Subquery rawlogs
SparkLogicalPlan (ExistingRdd [filename#0,message#1], 
 MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208)
  Subquery files
   Project [name#2]
LowerCaseSchema
 Subquery logfiles
  SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at 
 mapPartitions at basicOperators.scala:208)
 {code}
 Notice that {{rawLogs}} in the join operator is not lowercased.
 The reason is that, during analysis phase, the 
 {{CaseInsensitiveAttributeReferences}} batch is only executed before the 
 {{Resolution}} batch. And when {{srdd}} is registered as temporary table 
 {{boom}}, its original (unanalyzed) logical plan is stored into the catalog:
 {code}
 Join Inner, Some(('rawLogs.filename = 'files.name))
  UnresolvedRelation None, rawLogs, None
  Subquery files
   Project ['name]
UnresolvedRelation None, logFiles, None
 {code}
 notice that attributes referenced in the join operator (esp. {{rawLogs}}) is 
 not lowercased yet.
 And then, when {{select * from boom}} is been analyzed, its input logical 
 plan is:
 {code}
 Project [*]
  UnresolvedRelation None, boom, None
 {code}
 here the unresolved relation points to the unanalyzed logical plan of 
 {{srdd}} above, which is later discovered by rule {{ResolveRelations}}, thus 
 not touched by {{CaseInsensitiveAttributeReferences}} at all, and 
 {{rawLogs.filename}} is thus not lowercased:
 {code}
 === Applying Rule 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations ===
  Project [*]Project [*]
 ! UnresolvedRelation None, boom, NoneLowerCaseSchema
 ! Subquery boom
 !  Project ['name,'message]
 !   Join Inner, 
 Some(('rawLogs.filename = 'files.name))
 !LowerCaseSchema
 ! Subquery rawlogs
 !  SparkLogicalPlan (ExistingRdd 
 [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at 
 basicOperators.scala:208)
 !Subquery files
 ! Project ['name]
 !  LowerCaseSchema
 !   Subquery logfiles
 !SparkLogicalPlan 
 (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at 
 basicOperators.scala:208)
 {code}
 A reasonable fix for this could be always register analyzed logical plan to 
 the catalog when registering temporary tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3414) Case insensitivity breaks when unresolved relation contains attributes with uppercase letters in their names

2014-09-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132893#comment-14132893
 ] 

Apache Spark commented on SPARK-3414:
-

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/2382

 Case insensitivity breaks when unresolved relation contains attributes with 
 uppercase letters in their names
 

 Key: SPARK-3414
 URL: https://issues.apache.org/jira/browse/SPARK-3414
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2
Reporter: Cheng Lian
Assignee: Michael Armbrust
Priority: Critical
 Fix For: 1.2.0


 Paste the following snippet to {{spark-shell}} (need Hive support) to 
 reproduce this issue:
 {code}
 import org.apache.spark.sql.hive.HiveContext
 val hiveContext = new HiveContext(sc)
 import hiveContext._
 case class LogEntry(filename: String, message: String)
 case class LogFile(name: String)
 sc.makeRDD(Seq.empty[LogEntry]).registerTempTable(rawLogs)
 sc.makeRDD(Seq.empty[LogFile]).registerTempTable(logFiles)
 val srdd = sql(
   
 SELECT name, message
 FROM rawLogs
 JOIN (
   SELECT name
   FROM logFiles
 ) files
 ON rawLogs.filename = files.name
   )
 srdd.registerTempTable(boom)
 sql(select * from boom)
 {code}
 Exception thrown:
 {code}
 SchemaRDD[7] at RDD at SchemaRDD.scala:103
 == Query Plan ==
 == Physical Plan ==
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
 attributes: *, tree:
 Project [*]
  LowerCaseSchema
   Subquery boom
Project ['name,'message]
 Join Inner, Some(('rawLogs.filename = name#2))
  LowerCaseSchema
   Subquery rawlogs
SparkLogicalPlan (ExistingRdd [filename#0,message#1], 
 MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208)
  Subquery files
   Project [name#2]
LowerCaseSchema
 Subquery logfiles
  SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at 
 mapPartitions at basicOperators.scala:208)
 {code}
 Notice that {{rawLogs}} in the join operator is not lowercased.
 The reason is that, during analysis phase, the 
 {{CaseInsensitiveAttributeReferences}} batch is only executed before the 
 {{Resolution}} batch. And when {{srdd}} is registered as temporary table 
 {{boom}}, its original (unanalyzed) logical plan is stored into the catalog:
 {code}
 Join Inner, Some(('rawLogs.filename = 'files.name))
  UnresolvedRelation None, rawLogs, None
  Subquery files
   Project ['name]
UnresolvedRelation None, logFiles, None
 {code}
 notice that attributes referenced in the join operator (esp. {{rawLogs}}) is 
 not lowercased yet.
 And then, when {{select * from boom}} is been analyzed, its input logical 
 plan is:
 {code}
 Project [*]
  UnresolvedRelation None, boom, None
 {code}
 here the unresolved relation points to the unanalyzed logical plan of 
 {{srdd}} above, which is later discovered by rule {{ResolveRelations}}, thus 
 not touched by {{CaseInsensitiveAttributeReferences}} at all, and 
 {{rawLogs.filename}} is thus not lowercased:
 {code}
 === Applying Rule 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations ===
  Project [*]Project [*]
 ! UnresolvedRelation None, boom, NoneLowerCaseSchema
 ! Subquery boom
 !  Project ['name,'message]
 !   Join Inner, 
 Some(('rawLogs.filename = 'files.name))
 !LowerCaseSchema
 ! Subquery rawlogs
 !  SparkLogicalPlan (ExistingRdd 
 [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at 
 basicOperators.scala:208)
 !Subquery files
 ! Project ['name]
 !  LowerCaseSchema
 !   Subquery logfiles
 !SparkLogicalPlan 
 (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at 
 basicOperators.scala:208)
 {code}
 A reasonable fix for this could be always register analyzed logical plan to 
 the catalog when registering temporary tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3481) HiveComparisonTest throws exception of org.apache.hadoop.hive.ql.metadata.HiveException: Database does not exist: default

2014-09-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3481.
-
   Resolution: Fixed
Fix Version/s: 1.2.0
 Assignee: Cheng Hao

 HiveComparisonTest throws exception of 
 org.apache.hadoop.hive.ql.metadata.HiveException: Database does not exist: 
 default
 ---

 Key: SPARK-3481
 URL: https://issues.apache.org/jira/browse/SPARK-3481
 Project: Spark
  Issue Type: Test
  Components: SQL
Reporter: Cheng Hao
Assignee: Cheng Hao
Priority: Minor
 Fix For: 1.2.0


 In local test, lots of exception raised like:
 {panel}
 11:08:01.746 ERROR hive.ql.exec.DDLTask: 
 org.apache.hadoop.hive.ql.metadata.HiveException: Database does not exist: 
 default
   at 
 org.apache.hadoop.hive.ql.exec.DDLTask.switchDatabase(DDLTask.java:3480)
   at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:237)
   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151)
   at 
 org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65)
   at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1414)
   at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1192)
   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1020)
   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
   at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:298)
   at 
 org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:272)
   at 
 org.apache.spark.sql.hive.test.TestHiveContext.runSqlHive(TestHive.scala:88)
   at 
 org.apache.spark.sql.hive.test.TestHiveContext.reset(TestHive.scala:348)
   at 
 org.apache.spark.sql.hive.execution.HiveComparisonTest$$anonfun$createQueryTest$1.apply$mcV$sp(HiveComparisonTest.scala:255)
   at 
 org.apache.spark.sql.hive.execution.HiveComparisonTest$$anonfun$createQueryTest$1.apply(HiveComparisonTest.scala:225)
   at 
 org.apache.spark.sql.hive.execution.HiveComparisonTest$$anonfun$createQueryTest$1.apply(HiveComparisonTest.scala:225)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:158)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1121)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1559)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:155)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:167)
   at org.scalatest.FunSuite.runTest(FunSuite.scala:1559)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:200)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1559)
   at org.scalatest.Suite$class.run(Suite.scala:1423)
   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1559)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204)
   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:204)
   at 
 org.apache.spark.sql.hive.execution.HiveComparisonTest.org$scalatest$BeforeAndAfterAll$$super$run(HiveComparisonTest.scala:41)
   at 
 org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
   at 
 

[jira] [Commented] (SPARK-3519) PySpark RDDs are missing the distinct(n) method

2014-09-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132963#comment-14132963
 ] 

Apache Spark commented on SPARK-3519:
-

User 'mattf' has created a pull request for this issue:
https://github.com/apache/spark/pull/2383

 PySpark RDDs are missing the distinct(n) method
 ---

 Key: SPARK-3519
 URL: https://issues.apache.org/jira/browse/SPARK-3519
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 1.1.0
Reporter: Nicholas Chammas
Assignee: Matthew Farrellee

 {{distinct()}} works but {{distinct(N)}} doesn't.
 {code}
  sc.parallelize([1,1,2]).distinct()
 PythonRDD[15] at RDD at PythonRDD.scala:43
  sc.parallelize([1,1,2]).distinct(2)
 Traceback (most recent call last):
   File stdin, line 1, in module
 TypeError: distinct() takes exactly 1 argument (2 given)
 {code}
 The PySpark docs only call out [the {{distinct()}} 
 signature|http://spark.apache.org/docs/1.1.0/api/python/pyspark.rdd.RDD-class.html#distinct],
  but the programming guide [includes the {{distinct(N)}} 
 signature|http://spark.apache.org/docs/1.1.0/programming-guide.html#transformations]
  as well.
 {quote}
 {noformat}
 distinct([numTasks])) Return a new dataset that contains the distinct 
 elements of the source dataset.
 {noformat}
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3294) Avoid boxing/unboxing when handling in-memory columnar storage

2014-09-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3294.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

 Avoid boxing/unboxing when handling in-memory columnar storage
 --

 Key: SPARK-3294
 URL: https://issues.apache.org/jira/browse/SPARK-3294
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.2, 1.1.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Critical
 Fix For: 1.2.0


 When Spark SQL in-memory columnar storage was implemented, we tried to avoid 
 boxing/unboxing costs as much as possible, but {{javap}} shows that there 
 still exist code that involves boxing/unboxing on critical paths due to type 
 erasure, especially methods of sub-classes of {{ColumnType}}. Should 
 eliminate them whenever possible for better performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3030) reuse python worker

2014-09-13 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3030.
---
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2259
[https://github.com/apache/spark/pull/2259]

 reuse python worker
 ---

 Key: SPARK-3030
 URL: https://issues.apache.org/jira/browse/SPARK-3030
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Davies Liu
Assignee: Davies Liu
 Fix For: 1.2.0


 Currently, it will fork an Python worker for each task, it will better if we 
 can reuse the worker for later tasks.
 This will be very useful for large dataset with big broadcast, so it does not 
 need to sending broadcast to worker again and again. Also it can reduce the 
 overhead of launch a task.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3485) should check parameter type when find constructors

2014-09-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3485:

Target Version/s: 1.2.0

 should check parameter type when find constructors
 --

 Key: SPARK-3485
 URL: https://issues.apache.org/jira/browse/SPARK-3485
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Adrian Wang

 In hiveUdfs, we get constructors of primitivetypes by find a constructor 
 which takes only one parameter. This is very dangerous when more than one 
 constructors match. When the sequence of primitiveTypes becomes larger, the 
 problem would occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3515) ParquetMetastoreSuite fails when executed together with other suites under Maven

2014-09-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3515.
-
   Resolution: Fixed
Fix Version/s: 1.2.0
 Assignee: Cheng Lian

 ParquetMetastoreSuite fails when executed together with other suites under 
 Maven
 

 Key: SPARK-3515
 URL: https://issues.apache.org/jira/browse/SPARK-3515
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.1
Reporter: Cheng Lian
Assignee: Cheng Lian
 Fix For: 1.2.0


 Reproduction step:
 {code}
 mvn -Phive,hadoop-2.4 
 -DwildcardSuites=org.apache.spark.sql.parquet.ParquetMetastoreSuite,org.apache.spark.sql.hive.StatisticsSuite
  -pl core,sql/catalyst,sql/core,sql/hive test
 {code}
 Maven instantiates all discovered test suite object at first, and then starts 
 executing all test cases. {{ParquetMetastoreSuite}} sets up several temporary 
 tables in constructor, but these tables are deleted immediately since 
 {{StatisticsSuite}}'s constructor calls {{TestHiveContext.reset()}}.
 To fix this issue, we shouldn't put this kind of side effect in constructor, 
 but in {{beforeAll}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3501) Hive SimpleUDF will create duplicated type cast which cause exception in constant folding

2014-09-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3501:

Target Version/s: 1.2.0

 Hive SimpleUDF will create duplicated type cast which cause exception in 
 constant folding
 -

 Key: SPARK-3501
 URL: https://issues.apache.org/jira/browse/SPARK-3501
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Assignee: Cheng Hao
Priority: Minor

 When do the query like:
 select datediff(cast(value as timestamp), cast('2002-03-21 00:00:00' as 
 timestamp)) from src;
 SparkSQL will raise exception:
 {panel}
 [info] - Cast Timestamp to Timestamp in UDF *** FAILED ***
 [info]   scala.MatchError: TimestampType (of class 
 org.apache.spark.sql.catalyst.types.TimestampType$)
 [info]   at 
 org.apache.spark.sql.catalyst.expressions.Cast.castToTimestamp(Cast.scala:77)
 [info]   at 
 org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:251)
 [info]   at 
 org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247)
 [info]   at 
 org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263)
 [info]   at 
 org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$5$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:217)
 [info]   at 
 org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$5$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:210)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$2.apply(TreeNode.scala:180)
 [info]   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 [info]   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 {panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3501) Hive SimpleUDF will create duplicated type cast which cause exception in constant folding

2014-09-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3501:

Assignee: Cheng Hao

 Hive SimpleUDF will create duplicated type cast which cause exception in 
 constant folding
 -

 Key: SPARK-3501
 URL: https://issues.apache.org/jira/browse/SPARK-3501
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Assignee: Cheng Hao
Priority: Minor

 When do the query like:
 select datediff(cast(value as timestamp), cast('2002-03-21 00:00:00' as 
 timestamp)) from src;
 SparkSQL will raise exception:
 {panel}
 [info] - Cast Timestamp to Timestamp in UDF *** FAILED ***
 [info]   scala.MatchError: TimestampType (of class 
 org.apache.spark.sql.catalyst.types.TimestampType$)
 [info]   at 
 org.apache.spark.sql.catalyst.expressions.Cast.castToTimestamp(Cast.scala:77)
 [info]   at 
 org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:251)
 [info]   at 
 org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247)
 [info]   at 
 org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263)
 [info]   at 
 org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$5$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:217)
 [info]   at 
 org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$5$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:210)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
 [info]   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$2.apply(TreeNode.scala:180)
 [info]   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 [info]   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 {panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3438) Support for accessing secured HDFS in Standalone Mode

2014-09-13 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3438:
---
Description: 
Access to secured HDFS is currently supported in YARN using YARN's built in 
security mechanism. In YARN mode, a user application is authenticated when it 
is submitted, then it acquires delegation tokens and them ship them (via YARN) 
securely to workers.

In Standalone mode, it would be nice to support a more mechanism for accessing 
HDFS where we rely on a single shared secret to authenticate communication in 
the standalone cluster.

1. A company is running a standalone cluster.
2. They are fine if all Spark jobs in the cluster share a global secret, i.e. 
all Spark jobs can trust one another.
3. They are able to provide a Hadoop login on the driver node via a keytab or 
kinit. They want tokens from this login to be distributed to the executors to 
allow access to secure HDFS.
4. They also don't want to trust the network on the cluster. I.e. don't want to 
allow someone to fetch HDFS tokens easily over a known protocol, without 
authentication.

  was:Secured HDFS is supported in YARN currently, but not in standalone mode. 
The tricky bit is how disseminate the delegation tokens securely in standalone 
mode.


 Support for accessing secured HDFS in Standalone Mode
 -

 Key: SPARK-3438
 URL: https://issues.apache.org/jira/browse/SPARK-3438
 Project: Spark
  Issue Type: New Feature
  Components: Deploy, Spark Core
Affects Versions: 1.0.2
Reporter: Zhanfeng Huo

 Access to secured HDFS is currently supported in YARN using YARN's built in 
 security mechanism. In YARN mode, a user application is authenticated when it 
 is submitted, then it acquires delegation tokens and them ship them (via 
 YARN) securely to workers.
 In Standalone mode, it would be nice to support a more mechanism for 
 accessing HDFS where we rely on a single shared secret to authenticate 
 communication in the standalone cluster.
 1. A company is running a standalone cluster.
 2. They are fine if all Spark jobs in the cluster share a global secret, i.e. 
 all Spark jobs can trust one another.
 3. They are able to provide a Hadoop login on the driver node via a keytab or 
 kinit. They want tokens from this login to be distributed to the executors to 
 allow access to secure HDFS.
 4. They also don't want to trust the network on the cluster. I.e. don't want 
 to allow someone to fetch HDFS tokens easily over a known protocol, without 
 authentication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3463) Show metrics about spilling in Python

2014-09-13 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3463.
---
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2336
[https://github.com/apache/spark/pull/2336]

 Show metrics about spilling in Python
 -

 Key: SPARK-3463
 URL: https://issues.apache.org/jira/browse/SPARK-3463
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Davies Liu
Assignee: Davies Liu
 Fix For: 1.2.0


 It should also show the number of bytes spilled into disks while doing 
 aggregation in Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1087) Separate file for traceback and callsite related functions

2014-09-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133094#comment-14133094
 ] 

Apache Spark commented on SPARK-1087:
-

User 'staple' has created a pull request for this issue:
https://github.com/apache/spark/pull/2385

 Separate file for traceback and callsite related functions
 --

 Key: SPARK-1087
 URL: https://issues.apache.org/jira/browse/SPARK-1087
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Reporter: Jyotiska NK

 Right now, _extract_concise_traceback() is written inside rdd.py which 
 provides the callsite information. But for 
 [SPARK-972](https://spark-project.atlassian.net/browse/SPARK-972) in PR #581, 
 we used the function from context.py. Also some issues were faced regarding 
 the return string format. 
 It would be a good idea to move the the traceback function from rdd and 
 create a separate file for future developments. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org