[jira] [Commented] (SPARK-13352) BlockFetch does not scale well on large block

2016-04-17 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245209#comment-15245209
 ] 

Davies Liu commented on SPARK-13352:


corrected, thanks

> BlockFetch does not scale well on large block
> -
>
> Key: SPARK-13352
> URL: https://issues.apache.org/jira/browse/SPARK-13352
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Reporter: Davies Liu
>Assignee: Zhang, Liye
>Priority: Critical
> Fix For: 1.6.2, 2.0.0
>
>
> BlockManager.getRemoteBytes() perform poorly on large block
> {code}
>   test("block manager") {
> val N = 500 << 20
> val bm = sc.env.blockManager
> val blockId = TaskResultBlockId(0)
> val buffer = ByteBuffer.allocate(N)
> buffer.limit(N)
> bm.putBytes(blockId, buffer, StorageLevel.MEMORY_AND_DISK_SER)
> val result = bm.getRemoteBytes(blockId)
> assert(result.isDefined)
> assert(result.get.limit() === (N))
>   }
> {code}
> Here are runtime for different block sizes:
> {code}
> 50M3 seconds
> 100M  7 seconds
> 250M  33 seconds
> 500M 2 min
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13352) BlockFetch does not scale well on large block

2016-04-17 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15234559#comment-15234559
 ] 

Davies Liu edited comment on SPARK-13352 at 4/18/16 6:40 AM:
-

The result is much better now (there is some fixed overhead for tests):
{code}
50M2.2 seconds
100M  2.8 seconds
250M  3.7 seconds
500M  7.8 seconds
{code}


was (Author: davies):
The result is much better now (there is some fixed overhead for tests):
{code}
50M2.2 seconds
100M  2.8 seconds
250M  3.7 seconds
500M  7.8 min
{code}

> BlockFetch does not scale well on large block
> -
>
> Key: SPARK-13352
> URL: https://issues.apache.org/jira/browse/SPARK-13352
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Reporter: Davies Liu
>Assignee: Zhang, Liye
>Priority: Critical
> Fix For: 1.6.2, 2.0.0
>
>
> BlockManager.getRemoteBytes() perform poorly on large block
> {code}
>   test("block manager") {
> val N = 500 << 20
> val bm = sc.env.blockManager
> val blockId = TaskResultBlockId(0)
> val buffer = ByteBuffer.allocate(N)
> buffer.limit(N)
> bm.putBytes(blockId, buffer, StorageLevel.MEMORY_AND_DISK_SER)
> val result = bm.getRemoteBytes(blockId)
> assert(result.isDefined)
> assert(result.get.limit() === (N))
>   }
> {code}
> Here are runtime for different block sizes:
> {code}
> 50M3 seconds
> 100M  7 seconds
> 250M  33 seconds
> 500M 2 min
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14696) Needs implicit encoders for boxed primitive types

2016-04-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245204#comment-15245204
 ] 

Apache Spark commented on SPARK-14696:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/12466

> Needs implicit encoders for boxed primitive types
> -
>
> Key: SPARK-14696
> URL: https://issues.apache.org/jira/browse/SPARK-14696
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> We currently only have implicit encoders for scala primitive types. We should 
> also add implicit encoders for boxed primitives. Otherwise, the following 
> code would not have an encoder:
> {code}
> sqlContext.range(1000).map { i => i }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14696) Needs implicit encoders for boxed primitive types

2016-04-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14696:


Assignee: Reynold Xin  (was: Apache Spark)

> Needs implicit encoders for boxed primitive types
> -
>
> Key: SPARK-14696
> URL: https://issues.apache.org/jira/browse/SPARK-14696
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> We currently only have implicit encoders for scala primitive types. We should 
> also add implicit encoders for boxed primitives. Otherwise, the following 
> code would not have an encoder:
> {code}
> sqlContext.range(1000).map { i => i }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14696) Needs implicit encoders for boxed primitive types

2016-04-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14696:


Assignee: Apache Spark  (was: Reynold Xin)

> Needs implicit encoders for boxed primitive types
> -
>
> Key: SPARK-14696
> URL: https://issues.apache.org/jira/browse/SPARK-14696
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> We currently only have implicit encoders for scala primitive types. We should 
> also add implicit encoders for boxed primitives. Otherwise, the following 
> code would not have an encoder:
> {code}
> sqlContext.range(1000).map { i => i }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14696) Needs implicit encoders for boxed primitive types

2016-04-17 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-14696:
---

 Summary: Needs implicit encoders for boxed primitive types
 Key: SPARK-14696
 URL: https://issues.apache.org/jira/browse/SPARK-14696
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


We currently only have implicit encoders for scala primitive types. We should 
also add implicit encoders for boxed primitives. Otherwise, the following code 
would not have an encoder:

{code}
sqlContext.range(1000).map { i => i }
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14453) Consider removing SPARK_JAVA_OPTS env variable

2016-04-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245169#comment-15245169
 ] 

Apache Spark commented on SPARK-14453:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/12465

> Consider removing SPARK_JAVA_OPTS env variable
> --
>
> Key: SPARK-14453
> URL: https://issues.apache.org/jira/browse/SPARK-14453
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Reporter: Saisai Shao
>Priority: Minor
>
> SPARK_JAVA_OPTS was deprecated since 1.0, with the release of major version 
> (2.0), I think it would be better to remove the support of this env variable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14453) Consider removing SPARK_JAVA_OPTS env variable

2016-04-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14453:


Assignee: Apache Spark

> Consider removing SPARK_JAVA_OPTS env variable
> --
>
> Key: SPARK-14453
> URL: https://issues.apache.org/jira/browse/SPARK-14453
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Reporter: Saisai Shao
>Assignee: Apache Spark
>Priority: Minor
>
> SPARK_JAVA_OPTS was deprecated since 1.0, with the release of major version 
> (2.0), I think it would be better to remove the support of this env variable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14453) Consider removing SPARK_JAVA_OPTS env variable

2016-04-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14453:


Assignee: (was: Apache Spark)

> Consider removing SPARK_JAVA_OPTS env variable
> --
>
> Key: SPARK-14453
> URL: https://issues.apache.org/jira/browse/SPARK-14453
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Reporter: Saisai Shao
>Priority: Minor
>
> SPARK_JAVA_OPTS was deprecated since 1.0, with the release of major version 
> (2.0), I think it would be better to remove the support of this env variable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12810) PySpark CrossValidatorModel should support avgMetrics

2016-04-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245152#comment-15245152
 ] 

Apache Spark commented on SPARK-12810:
--

User 'vectorijk' has created a pull request for this issue:
https://github.com/apache/spark/pull/12464

> PySpark CrossValidatorModel should support avgMetrics
> -
>
> Key: SPARK-12810
> URL: https://issues.apache.org/jira/browse/SPARK-12810
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Feynman Liang
>  Labels: starter
>
> The {CrossValidator} in Scala supports {avgMetrics} since 1.5.0, which allows 
> the user to evaluate how well each {ParamMap} in the grid search performed 
> and identify the best parameters. We should support this in PySpark as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12810) PySpark CrossValidatorModel should support avgMetrics

2016-04-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12810:


Assignee: (was: Apache Spark)

> PySpark CrossValidatorModel should support avgMetrics
> -
>
> Key: SPARK-12810
> URL: https://issues.apache.org/jira/browse/SPARK-12810
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Feynman Liang
>  Labels: starter
>
> The {CrossValidator} in Scala supports {avgMetrics} since 1.5.0, which allows 
> the user to evaluate how well each {ParamMap} in the grid search performed 
> and identify the best parameters. We should support this in PySpark as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12810) PySpark CrossValidatorModel should support avgMetrics

2016-04-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12810:


Assignee: Apache Spark

> PySpark CrossValidatorModel should support avgMetrics
> -
>
> Key: SPARK-12810
> URL: https://issues.apache.org/jira/browse/SPARK-12810
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Feynman Liang
>Assignee: Apache Spark
>  Labels: starter
>
> The {CrossValidator} in Scala supports {avgMetrics} since 1.5.0, which allows 
> the user to evaluate how well each {ParamMap} in the grid search performed 
> and identify the best parameters. We should support this in PySpark as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13662) [SQL][Hive] Have SHOW TABLES return additional fields from Hive MetaStore

2016-04-17 Thread Evan Chan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245129#comment-15245129
 ] 

Evan Chan commented on SPARK-13662:
---

Vijay,

That would be awesome!   Please go ahead.




> [SQL][Hive] Have SHOW TABLES return additional fields from Hive MetaStore 
> --
>
> Key: SPARK-13662
> URL: https://issues.apache.org/jira/browse/SPARK-13662
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
> Environment: All
>Reporter: Evan Chan
>
> Currently, the SHOW TABLES command in Spark's Hive ThriftServer, or 
> equivalently the HiveContext.tables method, returns a DataFrame with only two 
> columns: the name of the table and whether it is temporary.  It would be 
> really nice to add support to return some extra information, such as:
> - Whether this table is Spark-only or a native Hive table
> - If spark-only, the name of the data source
> - potentially other properties
> The first two is really useful for BI environments connecting to multiple 
> data sources and that work with both Hive and Spark.
> Some thoughts:
> - The SQL/HiveContext Catalog API might need to be expanded to return 
> something like a TableEntry, rather than just a tuple of (name, temporary).
> - I believe there is a Hive Catalog/client API to get information about each 
> table.  I suppose one concern would be the speed of using this API.  Perhaps 
> there are other APis that can get this info faster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13904) Add support for pluggable cluster manager

2016-04-17 Thread Hemant Bhanawat (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245118#comment-15245118
 ] 

Hemant Bhanawat commented on SPARK-13904:
-

[~kiszk] I am looking into this. 

> Add support for pluggable cluster manager
> -
>
> Key: SPARK-13904
> URL: https://issues.apache.org/jira/browse/SPARK-13904
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Hemant Bhanawat
>
> Currently Spark allows only a few cluster managers viz Yarn, Mesos and 
> Standalone. But, as Spark is now being used in newer and different use cases, 
> there is a need for allowing other cluster managers to manage spark 
> components. One such use case is - embedding spark components like executor 
> and driver inside another process which may be a datastore. This allows 
> colocation of data and processing. Another requirement that stems from such a 
> use case is that the executors/driver should not take the parent process down 
> when they go down and the components can be relaunched inside the same 
> process again. 
> So, this JIRA requests two functionalities:
> 1. Support for external cluster managers
> 2. Allow a cluster manager to clean up the tasks without taking the parent 
> process down. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14647) Group SQLContext/HiveContext state into PersistentState

2016-04-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245109#comment-15245109
 ] 

Apache Spark commented on SPARK-14647:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/12463

> Group SQLContext/HiveContext state into PersistentState
> ---
>
> Key: SPARK-14647
> URL: https://issues.apache.org/jira/browse/SPARK-14647
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> This is analogous to SPARK-13526, which moved some things into 
> `SessionState`. After this issue we'll have an analogous `PersistentState` 
> that groups things to be shared across sessions. This will simplify the 
> constructors of the contexts significantly by allowing us to pass fewer 
> things into the contexts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14647) Group SQLContext/HiveContext state into PersistentState

2016-04-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14647:


Assignee: Andrew Or  (was: Apache Spark)

> Group SQLContext/HiveContext state into PersistentState
> ---
>
> Key: SPARK-14647
> URL: https://issues.apache.org/jira/browse/SPARK-14647
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> This is analogous to SPARK-13526, which moved some things into 
> `SessionState`. After this issue we'll have an analogous `PersistentState` 
> that groups things to be shared across sessions. This will simplify the 
> constructors of the contexts significantly by allowing us to pass fewer 
> things into the contexts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14647) Group SQLContext/HiveContext state into PersistentState

2016-04-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14647:


Assignee: Apache Spark  (was: Andrew Or)

> Group SQLContext/HiveContext state into PersistentState
> ---
>
> Key: SPARK-14647
> URL: https://issues.apache.org/jira/browse/SPARK-14647
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Apache Spark
> Fix For: 2.0.0
>
>
> This is analogous to SPARK-13526, which moved some things into 
> `SessionState`. After this issue we'll have an analogous `PersistentState` 
> that groups things to be shared across sessions. This will simplify the 
> constructors of the contexts significantly by allowing us to pass fewer 
> things into the contexts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14647) Group SQLContext/HiveContext state into PersistentState

2016-04-17 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245100#comment-15245100
 ] 

Yin Huai commented on SPARK-14647:
--

Seems that test still timed out after we reverted the commit 
(https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/628/testReport/org.apache.spark.sql.hive/HiveSparkSubmitSuite/SPARK_9757_Persist_Parquet_relation_with_decimal_column/).

> Group SQLContext/HiveContext state into PersistentState
> ---
>
> Key: SPARK-14647
> URL: https://issues.apache.org/jira/browse/SPARK-14647
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> This is analogous to SPARK-13526, which moved some things into 
> `SessionState`. After this issue we'll have an analogous `PersistentState` 
> that groups things to be shared across sessions. This will simplify the 
> constructors of the contexts significantly by allowing us to pass fewer 
> things into the contexts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14695) Error occurs when using OFF_HEAP persistent level

2016-04-17 Thread Liang Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Lee updated SPARK-14695:
--
Description: 
When running a PageRank job through  the default examples, e.g., the class  
'org.apache.spark.examples.graphx.Analytics' in 
spark-examples-1.6.0-hadoop2.6.0.jar package, we got the following erors:
16/04/18 03:16:55 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 9.0 
(TID 66) in 1662 ms on R1S1 (1/10)
16/04/18 03:16:55 INFO scheduler.TaskSetManager: Finished task 8.0 in stage 9.0 
(TID 73) in 1663 ms on R1S1 (2/10)
16/04/18 03:16:55 INFO scheduler.TaskSetManager: Finished task 5.0 in stage 9.0 
(TID 70) in 1672 ms on R1S1 (3/10)
16/04/18 03:16:55 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 9.0 
(TID 69) in 1680 ms on R1S1 (4/10)
16/04/18 03:16:55 INFO scheduler.TaskSetManager: Finished task 7.0 in stage 9.0 
(TID 72) in 1678 ms on R1S1 (5/10)
16/04/18 03:16:55 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 9.0 
(TID 67) in 1682 ms on R1S1 (6/10)
16/04/18 03:16:55 INFO scheduler.TaskSetManager: Finished task 4.0 in stage 9.0 
(TID 75) in 1710 ms on R1S1 (7/10)
16/04/18 03:16:55 INFO scheduler.TaskSetManager: Finished task 3.0 in stage 9.0 
(TID 74) in 1729 ms on R1S1 (8/10)
16/04/18 03:16:55 INFO scheduler.TaskSetManager: Finished task 9.0 in stage 9.0 
(TID 68) in 1838 ms on R1S1 (9/10)
16/04/18 03:17:25 WARN scheduler.TaskSetManager: Lost task 6.0 in stage 9.0 
(TID 71, R1S1): java.lang.IllegalArgumentException: requirement failed: 
sizeInBytes was negative: -1
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.storage.BlockInfo.markReady(BlockInfo.scala:55)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:822)
at 
org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:645)
at 
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:153)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

16/04/18 03:17:25 INFO scheduler.TaskSetManager: Starting task 6.1 in stage 9.0 
(TID 76, R1S1, partition 6,PROCESS_LOCAL, 2171 bytes)
16/04/18 03:17:25 DEBUG hdfs.DFSClient: DataStreamer block 
BP-1194875811-10.3.1.3-1460617951862:blk_1073742842_2018 sending packet packet 
seqno:-1 offsetInBlock:0 lastPacketInBlock:false lastByteOffsetInBlock: 0
16/04/18 03:17:25 DEBUG hdfs.DFSClient: DFSClient seqno: -1 status: SUCCESS 
status: SUCCESS downstreamAckTimeNanos: 653735
16/04/18 03:17:25 WARN scheduler.TaskSetManager: Lost task 6.1 in stage 9.0 
(TID 76, R1S1): org.apache.spark.storage.BlockException: Block manager failed 
to return cached value for rdd_28_6!
at 
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:158)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)



We use the following script to submit the job:
/Hadoop/spark-1.6.0-bin-hadoop2.6/bin/spark-submit --class 
org.apache.spark.examples.graphx.Analytics 
/Hadoop/spark-1.6.0-bin-hadoop2.6/lib/spark-examples-1.6.0-hadoop2.6.0.jar 
pagerank /data/soc-LiveJournal1.txt --output=/output/live-off.res --numEPart=10 
--numIter=1 --edgeStorageLevel=OFF_HEAP --vertexStorageLevel=OFF_HEAP

When we set the storage level to MEMORY_ONLY

[jira] [Comment Edited] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2016-04-17 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245084#comment-15245084
 ] 

Yong Tang edited comment on SPARK-14409 at 4/18/16 3:12 AM:


Thanks [~mlnick] for the references. I will take a look at those and see what 
we could do with it.

By the way, initially I though I could easily calling RankingMetrics in 
mllib.evaluation from the new ml.evaluation.RankingEvaluator. However, I am 
having some trouble in implementation because the 
{code}
  @Since("2.0.0")
  override def evaluate(dataset: Dataset[_]): Double
{code}
in `RankingEvaluator` is not so easy to be converted into RankingMetrics's 
(`RDD[(Array[T], Array[T])]`).

I will do some further investigation. If I can not find a easy way to convert 
the data set into generic `RDD[(Array[T], Array[T])]`, I will go directly 
implementing the methods in new ml.evaluation (instead of calling 
mllib.evaluation).


was (Author: yongtang):
Thanks [~mlnick] for the references. I will take a look at those and see what 
we could do with it.

By the way, initially I though I could easily calling RankingMetrics in 
mllib.evaluation from the new ml.evaluation.RankingEvaluator. However, I am 
having some trouble in implementation because the 
`
  @Since("2.0.0")
  override def evaluate(dataset: Dataset[_]): Double
`
in `RankingEvaluator` is not so easy to be converted into RankingMetrics's 
(`RDD[(Array[T], Array[T])]`).

I will do some further investigation. If I can not find a easy way to convert 
the data set into generic `RDD[(Array[T], Array[T])]`, I will go directly 
implementing the methods in new ml.evaluation (instead of calling 
mllib.evaluation).

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2016-04-17 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245084#comment-15245084
 ] 

Yong Tang commented on SPARK-14409:
---

Thanks [~mlnick] for the references. I will take a look at those and see what 
we could do with it.

By the way, initially I though I could easily calling RankingMetrics in 
mllib.evaluation from the new ml.evaluation.RankingEvaluator. However, I am 
having some trouble in implementation because the 
`
  @Since("2.0.0")
  override def evaluate(dataset: Dataset[_]): Double
`
in `RankingEvaluator` is not so easy to be converted into RankingMetrics's 
(`RDD[(Array[T], Array[T])]`).

I will do some further investigation. If I can not find a easy way to convert 
the data set into generic `RDD[(Array[T], Array[T])]`, I will go directly 
implementing the methods in new ml.evaluation (instead of calling 
mllib.evaluation).

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14695) Error occurs when using OFF_HEAP persistent level

2016-04-17 Thread Liang Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245083#comment-15245083
 ] 

Liang Lee commented on SPARK-14695:
---

The cluster enviroment is like this:
Totally 3 servers. One acts as NameNode, SparkMaster and TachyonMaster; the 
other two act as DataNode, Spark Worker and Tachyon Worker.
We set the Tachyon Worker Memory to 64GB per node and Total 128GB, only memory 
level is enabled in Tachyon.
We submit the job on Master node.

The most strange question is:
In the same worker, most executors can finish the task correctly but only 1 or 
2 executors failed to cache the block and cause the above errors.

> Error occurs when using OFF_HEAP persistent level 
> --
>
> Key: SPARK-14695
> URL: https://issues.apache.org/jira/browse/SPARK-14695
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 1.6.0
> Environment: Spark 1.6.0
> Tachyon 0.8.2
> Hadoop 2.6.0
>Reporter: Liang Lee
>
> When running a PageRank job through  the default examples, e.g., the class  
> 'org.apache.spark.examples.graphx.Analytics' in 
> spark-examples-1.6.0-hadoop2.6.0.jar package, we got the following erors:
> 16/04/18 02:30:01 WARN scheduler.TaskSetManager: Lost task 9.0 in stage 6.0 
> (TID
> 53, R1S1): java.lang.IllegalArgumentException: requirement failed: 
> sizeInBytes   
>  was negative: -1
> at scala.Predef$.require(Predef.scala:233)
> at org.apache.spark.storage.BlockInfo.markReady(BlockInfo.scala:55)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:822)
> at 
> org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala: 
>   645)
> at 
> org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:15 
>   3)
> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala: 
>   38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scal 
>   a:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scal 
>   a:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor. 
>   java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor 
>   .java:615)
> at java.lang.Thread.run(Thread.java:745)
> We use the following script to submit the job:
> /Hadoop/spark-1.6.0-bin-hadoop2.6/bin/spark-submit --class 
> org.apache.spark.examples.graphx.Analytics 
> /Hadoop/spark-1.6.0-bin-hadoop2.6/lib/spark-examples-1.6.0-hadoop2.6.0.jar 
> pagerank /data/soc-LiveJournal1.txt --output=/output/live-off.res 
> --numEPart=10 --numIter=1 --edgeStorageLevel=OFF_HEAP 
> --vertexStorageLevel=OFF_HEAP
> When we set the storage level to MEMORY_ONLY or DISK_ONLY, there is no error 
> and the job can finished correctly.
> But when we set the storage level to OFF_HEAP, which means using Tachyon for 
> the storage process, the error occurs.
> The executors stack is like this, seems the write block to Tahcyon failed.
> 16/04/18 02:25:54 ERROR ExternalBlockStore: Error in putValues(rdd_20_1)
> java.io.IOException: Fail to cache: null
>   at 
> tachyon.client.file.FileOutStream.handleCacheWriteException(FileOutStream.java:276)
>   at tachyon.client.file.FileOutStream.close(FileOutStream.java:165)
>   at 
> org.apache.spark.storage.TachyonBlockManager.putValues(TachyonBlockManager.scala:126)
>   at 
> org.apache.spark.storage.ExternalBlockStore.putIntoExternalBlockStore(ExternalBlockStore.scala:79)
>   at 
> org.apache.spark.storage.ExternalBlockStore.putIterator(ExternalBlockStore.scala:67)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:798)
>   at 
> org.apac

[jira] [Commented] (SPARK-14628) Remove all the Options in TaskMetrics

2016-04-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245079#comment-15245079
 ] 

Apache Spark commented on SPARK-14628:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/12462

> Remove all the Options in TaskMetrics
> -
>
> Key: SPARK-14628
> URL: https://issues.apache.org/jira/browse/SPARK-14628
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>
> Part of the reason why TaskMetrics and its callers are complicated are due to 
> the optional metrics we collect, including input, output, shuffle read, and 
> shuffle write. Given their values are zero, I think we can always track them. 
> It is usually very obvious whether a task is supposed to read any data or 
> not. By always tracking them, we can remove a lot of map, foreach, flatMap, 
> getOrElse(0L) calls throughout Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2016-04-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245077#comment-15245077
 ] 

Apache Spark commented on SPARK-14409:
--

User 'yongtang' has created a pull request for this issue:
https://github.com/apache/spark/pull/12461

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2016-04-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14409:


Assignee: (was: Apache Spark)

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2016-04-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14409:


Assignee: Apache Spark

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Apache Spark
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14695) Error occurs when using OFF_HEAP persistent level

2016-04-17 Thread Liang Lee (JIRA)
Liang Lee created SPARK-14695:
-

 Summary: Error occurs when using OFF_HEAP persistent level 
 Key: SPARK-14695
 URL: https://issues.apache.org/jira/browse/SPARK-14695
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Spark Core
Affects Versions: 1.6.0
 Environment: Spark 1.6.0
Tachyon 0.8.2
Hadoop 2.6.0
Reporter: Liang Lee


When running a PageRank job through  the default examples, e.g., the class  
'org.apache.spark.examples.graphx.Analytics' in 
spark-examples-1.6.0-hadoop2.6.0.jar package, we got the following erors:
16/04/18 02:30:01 WARN scheduler.TaskSetManager: Lost task 9.0 in stage 6.0 
(TID53, 
R1S1): java.lang.IllegalArgumentException: requirement failed: sizeInBytes  
  was negative: 
-1
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.storage.BlockInfo.markReady(BlockInfo.scala:55)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:822)
at 
org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:   
645)
at 
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:15   
3)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:   
38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scal   
a:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scal   
a:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.   
java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor   
.java:615)
at java.lang.Thread.run(Thread.java:745)


We use the following script to submit the job:
/Hadoop/spark-1.6.0-bin-hadoop2.6/bin/spark-submit --class 
org.apache.spark.examples.graphx.Analytics 
/Hadoop/spark-1.6.0-bin-hadoop2.6/lib/spark-examples-1.6.0-hadoop2.6.0.jar 
pagerank /data/soc-LiveJournal1.txt --output=/output/live-off.res --numEPart=10 
--numIter=1 --edgeStorageLevel=OFF_HEAP --vertexStorageLevel=OFF_HEAP

When we set the storage level to MEMORY_ONLY or DISK_ONLY, there is no error 
and the job can finished correctly.
But when we set the storage level to OFF_HEAP, which means using Tachyon for 
the storage process, the error occurs.

The executors stack is like this, seems the write block to Tahcyon failed.
16/04/18 02:25:54 ERROR ExternalBlockStore: Error in putValues(rdd_20_1)
java.io.IOException: Fail to cache: null
at 
tachyon.client.file.FileOutStream.handleCacheWriteException(FileOutStream.java:276)
at tachyon.client.file.FileOutStream.close(FileOutStream.java:165)
at 
org.apache.spark.storage.TachyonBlockManager.putValues(TachyonBlockManager.scala:126)
at 
org.apache.spark.storage.ExternalBlockStore.putIntoExternalBlockStore(ExternalBlockStore.scala:79)
at 
org.apache.spark.storage.ExternalBlockStore.putIterator(ExternalBlockStore.scala:67)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:798)
at 
org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:645)
at 
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:153)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at 
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.

[jira] [Created] (SPARK-14694) Thrift Server + Hive Metastore + Kerberos doesn't work

2016-04-17 Thread zhangguancheng (JIRA)
zhangguancheng created SPARK-14694:
--

 Summary: Thrift Server + Hive Metastore + Kerberos doesn't work
 Key: SPARK-14694
 URL: https://issues.apache.org/jira/browse/SPARK-14694
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.1, 1.6.0
 Environment: Spark 1.6.1. compiled with hadoop 2.6.0, yarn, hive
Hadoop 2.6.4 
Hive 1.1.1 
Kerberos
Reporter: zhangguancheng


My Hive Metasore is MySQL based. I started a spark thrift server on the same 
node as the Hive Metastore. I can open beeline and run select statements but 
for some commands like "show databases", I get an error:

{quote}
ERROR pool-24-thread-1 org.apache.thrift.transport.TSaslTransport:315 SASL 
negotiation failure
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: 
No valid credentials provided (Mechanism level: Failed to find any Kerberos 
tgt)]
at 
com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
at 
org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94)
at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
at 
org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:420)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:236)
at 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
at 
org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024)
at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234)
at org.apache.hadoop.hive.ql.exec.DDLTask.showDatabases(DDLTask.java:2223)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:385)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:88)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1653)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1412)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1195)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1059)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:495)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:484)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:290)
at 
org.apache.spark.sql.hive.client.ClientWrapper.liftedTree1$1(ClientWrapper.scala:237)
at 
org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:236)
at 
org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:279)
at 
org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:484)
at 
org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:474)
at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:605)
at 
org.apache.spark.sql.hive.execution.HiveNativeCommand.run(HiveNativeCommand.scala:33)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
at 
org.ap

[jira] [Updated] (SPARK-14693) Spark Streaming Context Hangs on Start

2016-04-17 Thread Evan Oman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Evan Oman updated SPARK-14693:
--
Description: 
All,

I am trying to use Kinesis with Spark Streaming on Spark 1.6.0 via Databricks 
and my `ssc.start()` command is hanging. 

I am using the following function (based on [this 
guide|http://spark.apache.org/docs/latest/streaming-kinesis-integration.html], 
which, as an aside, contains some broken Github links) to make my Spark 
Streaming Context:

{code:borderStyle=solid}
def creatingFunc(sc: SparkContext): StreamingContext = 
{
// Create a StreamingContext
val ssc = new StreamingContext(sc, 
Seconds(batchIntervalSeconds))

// Creata a Kinesis stream
val kinesisStream = KinesisUtils.createStream(ssc,
kinesisAppName, kinesisStreamName,
kinesisEndpointUrl, 
RegionUtils.getRegionByEndpoint(kinesisEndpointUrl).getName,
InitialPositionInStream.LATEST, 
Seconds(kinesisCheckpointIntervalSeconds),
StorageLevel.MEMORY_AND_DISK_SER_2, 
config.awsAccessKeyId, config.awsSecretKey)

kinesisStream.print()

ssc.remember(Minutes(1))
ssc.checkpoint(checkpointDir)
ssc
}
{code}


However when I run the following to start the streaming context:

{code:borderStyle=solid}
// Stop any existing StreamingContext 
val stopActiveContext = true
if (stopActiveContext) {
  StreamingContext.getActive.foreach { _.stop(stopSparkContext = false) }
} 

// Get or create a streaming context.
val ssc = StreamingContext.getActiveOrCreate(() => main.creatingFunc(sc))

// This starts the streaming context in the background. 
ssc.start()
{code}

The last bit, `ssc.start()`, hangs indefinitely without issuing any log 
messages. I am running this on a freshly spun up cluster with no other 
notebooks attached so there aren't any other streaming contexts running.

Any thoughts?

Additionally, here are the libraries I am using (from my build.sbt file):

{code:borderStyle=solid}
"org.apache.spark" % "spark-core_2.10" % "1.6.0"
"org.apache.spark" % "spark-sql_2.10" % "1.6.0"
"org.apache.spark" % "spark-streaming-kinesis-asl_2.10" % "1.6.0"
"org.apache.spark" % "spark-streaming_2.10" % "1.6.0"
{code}

  was:
All,

I am trying to use Kinesis with Spark Streaming on Spark 1.6.0 via Databricks 
and my `ssc.start()` command is hanging. 

I am using the following function (based on 
http://spark.apache.org/docs/latest/streaming-kinesis-integration.html, which, 
as an aside, contains some broken Github links) to make my Spark Streaming 
Context:

{code:borderStyle=solid}
def creatingFunc(sc: SparkContext): StreamingContext = 
{
// Create a StreamingContext
val ssc = new StreamingContext(sc, 
Seconds(batchIntervalSeconds))

// Creata a Kinesis stream
val kinesisStream = KinesisUtils.createStream(ssc,
kinesisAppName, kinesisStreamName,
kinesisEndpointUrl, 
RegionUtils.getRegionByEndpoint(kinesisEndpointUrl).getName,
InitialPositionInStream.LATEST, 
Seconds(kinesisCheckpointIntervalSeconds),
StorageLevel.MEMORY_AND_DISK_SER_2, 
config.awsAccessKeyId, config.awsSecretKey)

kinesisStream.print()

ssc.remember(Minutes(1))
ssc.checkpoint(checkpointDir)
ssc
}
{code}


However when I run the following to start the streaming context:

{code:borderStyle=solid}
// Stop any existing StreamingContext 
val stopActiveContext = true
if (stopActiveContext) {
  StreamingContext.getActive.foreach { _.stop(stopSparkContext = false) }
} 

// Get or create a streaming context.
val ssc = StreamingContext.getActiveOrCreate(() => main.creatingFunc(sc))

// This starts the streaming context in the background. 
ssc.start()
{code}

The last bit, `ssc.start()`, hangs indefinitely without issuing any log 
messages. I am running this on a freshly spun up cluster with no other 
notebooks attached so there aren't any other streaming contexts running.

Any thoughts?

Additionally, here are the libraries I am using (from my build.sbt file):

{code:borderStyle=solid}
"org.apache.spark" % "spark-core_2.10" % "1.6.0"
"org.apache.spark" % "spark-sql_2.10" % "1.6.0"
"org.apache.spark" % "spark-streaming-kinesis-asl_2.10" % "1.6.0"
"org.apache.spark" % "spark-streaming_2.10" % "1.6.0"
{code}


> Spark Streaming Context Hangs on Start
> --
>
> Key: SPARK-14693
> URL: https://issues.apache.org/jira/browse/SPARK-14693
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.0, 1.6.1
> Environment: Databricks Cl

[jira] [Updated] (SPARK-14693) Spark Streaming Context Hangs on Start

2016-04-17 Thread Evan Oman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Evan Oman updated SPARK-14693:
--
Description: 
All,

I am trying to use Kinesis with Spark Streaming on Spark 1.6.0 via Databricks 
and my `ssc.start()` command is hanging. 

I am using the following function (based on 
http://spark.apache.org/docs/latest/streaming-kinesis-integration.html, which, 
as an aside, contains some broken Github links) to make my Spark Streaming 
Context:

{code:borderStyle=solid}
def creatingFunc(sc: SparkContext): StreamingContext = 
{
// Create a StreamingContext
val ssc = new StreamingContext(sc, 
Seconds(batchIntervalSeconds))

// Creata a Kinesis stream
val kinesisStream = KinesisUtils.createStream(ssc,
kinesisAppName, kinesisStreamName,
kinesisEndpointUrl, 
RegionUtils.getRegionByEndpoint(kinesisEndpointUrl).getName,
InitialPositionInStream.LATEST, 
Seconds(kinesisCheckpointIntervalSeconds),
StorageLevel.MEMORY_AND_DISK_SER_2, 
config.awsAccessKeyId, config.awsSecretKey)

kinesisStream.print()

ssc.remember(Minutes(1))
ssc.checkpoint(checkpointDir)
ssc
}
{code}


However when I run the following to start the streaming context:

{code:borderStyle=solid}
// Stop any existing StreamingContext 
val stopActiveContext = true
if (stopActiveContext) {
  StreamingContext.getActive.foreach { _.stop(stopSparkContext = false) }
} 

// Get or create a streaming context.
val ssc = StreamingContext.getActiveOrCreate(() => main.creatingFunc(sc))

// This starts the streaming context in the background. 
ssc.start()
{code}

The last bit, `ssc.start()`, hangs indefinitely without issuing any log 
messages. I am running this on a freshly spun up cluster with no other 
notebooks attached so there aren't any other streaming contexts running.

Any thoughts?

Additionally, here are the libraries I am using (from my build.sbt file):

{code:borderStyle=solid}
"org.apache.spark" % "spark-core_2.10" % "1.6.0"
"org.apache.spark" % "spark-sql_2.10" % "1.6.0"
"org.apache.spark" % "spark-streaming-kinesis-asl_2.10" % "1.6.0"
"org.apache.spark" % "spark-streaming_2.10" % "1.6.0"
{code}

  was:
All,

I am trying to use Kinesis with Spark Streaming on Spark 1.6.0 via Databricks 
and my `ssc.start()` command is hanging. 

I am using the following function to make my Spark Streaming Context:

{code:borderStyle=solid}
def creatingFunc(sc: SparkContext): StreamingContext = 
{
// Create a StreamingContext
val ssc = new StreamingContext(sc, 
Seconds(batchIntervalSeconds))

// Creata a Kinesis stream
val kinesisStream = KinesisUtils.createStream(ssc,
kinesisAppName, kinesisStreamName,
kinesisEndpointUrl, 
RegionUtils.getRegionByEndpoint(kinesisEndpointUrl).getName,
InitialPositionInStream.LATEST, 
Seconds(kinesisCheckpointIntervalSeconds),
StorageLevel.MEMORY_AND_DISK_SER_2, 
config.awsAccessKeyId, config.awsSecretKey)

kinesisStream.print()

ssc.remember(Minutes(1))
ssc.checkpoint(checkpointDir)
ssc
}
{code}


However when I run the following to start the streaming context:

{code:borderStyle=solid}
// Stop any existing StreamingContext 
val stopActiveContext = true
if (stopActiveContext) {
  StreamingContext.getActive.foreach { _.stop(stopSparkContext = false) }
} 

// Get or create a streaming context.
val ssc = StreamingContext.getActiveOrCreate(() => main.creatingFunc(sc))

// This starts the streaming context in the background. 
ssc.start()
{code}

The last bit, `ssc.start()`, hangs indefinitely without issuing any log 
messages. I am running this on a freshly spun up cluster with no other 
notebooks attached so there aren't any other streaming contexts running.

Any thoughts?

Additionally, here are the libraries I am using (from my build.sbt file):

{code:borderStyle=solid}
"org.apache.spark" % "spark-core_2.10" % "1.6.0"
"org.apache.spark" % "spark-sql_2.10" % "1.6.0"
"org.apache.spark" % "spark-streaming-kinesis-asl_2.10" % "1.6.0"
"org.apache.spark" % "spark-streaming_2.10" % "1.6.0"
{code}


> Spark Streaming Context Hangs on Start
> --
>
> Key: SPARK-14693
> URL: https://issues.apache.org/jira/browse/SPARK-14693
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.0, 1.6.1
> Environment: Databricks Cloud
>Reporter: Evan Oman
>
> All,
> I am trying to use Kinesis with Spark Streaming on Spark 1.6.0 via Databricks 
> and my `ssc.start()` com

[jira] [Updated] (SPARK-14693) Spark Streaming Context Hangs on Start

2016-04-17 Thread Evan Oman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Evan Oman updated SPARK-14693:
--
Description: 
All,

I am trying to use Kinesis with Spark Streaming on Spark 1.6.0 via Databricks 
and my `ssc.start()` command is hanging. 

I am using the following function to make my Spark Streaming Context:

{code:borderStyle=solid}
def creatingFunc(sc: SparkContext): StreamingContext = 
{
// Create a StreamingContext
val ssc = new StreamingContext(sc, 
Seconds(batchIntervalSeconds))

// Creata a Kinesis stream
val kinesisStream = KinesisUtils.createStream(ssc,
kinesisAppName, kinesisStreamName,
kinesisEndpointUrl, 
RegionUtils.getRegionByEndpoint(kinesisEndpointUrl).getName,
InitialPositionInStream.LATEST, 
Seconds(kinesisCheckpointIntervalSeconds),
StorageLevel.MEMORY_AND_DISK_SER_2, 
config.awsAccessKeyId, config.awsSecretKey)

kinesisStream.print()

ssc.remember(Minutes(1))
ssc.checkpoint(checkpointDir)
ssc
}
{code}


However when I run the following to start the streaming context:

{code:borderStyle=solid}
// Stop any existing StreamingContext 
val stopActiveContext = true
if (stopActiveContext) {
  StreamingContext.getActive.foreach { _.stop(stopSparkContext = false) }
} 

// Get or create a streaming context.
val ssc = StreamingContext.getActiveOrCreate(() => main.creatingFunc(sc))

// This starts the streaming context in the background. 
ssc.start()
{code}

The last bit, `ssc.start()`, hangs indefinitely without issuing any log 
messages. I am running this on a freshly spun up cluster with no other 
notebooks attached so there aren't any other streaming contexts running.

Any thoughts?

Additionally, here are the libraries I am using (from my build.sbt file):

{code:borderStyle=solid}
"org.apache.spark" % "spark-core_2.10" % "1.6.0"
"org.apache.spark" % "spark-sql_2.10" % "1.6.0"
"org.apache.spark" % "spark-streaming-kinesis-asl_2.10" % "1.6.0"
"org.apache.spark" % "spark-streaming_2.10" % "1.6.0"
{code}

  was:
All,

I am trying to use Kinesis with Spark Streaming on Spark 1.6.0 via Databricks 
and my `ssc.start()` command is hanging. 

I am using the following function to make my Spark Streaming Context:

{code:scala|borderStyle=solid}
def creatingFunc(sc: SparkContext): StreamingContext = 
{
// Create a StreamingContext
val ssc = new StreamingContext(sc, 
Seconds(batchIntervalSeconds))

// Creata a Kinesis stream
val kinesisStream = KinesisUtils.createStream(ssc,
kinesisAppName, kinesisStreamName,
kinesisEndpointUrl, 
RegionUtils.getRegionByEndpoint(kinesisEndpointUrl).getName,
InitialPositionInStream.LATEST, 
Seconds(kinesisCheckpointIntervalSeconds),
StorageLevel.MEMORY_AND_DISK_SER_2, 
config.awsAccessKeyId, config.awsSecretKey)

kinesisStream.print()

ssc.remember(Minutes(1))
ssc.checkpoint(checkpointDir)
ssc
}
{code}


However when I run the following to start the streaming context:

{code:scala|borderStyle=solid}
// Stop any existing StreamingContext 
val stopActiveContext = true
if (stopActiveContext) {
  StreamingContext.getActive.foreach { _.stop(stopSparkContext = false) }
} 

// Get or create a streaming context.
val ssc = StreamingContext.getActiveOrCreate(() => main.creatingFunc(sc))

// This starts the streaming context in the background. 
ssc.start()
{code}

The last bit, `ssc.start()`, hangs indefinitely without issuing any log 
messages. I am running this on a freshly spun up cluster with no other 
notebooks attached so there aren't any other streaming contexts running.

Any thoughts?

Additionally, here are the libraries I am using (from my build.sbt file):

{code:borderStyle=solid}
"org.apache.spark" % "spark-core_2.10" % "1.6.0"
"org.apache.spark" % "spark-sql_2.10" % "1.6.0"
"org.apache.spark" % "spark-streaming-kinesis-asl_2.10" % "1.6.0"
"org.apache.spark" % "spark-streaming_2.10" % "1.6.0"
{code}


> Spark Streaming Context Hangs on Start
> --
>
> Key: SPARK-14693
> URL: https://issues.apache.org/jira/browse/SPARK-14693
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.0, 1.6.1
> Environment: Databricks Cloud
>Reporter: Evan Oman
>
> All,
> I am trying to use Kinesis with Spark Streaming on Spark 1.6.0 via Databricks 
> and my `ssc.start()` command is hanging. 
> I am using the following function to make my Spark Streaming Context:
> {code:borderStyle=solid}
> def creat

[jira] [Assigned] (SPARK-14127) [Table related commands] Describe table

2016-04-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14127:


Assignee: Apache Spark

> [Table related commands] Describe table
> ---
>
> Key: SPARK-14127
> URL: https://issues.apache.org/jira/browse/SPARK-14127
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>
> TOK_DESCTABLE
> Describe a column/table/partition (see here and here). Seems we support 
> DESCRIBE and DESCRIBE EXTENDED. It will be good to also support other 
> syntaxes (and check if we are missing anything).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14693) Spark Streaming Context Hangs on Start

2016-04-17 Thread Evan Oman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Evan Oman updated SPARK-14693:
--
Description: 
All,

I am trying to use Kinesis with Spark Streaming on Spark 1.6.0 via Databricks 
and my `ssc.start()` command is hanging. 

I am using the following function to make my Spark Streaming Context:

{code:scala|borderStyle=solid}
def creatingFunc(sc: SparkContext): StreamingContext = 
{
// Create a StreamingContext
val ssc = new StreamingContext(sc, 
Seconds(batchIntervalSeconds))

// Creata a Kinesis stream
val kinesisStream = KinesisUtils.createStream(ssc,
kinesisAppName, kinesisStreamName,
kinesisEndpointUrl, 
RegionUtils.getRegionByEndpoint(kinesisEndpointUrl).getName,
InitialPositionInStream.LATEST, 
Seconds(kinesisCheckpointIntervalSeconds),
StorageLevel.MEMORY_AND_DISK_SER_2, 
config.awsAccessKeyId, config.awsSecretKey)

kinesisStream.print()

ssc.remember(Minutes(1))
ssc.checkpoint(checkpointDir)
ssc
}
{code}


However when I run the following to start the streaming context:

{code:scala|borderStyle=solid}
// Stop any existing StreamingContext 
val stopActiveContext = true
if (stopActiveContext) {
  StreamingContext.getActive.foreach { _.stop(stopSparkContext = false) }
} 

// Get or create a streaming context.
val ssc = StreamingContext.getActiveOrCreate(() => main.creatingFunc(sc))

// This starts the streaming context in the background. 
ssc.start()
{code}

The last bit, `ssc.start()`, hangs indefinitely without issuing any log 
messages. I am running this on a freshly spun up cluster with no other 
notebooks attached so there aren't any other streaming contexts running.

Any thoughts?

Additionally, here are the libraries I am using (from my build.sbt file):

{code:borderStyle=solid}
"org.apache.spark" % "spark-core_2.10" % "1.6.0"
"org.apache.spark" % "spark-sql_2.10" % "1.6.0"
"org.apache.spark" % "spark-streaming-kinesis-asl_2.10" % "1.6.0"
"org.apache.spark" % "spark-streaming_2.10" % "1.6.0"
{code}

  was:
All,

I am trying to use Kinesis with Spark Streaming on Spark 1.6.0 via Databricks 
and my `ssc.start()` command is hanging. 

I am using the following function to make my Spark Streaming Context:

{code:borderStyle=solid}
def creatingFunc(sc: SparkContext): StreamingContext = 
{
// Create a StreamingContext
val ssc = new StreamingContext(sc, 
Seconds(batchIntervalSeconds))

// Creata a Kinesis stream
val kinesisStream = KinesisUtils.createStream(ssc,
kinesisAppName, kinesisStreamName,
kinesisEndpointUrl, 
RegionUtils.getRegionByEndpoint(kinesisEndpointUrl).getName,
InitialPositionInStream.LATEST, 
Seconds(kinesisCheckpointIntervalSeconds),
StorageLevel.MEMORY_AND_DISK_SER_2, 
config.awsAccessKeyId, config.awsSecretKey)

kinesisStream.print()

ssc.remember(Minutes(1))
ssc.checkpoint(checkpointDir)
ssc
}
{code}


However when I run the following to start the streaming context:

{code:borderStyle=solid}
// Stop any existing StreamingContext 
val stopActiveContext = true
if (stopActiveContext) {
  StreamingContext.getActive.foreach { _.stop(stopSparkContext = false) }
} 

// Get or create a streaming context.
val ssc = StreamingContext.getActiveOrCreate(() => main.creatingFunc(sc))

// This starts the streaming context in the background. 
ssc.start()
{code}

The last bit, `ssc.start()`, hangs indefinitely without issuing any log 
messages. I am running this on a freshly spun up cluster with no other 
notebooks attached so there aren't any other streaming contexts running.

Any thoughts?

Additionally, here are the libraries I am using (from my build.sbt file):

{code:borderStyle=solid}
"org.apache.spark" % "spark-core_2.10" % "1.6.0"
"org.apache.spark" % "spark-sql_2.10" % "1.6.0"
"org.apache.spark" % "spark-streaming-kinesis-asl_2.10" % "1.6.0"
"org.apache.spark" % "spark-streaming_2.10" % "1.6.0"
{code}


> Spark Streaming Context Hangs on Start
> --
>
> Key: SPARK-14693
> URL: https://issues.apache.org/jira/browse/SPARK-14693
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.0, 1.6.1
> Environment: Databricks Cloud
>Reporter: Evan Oman
>
> All,
> I am trying to use Kinesis with Spark Streaming on Spark 1.6.0 via Databricks 
> and my `ssc.start()` command is hanging. 
> I am using the following function to make my Spark Streaming Context:
> {code:scala|borderStyle=solid}
> def

[jira] [Assigned] (SPARK-14127) [Table related commands] Describe table

2016-04-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14127:


Assignee: (was: Apache Spark)

> [Table related commands] Describe table
> ---
>
> Key: SPARK-14127
> URL: https://issues.apache.org/jira/browse/SPARK-14127
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> TOK_DESCTABLE
> Describe a column/table/partition (see here and here). Seems we support 
> DESCRIBE and DESCRIBE EXTENDED. It will be good to also support other 
> syntaxes (and check if we are missing anything).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14127) [Table related commands] Describe table

2016-04-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245069#comment-15245069
 ] 

Apache Spark commented on SPARK-14127:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/12460

> [Table related commands] Describe table
> ---
>
> Key: SPARK-14127
> URL: https://issues.apache.org/jira/browse/SPARK-14127
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> TOK_DESCTABLE
> Describe a column/table/partition (see here and here). Seems we support 
> DESCRIBE and DESCRIBE EXTENDED. It will be good to also support other 
> syntaxes (and check if we are missing anything).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14693) Spark Streaming Context Hangs on Start

2016-04-17 Thread Evan Oman (JIRA)
Evan Oman created SPARK-14693:
-

 Summary: Spark Streaming Context Hangs on Start
 Key: SPARK-14693
 URL: https://issues.apache.org/jira/browse/SPARK-14693
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.6.1, 1.6.0
 Environment: Databricks Cloud
Reporter: Evan Oman


All,

I am trying to use Kinesis with Spark Streaming on Spark 1.6.0 via Databricks 
and my `ssc.start()` command is hanging. 

I am using the following function to make my Spark Streaming Context:

{code:borderStyle=solid}
def creatingFunc(sc: SparkContext): StreamingContext = 
{
// Create a StreamingContext
val ssc = new StreamingContext(sc, 
Seconds(batchIntervalSeconds))

// Creata a Kinesis stream
val kinesisStream = KinesisUtils.createStream(ssc,
kinesisAppName, kinesisStreamName,
kinesisEndpointUrl, 
RegionUtils.getRegionByEndpoint(kinesisEndpointUrl).getName,
InitialPositionInStream.LATEST, 
Seconds(kinesisCheckpointIntervalSeconds),
StorageLevel.MEMORY_AND_DISK_SER_2, 
config.awsAccessKeyId, config.awsSecretKey)

kinesisStream.print()

ssc.remember(Minutes(1))
ssc.checkpoint(checkpointDir)
ssc
}
{code}


However when I run the following to start the streaming context:

{code:borderStyle=solid}
// Stop any existing StreamingContext 
val stopActiveContext = true
if (stopActiveContext) {
  StreamingContext.getActive.foreach { _.stop(stopSparkContext = false) }
} 

// Get or create a streaming context.
val ssc = StreamingContext.getActiveOrCreate(() => main.creatingFunc(sc))

// This starts the streaming context in the background. 
ssc.start()
{code}

The last bit, `ssc.start()`, hangs indefinitely without issuing any log 
messages. I am running this on a freshly spun up cluster with no other 
notebooks attached so there aren't any other streaming contexts running.

Any thoughts?

Additionally, here are the libraries I am using (from my build.sbt file):

{code:borderStyle=solid}
"org.apache.spark" % "spark-core_2.10" % "1.6.0"
"org.apache.spark" % "spark-sql_2.10" % "1.6.0"
"org.apache.spark" % "spark-streaming-kinesis-asl_2.10" % "1.6.0"
"org.apache.spark" % "spark-streaming_2.10" % "1.6.0"
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14453) Consider removing SPARK_JAVA_OPTS env variable

2016-04-17 Thread Krishnan Narayan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245065#comment-15245065
 ] 

Krishnan Narayan commented on SPARK-14453:
--

+1 

> Consider removing SPARK_JAVA_OPTS env variable
> --
>
> Key: SPARK-14453
> URL: https://issues.apache.org/jira/browse/SPARK-14453
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Reporter: Saisai Shao
>Priority: Minor
>
> SPARK_JAVA_OPTS was deprecated since 1.0, with the release of major version 
> (2.0), I think it would be better to remove the support of this env variable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14453) Consider removing SPARK_JAVA_OPTS env variable

2016-04-17 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245048#comment-15245048
 ] 

Saisai Shao commented on SPARK-14453:
-

Yes, this should be a part of SPARK-12344. Since SPARK-12344 is already a 
subtask, so I cannot make this as a subtask belongs to that.

> Consider removing SPARK_JAVA_OPTS env variable
> --
>
> Key: SPARK-14453
> URL: https://issues.apache.org/jira/browse/SPARK-14453
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Reporter: Saisai Shao
>Priority: Minor
>
> SPARK_JAVA_OPTS was deprecated since 1.0, with the release of major version 
> (2.0), I think it would be better to remove the support of this env variable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14692) Error While Setting the path for R front end

2016-04-17 Thread Niranjan Molkeri` (JIRA)
Niranjan Molkeri` created SPARK-14692:
-

 Summary: Error While Setting the path for R front end
 Key: SPARK-14692
 URL: https://issues.apache.org/jira/browse/SPARK-14692
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.6.1
 Environment: Mac OSX
Reporter: Niranjan Molkeri`


Trying to set Environment path for SparkR in RStudio. 
Getting this bug. 

> .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
> library(SparkR)
Error in library(SparkR) : there is no package called ‘SparkR’
> sc <- sparkR.init(master="local")
Error: could not find function "sparkR.init"


In the directory which it is pointed. There is directory called SparkR. I don't 
know how to proceed with this.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13662) [SQL][Hive] Have SHOW TABLES return additional fields from Hive MetaStore

2016-04-17 Thread Vijay Parmar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245029#comment-15245029
 ] 

Vijay Parmar commented on SPARK-13662:
--

Hi Evan,

I would like to look into the issue.

Let me know if I can go ahead on this?


Thanks
Vijay

> [SQL][Hive] Have SHOW TABLES return additional fields from Hive MetaStore 
> --
>
> Key: SPARK-13662
> URL: https://issues.apache.org/jira/browse/SPARK-13662
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
> Environment: All
>Reporter: Evan Chan
>
> Currently, the SHOW TABLES command in Spark's Hive ThriftServer, or 
> equivalently the HiveContext.tables method, returns a DataFrame with only two 
> columns: the name of the table and whether it is temporary.  It would be 
> really nice to add support to return some extra information, such as:
> - Whether this table is Spark-only or a native Hive table
> - If spark-only, the name of the data source
> - potentially other properties
> The first two is really useful for BI environments connecting to multiple 
> data sources and that work with both Hive and Spark.
> Some thoughts:
> - The SQL/HiveContext Catalog API might need to be expanded to return 
> something like a TableEntry, rather than just a tuple of (name, temporary).
> - I believe there is a Hive Catalog/client API to get information about each 
> table.  I suppose one concern would be the speed of using this API.  Perhaps 
> there are other APis that can get this info faster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14691) Simplify and Unify Error Generation for Unsupported Alter Table DDL

2016-04-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14691:


Assignee: (was: Apache Spark)

> Simplify and Unify Error Generation for Unsupported Alter Table DDL
> ---
>
> Key: SPARK-14691
> URL: https://issues.apache.org/jira/browse/SPARK-14691
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> So far, we are capturing each unsupported Alter Table in separate visit 
> functions. They should be unified and issue a ParseException instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14691) Simplify and Unify Error Generation for Unsupported Alter Table DDL

2016-04-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14691:


Assignee: Apache Spark

> Simplify and Unify Error Generation for Unsupported Alter Table DDL
> ---
>
> Key: SPARK-14691
> URL: https://issues.apache.org/jira/browse/SPARK-14691
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> So far, we are capturing each unsupported Alter Table in separate visit 
> functions. They should be unified and issue a ParseException instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14691) Simplify and Unify Error Generation for Unsupported Alter Table DDL

2016-04-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245028#comment-15245028
 ] 

Apache Spark commented on SPARK-14691:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/12459

> Simplify and Unify Error Generation for Unsupported Alter Table DDL
> ---
>
> Key: SPARK-14691
> URL: https://issues.apache.org/jira/browse/SPARK-14691
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> So far, we are capturing each unsupported Alter Table in separate visit 
> functions. They should be unified and issue a ParseException instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14691) Simplify and Unify Error Generation for Unsupported Alter Table DDL

2016-04-17 Thread Xiao Li (JIRA)
Xiao Li created SPARK-14691:
---

 Summary: Simplify and Unify Error Generation for Unsupported Alter 
Table DDL
 Key: SPARK-14691
 URL: https://issues.apache.org/jira/browse/SPARK-14691
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiao Li


So far, we are capturing each unsupported Alter Table in separate visit 
functions. They should be unified and issue a ParseException instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14325) some strange name conflicts in `group_by`

2016-04-17 Thread Vijay Parmar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245024#comment-15245024
 ] 

Vijay Parmar commented on SPARK-14325:
--

Hi Dmitriy,

It has been a very short time I joined the community.

Would like to look into the issue if no other member has taken it up?

Please let me know if I can go ahead?

Thanks
Vijay 

> some strange name conflicts in `group_by`
> -
>
> Key: SPARK-14325
> URL: https://issues.apache.org/jira/browse/SPARK-14325
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.0, 1.6.1
> Environment: sparkR 1.6.0
>Reporter: Dmitriy Selivanov
>
> group_by strange behaviour when try to aggregate by column with name "x".
> consider following example
> {code}
> df
> # DataFrame[userId:bigint, type:string, j:int, x:int]
> df %>%group_by(df$userId, df$type, df$j) %>% agg(x = "sum")
> #Error in (function (classes, fdef, mtable)  : 
> #  unable to find an inherited method for function ‘agg’ for signature 
> ‘"character"’
> {code}
> after renaming x -> x2 works just file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14647) Group SQLContext/HiveContext state into PersistentState

2016-04-17 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245009#comment-15245009
 ] 

Andrew Or commented on SPARK-14647:
---

I've reverted it for now.

> Group SQLContext/HiveContext state into PersistentState
> ---
>
> Key: SPARK-14647
> URL: https://issues.apache.org/jira/browse/SPARK-14647
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> This is analogous to SPARK-13526, which moved some things into 
> `SessionState`. After this issue we'll have an analogous `PersistentState` 
> that groups things to be shared across sessions. This will simplify the 
> constructors of the contexts significantly by allowing us to pass fewer 
> things into the contexts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14586) SparkSQL doesn't parse decimal like Hive

2016-04-17 Thread Stephane Maarek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244986#comment-15244986
 ] 

Stephane Maarek commented on SPARK-14586:
-

Hi [~tsuresh], thanks for your reply. It makes sense! I'm using Hive 1.2.1.
My only concern is that looking at the code, I understand why the number 
wouldn't be parsed correctly in Spark and Hive, but I don't understand why Hive 
1.2.1 CLI would parse the number correctly (as seen in my troubleshooting)? 
Isn't Spark using the exact same logic as Hive?

> SparkSQL doesn't parse decimal like Hive
> 
>
> Key: SPARK-14586
> URL: https://issues.apache.org/jira/browse/SPARK-14586
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Stephane Maarek
>
> create a test_data.csv with the following
> {code:none}
> a, 2.0
> ,3.0
> {code}
> (the space is intended before the 2)
> copy the test_data.csv to hdfs:///spark_testing_2
> go in hive, run the following statements
> {code:sql}
> CREATE SCHEMA IF NOT EXISTS spark_testing;
> DROP TABLE IF EXISTS spark_testing.test_csv_2;
> CREATE EXTERNAL TABLE `spark_testing.test_csv_2`(
>   column_1 varchar(10),
>   column_2 decimal(4,2))
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS TEXTFILE LOCATION '/spark_testing_2'
> TBLPROPERTIES('serialization.null.format'='');
> select * from spark_testing.test_csv_2;
> OK
> a   2
> NULL3
> {code}
> As you can see, the value " 2" gets parsed correctly to 2
> Now onto Spark-shell:
> {code:java}
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> sqlContext.sql("select * from spark_testing.test_csv_2").show()
> +++
> |column_1|column_2|
> +++
> |   a|null|
> |null|3.00|
> +++
> {code}
> As you can see, the " 2" got parsed to null. Therefore Hive and Spark don't 
> have a similar parsing behavior for decimals. I wouldn't say it is a bug per 
> se, but it looks like a necessary improvement for the two engines to 
> converge. Hive version is 1.5.1
> Not sure if relevant, but Scala does parse numbers with leading space 
> correctly
> {code}
> scala> "2.0".toDouble
> res21: Double = 2.0
> scala> " 2.0".toDouble
> res22: Double = 2.0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14642) import org.apache.spark.sql.expressions._ breaks udf under functions

2016-04-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244984#comment-15244984
 ] 

Apache Spark commented on SPARK-14642:
--

User 'sbcd90' has created a pull request for this issue:
https://github.com/apache/spark/pull/12458

> import org.apache.spark.sql.expressions._ breaks udf under functions
> 
>
> Key: SPARK-14642
> URL: https://issues.apache.org/jira/browse/SPARK-14642
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> The following code works
> {code}
> scala> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.functions._
> scala> udf((v: String) => v.stripSuffix("-abc"))
> res0: org.apache.spark.sql.expressions.UserDefinedFunction = 
> UserDefinedFunction(,StringType,Some(List(StringType)))
> {code}
> But, the following does not
> {code}
> scala> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.functions._
> scala> import org.apache.spark.sql.expressions._
> import org.apache.spark.sql.expressions._
> scala> udf((v: String) => v.stripSuffix("-abc"))
> :30: error: No TypeTag available for String
>udf((v: String) => v.stripSuffix("-abc"))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14642) import org.apache.spark.sql.expressions._ breaks udf under functions

2016-04-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14642:


Assignee: Apache Spark

> import org.apache.spark.sql.expressions._ breaks udf under functions
> 
>
> Key: SPARK-14642
> URL: https://issues.apache.org/jira/browse/SPARK-14642
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>Priority: Critical
>
> The following code works
> {code}
> scala> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.functions._
> scala> udf((v: String) => v.stripSuffix("-abc"))
> res0: org.apache.spark.sql.expressions.UserDefinedFunction = 
> UserDefinedFunction(,StringType,Some(List(StringType)))
> {code}
> But, the following does not
> {code}
> scala> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.functions._
> scala> import org.apache.spark.sql.expressions._
> import org.apache.spark.sql.expressions._
> scala> udf((v: String) => v.stripSuffix("-abc"))
> :30: error: No TypeTag available for String
>udf((v: String) => v.stripSuffix("-abc"))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14642) import org.apache.spark.sql.expressions._ breaks udf under functions

2016-04-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14642:


Assignee: (was: Apache Spark)

> import org.apache.spark.sql.expressions._ breaks udf under functions
> 
>
> Key: SPARK-14642
> URL: https://issues.apache.org/jira/browse/SPARK-14642
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> The following code works
> {code}
> scala> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.functions._
> scala> udf((v: String) => v.stripSuffix("-abc"))
> res0: org.apache.spark.sql.expressions.UserDefinedFunction = 
> UserDefinedFunction(,StringType,Some(List(StringType)))
> {code}
> But, the following does not
> {code}
> scala> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.functions._
> scala> import org.apache.spark.sql.expressions._
> import org.apache.spark.sql.expressions._
> scala> udf((v: String) => v.stripSuffix("-abc"))
> :30: error: No TypeTag available for String
>udf((v: String) => v.stripSuffix("-abc"))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14632) randomSplit method fails on dataframes with maps in schema

2016-04-17 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14632.
-
   Resolution: Fixed
 Assignee: Subhobrata Dey
Fix Version/s: 2.0.0

> randomSplit method fails on dataframes with maps in schema
> --
>
> Key: SPARK-14632
> URL: https://issues.apache.org/jira/browse/SPARK-14632
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Stefano Costantini
>Assignee: Subhobrata Dey
> Fix For: 2.0.0
>
>
> Applying the randomSplit method to a dataframe with at least one map in the 
> schema results in an exception
> {noformat}
> org.apache.spark.sql.AnalysisException: cannot resolve 'features ASC' due to 
> data type mismatch: cannot sort data type map;
> {noformat}
> This bug can be reproduced as follows:
> {code}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext.implicits._
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> val arr = Array(("user1", Map("f1" -> 1.0, "f2" -> 1.0)), ("user2", Map("f2" 
> -> 1.0, "f3" -> 1.0)), ("user3",Map("f1" -> 1.0, "f2" -> 1.0)))
> val df = sc.parallelize(arr).toDF("user","features")
> df.printSchema
> val Array(split1, split2) = df.randomSplit(Array(0.7, 0.3), seed = 101L)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12922) Implement gapply() on DataFrame in SparkR

2016-04-17 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244918#comment-15244918
 ] 

Narine Kokhlikyan commented on SPARK-12922:
---

Hi [~sunrui],

I’ve made some progress in putting logical and physical plans together and 
calling R workers, however I still have some questions.
1. I’m still not quite sure about the number of partitions. As you wrote in 
https://issues.apache.org/jira/browse/SPARK-6817 we need to 
tune the number of partitions based on “spark.sql.shuffle.partitions”. What 
do you exactly mean by tuning? Repartitioning ?
2.   I have another question about grouping by keys:
  groupByKey with one key is fine, however if we have more than one key we 
probably need to introduce a case class. With a case
  class it looks okay too, but I’m not sure how convenient it is. Any ideas 
?
  case class KeyData(a: Int, b: Int)
  val gd1 = df.groupByKey(r=>KeyData(r.getInt(0), r.getInt(1)))


Thanks,
Narine

> Implement gapply() on DataFrame in SparkR
> -
>
> Key: SPARK-12922
> URL: https://issues.apache.org/jira/browse/SPARK-12922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Sun Rui
>
> gapply() applies an R function on groups grouped by one or more columns of a 
> DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() 
> in the Dataset API.
> Two API styles are supported:
> 1.
> {code}
> gd <- groupBy(df, col1, ...)
> gapply(gd, function(grouping_key, group) {}, schema)
> {code}
> 2.
> {code}
> gapply(df, grouping_columns, function(grouping_key, group) {}, schema) 
> {code}
> R function input: grouping keys value, a local data.frame of this grouped 
> data 
> R function output: local data.frame
> Schema specifies the Row format of the output of the R function. It must 
> match the R function's output.
> Note that map-side combination (partial aggregation) is not supported, user 
> could do map-side combination via dapply().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14642) import org.apache.spark.sql.expressions._ breaks udf under functions

2016-04-17 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244899#comment-15244899
 ] 

Yin Huai commented on SPARK-14642:
--

[~sbcd90] Yea sure. I am not very sure about the right solution. But, having a 
PR can definitely help the discussion and help others better understand the 
problem :)

> import org.apache.spark.sql.expressions._ breaks udf under functions
> 
>
> Key: SPARK-14642
> URL: https://issues.apache.org/jira/browse/SPARK-14642
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> The following code works
> {code}
> scala> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.functions._
> scala> udf((v: String) => v.stripSuffix("-abc"))
> res0: org.apache.spark.sql.expressions.UserDefinedFunction = 
> UserDefinedFunction(,StringType,Some(List(StringType)))
> {code}
> But, the following does not
> {code}
> scala> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.functions._
> scala> import org.apache.spark.sql.expressions._
> import org.apache.spark.sql.expressions._
> scala> udf((v: String) => v.stripSuffix("-abc"))
> :30: error: No TypeTag available for String
>udf((v: String) => v.stripSuffix("-abc"))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14642) import org.apache.spark.sql.expressions._ breaks udf under functions

2016-04-17 Thread Subhobrata Dey (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244896#comment-15244896
 ] 

Subhobrata Dey commented on SPARK-14642:


Hello [~yhuai], I see that the issue gets resolved when the package 

{code:java}
org.apache.spark.sql.expressions.scala
{code}

does not exist & the file 

{code:java}
typed.scala
{code}

is put directly under the package 

{code:java}
org.apache.spark.sql.expressions
{code}

in spark-sql_.jar

Can I submit a PR for this?

> import org.apache.spark.sql.expressions._ breaks udf under functions
> 
>
> Key: SPARK-14642
> URL: https://issues.apache.org/jira/browse/SPARK-14642
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> The following code works
> {code}
> scala> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.functions._
> scala> udf((v: String) => v.stripSuffix("-abc"))
> res0: org.apache.spark.sql.expressions.UserDefinedFunction = 
> UserDefinedFunction(,StringType,Some(List(StringType)))
> {code}
> But, the following does not
> {code}
> scala> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.functions._
> scala> import org.apache.spark.sql.expressions._
> import org.apache.spark.sql.expressions._
> scala> udf((v: String) => v.stripSuffix("-abc"))
> :30: error: No TypeTag available for String
>udf((v: String) => v.stripSuffix("-abc"))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13904) Add support for pluggable cluster manager

2016-04-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244882#comment-15244882
 ] 

Apache Spark commented on SPARK-13904:
--

User 'tedyu' has created a pull request for this issue:
https://github.com/apache/spark/pull/12457

> Add support for pluggable cluster manager
> -
>
> Key: SPARK-13904
> URL: https://issues.apache.org/jira/browse/SPARK-13904
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Hemant Bhanawat
>
> Currently Spark allows only a few cluster managers viz Yarn, Mesos and 
> Standalone. But, as Spark is now being used in newer and different use cases, 
> there is a need for allowing other cluster managers to manage spark 
> components. One such use case is - embedding spark components like executor 
> and driver inside another process which may be a datastore. This allows 
> colocation of data and processing. Another requirement that stems from such a 
> use case is that the executors/driver should not take the parent process down 
> when they go down and the components can be relaunched inside the same 
> process again. 
> So, this JIRA requests two functionalities:
> 1. Support for external cluster managers
> 2. Allow a cluster manager to clean up the tasks without taking the parent 
> process down. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14647) Group SQLContext/HiveContext state into PersistentState

2016-04-17 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244880#comment-15244880
 ] 

Yin Huai commented on SPARK-14647:
--

Let me check the code and see if there is any suspicious place.

> Group SQLContext/HiveContext state into PersistentState
> ---
>
> Key: SPARK-14647
> URL: https://issues.apache.org/jira/browse/SPARK-14647
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> This is analogous to SPARK-13526, which moved some things into 
> `SessionState`. After this issue we'll have an analogous `PersistentState` 
> that groups things to be shared across sessions. This will simplify the 
> constructors of the contexts significantly by allowing us to pass fewer 
> things into the contexts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14647) Group SQLContext/HiveContext state into PersistentState

2016-04-17 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244879#comment-15244879
 ] 

Yin Huai commented on SPARK-14647:
--

Looking at the log, it seems that it took a long time to resolve maven 
dependency (that test is specific to hive 0.13 metastore. So, it will first 
download jars using ivy). 

> Group SQLContext/HiveContext state into PersistentState
> ---
>
> Key: SPARK-14647
> URL: https://issues.apache.org/jira/browse/SPARK-14647
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> This is analogous to SPARK-13526, which moved some things into 
> `SessionState`. After this issue we'll have an analogous `PersistentState` 
> that groups things to be shared across sessions. This will simplify the 
> constructors of the contexts significantly by allowing us to pass fewer 
> things into the contexts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14686) Implement a non-inheritable localProperty facility

2016-04-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244807#comment-15244807
 ] 

Apache Spark commented on SPARK-14686:
--

User 'marcintustin' has created a pull request for this issue:
https://github.com/apache/spark/pull/12456

> Implement a non-inheritable localProperty facility
> --
>
> Key: SPARK-14686
> URL: https://issues.apache.org/jira/browse/SPARK-14686
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Marcin Tustin
>Priority: Minor
>
> As discussed here: 
> http://mail-archives.apache.org/mod_mbox/spark-dev/201604.mbox/%3CCANXtaKA4ZdpiUbZPnzDBN8ZL7_RwKSMuz6n45ixnHVEBE1hgjg%40mail.gmail.com%3E
> Spark localProperties are always inherited by spawned threads. There are 
> situations in which this is undesirable (notably spark.sql.execution.id and 
> any other localProperty that should always be cleaned up). This is a ticket 
> to implement a non-inheritable mechanism for localProperties. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14686) Implement a non-inheritable localProperty facility

2016-04-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14686:


Assignee: (was: Apache Spark)

> Implement a non-inheritable localProperty facility
> --
>
> Key: SPARK-14686
> URL: https://issues.apache.org/jira/browse/SPARK-14686
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Marcin Tustin
>Priority: Minor
>
> As discussed here: 
> http://mail-archives.apache.org/mod_mbox/spark-dev/201604.mbox/%3CCANXtaKA4ZdpiUbZPnzDBN8ZL7_RwKSMuz6n45ixnHVEBE1hgjg%40mail.gmail.com%3E
> Spark localProperties are always inherited by spawned threads. There are 
> situations in which this is undesirable (notably spark.sql.execution.id and 
> any other localProperty that should always be cleaned up). This is a ticket 
> to implement a non-inheritable mechanism for localProperties. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14686) Implement a non-inheritable localProperty facility

2016-04-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14686:


Assignee: Apache Spark

> Implement a non-inheritable localProperty facility
> --
>
> Key: SPARK-14686
> URL: https://issues.apache.org/jira/browse/SPARK-14686
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Marcin Tustin
>Assignee: Apache Spark
>Priority: Minor
>
> As discussed here: 
> http://mail-archives.apache.org/mod_mbox/spark-dev/201604.mbox/%3CCANXtaKA4ZdpiUbZPnzDBN8ZL7_RwKSMuz6n45ixnHVEBE1hgjg%40mail.gmail.com%3E
> Spark localProperties are always inherited by spawned threads. There are 
> situations in which this is undesirable (notably spark.sql.execution.id and 
> any other localProperty that should always be cleaned up). This is a ticket 
> to implement a non-inheritable mechanism for localProperties. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13904) Add support for pluggable cluster manager

2016-04-17 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244805#comment-15244805
 ] 

Kazuaki Ishizaki commented on SPARK-13904:
--

To merge this PR may have begun causing test failures. Would it be possible to 
look at these links?
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/627/
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/626/

cf. [SPARK-14690]

> Add support for pluggable cluster manager
> -
>
> Key: SPARK-13904
> URL: https://issues.apache.org/jira/browse/SPARK-13904
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Hemant Bhanawat
>
> Currently Spark allows only a few cluster managers viz Yarn, Mesos and 
> Standalone. But, as Spark is now being used in newer and different use cases, 
> there is a need for allowing other cluster managers to manage spark 
> components. One such use case is - embedding spark components like executor 
> and driver inside another process which may be a datastore. This allows 
> colocation of data and processing. Another requirement that stems from such a 
> use case is that the executors/driver should not take the parent process down 
> when they go down and the components can be relaunched inside the same 
> process again. 
> So, this JIRA requests two functionalities:
> 1. Support for external cluster managers
> 2. Allow a cluster manager to clean up the tasks without taking the parent 
> process down. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-14690) [SQL] SPARK-8020 fails in Jenkins for master

2016-04-17 Thread Kazuaki Ishizaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-14690:
-
Comment: was deleted

(was: Add a link to the original JIRA)

> [SQL] SPARK-8020 fails in Jenkins for master
> 
>
> Key: SPARK-14690
> URL: https://issues.apache.org/jira/browse/SPARK-14690
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Kazuaki Ishizaki
>
> After merging a PR for [SPARK-14672], [SPARK-13904], and another one, a test 
> "SPARK-8020" fails.
> Here is a result at amplab Jenkins.
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/627/
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/626/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-14690) [SQL] SPARK-8020 fails in Jenkins for master

2016-04-17 Thread Kazuaki Ishizaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki closed SPARK-14690.


Add a link to the original JIRA

> [SQL] SPARK-8020 fails in Jenkins for master
> 
>
> Key: SPARK-14690
> URL: https://issues.apache.org/jira/browse/SPARK-14690
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Kazuaki Ishizaki
>
> After merging a PR for [SPARK-14672], [SPARK-13904], and another one, a test 
> "SPARK-8020" fails.
> Here is a result at amplab Jenkins.
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/627/
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/626/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14690) [SQL] SPARK-8020 fails in Jenkins for master

2016-04-17 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244798#comment-15244798
 ] 

Kazuaki Ishizaki commented on SPARK-14690:
--

I see. I will reopen the original JIRA soon.

> [SQL] SPARK-8020 fails in Jenkins for master
> 
>
> Key: SPARK-14690
> URL: https://issues.apache.org/jira/browse/SPARK-14690
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Kazuaki Ishizaki
>
> After merging a PR for [SPARK-14672], [SPARK-13904], and another one, a test 
> "SPARK-8020" fails.
> Here is a result at amplab Jenkins.
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/627/
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/626/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14690) [SQL] SPARK-8020 fails in Jenkins for master

2016-04-17 Thread Kazuaki Ishizaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki resolved SPARK-14690.
--
Resolution: Duplicate

> [SQL] SPARK-8020 fails in Jenkins for master
> 
>
> Key: SPARK-14690
> URL: https://issues.apache.org/jira/browse/SPARK-14690
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Kazuaki Ishizaki
>
> After merging a PR for [SPARK-14672], [SPARK-13904], and another one, a test 
> "SPARK-8020" fails.
> Here is a result at amplab Jenkins.
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/627/
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/626/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14690) [SQL] SPARK-8020 fails in Jenkins for master

2016-04-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244784#comment-15244784
 ] 

Sean Owen commented on SPARK-14690:
---

Same, please reopen the JIRA whose resolution you believe caused the failures. 
This splits the thread of discussion for anyone following the original change.

> [SQL] SPARK-8020 fails in Jenkins for master
> 
>
> Key: SPARK-14690
> URL: https://issues.apache.org/jira/browse/SPARK-14690
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Kazuaki Ishizaki
>
> After merging a PR for [SPARK-14672], [SPARK-13904], and another one, a test 
> "SPARK-8020" fails.
> Here is a result at amplab Jenkins.
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/627/
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/626/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-14647) Group SQLContext/HiveContext state into PersistentState

2016-04-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-14647:
---

Pardon, does look like this may have begun causing test failures:

https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/625/

Cf. SPARK-14689

> Group SQLContext/HiveContext state into PersistentState
> ---
>
> Key: SPARK-14647
> URL: https://issues.apache.org/jira/browse/SPARK-14647
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> This is analogous to SPARK-13526, which moved some things into 
> `SessionState`. After this issue we'll have an analogous `PersistentState` 
> that groups things to be shared across sessions. This will simplify the 
> constructors of the contexts significantly by allowing us to pass fewer 
> things into the contexts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14689) [SQL] SPARK-9757 fails in Jenkins for master

2016-04-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-14689.
---
Resolution: Duplicate

I think it doesn't help to make a new JIRA for this, as it splits the thread of 
discussion. I'm going to mark this a duplicate and reopen the other JIRA.

> [SQL] SPARK-9757 fails in Jenkins for master
> 
>
> Key: SPARK-14689
> URL: https://issues.apache.org/jira/browse/SPARK-14689
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Kazuaki Ishizaki
>Priority: Blocker
>
> After merging a PR for [SPARK-14647], a test "SPARK-9757" fails.
> Here is a result at amplab Jenkins.
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/625/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14690) [SQL] SPARK-8020 fails in Jenkins for master

2016-04-17 Thread Kazuaki Ishizaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-14690:
-
Summary: [SQL] SPARK-8020 fails in Jenkins for master  (was: [SQL] 
SPARK-9757 fails in Jenkins for master)

> [SQL] SPARK-8020 fails in Jenkins for master
> 
>
> Key: SPARK-14690
> URL: https://issues.apache.org/jira/browse/SPARK-14690
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Kazuaki Ishizaki
>
> After merging a PR for [SPARK-14672], [SPARK-13904], and another one, a test 
> "SPARK-8020" fails.
> Here is a result at amplab Jenkins.
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/627/
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/626/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14690) [SQL] SPARK-9757 fails in Jenkins for master

2016-04-17 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-14690:


 Summary: [SQL] SPARK-9757 fails in Jenkins for master
 Key: SPARK-14690
 URL: https://issues.apache.org/jira/browse/SPARK-14690
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Kazuaki Ishizaki


After merging a PR for [SPARK-14672], [SPARK-13904], and another one, a test 
"SPARK-8020" fails.

Here is a result at amplab Jenkins.
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/627/
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/626/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14689) [SQL] SPARK-9757 fails in Jenkins for master

2016-04-17 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-14689:


 Summary: [SQL] SPARK-9757 fails in Jenkins for master
 Key: SPARK-14689
 URL: https://issues.apache.org/jira/browse/SPARK-14689
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Kazuaki Ishizaki
Priority: Blocker


After merging a PR for [SPARK-14647], a test "SPARK-9757" fails.

Here is a result at amplab Jenkins.
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/625/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14688) pyspark textFileStream gzipped

2016-04-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244770#comment-15244770
 ] 

Sean Owen commented on SPARK-14688:
---

Please provide some detail? AFAIK it's just delegating to the same Hadoop APIs 
to read, right?

> pyspark textFileStream gzipped
> --
>
> Key: SPARK-14688
> URL: https://issues.apache.org/jira/browse/SPARK-14688
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Streaming
>Affects Versions: 1.6.1
>Reporter: seth
>  Labels: pyspark, streaming
>
> pyspark streamingObject does not support reading gzip files.
> 2 notes: 
> 1.regular sparkContext does support gzip files
> 2. Java/Scala method support streaming gzip files



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14688) pyspark textFileStream gzipped

2016-04-17 Thread seth (JIRA)
seth created SPARK-14688:


 Summary: pyspark textFileStream gzipped
 Key: SPARK-14688
 URL: https://issues.apache.org/jira/browse/SPARK-14688
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Streaming
Affects Versions: 1.6.1
Reporter: seth


pyspark streamingObject does not support reading gzip files.
2 notes: 
1.regular sparkContext does support gzip files
2. Java/Scala method support streaming gzip files



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14685) Properly document heritability of localProperties

2016-04-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14685:


Assignee: Apache Spark

> Properly document heritability of localProperties
> -
>
> Key: SPARK-14685
> URL: https://issues.apache.org/jira/browse/SPARK-14685
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Marcin Tustin
>Assignee: Apache Spark
>Priority: Minor
>
> As discussed here: 
> http://mail-archives.apache.org/mod_mbox/spark-dev/201604.mbox/%3CCANXtaKA4ZdpiUbZPnzDBN8ZL7_RwKSMuz6n45ixnHVEBE1hgjg%40mail.gmail.com%3E
> One thread spawned by another will inherit spark localProperties. This is not 
> currently documented, and there are no tests for that specific behaviour.
> This is a ticket to document this behaviour, including its consequences, and 
> implement an appropriate test. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14685) Properly document heritability of localProperties

2016-04-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244692#comment-15244692
 ] 

Apache Spark commented on SPARK-14685:
--

User 'marcintustin' has created a pull request for this issue:
https://github.com/apache/spark/pull/12455

> Properly document heritability of localProperties
> -
>
> Key: SPARK-14685
> URL: https://issues.apache.org/jira/browse/SPARK-14685
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Marcin Tustin
>Priority: Minor
>
> As discussed here: 
> http://mail-archives.apache.org/mod_mbox/spark-dev/201604.mbox/%3CCANXtaKA4ZdpiUbZPnzDBN8ZL7_RwKSMuz6n45ixnHVEBE1hgjg%40mail.gmail.com%3E
> One thread spawned by another will inherit spark localProperties. This is not 
> currently documented, and there are no tests for that specific behaviour.
> This is a ticket to document this behaviour, including its consequences, and 
> implement an appropriate test. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14685) Properly document heritability of localProperties

2016-04-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14685:


Assignee: (was: Apache Spark)

> Properly document heritability of localProperties
> -
>
> Key: SPARK-14685
> URL: https://issues.apache.org/jira/browse/SPARK-14685
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Marcin Tustin
>Priority: Minor
>
> As discussed here: 
> http://mail-archives.apache.org/mod_mbox/spark-dev/201604.mbox/%3CCANXtaKA4ZdpiUbZPnzDBN8ZL7_RwKSMuz6n45ixnHVEBE1hgjg%40mail.gmail.com%3E
> One thread spawned by another will inherit spark localProperties. This is not 
> currently documented, and there are no tests for that specific behaviour.
> This is a ticket to document this behaviour, including its consequences, and 
> implement an appropriate test. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14343) Dataframe operations on a partitioned dataset (using partition discovery) return invalid results

2016-04-17 Thread Jurriaan Pruis (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244669#comment-15244669
 ] 

Jurriaan Pruis commented on SPARK-14343:


On the spark 2.0.0 nightly build it doesn't work at all:

{code:none}
>>> df=sqlContext.read.text('dataset')
16/04/17 16:11:34 INFO HDFSFileCatalog: Listing file:/Users/.../dataset on 
driver
16/04/17 16:11:34 INFO HDFSFileCatalog: Listing 
file:/Users/.../dataset/year=2014 on driver
16/04/17 16:11:34 INFO HDFSFileCatalog: Listing 
file:/Users/.../dataset/year=2015 on driver
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/Users/.../Downloads/spark-2.0.0-SNAPSHOT-bin-hadoop2.7/python/pyspark/sql/readwriter.py",
 line 245, in text
return 
self._df(self._jreader.text(self._sqlContext._sc._jvm.PythonUtils.toSeq(paths)))
  File 
"/Users/.../Downloads/spark-2.0.0-SNAPSHOT-bin-hadoop2.7/python/lib/py4j-0.9.2-src.zip/py4j/java_gateway.py",
 line 836, in __call__
  File 
"/Users/.../Downloads/spark-2.0.0-SNAPSHOT-bin-hadoop2.7/python/pyspark/sql/utils.py",
 line 57, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'Try to map struct 
to Tuple1, but failed as the number of fields does not line up.\n - Input 
schema: struct\n - Target schema: struct;'
{code}

> Dataframe operations on a partitioned dataset (using partition discovery) 
> return invalid results
> 
>
> Key: SPARK-14343
> URL: https://issues.apache.org/jira/browse/SPARK-14343
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
> Environment: Mac OS X 10.11.4
>Reporter: Jurriaan Pruis
>
> When reading a dataset using {{sqlContext.read.text()}} queries on the 
> partitioned column return invalid results.
> h2. How to reproduce:
> h3. Generate datasets
> {code:title=repro.sh}
> #!/bin/sh
> mkdir -p dataset/year=2014
> mkdir -p dataset/year=2015
> echo "data from 2014" > dataset/year=2014/part01.txt
> echo "data from 2015" > dataset/year=2015/part01.txt
> {code}
> {code:title=repro2.sh}
> #!/bin/sh
> mkdir -p dataset2/month=june
> mkdir -p dataset2/month=july
> echo "data from june" > dataset2/month=june/part01.txt
> echo "data from july" > dataset2/month=july/part01.txt
> {code}
> h3. using first dataset
> {code:none}
> >>> df = sqlContext.read.text('dataset')
> ...
> >>> df
> DataFrame[value: string, year: int]
> >>> df.show()
> +--++
> | value|year|
> +--++
> |data from 2014|2014|
> |data from 2015|2015|
> +--++
> >>> df.select('year').show()
> ++
> |year|
> ++
> |  14|
> |  14|
> ++
> {code}
> This is clearly wrong. Seems like it returns the length of the value column?
> h3. using second dataset
> With another dataset it looks like this:
> {code:none}
> >>> df = sqlContext.read.text('dataset2')
> >>> df
> DataFrame[value: string, month: string]
> >>> df.show()
> +--+-+
> | value|month|
> +--+-+
> |data from june| june|
> |data from july| july|
> +--+-+
> >>> df.select('month').show()
> +--+
> | month|
> +--+
> |data from june|
> |data from july|
> +--+
> {code}
> Here it returns the value of the value column instead of the month partition.
> h3. Workaround
> When I convert the dataframe to an RDD and back to a DataFrame I get the 
> following result (which is the expected behaviour):
> {code:none}
> >>> df.rdd.toDF().select('month').show()
> +-+
> |month|
> +-+
> | june|
> | july|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14343) Dataframe operations on a partitioned dataset (using partition discovery) return invalid results

2016-04-17 Thread Jurriaan Pruis (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jurriaan Pruis updated SPARK-14343:
---
Affects Version/s: 2.0.0

> Dataframe operations on a partitioned dataset (using partition discovery) 
> return invalid results
> 
>
> Key: SPARK-14343
> URL: https://issues.apache.org/jira/browse/SPARK-14343
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
> Environment: Mac OS X 10.11.4
>Reporter: Jurriaan Pruis
>
> When reading a dataset using {{sqlContext.read.text()}} queries on the 
> partitioned column return invalid results.
> h2. How to reproduce:
> h3. Generate datasets
> {code:title=repro.sh}
> #!/bin/sh
> mkdir -p dataset/year=2014
> mkdir -p dataset/year=2015
> echo "data from 2014" > dataset/year=2014/part01.txt
> echo "data from 2015" > dataset/year=2015/part01.txt
> {code}
> {code:title=repro2.sh}
> #!/bin/sh
> mkdir -p dataset2/month=june
> mkdir -p dataset2/month=july
> echo "data from june" > dataset2/month=june/part01.txt
> echo "data from july" > dataset2/month=july/part01.txt
> {code}
> h3. using first dataset
> {code:none}
> >>> df = sqlContext.read.text('dataset')
> ...
> >>> df
> DataFrame[value: string, year: int]
> >>> df.show()
> +--++
> | value|year|
> +--++
> |data from 2014|2014|
> |data from 2015|2015|
> +--++
> >>> df.select('year').show()
> ++
> |year|
> ++
> |  14|
> |  14|
> ++
> {code}
> This is clearly wrong. Seems like it returns the length of the value column?
> h3. using second dataset
> With another dataset it looks like this:
> {code:none}
> >>> df = sqlContext.read.text('dataset2')
> >>> df
> DataFrame[value: string, month: string]
> >>> df.show()
> +--+-+
> | value|month|
> +--+-+
> |data from june| june|
> |data from july| july|
> +--+-+
> >>> df.select('month').show()
> +--+
> | month|
> +--+
> |data from june|
> |data from july|
> +--+
> {code}
> Here it returns the value of the value column instead of the month partition.
> h3. Workaround
> When I convert the dataframe to an RDD and back to a DataFrame I get the 
> following result (which is the expected behaviour):
> {code:none}
> >>> df.rdd.toDF().select('month').show()
> +-+
> |month|
> +-+
> | june|
> | july|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13753) Column nullable is derived incorrectly

2016-04-17 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244640#comment-15244640
 ] 

Takeshi Yamamuro commented on SPARK-13753:
--

Could you also put the explain result of your query?

> Column nullable is derived incorrectly
> --
>
> Key: SPARK-13753
> URL: https://issues.apache.org/jira/browse/SPARK-13753
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Jingwei Lu
>Priority: Critical
>
> There is a problem in spark sql to derive nullable column and used in 
> optimization incorrectly. In following query:
> {code}
> select concat("perf.realtime.web", b.tags[1]) as metric, b.value, b.tags[0]
>   from (
> select explode(map(a.frontend[0], 
> ARRAY(concat("metric:frontend", ",controller:", COALESCE(controller, "null"), 
> ",action:", COALESCE(action, "null")), ".p50"),
>  a.frontend[1], 
> ARRAY(concat("metric:frontend", ",controller:", COALESCE(controller, "null"), 
> ",action:", COALESCE(action, "null")), ".p90"),
>  a.backend[0], ARRAY(concat("metric:backend", 
> ",controller:", COALESCE(controller, "null"), ",action:", COALESCE(action, 
> "null")), ".p50"),
>  a.backend[1], ARRAY(concat("metric:backend", 
> ",controller:", COALESCE(controller, "null"), ",action:", COALESCE(action, 
> "null")), ".p90"),
>  a.render[0], ARRAY(concat("metric:render", 
> ",controller:", COALESCE(controller, "null"), ",action:", COALESCE(action, 
> "null")), ".p50"),
>  a.render[1], ARRAY(concat("metric:render", 
> ",controller:", COALESCE(controller, "null"), ",action:", COALESCE(action, 
> "null")), ".p90"),
>  a.page_load_time[0], 
> ARRAY(concat("metric:page_load_time", ",controller:", COALESCE(controller, 
> "null"), ",action:", COALESCE(action, "null")), ".p50"),
>  a.page_load_time[1], 
> ARRAY(concat("metric:page_load_time", ",controller:", COALESCE(controller, 
> "null"), ",action:", COALESCE(action, "null")), ".p90"),
>  a.total_load_time[0], 
> ARRAY(concat("metric:total_load_time", ",controller:", COALESCE(controller, 
> "null"), ",action:", COALESCE(action, "null")), ".p50"),
>  a.total_load_time[1], 
> ARRAY(concat("metric:total_load_time", ",controller:", COALESCE(controller, 
> "null"), ",action:", COALESCE(action, "null")), ".p90"))) as (value, tags)
> from (
>   select  data.controller as controller, data.action as 
> action,
>   percentile(data.frontend, array(0.5, 0.9)) as 
> frontend,
>   percentile(data.backend, array(0.5, 0.9)) as 
> backend,
>   percentile(data.render, array(0.5, 0.9)) as render,
>   percentile(data.page_load_time, array(0.5, 0.9)) as 
> page_load_time,
>   percentile(data.total_load_time, array(0.5, 0.9)) 
> as total_load_time
>   from air_events_rt
>   where type='air_events' and data.event_name='pageload'
>   group by data.controller, data.action
> ) a
>   ) b
>   where b.value is not null
> {code}
> b.value is incorrectly derived as not nullable.  "b.value is not null" 
> predicate will be ignored by optimizer which cause the query return incorrect 
> result. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14635) Documentation and Examples for TF-IDF only refer to HashingTF

2016-04-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244575#comment-15244575
 ] 

Apache Spark commented on SPARK-14635:
--

User 'hhbyyh' has created a pull request for this issue:
https://github.com/apache/spark/pull/12454

> Documentation and Examples for TF-IDF only refer to HashingTF
> -
>
> Key: SPARK-14635
> URL: https://issues.apache.org/jira/browse/SPARK-14635
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> Currently, the [docs for 
> TF-IDF|http://spark.apache.org/docs/latest/ml-features.html#tf-idf-hashingtf-and-idf]
>  only refer to using {{HashingTF}} with {{IDF}}. However, {{CountVectorizer}} 
> can also be used. We should probably amend the user guide and examples to 
> show this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14635) Documentation and Examples for TF-IDF only refer to HashingTF

2016-04-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14635:


Assignee: Apache Spark

> Documentation and Examples for TF-IDF only refer to HashingTF
> -
>
> Key: SPARK-14635
> URL: https://issues.apache.org/jira/browse/SPARK-14635
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, the [docs for 
> TF-IDF|http://spark.apache.org/docs/latest/ml-features.html#tf-idf-hashingtf-and-idf]
>  only refer to using {{HashingTF}} with {{IDF}}. However, {{CountVectorizer}} 
> can also be used. We should probably amend the user guide and examples to 
> show this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14635) Documentation and Examples for TF-IDF only refer to HashingTF

2016-04-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14635:


Assignee: (was: Apache Spark)

> Documentation and Examples for TF-IDF only refer to HashingTF
> -
>
> Key: SPARK-14635
> URL: https://issues.apache.org/jira/browse/SPARK-14635
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> Currently, the [docs for 
> TF-IDF|http://spark.apache.org/docs/latest/ml-features.html#tf-idf-hashingtf-and-idf]
>  only refer to using {{HashingTF}} with {{IDF}}. However, {{CountVectorizer}} 
> can also be used. We should probably amend the user guide and examples to 
> show this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14681) Provide label/impurity stats for spark.ml decision tree nodes

2016-04-17 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244574#comment-15244574
 ] 

zhengruifeng commented on SPARK-14681:
--

Will this stats be inclued in trainingSummary or non-trainingSummary evaluated 
on some dataframe?

> Provide label/impurity stats for spark.ml decision tree nodes
> -
>
> Key: SPARK-14681
> URL: https://issues.apache.org/jira/browse/SPARK-14681
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Currently, spark.ml decision trees provide all node info except for the 
> aggregated stats about labels and impurities.  This task is to provide those 
> publicly.  We need to choose a good API for it, so we should discuss the 
> design on this issue before implementing it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14283) Avoid sort in randomSplit when possible

2016-04-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14283:


Assignee: Apache Spark

> Avoid sort in randomSplit when possible
> ---
>
> Key: SPARK-14283
> URL: https://issues.apache.org/jira/browse/SPARK-14283
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> Dataset.randomSplit sorts each partition in order to guarantee an ordering 
> and make randomSplit deterministic given the seed.  Since randomSplit is used 
> a fair amount in ML, it would be great to avoid the sort when possible.
> Are there cases when it could be avoided?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14283) Avoid sort in randomSplit when possible

2016-04-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244571#comment-15244571
 ] 

Apache Spark commented on SPARK-14283:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/12453

> Avoid sort in randomSplit when possible
> ---
>
> Key: SPARK-14283
> URL: https://issues.apache.org/jira/browse/SPARK-14283
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Joseph K. Bradley
>
> Dataset.randomSplit sorts each partition in order to guarantee an ordering 
> and make randomSplit deterministic given the seed.  Since randomSplit is used 
> a fair amount in ML, it would be great to avoid the sort when possible.
> Are there cases when it could be avoided?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14283) Avoid sort in randomSplit when possible

2016-04-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14283:


Assignee: (was: Apache Spark)

> Avoid sort in randomSplit when possible
> ---
>
> Key: SPARK-14283
> URL: https://issues.apache.org/jira/browse/SPARK-14283
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Joseph K. Bradley
>
> Dataset.randomSplit sorts each partition in order to guarantee an ordering 
> and make randomSplit deterministic given the seed.  Since randomSplit is used 
> a fair amount in ML, it would be great to avoid the sort when possible.
> Are there cases when it could be avoided?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14283) Avoid sort in randomSplit when possible

2016-04-17 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244568#comment-15244568
 ] 

zhengruifeng commented on SPARK-14283:
--

[~josephkb] I can work on this.
There should be a version of randomSplit that avoid the local sort which is 
meaningless in ML.
But the calls in ML should be add a extra param to avoid local sort IMO.

> Avoid sort in randomSplit when possible
> ---
>
> Key: SPARK-14283
> URL: https://issues.apache.org/jira/browse/SPARK-14283
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Joseph K. Bradley
>
> Dataset.randomSplit sorts each partition in order to guarantee an ordering 
> and make randomSplit deterministic given the seed.  Since randomSplit is used 
> a fair amount in ML, it would be great to avoid the sort when possible.
> Are there cases when it could be avoided?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13363) Aggregator not working with DataFrame

2016-04-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244566#comment-15244566
 ] 

Apache Spark commented on SPARK-13363:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/12451

> Aggregator not working with DataFrame
> -
>
> Key: SPARK-13363
> URL: https://issues.apache.org/jira/browse/SPARK-13363
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: koert kuipers
>Assignee: Wenchen Fan
>Priority: Blocker
> Fix For: 2.0.0
>
>
> org.apache.spark.sql.expressions.Aggregator doc/comments says: A base class 
> for user-defined aggregations, which can be used in [[DataFrame]] and 
> [[Dataset]]
> it works well with Dataset/GroupedDataset, but i am having no luck using it 
> with DataFrame/GroupedData. does anyone have an example how to use it with a 
> DataFrame?
> in particular i would like to use it with this method in GroupedData:
> {noformat}
>   def agg(expr: Column, exprs: Column*): DataFrame
> {noformat}
> clearly it should be possible, since GroupedDataset uses that very same 
> method to do the work:
> {noformat}
>   private def agg(exprs: Column*): DataFrame =
> groupedData.agg(withEncoder(exprs.head), exprs.tail.map(withEncoder): _*)
> {noformat}
> the trick seems to be the wrapping in withEncoder, which is private. i tried 
> to do something like it myself, but i had no luck since it uses more private 
> stuff in TypedColumn.
> anyhow, my attempt at using it in DataFrame:
> {noformat}
> val simpleSum = new Aggregator[Int, Int, Int] {
>   def zero: Int = 0 // The initial value.
>   def reduce(b: Int, a: Int) = b + a// Add an element to the running total
>   def merge(b1: Int, b2: Int) = b1 + b2 // Merge intermediate values.
>   def finish(b: Int) = b// Return the final result.
> }.toColumn
> val df = sc.makeRDD(1 to 3).map(i => (i, i)).toDF("k", "v")
> df.groupBy("k").agg(simpleSum).show
> {noformat}
> and the resulting error:
> {noformat}
> org.apache.spark.sql.AnalysisException: unresolved operator 'Aggregate 
> [k#104], [k#104,($anon$3(),mode=Complete,isDistinct=false) AS sum#106];
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:46)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:241)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
> at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:122)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:46)
> at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:130)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:49)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org