[jira] [Updated] (SPARK-20965) Support PREPARE/EXECUTE/DECLARE/FETCH statements

2017-06-26 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-20965:
-
Summary: Support PREPARE/EXECUTE/DECLARE/FETCH statements  (was: Support 
PREPARE and EXECUTE statements)

> Support PREPARE/EXECUTE/DECLARE/FETCH statements
> 
>
> Key: SPARK-20965
> URL: https://issues.apache.org/jira/browse/SPARK-20965
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> https://github.com/apache/spark/compare/master...maropu:SPARK-20965#diff-06e7b9d9e1afd7aca1a3b2cd18561953R63



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20965) Support PREPARE/EXECUTE/DECLARE/FETCH statements

2017-06-26 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-20965:
-
Description: 
It might help to implement PREPARE/EXECUTE/DECLARE/FETCH statements by 
referring the ANSI/SQL standard.
An example query in PostgreSQL (MySQL also support these statements) is as 
follows;

{code}
// Prepared statement
PREPARE pstmt(int) AS SELECT * FROM t WEHERE id = $1;
EXECUTE pstmt(1);
DEALLOCATE pstmt;

// Cursor
DECLARE cursor CURSOR FOR SELECT * FROM t;
FETCH 1 FROM cursor;
CLOSE cursor;
{code}

I probably think one of ideas to implement these statements is to put register 
implementations
in `SessionState` to hold prepared statements and open cursors like;
prototype: 
https://github.com/apache/spark/compare/master...maropu:SPARK-20965#diff-06e7b9d9e1afd7aca1a3b2cd18561953R63



  
was:https://github.com/apache/spark/compare/master...maropu:SPARK-20965#diff-06e7b9d9e1afd7aca1a3b2cd18561953R63


> Support PREPARE/EXECUTE/DECLARE/FETCH statements
> 
>
> Key: SPARK-20965
> URL: https://issues.apache.org/jira/browse/SPARK-20965
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> It might help to implement PREPARE/EXECUTE/DECLARE/FETCH statements by 
> referring the ANSI/SQL standard.
> An example query in PostgreSQL (MySQL also support these statements) is as 
> follows;
> {code}
> // Prepared statement
> PREPARE pstmt(int) AS SELECT * FROM t WEHERE id = $1;
> EXECUTE pstmt(1);
> DEALLOCATE pstmt;
> // Cursor
> DECLARE cursor CURSOR FOR SELECT * FROM t;
> FETCH 1 FROM cursor;
> CLOSE cursor;
> {code}
> I probably think one of ideas to implement these statements is to put 
> register implementations
> in `SessionState` to hold prepared statements and open cursors like;
> prototype: 
> https://github.com/apache/spark/compare/master...maropu:SPARK-20965#diff-06e7b9d9e1afd7aca1a3b2cd18561953R63



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20965) Support PREPARE and EXECUTE statements

2017-06-26 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-20965:
-
Description: 
https://github.com/apache/spark/compare/master...maropu:SPARK-20965#diff-06e7b9d9e1afd7aca1a3b2cd18561953R63
  (was: Parameterized queries might help for some users, so we might support 
PREPRE and EXECUTE statements by referring the ANSI/SQL standard (e.g., it is 
some useful for users who frequently use the same queries)

{code}
PREPARE sqlstmt (int) AS SELECT * FROM t WEHERE id = $1;
EXECUTE sqlstmt(1);
{code}

One of implementation references: 
https://www.postgresql.org/docs/current/static/sql-prepare.html)

> Support PREPARE and EXECUTE statements
> --
>
> Key: SPARK-20965
> URL: https://issues.apache.org/jira/browse/SPARK-20965
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> https://github.com/apache/spark/compare/master...maropu:SPARK-20965#diff-06e7b9d9e1afd7aca1a3b2cd18561953R63



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21221) CrossValidator and TrainValidationSplit Persist Nested Estimators

2017-06-26 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21221:
--
Shepherd: Joseph K. Bradley

> CrossValidator and TrainValidationSplit Persist Nested Estimators
> -
>
> Key: SPARK-21221
> URL: https://issues.apache.org/jira/browse/SPARK-21221
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Ajay Saini
>
> Currently, the saving of parameters done in ValidatorParams.scala assumes 
> that all parameters in EstimatorParameterMaps are JSON serializable. Such an 
> assumption causes CrossValidator and TrainValidationSplit persistence to fail 
> when the internal estimator to these meta-algorithms is not JSON 
> serializable. One example is OneVsRest which has a classifier (estimator) as 
> a parameter.
> The changes would involve removing the assumption in validator params that 
> all the estimator params are JSON serializable. This could mean saving 
> parameters that are not JSON serializable separately at a specified path. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21222) Move elimination of Distinct clause from analyzer to optimizer

2017-06-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064229#comment-16064229
 ] 

Apache Spark commented on SPARK-21222:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/18429

> Move elimination of Distinct clause from analyzer to optimizer
> --
>
> Key: SPARK-21222
> URL: https://issues.apache.org/jira/browse/SPARK-21222
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Priority: Minor
>
> Distinct clause is after MAX/MIN clause 
> "Select MAX(distinct a) FROM src from"
>  is equivalent of 
> "Select MAX(distinct a) FROM src from"
> However, this optimization is implemented in analyzer. It should be in 
> optimizer.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21222) Move elimination of Distinct clause from analyzer to optimizer

2017-06-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21222:


Assignee: Apache Spark

> Move elimination of Distinct clause from analyzer to optimizer
> --
>
> Key: SPARK-21222
> URL: https://issues.apache.org/jira/browse/SPARK-21222
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Minor
>
> Distinct clause is after MAX/MIN clause 
> "Select MAX(distinct a) FROM src from"
>  is equivalent of 
> "Select MAX(distinct a) FROM src from"
> However, this optimization is implemented in analyzer. It should be in 
> optimizer.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21222) Move elimination of Distinct clause from analyzer to optimizer

2017-06-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21222:


Assignee: (was: Apache Spark)

> Move elimination of Distinct clause from analyzer to optimizer
> --
>
> Key: SPARK-21222
> URL: https://issues.apache.org/jira/browse/SPARK-21222
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Priority: Minor
>
> Distinct clause is after MAX/MIN clause 
> "Select MAX(distinct a) FROM src from"
>  is equivalent of 
> "Select MAX(distinct a) FROM src from"
> However, this optimization is implemented in analyzer. It should be in 
> optimizer.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21222) Move elimination of Distinct clause from analyzer to optimizer

2017-06-26 Thread Gengliang Wang (JIRA)
Gengliang Wang created SPARK-21222:
--

 Summary: Move elimination of Distinct clause from analyzer to 
optimizer
 Key: SPARK-21222
 URL: https://issues.apache.org/jira/browse/SPARK-21222
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Gengliang Wang
Priority: Minor


Distinct clause is after MAX/MIN clause 
"Select MAX(distinct a) FROM src from"
 is equivalent of 
"Select MAX(distinct a) FROM src from"

However, this optimization is implemented in analyzer. It should be in 
optimizer.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6635) DataFrame.withColumn can create columns with identical names

2017-06-26 Thread Yichuan Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064095#comment-16064095
 ] 

Yichuan Wang edited comment on SPARK-6635 at 6/27/17 1:43 AM:
--

withColumn have this strange behavior with join, it replaces both columns from 
the left and right side of the join, and replace them with the new column 
(Spark 2.0.2)

{code:java}
val left = Seq((1, "one", null, "p1"), (2, "two", "b", null)).toDF("id", 
"name", "note", "note2")
val right = Seq((1, "c", 11, "n1"), (2,  null, 22, "n2")).toDF("id",  "note", 
"seq", "note2")
val j = left.join(right, "id")
j.show()
{code}

{code:java}
+---+++-++---+-+
| id|name|note|note2|note|seq|note2|
+---+++-++---+-+
|  1| one|null|   p1|   c| 11|   n1|
|  2| two|   b| null|null| 22|   n2|
+---+++-++---+-+
{code}


{code:java}
val k = j.withColumn("note", coalesce(left("note"), right("note")))
k.show()
{code}


{code:java}
+---+++-++---+-+
| id|name|note|note2|note|seq|note2|
+---+++-++---+-+
|  1| one|   c|   p1|   c| 11|   n1|
|  2| two|   b| null|   b| 22|   n2|
+---+++-++---+-+
{code}


{code:java}
val l = k.drop("note")
l.show()
{code}



{code:java}
+---++-+---+-+
| id|name|note2|seq|note2|
+---++-+---+-+
|  1| one|   p1| 11|   n1|
|  2| two| null| 22|   n2|
+---++-+---+-+
{code}







was (Author: yichuan):
withColumn have this strange behavior with join, it replace both columns from 
the left and right side of join, and replace them with the new column (Spark 
2.0.2)

{code:java}
val left = Seq((1, "one", null, "p1"), (2, "two", "b", null)).toDF("id", 
"name", "note", "note2")
val right = Seq((1, "c", 11, "n1"), (2,  null, 22, "n2")).toDF("id",  "note", 
"seq", "note2")
val j = left.join(right, "id")
j.show()
{code}

{code:java}
+---+++-++---+-+
| id|name|note|note2|note|seq|note2|
+---+++-++---+-+
|  1| one|null|   p1|   c| 11|   n1|
|  2| two|   b| null|null| 22|   n2|
+---+++-++---+-+
{code}


{code:java}
val k = j.withColumn("note", coalesce(left("note"), right("note")))
k.show()
{code}


{code:java}
+---+++-++---+-+
| id|name|note|note2|note|seq|note2|
+---+++-++---+-+
|  1| one|   c|   p1|   c| 11|   n1|
|  2| two|   b| null|   b| 22|   n2|
+---+++-++---+-+
{code}


{code:java}
val l = k.drop("note")
l.show()
{code}



{code:java}
+---++-+---+-+
| id|name|note2|seq|note2|
+---++-+---+-+
|  1| one|   p1| 11|   n1|
|  2| two| null| 22|   n2|
+---++-+---+-+
{code}






> DataFrame.withColumn can create columns with identical names
> 
>
> Key: SPARK-6635
> URL: https://issues.apache.org/jira/browse/SPARK-6635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Liang-Chi Hsieh
> Fix For: 1.4.0
>
>
> DataFrame lets you create multiple columns with the same name, which causes 
> problems when you try to refer to columns by name.
> Proposal: If a column is added to a DataFrame with a column of the same name, 
> then the new column should replace the old column.
> {code}
> scala> val df = sc.parallelize(Array(1,2,3)).toDF("x")
> df: org.apache.spark.sql.DataFrame = [x: int]
> scala> val df3 = df.withColumn("x", df("x") + 1)
> df3: org.apache.spark.sql.DataFrame = [x: int, x: int]
> scala> df3.collect()
> res1: Array[org.apache.spark.sql.Row] = Array([1,2], [2,3], [3,4])
> scala> df3("x")
> org.apache.spark.sql.AnalysisException: Reference 'x' is ambiguous, could be: 
> x, x.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:216)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:121)
>   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:161)
>   at org.apache.spark.sql.DataFrame.col(DataFrame.scala:436)
>   at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:426)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:31)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:35)
>   at $iwC$$iwC$$iwC$$iwC.(:37)
>   at $iwC$$iwC$$iwC.(:39)
>   at $iwC$$iwC.(:41)
>   at $iwC.(:43)
>   at (:45)
>   at .(:49)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAc

[jira] [Commented] (SPARK-6635) DataFrame.withColumn can create columns with identical names

2017-06-26 Thread Yichuan Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064095#comment-16064095
 ] 

Yichuan Wang commented on SPARK-6635:
-

withColumn have this strange behavior with join, it replace both columns from 
the left and right side of join, and replace them with the new column (Spark 
2.0.2)

{code:java}
val left = Seq((1, "one", null, "p1"), (2, "two", "b", null)).toDF("id", 
"name", "note", "note2")
val right = Seq((1, "c", 11, "n1"), (2,  null, 22, "n2")).toDF("id",  "note", 
"seq", "note2")
val j = left.join(right, "id")
j.show()
{code}

{code:java}
+---+++-++---+-+
| id|name|note|note2|note|seq|note2|
+---+++-++---+-+
|  1| one|null|   p1|   c| 11|   n1|
|  2| two|   b| null|null| 22|   n2|
+---+++-++---+-+
{code}


{code:java}
val k = j.withColumn("note", coalesce(left("note"), right("note")))
k.show()
{code}


{code:java}
+---+++-++---+-+
| id|name|note|note2|note|seq|note2|
+---+++-++---+-+
|  1| one|   c|   p1|   c| 11|   n1|
|  2| two|   b| null|   b| 22|   n2|
+---+++-++---+-+
{code}


{code:java}
val l = k.drop("note")
l.show()
{code}



{code:java}
+---++-+---+-+
| id|name|note2|seq|note2|
+---++-+---+-+
|  1| one|   p1| 11|   n1|
|  2| two| null| 22|   n2|
+---++-+---+-+
{code}






> DataFrame.withColumn can create columns with identical names
> 
>
> Key: SPARK-6635
> URL: https://issues.apache.org/jira/browse/SPARK-6635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Liang-Chi Hsieh
> Fix For: 1.4.0
>
>
> DataFrame lets you create multiple columns with the same name, which causes 
> problems when you try to refer to columns by name.
> Proposal: If a column is added to a DataFrame with a column of the same name, 
> then the new column should replace the old column.
> {code}
> scala> val df = sc.parallelize(Array(1,2,3)).toDF("x")
> df: org.apache.spark.sql.DataFrame = [x: int]
> scala> val df3 = df.withColumn("x", df("x") + 1)
> df3: org.apache.spark.sql.DataFrame = [x: int, x: int]
> scala> df3.collect()
> res1: Array[org.apache.spark.sql.Row] = Array([1,2], [2,3], [3,4])
> scala> df3("x")
> org.apache.spark.sql.AnalysisException: Reference 'x' is ambiguous, could be: 
> x, x.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:216)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:121)
>   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:161)
>   at org.apache.spark.sql.DataFrame.col(DataFrame.scala:436)
>   at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:426)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:31)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:35)
>   at $iwC$$iwC$$iwC$$iwC.(:37)
>   at $iwC$$iwC$$iwC.(:39)
>   at $iwC$$iwC.(:41)
>   at $iwC.(:43)
>   at (:45)
>   at .(:49)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
>   at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856)
>   at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901)
>   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813)
>   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656)
>   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664)
>   at 
> org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
>   at

[jira] [Commented] (SPARK-21221) CrossValidator and TrainValidationSplit Persist Nested Estimators

2017-06-26 Thread Ajay Saini (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064068#comment-16064068
 ] 

Ajay Saini commented on SPARK-21221:


Note: In order for python persistence of OneVsRest inside a 
CrossValidator/TrainValidationSplit to work this change needs to be merged 
because Python persistence of meta-algorithms relies on the Scala saving 
implementation.

> CrossValidator and TrainValidationSplit Persist Nested Estimators
> -
>
> Key: SPARK-21221
> URL: https://issues.apache.org/jira/browse/SPARK-21221
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Ajay Saini
>
> Currently, the saving of parameters done in ValidatorParams.scala assumes 
> that all parameters in EstimatorParameterMaps are JSON serializable. Such an 
> assumption causes CrossValidator and TrainValidationSplit persistence to fail 
> when the internal estimator to these meta-algorithms is not JSON 
> serializable. One example is OneVsRest which has a classifier (estimator) as 
> a parameter.
> The changes would involve removing the assumption in validator params that 
> all the estimator params are JSON serializable. This could mean saving 
> parameters that are not JSON serializable separately at a specified path. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-21221) CrossValidator and TrainValidationSplit Persist Nested Estimators

2017-06-26 Thread Ajay Saini (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ajay Saini updated SPARK-21221:
---
Comment: was deleted

(was: Pull Request Here: https://github.com/apache/spark/pull/18428)

> CrossValidator and TrainValidationSplit Persist Nested Estimators
> -
>
> Key: SPARK-21221
> URL: https://issues.apache.org/jira/browse/SPARK-21221
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Ajay Saini
>
> Currently, the saving of parameters done in ValidatorParams.scala assumes 
> that all parameters in EstimatorParameterMaps are JSON serializable. Such an 
> assumption causes CrossValidator and TrainValidationSplit persistence to fail 
> when the internal estimator to these meta-algorithms is not JSON 
> serializable. One example is OneVsRest which has a classifier (estimator) as 
> a parameter.
> The changes would involve removing the assumption in validator params that 
> all the estimator params are JSON serializable. This could mean saving 
> parameters that are not JSON serializable separately at a specified path. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21221) CrossValidator and TrainValidationSplit Persist Nested Estimators

2017-06-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064065#comment-16064065
 ] 

Apache Spark commented on SPARK-21221:
--

User 'ajaysaini725' has created a pull request for this issue:
https://github.com/apache/spark/pull/18428

> CrossValidator and TrainValidationSplit Persist Nested Estimators
> -
>
> Key: SPARK-21221
> URL: https://issues.apache.org/jira/browse/SPARK-21221
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Ajay Saini
>
> Currently, the saving of parameters done in ValidatorParams.scala assumes 
> that all parameters in EstimatorParameterMaps are JSON serializable. Such an 
> assumption causes CrossValidator and TrainValidationSplit persistence to fail 
> when the internal estimator to these meta-algorithms is not JSON 
> serializable. One example is OneVsRest which has a classifier (estimator) as 
> a parameter.
> The changes would involve removing the assumption in validator params that 
> all the estimator params are JSON serializable. This could mean saving 
> parameters that are not JSON serializable separately at a specified path. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21221) CrossValidator and TrainValidationSplit Persist Nested Estimators

2017-06-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21221:


Assignee: (was: Apache Spark)

> CrossValidator and TrainValidationSplit Persist Nested Estimators
> -
>
> Key: SPARK-21221
> URL: https://issues.apache.org/jira/browse/SPARK-21221
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Ajay Saini
>
> Currently, the saving of parameters done in ValidatorParams.scala assumes 
> that all parameters in EstimatorParameterMaps are JSON serializable. Such an 
> assumption causes CrossValidator and TrainValidationSplit persistence to fail 
> when the internal estimator to these meta-algorithms is not JSON 
> serializable. One example is OneVsRest which has a classifier (estimator) as 
> a parameter.
> The changes would involve removing the assumption in validator params that 
> all the estimator params are JSON serializable. This could mean saving 
> parameters that are not JSON serializable separately at a specified path. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21221) CrossValidator and TrainValidationSplit Persist Nested Estimators

2017-06-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21221:


Assignee: Apache Spark

> CrossValidator and TrainValidationSplit Persist Nested Estimators
> -
>
> Key: SPARK-21221
> URL: https://issues.apache.org/jira/browse/SPARK-21221
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Ajay Saini
>Assignee: Apache Spark
>
> Currently, the saving of parameters done in ValidatorParams.scala assumes 
> that all parameters in EstimatorParameterMaps are JSON serializable. Such an 
> assumption causes CrossValidator and TrainValidationSplit persistence to fail 
> when the internal estimator to these meta-algorithms is not JSON 
> serializable. One example is OneVsRest which has a classifier (estimator) as 
> a parameter.
> The changes would involve removing the assumption in validator params that 
> all the estimator params are JSON serializable. This could mean saving 
> parameters that are not JSON serializable separately at a specified path. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21221) CrossValidator and TrainValidationSplit Persist Nested Estimators

2017-06-26 Thread Ajay Saini (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064064#comment-16064064
 ] 

Ajay Saini commented on SPARK-21221:


Pull Request Here: https://github.com/apache/spark/pull/18428

> CrossValidator and TrainValidationSplit Persist Nested Estimators
> -
>
> Key: SPARK-21221
> URL: https://issues.apache.org/jira/browse/SPARK-21221
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Ajay Saini
>
> Currently, the saving of parameters done in ValidatorParams.scala assumes 
> that all parameters in EstimatorParameterMaps are JSON serializable. Such an 
> assumption causes CrossValidator and TrainValidationSplit persistence to fail 
> when the internal estimator to these meta-algorithms is not JSON 
> serializable. One example is OneVsRest which has a classifier (estimator) as 
> a parameter.
> The changes would involve removing the assumption in validator params that 
> all the estimator params are JSON serializable. This could mean saving 
> parameters that are not JSON serializable separately at a specified path. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17129) Support statistics collection and cardinality estimation for partitioned tables

2017-06-26 Thread Zhenhua Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064062#comment-16064062
 ] 

Zhenhua Wang commented on SPARK-17129:
--

[~mbasmanova] Thanks for working on it~ I'll review it in these days.

> Support statistics collection and cardinality estimation for partitioned 
> tables
> ---
>
> Key: SPARK-17129
> URL: https://issues.apache.org/jira/browse/SPARK-17129
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>
> I upgrade this JIRA, because there are many tasks found and needed to be done 
> here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21221) CrossValidator and TrainValidationSplit Persist Nested Estimators

2017-06-26 Thread Ajay Saini (JIRA)
Ajay Saini created SPARK-21221:
--

 Summary: CrossValidator and TrainValidationSplit Persist Nested 
Estimators
 Key: SPARK-21221
 URL: https://issues.apache.org/jira/browse/SPARK-21221
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.1.1
Reporter: Ajay Saini


Currently, the saving of parameters done in ValidatorParams.scala assumes that 
all parameters in EstimatorParameterMaps are JSON serializable. Such an 
assumption causes CrossValidator and TrainValidationSplit persistence to fail 
when the internal estimator to these meta-algorithms is not JSON serializable. 
One example is OneVsRest which has a classifier (estimator) as a parameter.

The changes would involve removing the assumption in validator params that all 
the estimator params are JSON serializable. This could mean saving parameters 
that are not JSON serializable separately at a specified path. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21063) Spark return an empty result from remote hadoop cluster

2017-06-26 Thread Yuming Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064007#comment-16064007
 ] 

Yuming Wang commented on SPARK-21063:
-

[~pbykov], I just verified it. It can get the result without any specific 
configuration if {{jdbc:hive2://remote.hive.local:1/default}} is [Spark 
hive-thriftserver|https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala].
But it will throw {{org.apache.thrift.TApplicationException}} if 
{{jdbc:hive2://remote.hive.local:1/default}} is 
[hive-thriftserver|https://github.com/apache/hive/blob/master/service/src/java/org/apache/hive/service/server/HiveServer2.java],
 
So what is your {{jdbc:hive2://remote.hive.local:1/default}}?

> Spark return an empty result from remote hadoop cluster
> ---
>
> Key: SPARK-21063
> URL: https://issues.apache.org/jira/browse/SPARK-21063
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Peter Bykov
>
> Spark returning empty result from when querying remote hadoop cluster.
> All firewall settings removed.
> Querying using JDBC working properly using hive-jdbc driver from version 1.1.1
> Code snippet is:
> {code:java}
> val spark = SparkSession.builder
> .appName("RemoteSparkTest")
> .master("local")
> .getOrCreate()
> val df = spark.read
>   .option("url", "jdbc:hive2://remote.hive.local:1/default")
>   .option("user", "user")
>   .option("password", "pass")
>   .option("dbtable", "test_table")
>   .option("driver", "org.apache.hive.jdbc.HiveDriver")
>   .format("jdbc")
>   .load()
>  
> df.show()
> {code}
> Result:
> {noformat}
> +---+
> |test_table.test_col|
> +---+
> +---+
> {noformat}
> All manipulations like: 
> {code:java}
> df.select(*).show()
> {code}
> returns empty result too.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-21212) Can't use Count(*) with Order Clause

2017-06-26 Thread Shawn Lavelle (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shawn Lavelle closed SPARK-21212.
-
Resolution: Not A Problem

> Can't use Count(*) with Order Clause
> 
>
> Key: SPARK-21212
> URL: https://issues.apache.org/jira/browse/SPARK-21212
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Windows; external data provided through data source api
>Reporter: Shawn Lavelle
>Priority: Minor
>
> I don't think this should fail the query:
> _Notes: VALUE is a column of table TABLE. columns and table names redacted. I 
> can generate a simplified test case if needed, but this is easy to reproduce. 
> _ 
> {code}jdbc:hive2://user:port/> select count(*) from table where value between 
> 1498240079000 and cast(now() as bigint)*1000 order by value;
> {code}
> {code}
> Error: org.apache.spark.sql.AnalysisException: cannot resolve '`value`' given 
> input columns: [count(1)]; line 1 pos 113;
> 'Sort ['value ASC NULLS FIRST], true
> +- Aggregate [count(1) AS count(1)#718L]
>+- Filter ((value#413L >= 1498240079000) && (value#413L <= 
> (cast(current_timestamp() as bigint) * cast(1000 as bigint
>   +- SubqueryAlias table
>  +- 
> Relation[field1#411L,field2#412,value#413L,field3#414,field4#415,field5#416,field6#417,field7#418,field8#419,field9#420]
>  com.redacted@16004579 (state=,code=0)
> {code}
> Arguably, the optimizer could ignore the "order by" clause, but I leave that 
> to more informed minds than my own.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21212) Can't use Count(*) with Order Clause

2017-06-26 Thread Shawn Lavelle (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064005#comment-16064005
 ] 

Shawn Lavelle commented on SPARK-21212:
---

I think you're right. I know my users (not skilled at their SQL) will continue 
attempt the supplied problem regardless of what training I throw at them.

> Can't use Count(*) with Order Clause
> 
>
> Key: SPARK-21212
> URL: https://issues.apache.org/jira/browse/SPARK-21212
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Windows; external data provided through data source api
>Reporter: Shawn Lavelle
>Priority: Minor
>
> I don't think this should fail the query:
> _Notes: VALUE is a column of table TABLE. columns and table names redacted. I 
> can generate a simplified test case if needed, but this is easy to reproduce. 
> _ 
> {code}jdbc:hive2://user:port/> select count(*) from table where value between 
> 1498240079000 and cast(now() as bigint)*1000 order by value;
> {code}
> {code}
> Error: org.apache.spark.sql.AnalysisException: cannot resolve '`value`' given 
> input columns: [count(1)]; line 1 pos 113;
> 'Sort ['value ASC NULLS FIRST], true
> +- Aggregate [count(1) AS count(1)#718L]
>+- Filter ((value#413L >= 1498240079000) && (value#413L <= 
> (cast(current_timestamp() as bigint) * cast(1000 as bigint
>   +- SubqueryAlias table
>  +- 
> Relation[field1#411L,field2#412,value#413L,field3#414,field4#415,field5#416,field6#417,field7#418,field8#419,field9#420]
>  com.redacted@16004579 (state=,code=0)
> {code}
> Arguably, the optimizer could ignore the "order by" clause, but I leave that 
> to more informed minds than my own.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21220) Use outputPartitioning's bucketing if possible on write

2017-06-26 Thread Andrew Ash (JIRA)
Andrew Ash created SPARK-21220:
--

 Summary: Use outputPartitioning's bucketing if possible on write
 Key: SPARK-21220
 URL: https://issues.apache.org/jira/browse/SPARK-21220
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Andrew Ash


When reading a bucketed dataset and writing it back with no transformations (a 
copy) the bucketing information is lost and the user is required to re-specify 
the bucketing information on write.  This negatively affects read performance 
on the copied dataset since the bucketing information enables significant 
optimizations that aren't possible on the un-bucketed copied table.

Spark should propagate this bucketing information for copied datasets, and more 
generally could support inferring bucket information based on the known 
partitioning of the final RDD at save time when that partitioning is a 
{{HashPartitioning}}.

https://github.com/apache/spark/blob/v2.2.0-rc5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L118

In the above linked {{bucketIdExpression}}, we could {{.orElse}} a bucket 
expression based on outputPartitionings that are HashPartitioning.

This preserves bucket information for bucketed datasets, and also supports 
saving this metadata at write time for datasets with a known partitioning.  
Both of these cases should improve performance at read time of the 
newly-written dataset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14220) Build and test Spark against Scala 2.12

2017-06-26 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063984#comment-16063984
 ] 

Reynold Xin commented on SPARK-14220:
-

If all those issues have been released than it would be easy. Do you want to 
take a stab at that and see if it works with the DataFrame API?


> Build and test Spark against Scala 2.12
> ---
>
> Key: SPARK-14220
> URL: https://issues.apache.org/jira/browse/SPARK-14220
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.12 milestone.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21218) Convert IN predicate to equivalent Parquet filter

2017-06-26 Thread Andrew Duffy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063981#comment-16063981
 ] 

Andrew Duffy commented on SPARK-21218:
--

Good catch, looks like a dupe. [~hyukjin.kwon] did profiling on the issue last 
year and it was determined at the time that eval'ing the disjunction in Spark 
was faster.

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-21218
> URL: https://issues.apache.org/jira/browse/SPARK-21218
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.1
>Reporter: Michael Styles
>
> Convert IN predicate to equivalent expression involving equality conditions 
> to allow the filter to be pushed down to Parquet.
> For instance,
> C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17091) ParquetFilters rewrite IN to OR of Eq

2017-06-26 Thread Andrew Duffy (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Duffy resolved SPARK-17091.
--
Resolution: Won't Fix

Should've closed this last year, but at the time based on Hyukjin Kwon's 
profiling, evaluating the disjunction at the Spark level was determined to 
yield better performance.

> ParquetFilters rewrite IN to OR of Eq
> -
>
> Key: SPARK-17091
> URL: https://issues.apache.org/jira/browse/SPARK-17091
> Project: Spark
>  Issue Type: Bug
>Reporter: Andrew Duffy
>
> Past attempts at pushing down the InSet operation for Parquet relied on 
> user-defined predicates. It would be simpler to rewrite an IN clause into the 
> corresponding OR union of a set of equality conditions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21212) Can't use Count(*) with Order Clause

2017-06-26 Thread Shawn Lavelle (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063952#comment-16063952
 ] 

Shawn Lavelle edited comment on SPARK-21212 at 6/26/17 11:26 PM:
-

[~srowen],Are you trying to say is that the order by is attempting to apply 
itself to the aggregate count column while ignoring columns within the table 
itself? 


was (Author: azeroth2b):
[~srowen], I can assure you that value is a column in the table.  For example, 

{code}
select * from table where value between 1 and 5 order by value;
{code}
and
{code}
select count(*) from table where value between 1 and 5
{code}
both complete successfully.

> Can't use Count(*) with Order Clause
> 
>
> Key: SPARK-21212
> URL: https://issues.apache.org/jira/browse/SPARK-21212
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Windows; external data provided through data source api
>Reporter: Shawn Lavelle
>Priority: Minor
>
> I don't think this should fail the query:
> _Notes: VALUE is a column of table TABLE. columns and table names redacted. I 
> can generate a simplified test case if needed, but this is easy to reproduce. 
> _ 
> {code}jdbc:hive2://user:port/> select count(*) from table where value between 
> 1498240079000 and cast(now() as bigint)*1000 order by value;
> {code}
> {code}
> Error: org.apache.spark.sql.AnalysisException: cannot resolve '`value`' given 
> input columns: [count(1)]; line 1 pos 113;
> 'Sort ['value ASC NULLS FIRST], true
> +- Aggregate [count(1) AS count(1)#718L]
>+- Filter ((value#413L >= 1498240079000) && (value#413L <= 
> (cast(current_timestamp() as bigint) * cast(1000 as bigint
>   +- SubqueryAlias table
>  +- 
> Relation[field1#411L,field2#412,value#413L,field3#414,field4#415,field5#416,field6#417,field7#418,field8#419,field9#420]
>  com.redacted@16004579 (state=,code=0)
> {code}
> Arguably, the optimizer could ignore the "order by" clause, but I leave that 
> to more informed minds than my own.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21212) Can't use Count(*) with Order Clause

2017-06-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063968#comment-16063968
 ] 

Sean Owen commented on SPARK-21212:
---

Yes but you are not selecting the thing you order by. I thought this was 
required in SQL. It seems to be at least in some engines and seems to be here. 
Add whatever it is to the select clause. 

> Can't use Count(*) with Order Clause
> 
>
> Key: SPARK-21212
> URL: https://issues.apache.org/jira/browse/SPARK-21212
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Windows; external data provided through data source api
>Reporter: Shawn Lavelle
>Priority: Minor
>
> I don't think this should fail the query:
> _Notes: VALUE is a column of table TABLE. columns and table names redacted. I 
> can generate a simplified test case if needed, but this is easy to reproduce. 
> _ 
> {code}jdbc:hive2://user:port/> select count(*) from table where value between 
> 1498240079000 and cast(now() as bigint)*1000 order by value;
> {code}
> {code}
> Error: org.apache.spark.sql.AnalysisException: cannot resolve '`value`' given 
> input columns: [count(1)]; line 1 pos 113;
> 'Sort ['value ASC NULLS FIRST], true
> +- Aggregate [count(1) AS count(1)#718L]
>+- Filter ((value#413L >= 1498240079000) && (value#413L <= 
> (cast(current_timestamp() as bigint) * cast(1000 as bigint
>   +- SubqueryAlias table
>  +- 
> Relation[field1#411L,field2#412,value#413L,field3#414,field4#415,field5#416,field6#417,field7#418,field8#419,field9#420]
>  com.redacted@16004579 (state=,code=0)
> {code}
> Arguably, the optimizer could ignore the "order by" clause, but I leave that 
> to more informed minds than my own.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21212) Can't use Count(*) with Order Clause

2017-06-26 Thread Shawn Lavelle (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shawn Lavelle updated SPARK-21212:
--
Description: 
I don't think this should fail the query:

_Notes: VALUE is a column of table TABLE. columns and table names redacted. I 
can generate a simplified test case if needed, but this is easy to reproduce. _ 

{code}jdbc:hive2://user:port/> select count(*) from table where value between 
1498240079000 and cast(now() as bigint)*1000 order by value;
{code}


{code}
Error: org.apache.spark.sql.AnalysisException: cannot resolve '`value`' given 
input columns: [count(1)]; line 1 pos 113;
'Sort ['value ASC NULLS FIRST], true
+- Aggregate [count(1) AS count(1)#718L]
   +- Filter ((value#413L >= 1498240079000) && (value#413L <= 
(cast(current_timestamp() as bigint) * cast(1000 as bigint
  +- SubqueryAlias table
 +- 
Relation[field1#411L,field2#412,value#413L,field3#414,field4#415,field5#416,field6#417,field7#418,field8#419,field9#420]
 com.redacted@16004579 (state=,code=0)
{code}

Arguably, the optimizer could ignore the "order by" clause, but I leave that to 
more informed minds than my own.


  was:
I don't think this should fail the query:

{code}jdbc:hive2://user:port/> select count(*) from table where value between 
1498240079000 and cast(now() as bigint)*1000 order by value;
{code}


{code}
Error: org.apache.spark.sql.AnalysisException: cannot resolve '`value`' given 
input columns: [count(1)]; line 1 pos 113;
'Sort ['value ASC NULLS FIRST], true
+- Aggregate [count(1) AS count(1)#718L]
   +- Filter ((value#413L >= 1498240079000) && (value#413L <= 
(cast(current_timestamp() as bigint) * cast(1000 as bigint
  +- SubqueryAlias table
 +- 
Relation[field1#411L,field2#412,value#413L,field3#414,field4#415,field5#416,field6#417,field7#418,field8#419,field9#420]
 com.redacted@16004579 (state=,code=0)
{code}

Arguably, the optimizer could ignore the "order by" clause, but I leave that to 
more informed minds than my own.



> Can't use Count(*) with Order Clause
> 
>
> Key: SPARK-21212
> URL: https://issues.apache.org/jira/browse/SPARK-21212
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Windows; external data provided through data source api
>Reporter: Shawn Lavelle
>Priority: Minor
>
> I don't think this should fail the query:
> _Notes: VALUE is a column of table TABLE. columns and table names redacted. I 
> can generate a simplified test case if needed, but this is easy to reproduce. 
> _ 
> {code}jdbc:hive2://user:port/> select count(*) from table where value between 
> 1498240079000 and cast(now() as bigint)*1000 order by value;
> {code}
> {code}
> Error: org.apache.spark.sql.AnalysisException: cannot resolve '`value`' given 
> input columns: [count(1)]; line 1 pos 113;
> 'Sort ['value ASC NULLS FIRST], true
> +- Aggregate [count(1) AS count(1)#718L]
>+- Filter ((value#413L >= 1498240079000) && (value#413L <= 
> (cast(current_timestamp() as bigint) * cast(1000 as bigint
>   +- SubqueryAlias table
>  +- 
> Relation[field1#411L,field2#412,value#413L,field3#414,field4#415,field5#416,field6#417,field7#418,field8#419,field9#420]
>  com.redacted@16004579 (state=,code=0)
> {code}
> Arguably, the optimizer could ignore the "order by" clause, but I leave that 
> to more informed minds than my own.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14220) Build and test Spark against Scala 2.12

2017-06-26 Thread Flavio Brasil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063957#comment-16063957
 ] 

Flavio Brasil commented on SPARK-14220:
---

[~rxin] Could you expand your last comment? Is it hard because of the issue 
described in [this doc | 
https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit]?
 Scala 2.12 already [supports automatic conversion | 
https://scastie.scala-lang.org/dvkk8TFOR1OYKUz2snfzbw] from scala functions to 
java functional interfaces, maybe that wasn't the case at the time that the doc 
was written.

Spark is one of the last dependencies at Twitter that doesn't support Scala 
2.12. Would it be possible for us to collaborate on this?

> Build and test Spark against Scala 2.12
> ---
>
> Key: SPARK-14220
> URL: https://issues.apache.org/jira/browse/SPARK-14220
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.12 milestone.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21212) Can't use Count(*) with Order Clause

2017-06-26 Thread Shawn Lavelle (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063952#comment-16063952
 ] 

Shawn Lavelle edited comment on SPARK-21212 at 6/26/17 11:17 PM:
-

[~srowen], I can assure you that value is a column in the table.  For example, 

{code}
select * from table where value between 1 and 5 order by value;
{code}
and
{code}
select count(*) from table where value between 1 and 5
{code}
both complete successfully.


was (Author: azeroth2b):
[~srowen], I can assure you that value is a column in the table.  For example, 

[code]
select * from table where value between 1 and 5 order by value;
[code]
and
[code]
select coun(*) from table where value between 1 and 5
[code]
both complete successfully.

> Can't use Count(*) with Order Clause
> 
>
> Key: SPARK-21212
> URL: https://issues.apache.org/jira/browse/SPARK-21212
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Windows; external data provided through data source api
>Reporter: Shawn Lavelle
>Priority: Minor
>
> I don't think this should fail the query:
> {code}jdbc:hive2://user:port/> select count(*) from table where value between 
> 1498240079000 and cast(now() as bigint)*1000 order by value;
> {code}
> {code}
> Error: org.apache.spark.sql.AnalysisException: cannot resolve '`value`' given 
> input columns: [count(1)]; line 1 pos 113;
> 'Sort ['value ASC NULLS FIRST], true
> +- Aggregate [count(1) AS count(1)#718L]
>+- Filter ((value#413L >= 1498240079000) && (value#413L <= 
> (cast(current_timestamp() as bigint) * cast(1000 as bigint
>   +- SubqueryAlias table
>  +- 
> Relation[field1#411L,field2#412,value#413L,field3#414,field4#415,field5#416,field6#417,field7#418,field8#419,field9#420]
>  com.redacted@16004579 (state=,code=0)
> {code}
> Arguably, the optimizer could ignore the "order by" clause, but I leave that 
> to more informed minds than my own.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21212) Can't use Count(*) with Order Clause

2017-06-26 Thread Shawn Lavelle (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063952#comment-16063952
 ] 

Shawn Lavelle edited comment on SPARK-21212 at 6/26/17 11:17 PM:
-

[~srowen], I can assure you that value is a column in the table.  For example, 

[code]
select * from table where value between 1 and 5 order by value;
[code]
and
[code]
select coun(*) from table where value between 1 and 5
[code]
both complete successfully.


was (Author: azeroth2b):
[~srowen], I can assure you that value is a column in the table.

> Can't use Count(*) with Order Clause
> 
>
> Key: SPARK-21212
> URL: https://issues.apache.org/jira/browse/SPARK-21212
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Windows; external data provided through data source api
>Reporter: Shawn Lavelle
>Priority: Minor
>
> I don't think this should fail the query:
> {code}jdbc:hive2://user:port/> select count(*) from table where value between 
> 1498240079000 and cast(now() as bigint)*1000 order by value;
> {code}
> {code}
> Error: org.apache.spark.sql.AnalysisException: cannot resolve '`value`' given 
> input columns: [count(1)]; line 1 pos 113;
> 'Sort ['value ASC NULLS FIRST], true
> +- Aggregate [count(1) AS count(1)#718L]
>+- Filter ((value#413L >= 1498240079000) && (value#413L <= 
> (cast(current_timestamp() as bigint) * cast(1000 as bigint
>   +- SubqueryAlias table
>  +- 
> Relation[field1#411L,field2#412,value#413L,field3#414,field4#415,field5#416,field6#417,field7#418,field8#419,field9#420]
>  com.redacted@16004579 (state=,code=0)
> {code}
> Arguably, the optimizer could ignore the "order by" clause, but I leave that 
> to more informed minds than my own.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21212) Can't use Count(*) with Order Clause

2017-06-26 Thread Shawn Lavelle (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063952#comment-16063952
 ] 

Shawn Lavelle commented on SPARK-21212:
---

[~srowen], I can assure you that value is a column in the table.

> Can't use Count(*) with Order Clause
> 
>
> Key: SPARK-21212
> URL: https://issues.apache.org/jira/browse/SPARK-21212
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Windows; external data provided through data source api
>Reporter: Shawn Lavelle
>Priority: Minor
>
> I don't think this should fail the query:
> {code}jdbc:hive2://user:port/> select count(*) from table where value between 
> 1498240079000 and cast(now() as bigint)*1000 order by value;
> {code}
> {code}
> Error: org.apache.spark.sql.AnalysisException: cannot resolve '`value`' given 
> input columns: [count(1)]; line 1 pos 113;
> 'Sort ['value ASC NULLS FIRST], true
> +- Aggregate [count(1) AS count(1)#718L]
>+- Filter ((value#413L >= 1498240079000) && (value#413L <= 
> (cast(current_timestamp() as bigint) * cast(1000 as bigint
>   +- SubqueryAlias table
>  +- 
> Relation[field1#411L,field2#412,value#413L,field3#414,field4#415,field5#416,field6#417,field7#418,field8#419,field9#420]
>  com.redacted@16004579 (state=,code=0)
> {code}
> Arguably, the optimizer could ignore the "order by" clause, but I leave that 
> to more informed minds than my own.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21218) Convert IN predicate to equivalent Parquet filter

2017-06-26 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063925#comment-16063925
 ] 

Hyukjin Kwon commented on SPARK-21218:
--

I believe it is a duplicate of SPARK-17091.

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-21218
> URL: https://issues.apache.org/jira/browse/SPARK-21218
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.1
>Reporter: Michael Styles
>
> Convert IN predicate to equivalent expression involving equality conditions 
> to allow the filter to be pushed down to Parquet.
> For instance,
> C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21212) Can't use Count(*) with Order Clause

2017-06-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063915#comment-16063915
 ] 

Sean Owen commented on SPARK-21212:
---

Yes, but this doesn't work unless you say what 'value' is: {{select count(*) as 
value from table where ...}}

> Can't use Count(*) with Order Clause
> 
>
> Key: SPARK-21212
> URL: https://issues.apache.org/jira/browse/SPARK-21212
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Windows; external data provided through data source api
>Reporter: Shawn Lavelle
>Priority: Minor
>
> I don't think this should fail the query:
> {code}jdbc:hive2://user:port/> select count(*) from table where value between 
> 1498240079000 and cast(now() as bigint)*1000 order by value;
> {code}
> {code}
> Error: org.apache.spark.sql.AnalysisException: cannot resolve '`value`' given 
> input columns: [count(1)]; line 1 pos 113;
> 'Sort ['value ASC NULLS FIRST], true
> +- Aggregate [count(1) AS count(1)#718L]
>+- Filter ((value#413L >= 1498240079000) && (value#413L <= 
> (cast(current_timestamp() as bigint) * cast(1000 as bigint
>   +- SubqueryAlias table
>  +- 
> Relation[field1#411L,field2#412,value#413L,field3#414,field4#415,field5#416,field6#417,field7#418,field8#419,field9#420]
>  com.redacted@16004579 (state=,code=0)
> {code}
> Arguably, the optimizer could ignore the "order by" clause, but I leave that 
> to more informed minds than my own.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21219) Task retry occurs on same executor due to race condition with blacklisting

2017-06-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063909#comment-16063909
 ] 

Apache Spark commented on SPARK-21219:
--

User 'ericvandenbergfb' has created a pull request for this issue:
https://github.com/apache/spark/pull/18427

> Task retry occurs on same executor due to race condition with blacklisting
> --
>
> Key: SPARK-21219
> URL: https://issues.apache.org/jira/browse/SPARK-21219
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.1
>Reporter: Eric Vandenberg
>Priority: Minor
> Attachments: spark_driver.log.anon, spark_executor.log.anon
>
>
> When a task fails it is (1) added into the pending task list and then (2) 
> corresponding black list policy is enforced (ie, specifying if it can/can't 
> run on a particular node/executor/etc.)  Unfortunately the ordering is such 
> that retrying the task could assign the task to the same executor, which, 
> incidentally could be shutting down and immediately fail the retry.   Instead 
> the order should be (1) the black list state should be updated and then (2) 
> the task assigned, ensuring that the black list policy is properly enforced.
> The attached logs demonstrate the race condition.
> See spark_executor.log.anon:
> 1. Task 55.2 fails on the executor
> 17/06/20 13:25:07 ERROR Executor: Exception in task 55.2 in stage 5.0 (TID 
> 39575)
> java.lang.OutOfMemoryError: Java heap space
> 2. Immediately the same executor is assigned the retry task:
> 17/06/20 13:25:07 INFO CoarseGrainedExecutorBackend: Got assigned task 39651
> 17/06/20 13:25:07 INFO Executor: Running task 55.3 in stage 5.0 (TID 39651)
> 3. The retry task of course fails since the executor is also shutting down 
> due to the original task 55.2 OOM failure.
> See the spark_driver.log.anon:
> The driver processes the lost task 55.2:
> 17/06/20 13:25:07 WARN TaskSetManager: Lost task 55.2 in stage 5.0 (TID 
> 39575, foobar.masked-server.com, executor 
> attempt_foobar.masked-server.com-___.masked-server.com-____0):
>  java.lang.OutOfMemoryError: Java heap space
> The driver then receives the ExecutorLostFailure for the retry task 55.3 
> (although it's obfuscated in these logs, the server info is same...)
> 17/06/20 13:25:10 WARN TaskSetManager: Lost task 55.3 in stage 5.0 (TID 
> 39651, foobar.masked-server.com, executor 
> attempt_foobar.masked-server.com-___.masked-server.com-____0):
>  ExecutorLostFailure (executor 
> attempt_foobar.masked-server.com-___.masked-server.com-____0
>  exited caused by one of the running tasks) Reason: Remote RPC client 
> disassociated. Likely due to containers exceeding thresholds, or network 
> issues. Check driver logs for WARN messages.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21219) Task retry occurs on same executor due to race condition with blacklisting

2017-06-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21219:


Assignee: (was: Apache Spark)

> Task retry occurs on same executor due to race condition with blacklisting
> --
>
> Key: SPARK-21219
> URL: https://issues.apache.org/jira/browse/SPARK-21219
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.1
>Reporter: Eric Vandenberg
>Priority: Minor
> Attachments: spark_driver.log.anon, spark_executor.log.anon
>
>
> When a task fails it is (1) added into the pending task list and then (2) 
> corresponding black list policy is enforced (ie, specifying if it can/can't 
> run on a particular node/executor/etc.)  Unfortunately the ordering is such 
> that retrying the task could assign the task to the same executor, which, 
> incidentally could be shutting down and immediately fail the retry.   Instead 
> the order should be (1) the black list state should be updated and then (2) 
> the task assigned, ensuring that the black list policy is properly enforced.
> The attached logs demonstrate the race condition.
> See spark_executor.log.anon:
> 1. Task 55.2 fails on the executor
> 17/06/20 13:25:07 ERROR Executor: Exception in task 55.2 in stage 5.0 (TID 
> 39575)
> java.lang.OutOfMemoryError: Java heap space
> 2. Immediately the same executor is assigned the retry task:
> 17/06/20 13:25:07 INFO CoarseGrainedExecutorBackend: Got assigned task 39651
> 17/06/20 13:25:07 INFO Executor: Running task 55.3 in stage 5.0 (TID 39651)
> 3. The retry task of course fails since the executor is also shutting down 
> due to the original task 55.2 OOM failure.
> See the spark_driver.log.anon:
> The driver processes the lost task 55.2:
> 17/06/20 13:25:07 WARN TaskSetManager: Lost task 55.2 in stage 5.0 (TID 
> 39575, foobar.masked-server.com, executor 
> attempt_foobar.masked-server.com-___.masked-server.com-____0):
>  java.lang.OutOfMemoryError: Java heap space
> The driver then receives the ExecutorLostFailure for the retry task 55.3 
> (although it's obfuscated in these logs, the server info is same...)
> 17/06/20 13:25:10 WARN TaskSetManager: Lost task 55.3 in stage 5.0 (TID 
> 39651, foobar.masked-server.com, executor 
> attempt_foobar.masked-server.com-___.masked-server.com-____0):
>  ExecutorLostFailure (executor 
> attempt_foobar.masked-server.com-___.masked-server.com-____0
>  exited caused by one of the running tasks) Reason: Remote RPC client 
> disassociated. Likely due to containers exceeding thresholds, or network 
> issues. Check driver logs for WARN messages.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21219) Task retry occurs on same executor due to race condition with blacklisting

2017-06-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21219:


Assignee: Apache Spark

> Task retry occurs on same executor due to race condition with blacklisting
> --
>
> Key: SPARK-21219
> URL: https://issues.apache.org/jira/browse/SPARK-21219
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.1
>Reporter: Eric Vandenberg
>Assignee: Apache Spark
>Priority: Minor
> Attachments: spark_driver.log.anon, spark_executor.log.anon
>
>
> When a task fails it is (1) added into the pending task list and then (2) 
> corresponding black list policy is enforced (ie, specifying if it can/can't 
> run on a particular node/executor/etc.)  Unfortunately the ordering is such 
> that retrying the task could assign the task to the same executor, which, 
> incidentally could be shutting down and immediately fail the retry.   Instead 
> the order should be (1) the black list state should be updated and then (2) 
> the task assigned, ensuring that the black list policy is properly enforced.
> The attached logs demonstrate the race condition.
> See spark_executor.log.anon:
> 1. Task 55.2 fails on the executor
> 17/06/20 13:25:07 ERROR Executor: Exception in task 55.2 in stage 5.0 (TID 
> 39575)
> java.lang.OutOfMemoryError: Java heap space
> 2. Immediately the same executor is assigned the retry task:
> 17/06/20 13:25:07 INFO CoarseGrainedExecutorBackend: Got assigned task 39651
> 17/06/20 13:25:07 INFO Executor: Running task 55.3 in stage 5.0 (TID 39651)
> 3. The retry task of course fails since the executor is also shutting down 
> due to the original task 55.2 OOM failure.
> See the spark_driver.log.anon:
> The driver processes the lost task 55.2:
> 17/06/20 13:25:07 WARN TaskSetManager: Lost task 55.2 in stage 5.0 (TID 
> 39575, foobar.masked-server.com, executor 
> attempt_foobar.masked-server.com-___.masked-server.com-____0):
>  java.lang.OutOfMemoryError: Java heap space
> The driver then receives the ExecutorLostFailure for the retry task 55.3 
> (although it's obfuscated in these logs, the server info is same...)
> 17/06/20 13:25:10 WARN TaskSetManager: Lost task 55.3 in stage 5.0 (TID 
> 39651, foobar.masked-server.com, executor 
> attempt_foobar.masked-server.com-___.masked-server.com-____0):
>  ExecutorLostFailure (executor 
> attempt_foobar.masked-server.com-___.masked-server.com-____0
>  exited caused by one of the running tasks) Reason: Remote RPC client 
> disassociated. Likely due to containers exceeding thresholds, or network 
> issues. Check driver logs for WARN messages.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21212) Can't use Count(*) with Order Clause

2017-06-26 Thread Shawn Lavelle (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063903#comment-16063903
 ] 

Shawn Lavelle commented on SPARK-21212:
---

I redacted most of the information to protect proprietary information.  The SQL 
Optimizer replaces count(*) with count(1).

> Can't use Count(*) with Order Clause
> 
>
> Key: SPARK-21212
> URL: https://issues.apache.org/jira/browse/SPARK-21212
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Windows; external data provided through data source api
>Reporter: Shawn Lavelle
>Priority: Minor
>
> I don't think this should fail the query:
> {code}jdbc:hive2://user:port/> select count(*) from table where value between 
> 1498240079000 and cast(now() as bigint)*1000 order by value;
> {code}
> {code}
> Error: org.apache.spark.sql.AnalysisException: cannot resolve '`value`' given 
> input columns: [count(1)]; line 1 pos 113;
> 'Sort ['value ASC NULLS FIRST], true
> +- Aggregate [count(1) AS count(1)#718L]
>+- Filter ((value#413L >= 1498240079000) && (value#413L <= 
> (cast(current_timestamp() as bigint) * cast(1000 as bigint
>   +- SubqueryAlias table
>  +- 
> Relation[field1#411L,field2#412,value#413L,field3#414,field4#415,field5#416,field6#417,field7#418,field8#419,field9#420]
>  com.redacted@16004579 (state=,code=0)
> {code}
> Arguably, the optimizer could ignore the "order by" clause, but I leave that 
> to more informed minds than my own.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21190) SPIP: Vectorized UDFs in Python

2017-06-26 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063873#comment-16063873
 ] 

Li Jin edited comment on SPARK-21190 at 6/26/17 10:02 PM:
--

[~r...@databricks.com],

The use case of seeing entire partition at a time is one of the less 
interesting use case, because partition is somehow an implementation detail of 
spark. From the use cases I saw from our internal pyspark/pandas users, the 
majority interests are to use pandas udf with a spark windowing and grouping 
operation, for instance, breaking data into day/month/year piece and perform 
some analysis using pandas for each piece.

However, the challenge still holds if one group is too large. One idea is to 
implement local aggregation and merge in pandas operations so it can be 
executed efficiently, but I haven't put too much thoughts in this.

I think without solving the problem of "one group doesn't fit in memory", this 
can already be a pretty powerful tool because currently to perform any pandas 
analysis, the *entire dataset* has to fit in memory (that is, without using 
tools like dask).


was (Author: icexelloss):
[~r...@databricks.com],

The use case of seeing entire partition at a time is one of the less 
interesting use case, because partition is somehow an implementation detail of 
spark. From the use cases I saw from our internal pyspark/pandas users, the 
majority interests are to use pandas udf with a spark windowing and grouping 
operation, for instance, breaking data into day/month/year piece and perform 
some analysis using pandas for each piece.

However, the challenge still holds if one group is too large. One idea is to 
implement local aggregation and merge in pandas operations so it can be 
executed efficiently.

I think without solving the problem of "one group doesn't fit in memory", this 
can already be a pretty powerful tool because currently to perform any pandas 
analysis, the *entire dataset* has to fit in memory (that is, without using 
tools like dask).

> SPIP: Vectorized UDFs in Python
> ---
>
> Key: SPARK-21190
> URL: https://issues.apache.org/jira/browse/SPARK-21190
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: SPIP
> Attachments: SPIPVectorizedUDFsforPython (1).pdf
>
>
> *Background and Motivation*
> Python is one of the most popular programming languages among Spark users. 
> Spark currently exposes a row-at-a-time interface for defining and executing 
> user-defined functions (UDFs). This introduces high overhead in serialization 
> and deserialization, and also makes it difficult to leverage Python libraries 
> (e.g. numpy, Pandas) that are written in native code.
>  
> This proposal advocates introducing new APIs to support vectorized UDFs in 
> Python, in which a block of data is transferred over to Python in some 
> columnar format for execution.
>  
>  
> *Target Personas*
> Data scientists, data engineers, library developers.
>  
> *Goals*
> - Support vectorized UDFs that apply on chunks of the data frame
> - Low system overhead: Substantially reduce serialization and deserialization 
> overhead when compared with row-at-a-time interface
> - UDF performance: Enable users to leverage native libraries in Python (e.g. 
> numpy, Pandas) for data manipulation in these UDFs
>  
> *Non-Goals*
> The following are explicitly out of scope for the current SPIP, and should be 
> done in future SPIPs. Nonetheless, it would be good to consider these future 
> use cases during API design, so we can achieve some consistency when rolling 
> out new APIs.
>  
> - Define block oriented UDFs in other languages (that are not Python).
> - Define aggregate UDFs
> - Tight integration with machine learning frameworks
>  
> *Proposed API Changes*
> The following sketches some possibilities. I haven’t spent a lot of time 
> thinking about the API (wrote it down in 5 mins) and I am not attached to 
> this design at all. The main purpose of the SPIP is to get feedback on use 
> cases and see how they can impact API design.
>  
> Two things to consider are:
>  
> 1. Python is dynamically typed, whereas DataFrames/SQL requires static, 
> analysis time typing. This means users would need to specify the return type 
> of their UDFs.
>  
> 2. Ratio of input rows to output rows. We propose initially we require number 
> of output rows to be the same as the number of input rows. In the future, we 
> can consider relaxing this constraint with support for vectorized aggregate 
> UDFs.
>  
> Proposed API sketch (using examples):
>  
> Use case 1. A function that defines all the columns of a DataFrame (similar 
> to a “map” function):
>  
> {code}
> @spark_

[jira] [Commented] (SPARK-21190) SPIP: Vectorized UDFs in Python

2017-06-26 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063873#comment-16063873
 ] 

Li Jin commented on SPARK-21190:


[~r...@databricks.com],

The use case of seeing entire partition at a time is one of the less 
interesting use case, because partition is somehow an implementation detail of 
spark. From the use cases I saw from our internal pyspark/pandas users, the 
majority interests are to use pandas udf with a spark windowing and grouping 
operation, for instance, breaking data into day/month/year piece and perform 
some analysis using pandas for each piece.

However, the challenge still holds if one group is too large. One idea is to 
implement local aggregation and merge in pandas operations so it can be 
executed efficiently.

I think without solving the problem of "one group doesn't fit in memory", this 
can already be a pretty powerful tool because currently to perform any pandas 
analysis, the *entire dataset* has to fit in memory (that is, without using 
tools like dask).

> SPIP: Vectorized UDFs in Python
> ---
>
> Key: SPARK-21190
> URL: https://issues.apache.org/jira/browse/SPARK-21190
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: SPIP
> Attachments: SPIPVectorizedUDFsforPython (1).pdf
>
>
> *Background and Motivation*
> Python is one of the most popular programming languages among Spark users. 
> Spark currently exposes a row-at-a-time interface for defining and executing 
> user-defined functions (UDFs). This introduces high overhead in serialization 
> and deserialization, and also makes it difficult to leverage Python libraries 
> (e.g. numpy, Pandas) that are written in native code.
>  
> This proposal advocates introducing new APIs to support vectorized UDFs in 
> Python, in which a block of data is transferred over to Python in some 
> columnar format for execution.
>  
>  
> *Target Personas*
> Data scientists, data engineers, library developers.
>  
> *Goals*
> - Support vectorized UDFs that apply on chunks of the data frame
> - Low system overhead: Substantially reduce serialization and deserialization 
> overhead when compared with row-at-a-time interface
> - UDF performance: Enable users to leverage native libraries in Python (e.g. 
> numpy, Pandas) for data manipulation in these UDFs
>  
> *Non-Goals*
> The following are explicitly out of scope for the current SPIP, and should be 
> done in future SPIPs. Nonetheless, it would be good to consider these future 
> use cases during API design, so we can achieve some consistency when rolling 
> out new APIs.
>  
> - Define block oriented UDFs in other languages (that are not Python).
> - Define aggregate UDFs
> - Tight integration with machine learning frameworks
>  
> *Proposed API Changes*
> The following sketches some possibilities. I haven’t spent a lot of time 
> thinking about the API (wrote it down in 5 mins) and I am not attached to 
> this design at all. The main purpose of the SPIP is to get feedback on use 
> cases and see how they can impact API design.
>  
> Two things to consider are:
>  
> 1. Python is dynamically typed, whereas DataFrames/SQL requires static, 
> analysis time typing. This means users would need to specify the return type 
> of their UDFs.
>  
> 2. Ratio of input rows to output rows. We propose initially we require number 
> of output rows to be the same as the number of input rows. In the future, we 
> can consider relaxing this constraint with support for vectorized aggregate 
> UDFs.
>  
> Proposed API sketch (using examples):
>  
> Use case 1. A function that defines all the columns of a DataFrame (similar 
> to a “map” function):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_on_entire_df(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A Pandas data frame.
>   """
>   input[c] = input[a] + input[b]
>   Input[d] = input[a] - input[b]
>   return input
>  
> spark.range(1000).selectExpr("id a", "id / 2 b")
>   .mapBatches(my_func_on_entire_df)
> {code}
>  
> Use case 2. A function that defines only one column (similar to existing 
> UDFs):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_that_returns_one_column(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A numpy array
>   """
>   return input[a] + input[b]
>  
> my_func = udf(my_func_that_returns_one_column)
>  
> df = spark.range(1000).selectExpr("id a", "id / 2 b")
> df.withColumn("c", my_func(df.a, df.b))
> {code}
>  
>  
>  
> *Optional Design Sketch*

[jira] [Updated] (SPARK-21219) Task retry occurs on same executor due to race condition with blacklisting

2017-06-26 Thread Eric Vandenberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Vandenberg updated SPARK-21219:

Description: 
When a task fails it is (1) added into the pending task list and then (2) 
corresponding black list policy is enforced (ie, specifying if it can/can't run 
on a particular node/executor/etc.)  Unfortunately the ordering is such that 
retrying the task could assign the task to the same executor, which, 
incidentally could be shutting down and immediately fail the retry.   Instead 
the order should be (1) the black list state should be updated and then (2) the 
task assigned, ensuring that the black list policy is properly enforced.

The attached logs demonstrate the race condition.

See spark_executor.log.anon:

1. Task 55.2 fails on the executor

17/06/20 13:25:07 ERROR Executor: Exception in task 55.2 in stage 5.0 (TID 
39575)
java.lang.OutOfMemoryError: Java heap space

2. Immediately the same executor is assigned the retry task:

17/06/20 13:25:07 INFO CoarseGrainedExecutorBackend: Got assigned task 39651
17/06/20 13:25:07 INFO Executor: Running task 55.3 in stage 5.0 (TID 39651)

3. The retry task of course fails since the executor is also shutting down due 
to the original task 55.2 OOM failure.

See the spark_driver.log.anon:

The driver processes the lost task 55.2:

17/06/20 13:25:07 WARN TaskSetManager: Lost task 55.2 in stage 5.0 (TID 39575, 
foobar.masked-server.com, executor 
attempt_foobar.masked-server.com-___.masked-server.com-____0):
 java.lang.OutOfMemoryError: Java heap space

The driver then receives the ExecutorLostFailure for the retry task 55.3 
(although it's obfuscated in these logs, the server info is same...)

17/06/20 13:25:10 WARN TaskSetManager: Lost task 55.3 in stage 5.0 (TID 39651, 
foobar.masked-server.com, executor 
attempt_foobar.masked-server.com-___.masked-server.com-____0):
 ExecutorLostFailure (executor 
attempt_foobar.masked-server.com-___.masked-server.com-____0
 exited caused by one of the running tasks) Reason: Remote RPC client 
disassociated. Likely due to containers exceeding thresholds, or network 
issues. Check driver logs for WARN messages.


  was:
When a task fails it is added into the pending task list and corresponding 
black list policy is enforced (ie, specifying if it can/can't run on a 
particular node/executor/etc.)  Unfortunately the ordering is such that 
retrying the task could assign the task to the same executor, which, 
incidentally could be shutting down and immediately fail the retry.   Instead 
the black list state should be updated and then the task assigned, ensuring 
that the black list policy is properly enforced.

The attached logs demonstrate the race condition.

See spark_executor.log.anon:

1. Task 55.2 fails on the executor

17/06/20 13:25:07 ERROR Executor: Exception in task 55.2 in stage 5.0 (TID 
39575)
java.lang.OutOfMemoryError: Java heap space

2. Immediately the same executor is assigned the retry task:

17/06/20 13:25:07 INFO CoarseGrainedExecutorBackend: Got assigned task 39651
17/06/20 13:25:07 INFO Executor: Running task 55.3 in stage 5.0 (TID 39651)

3. The retry task of course fails since the executor is also shutting down due 
to the original task 55.2 OOM failure.

See the spark_driver.log.anon:

The driver processes the lost task 55.2:

17/06/20 13:25:07 WARN TaskSetManager: Lost task 55.2 in stage 5.0 (TID 39575, 
foobar.masked-server.com, executor 
attempt_foobar.masked-server.com-___.masked-server.com-____0):
 java.lang.OutOfMemoryError: Java heap space

The driver then receives the ExecutorLostFailure for the retry task 55.3 
(although it's obfuscated in these logs, the server info is same...)

17/06/20 13:25:10 WARN TaskSetManager: Lost task 55.3 in stage 5.0 (TID 39651, 
foobar.masked-server.com, executor 
attempt_foobar.masked-server.com-___.masked-server.com-____0):
 ExecutorLostFailure (executor 
attempt_foobar.masked-server.com-___.masked-server.com-____0
 exited caused by one of the running tasks) Reason: Remote RPC client 
disassociated. Likely due to containers exceeding thresholds, or network 
issues. Check driver logs for WARN messages.



> Task retry occurs on same executor due to race condition with blacklisting
> --
>
> Key: SPARK-21219
> URL: https://issues.apache.org/jira/browse/SPARK-21219
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.1
>Reporter: Eric Vandenberg
>Priority: Minor
> Attachments: spark_driver.log.anon, spark_executor.log.anon
>
>
> W

[jira] [Updated] (SPARK-21219) Task retry occurs on same executor due to race condition with blacklisting

2017-06-26 Thread Eric Vandenberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Vandenberg updated SPARK-21219:

Description: 
When a task fails it is added into the pending task list and corresponding 
black list policy is enforced (ie, specifying if it can/can't run on a 
particular node/executor/etc.)  Unfortunately the ordering is such that 
retrying the task could assign the task to the same executor, which, 
incidentally could be shutting down and immediately fail the retry.   Instead 
the black list state should be updated and then the task assigned, ensuring 
that the black list policy is properly enforced.

The attached logs demonstrate the race condition.

See spark_executor.log.anon:

1. Task 55.2 fails on the executor

17/06/20 13:25:07 ERROR Executor: Exception in task 55.2 in stage 5.0 (TID 
39575)
java.lang.OutOfMemoryError: Java heap space

2. Immediately the same executor is assigned the retry task:

17/06/20 13:25:07 INFO CoarseGrainedExecutorBackend: Got assigned task 39651
17/06/20 13:25:07 INFO Executor: Running task 55.3 in stage 5.0 (TID 39651)

3. The retry task of course fails since the executor is also shutting down due 
to the original task 55.2 OOM failure.

See the spark_driver.log.anon:

The driver processes the lost task 55.2:

17/06/20 13:25:07 WARN TaskSetManager: Lost task 55.2 in stage 5.0 (TID 39575, 
foobar.masked-server.com, executor 
attempt_foobar.masked-server.com-___.masked-server.com-____0):
 java.lang.OutOfMemoryError: Java heap space

The driver then receives the ExecutorLostFailure for the retry task 55.3 
(although it's obfuscated in these logs, the server info is same...)

17/06/20 13:25:10 WARN TaskSetManager: Lost task 55.3 in stage 5.0 (TID 39651, 
foobar.masked-server.com, executor 
attempt_foobar.masked-server.com-___.masked-server.com-____0):
 ExecutorLostFailure (executor 
attempt_foobar.masked-server.com-___.masked-server.com-____0
 exited caused by one of the running tasks) Reason: Remote RPC client 
disassociated. Likely due to containers exceeding thresholds, or network 
issues. Check driver logs for WARN messages.


  was:
When a task fails it is added into the pending task list and corresponding 
black list policy is enforced (ie, specifying if it can/can't run on a 
particular node/executor/etc.)  Unfortunately the ordering is such that 
retrying the task could assign the task to the same executor, which, 
incidentally could be shutting down and immediately fail the retry.   Instead 
the black list state should be updated and then the task assigned, ensuring 
that the black list policy is properly enforced.



> Task retry occurs on same executor due to race condition with blacklisting
> --
>
> Key: SPARK-21219
> URL: https://issues.apache.org/jira/browse/SPARK-21219
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.1
>Reporter: Eric Vandenberg
>Priority: Minor
> Attachments: spark_driver.log.anon, spark_executor.log.anon
>
>
> When a task fails it is added into the pending task list and corresponding 
> black list policy is enforced (ie, specifying if it can/can't run on a 
> particular node/executor/etc.)  Unfortunately the ordering is such that 
> retrying the task could assign the task to the same executor, which, 
> incidentally could be shutting down and immediately fail the retry.   Instead 
> the black list state should be updated and then the task assigned, ensuring 
> that the black list policy is properly enforced.
> The attached logs demonstrate the race condition.
> See spark_executor.log.anon:
> 1. Task 55.2 fails on the executor
> 17/06/20 13:25:07 ERROR Executor: Exception in task 55.2 in stage 5.0 (TID 
> 39575)
> java.lang.OutOfMemoryError: Java heap space
> 2. Immediately the same executor is assigned the retry task:
> 17/06/20 13:25:07 INFO CoarseGrainedExecutorBackend: Got assigned task 39651
> 17/06/20 13:25:07 INFO Executor: Running task 55.3 in stage 5.0 (TID 39651)
> 3. The retry task of course fails since the executor is also shutting down 
> due to the original task 55.2 OOM failure.
> See the spark_driver.log.anon:
> The driver processes the lost task 55.2:
> 17/06/20 13:25:07 WARN TaskSetManager: Lost task 55.2 in stage 5.0 (TID 
> 39575, foobar.masked-server.com, executor 
> attempt_foobar.masked-server.com-___.masked-server.com-____0):
>  java.lang.OutOfMemoryError: Java heap space
> The driver then receives the ExecutorLostFailure for the retry task 55.3 
> (although it's obfuscated in these logs, the server info is same...)
> 17/06/20 13:25:10 WARN TaskSetManager: Lost 

[jira] [Updated] (SPARK-21219) Task retry occurs on same executor due to race condition with blacklisting

2017-06-26 Thread Eric Vandenberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Vandenberg updated SPARK-21219:

Attachment: spark_executor.log.anon
spark_driver.log.anon

> Task retry occurs on same executor due to race condition with blacklisting
> --
>
> Key: SPARK-21219
> URL: https://issues.apache.org/jira/browse/SPARK-21219
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.1
>Reporter: Eric Vandenberg
>Priority: Minor
> Attachments: spark_driver.log.anon, spark_executor.log.anon
>
>
> When a task fails it is added into the pending task list and corresponding 
> black list policy is enforced (ie, specifying if it can/can't run on a 
> particular node/executor/etc.)  Unfortunately the ordering is such that 
> retrying the task could assign the task to the same executor, which, 
> incidentally could be shutting down and immediately fail the retry.   Instead 
> the black list state should be updated and then the task assigned, ensuring 
> that the black list policy is properly enforced.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21219) Task retry occurs on same executor due to race condition with blacklisting

2017-06-26 Thread Eric Vandenberg (JIRA)
Eric Vandenberg created SPARK-21219:
---

 Summary: Task retry occurs on same executor due to race condition 
with blacklisting
 Key: SPARK-21219
 URL: https://issues.apache.org/jira/browse/SPARK-21219
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 2.1.1
Reporter: Eric Vandenberg
Priority: Minor


When a task fails it is added into the pending task list and corresponding 
black list policy is enforced (ie, specifying if it can/can't run on a 
particular node/executor/etc.)  Unfortunately the ordering is such that 
retrying the task could assign the task to the same executor, which, 
incidentally could be shutting down and immediately fail the retry.   Instead 
the black list state should be updated and then the task assigned, ensuring 
that the black list policy is properly enforced.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21216) Streaming DataFrames fail to join with Hive tables

2017-06-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21216:


Assignee: Apache Spark  (was: Burak Yavuz)

> Streaming DataFrames fail to join with Hive tables
> --
>
> Key: SPARK-21216
> URL: https://issues.apache.org/jira/browse/SPARK-21216
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.1
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>
> The following code will throw a cryptic exception:
> {code}
> import org.apache.spark.sql.execution.streaming.MemoryStream
> import testImplicits._
> implicit val _sqlContext = spark.sqlContext
> Seq((1, "one"), (2, "two"), (4, "four")).toDF("number", 
> "word").createOrReplaceTempView("t1")
> // Make a table and ensure it will be broadcast.
> sql("""CREATE TABLE smallTable(word string, number int)
>   |ROW FORMAT SERDE 
> 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
>   |STORED AS TEXTFILE
> """.stripMargin)
> sql(
>   """INSERT INTO smallTable
> |SELECT word, number from t1
>   """.stripMargin)
> val inputData = MemoryStream[Int]
> val joined = inputData.toDS().toDF()
>   .join(spark.table("smallTable"), $"value" === $"number")
> val sq = joined.writeStream
>   .format("memory")
>   .queryName("t2")
>   .start()
> try {
>   inputData.addData(1, 2)
>   sq.processAllAvailable()
> } finally {
>   sq.stop()
> }
> {code}
> If someone creates a HiveSession, the planner in `IncrementalExecution` 
> doesn't take into account the Hive scan strategies



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21216) Streaming DataFrames fail to join with Hive tables

2017-06-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063662#comment-16063662
 ] 

Apache Spark commented on SPARK-21216:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/18426

> Streaming DataFrames fail to join with Hive tables
> --
>
> Key: SPARK-21216
> URL: https://issues.apache.org/jira/browse/SPARK-21216
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.1
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>
> The following code will throw a cryptic exception:
> {code}
> import org.apache.spark.sql.execution.streaming.MemoryStream
> import testImplicits._
> implicit val _sqlContext = spark.sqlContext
> Seq((1, "one"), (2, "two"), (4, "four")).toDF("number", 
> "word").createOrReplaceTempView("t1")
> // Make a table and ensure it will be broadcast.
> sql("""CREATE TABLE smallTable(word string, number int)
>   |ROW FORMAT SERDE 
> 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
>   |STORED AS TEXTFILE
> """.stripMargin)
> sql(
>   """INSERT INTO smallTable
> |SELECT word, number from t1
>   """.stripMargin)
> val inputData = MemoryStream[Int]
> val joined = inputData.toDS().toDF()
>   .join(spark.table("smallTable"), $"value" === $"number")
> val sq = joined.writeStream
>   .format("memory")
>   .queryName("t2")
>   .start()
> try {
>   inputData.addData(1, 2)
>   sq.processAllAvailable()
> } finally {
>   sq.stop()
> }
> {code}
> If someone creates a HiveSession, the planner in `IncrementalExecution` 
> doesn't take into account the Hive scan strategies



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21216) Streaming DataFrames fail to join with Hive tables

2017-06-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21216:


Assignee: Burak Yavuz  (was: Apache Spark)

> Streaming DataFrames fail to join with Hive tables
> --
>
> Key: SPARK-21216
> URL: https://issues.apache.org/jira/browse/SPARK-21216
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.1
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>
> The following code will throw a cryptic exception:
> {code}
> import org.apache.spark.sql.execution.streaming.MemoryStream
> import testImplicits._
> implicit val _sqlContext = spark.sqlContext
> Seq((1, "one"), (2, "two"), (4, "four")).toDF("number", 
> "word").createOrReplaceTempView("t1")
> // Make a table and ensure it will be broadcast.
> sql("""CREATE TABLE smallTable(word string, number int)
>   |ROW FORMAT SERDE 
> 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
>   |STORED AS TEXTFILE
> """.stripMargin)
> sql(
>   """INSERT INTO smallTable
> |SELECT word, number from t1
>   """.stripMargin)
> val inputData = MemoryStream[Int]
> val joined = inputData.toDS().toDF()
>   .join(spark.table("smallTable"), $"value" === $"number")
> val sq = joined.writeStream
>   .format("memory")
>   .queryName("t2")
>   .start()
> try {
>   inputData.addData(1, 2)
>   sq.processAllAvailable()
> } finally {
>   sq.stop()
> }
> {code}
> If someone creates a HiveSession, the planner in `IncrementalExecution` 
> doesn't take into account the Hive scan strategies



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21217) Support ColumnVector.Array.toArray()

2017-06-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063631#comment-16063631
 ] 

Apache Spark commented on SPARK-21217:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/18425

> Support ColumnVector.Array.toArray()
> --
>
> Key: SPARK-21217
> URL: https://issues.apache.org/jira/browse/SPARK-21217
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>
> When {{toArray()}} in {{ColumnVector.Array}} (e.g. {{toIntArray()}}) is 
> called, the generic method in {{ArrayData}} is called. It is not fast since 
> element-wise copy is performed.
> This JIRA entry implements bulk-copy for these methods in 
> {{ColumnVector.Array}} by using {{System.arrayCopy()}} or 
> {{Platform.copyMemory()}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21217) Support ColumnVector.Array.toArray()

2017-06-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21217:


Assignee: (was: Apache Spark)

> Support ColumnVector.Array.toArray()
> --
>
> Key: SPARK-21217
> URL: https://issues.apache.org/jira/browse/SPARK-21217
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>
> When {{toArray()}} in {{ColumnVector.Array}} (e.g. {{toIntArray()}}) is 
> called, the generic method in {{ArrayData}} is called. It is not fast since 
> element-wise copy is performed.
> This JIRA entry implements bulk-copy for these methods in 
> {{ColumnVector.Array}} by using {{System.arrayCopy()}} or 
> {{Platform.copyMemory()}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21217) Support ColumnVector.Array.toArray()

2017-06-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21217:


Assignee: Apache Spark

> Support ColumnVector.Array.toArray()
> --
>
> Key: SPARK-21217
> URL: https://issues.apache.org/jira/browse/SPARK-21217
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>
> When {{toArray()}} in {{ColumnVector.Array}} (e.g. {{toIntArray()}}) is 
> called, the generic method in {{ArrayData}} is called. It is not fast since 
> element-wise copy is performed.
> This JIRA entry implements bulk-copy for these methods in 
> {{ColumnVector.Array}} by using {{System.arrayCopy()}} or 
> {{Platform.copyMemory()}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21218) Convert IN predicate to equivalent Parquet filter

2017-06-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21218:


Assignee: (was: Apache Spark)

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-21218
> URL: https://issues.apache.org/jira/browse/SPARK-21218
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.1
>Reporter: Michael Styles
>
> Convert IN predicate to equivalent expression involving equality conditions 
> to allow the filter to be pushed down to Parquet.
> For instance,
> C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21218) Convert IN predicate to equivalent Parquet filter

2017-06-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21218:


Assignee: Apache Spark

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-21218
> URL: https://issues.apache.org/jira/browse/SPARK-21218
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.1
>Reporter: Michael Styles
>Assignee: Apache Spark
>
> Convert IN predicate to equivalent expression involving equality conditions 
> to allow the filter to be pushed down to Parquet.
> For instance,
> C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21218) Convert IN predicate to equivalent Parquet filter

2017-06-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063620#comment-16063620
 ] 

Apache Spark commented on SPARK-21218:
--

User 'ptkool' has created a pull request for this issue:
https://github.com/apache/spark/pull/18424

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-21218
> URL: https://issues.apache.org/jira/browse/SPARK-21218
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.1
>Reporter: Michael Styles
>
> Convert IN predicate to equivalent expression involving equality conditions 
> to allow the filter to be pushed down to Parquet.
> For instance,
> C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21218) Convert IN predicate to equivalent Parquet filter

2017-06-26 Thread Michael Styles (JIRA)
Michael Styles created SPARK-21218:
--

 Summary: Convert IN predicate to equivalent Parquet filter
 Key: SPARK-21218
 URL: https://issues.apache.org/jira/browse/SPARK-21218
 Project: Spark
  Issue Type: Improvement
  Components: Optimizer
Affects Versions: 2.1.1
Reporter: Michael Styles


Convert IN predicate to equivalent expression involving equality conditions to 
allow the filter to be pushed down to Parquet.

For instance,

C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21217) Support ColumnVector.Array.toArray()

2017-06-26 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-21217:


 Summary: Support ColumnVector.Array.toArray()
 Key: SPARK-21217
 URL: https://issues.apache.org/jira/browse/SPARK-21217
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Kazuaki Ishizaki


When {{toArray()}} in {{ColumnVector.Array}} (e.g. {{toIntArray()}}) is 
called, the generic method in {{ArrayData}} is called. It is not fast since 
element-wise copy is performed.
This JIRA entry implements bulk-copy for these methods in 
{{ColumnVector.Array}} by using {{System.arrayCopy()}} or 
{{Platform.copyMemory()}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21216) Streaming DataFrames fail to join with Hive tables

2017-06-26 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-21216:
---

 Summary: Streaming DataFrames fail to join with Hive tables
 Key: SPARK-21216
 URL: https://issues.apache.org/jira/browse/SPARK-21216
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.1.1
Reporter: Burak Yavuz
Assignee: Burak Yavuz


The following code will throw a cryptic exception:

{code}
import org.apache.spark.sql.execution.streaming.MemoryStream
import testImplicits._

implicit val _sqlContext = spark.sqlContext

Seq((1, "one"), (2, "two"), (4, "four")).toDF("number", 
"word").createOrReplaceTempView("t1")
// Make a table and ensure it will be broadcast.
sql("""CREATE TABLE smallTable(word string, number int)
  |ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
  |STORED AS TEXTFILE
""".stripMargin)

sql(
  """INSERT INTO smallTable
|SELECT word, number from t1
  """.stripMargin)

val inputData = MemoryStream[Int]
val joined = inputData.toDS().toDF()
  .join(spark.table("smallTable"), $"value" === $"number")

val sq = joined.writeStream
  .format("memory")
  .queryName("t2")
  .start()
try {
  inputData.addData(1, 2)

  sq.processAllAvailable()
} finally {
  sq.stop()
}
{code}

If someone creates a HiveSession, the planner in `IncrementalExecution` doesn't 
take into account the Hive scan strategies



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21158) SparkSQL function SparkSession.Catalog.ListTables() does not handle spark setting for case-sensitivity

2017-06-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063561#comment-16063561
 ] 

Apache Spark commented on SPARK-21158:
--

User 'cammachusa' has created a pull request for this issue:
https://github.com/apache/spark/pull/18423

> SparkSQL function SparkSession.Catalog.ListTables() does not handle spark 
> setting for case-sensitivity
> --
>
> Key: SPARK-21158
> URL: https://issues.apache.org/jira/browse/SPARK-21158
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Windows 10
> IntelliJ 
> Scala
>Reporter: Kathryn McClintic
>Priority: Minor
>  Labels: easyfix, features, sparksql, windows
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When working with SQL table names in Spark SQL we have noticed some issues 
> with case-sensitivity.
> If you set spark.sql.caseSensitive setting to be true, SparkSQL stores the 
> table names in the way it was provided. This is correct.
> If you set  spark.sql.caseSensitive setting to be false, SparkSQL stores the 
> table names in lower case.
> Then, we use the function sqlContext.tableNames() to get all the tables in 
> our DB. We check if this list contains(<"string of table name">) to determine 
> if we have already created a table. If case-sensitivity is turned off 
> (false), this function should look if the table name is contained in the 
> table list regardless of case.
> However, it tries to look for only ones that match the lower case version of 
> the stored table. Therefore, if you pass in a camel or upper case table name, 
> this function would return false when in fact the table does exist.
> The root cause of this issue is in the function 
> SparkSession.Catalog.ListTables()
> For example:
> In your SQL context - you have  four tables and you have chosen to have 
> spark.sql.case-Sensitive=false so it stores your tables in lowercase: 
> carnames
> carmodels
> carnamesandmodels
> users
> dealerlocations
> When running your pipeline, you want to see if you have already created the 
> temp join table of 'carnamesandmodels'. However, you have stored it as a 
> constant which reads: CarNamesAndModels for readability.
> So you can use the function
> sqlContext.tableNames().contains("CarNamesAndModels").
> This should return true - because we know its already created, but it will 
> currently return false since CarNamesAndModels is not in lowercase.
> The responsibility to change the name passed into the .contains method to be 
> lowercase should not be put on the spark user. This should be done by spark 
> sql if case-sensitivity is turned to false.
> Proposed solutions:
> - Setting case sensitive in the sql context should make the sql context 
> be agnostic to case but not change the storage of the table
> - There should be a custom contains method for ListTables() which converts 
> the tablename to be lowercase before checking
> - SparkSession.Catalog.ListTables() should return the list of tables in the 
> input format instead of in all lowercase.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21158) SparkSQL function SparkSession.Catalog.ListTables() does not handle spark setting for case-sensitivity

2017-06-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21158:


Assignee: (was: Apache Spark)

> SparkSQL function SparkSession.Catalog.ListTables() does not handle spark 
> setting for case-sensitivity
> --
>
> Key: SPARK-21158
> URL: https://issues.apache.org/jira/browse/SPARK-21158
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Windows 10
> IntelliJ 
> Scala
>Reporter: Kathryn McClintic
>Priority: Minor
>  Labels: easyfix, features, sparksql, windows
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When working with SQL table names in Spark SQL we have noticed some issues 
> with case-sensitivity.
> If you set spark.sql.caseSensitive setting to be true, SparkSQL stores the 
> table names in the way it was provided. This is correct.
> If you set  spark.sql.caseSensitive setting to be false, SparkSQL stores the 
> table names in lower case.
> Then, we use the function sqlContext.tableNames() to get all the tables in 
> our DB. We check if this list contains(<"string of table name">) to determine 
> if we have already created a table. If case-sensitivity is turned off 
> (false), this function should look if the table name is contained in the 
> table list regardless of case.
> However, it tries to look for only ones that match the lower case version of 
> the stored table. Therefore, if you pass in a camel or upper case table name, 
> this function would return false when in fact the table does exist.
> The root cause of this issue is in the function 
> SparkSession.Catalog.ListTables()
> For example:
> In your SQL context - you have  four tables and you have chosen to have 
> spark.sql.case-Sensitive=false so it stores your tables in lowercase: 
> carnames
> carmodels
> carnamesandmodels
> users
> dealerlocations
> When running your pipeline, you want to see if you have already created the 
> temp join table of 'carnamesandmodels'. However, you have stored it as a 
> constant which reads: CarNamesAndModels for readability.
> So you can use the function
> sqlContext.tableNames().contains("CarNamesAndModels").
> This should return true - because we know its already created, but it will 
> currently return false since CarNamesAndModels is not in lowercase.
> The responsibility to change the name passed into the .contains method to be 
> lowercase should not be put on the spark user. This should be done by spark 
> sql if case-sensitivity is turned to false.
> Proposed solutions:
> - Setting case sensitive in the sql context should make the sql context 
> be agnostic to case but not change the storage of the table
> - There should be a custom contains method for ListTables() which converts 
> the tablename to be lowercase before checking
> - SparkSession.Catalog.ListTables() should return the list of tables in the 
> input format instead of in all lowercase.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21158) SparkSQL function SparkSession.Catalog.ListTables() does not handle spark setting for case-sensitivity

2017-06-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21158:


Assignee: Apache Spark

> SparkSQL function SparkSession.Catalog.ListTables() does not handle spark 
> setting for case-sensitivity
> --
>
> Key: SPARK-21158
> URL: https://issues.apache.org/jira/browse/SPARK-21158
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Windows 10
> IntelliJ 
> Scala
>Reporter: Kathryn McClintic
>Assignee: Apache Spark
>Priority: Minor
>  Labels: easyfix, features, sparksql, windows
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When working with SQL table names in Spark SQL we have noticed some issues 
> with case-sensitivity.
> If you set spark.sql.caseSensitive setting to be true, SparkSQL stores the 
> table names in the way it was provided. This is correct.
> If you set  spark.sql.caseSensitive setting to be false, SparkSQL stores the 
> table names in lower case.
> Then, we use the function sqlContext.tableNames() to get all the tables in 
> our DB. We check if this list contains(<"string of table name">) to determine 
> if we have already created a table. If case-sensitivity is turned off 
> (false), this function should look if the table name is contained in the 
> table list regardless of case.
> However, it tries to look for only ones that match the lower case version of 
> the stored table. Therefore, if you pass in a camel or upper case table name, 
> this function would return false when in fact the table does exist.
> The root cause of this issue is in the function 
> SparkSession.Catalog.ListTables()
> For example:
> In your SQL context - you have  four tables and you have chosen to have 
> spark.sql.case-Sensitive=false so it stores your tables in lowercase: 
> carnames
> carmodels
> carnamesandmodels
> users
> dealerlocations
> When running your pipeline, you want to see if you have already created the 
> temp join table of 'carnamesandmodels'. However, you have stored it as a 
> constant which reads: CarNamesAndModels for readability.
> So you can use the function
> sqlContext.tableNames().contains("CarNamesAndModels").
> This should return true - because we know its already created, but it will 
> currently return false since CarNamesAndModels is not in lowercase.
> The responsibility to change the name passed into the .contains method to be 
> lowercase should not be put on the spark user. This should be done by spark 
> sql if case-sensitivity is turned to false.
> Proposed solutions:
> - Setting case sensitive in the sql context should make the sql context 
> be agnostic to case but not change the storage of the table
> - There should be a custom contains method for ListTables() which converts 
> the tablename to be lowercase before checking
> - SparkSession.Catalog.ListTables() should return the list of tables in the 
> input format instead of in all lowercase.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20889) SparkR grouped documentation for Column methods

2017-06-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063552#comment-16063552
 ] 

Apache Spark commented on SPARK-20889:
--

User 'actuaryzhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/18422

> SparkR grouped documentation for Column methods
> ---
>
> Key: SPARK-20889
> URL: https://issues.apache.org/jira/browse/SPARK-20889
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Affects Versions: 2.1.1
>Reporter: Wayne Zhang
>Assignee: Wayne Zhang
>  Labels: documentation
> Fix For: 2.3.0
>
>
> Group the documentation of individual methods defined for the Column class. 
> This aims to create the following improvements:
> - Centralized documentation for easy navigation (user can view multiple 
> related methods on one single page).
> - Reduced number of items in Seealso.
> - Betters examples using shared data. This avoids creating a data frame for 
> each function if they are documented separately. And more importantly, user 
> can copy and paste to run them directly!
> - Cleaner structure and much fewer Rd files (remove a large number of Rd 
> files).
> - Remove duplicated definition of param (since they share exactly the same 
> argument).
> - No need to write meaningless examples for trivial functions (because of 
> grouping).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21190) SPIP: Vectorized UDFs in Python

2017-06-26 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063515#comment-16063515
 ] 

Reynold Xin commented on SPARK-21190:
-

[~icexelloss] Thanks. Your proposal brings up a good point, which is do we want 
to allow users to see the entire partition at a time. As demonstrated by your 
example, this can be a very powerful tool (especially for leveraging the time 
series analysis features available in Pandas). The primary challenge there is 
that a partition can be very large, and it is possible to run out of memory 
when using Pandas. Any thoughts on how we should handle those cases?




> SPIP: Vectorized UDFs in Python
> ---
>
> Key: SPARK-21190
> URL: https://issues.apache.org/jira/browse/SPARK-21190
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: SPIP
> Attachments: SPIPVectorizedUDFsforPython (1).pdf
>
>
> *Background and Motivation*
> Python is one of the most popular programming languages among Spark users. 
> Spark currently exposes a row-at-a-time interface for defining and executing 
> user-defined functions (UDFs). This introduces high overhead in serialization 
> and deserialization, and also makes it difficult to leverage Python libraries 
> (e.g. numpy, Pandas) that are written in native code.
>  
> This proposal advocates introducing new APIs to support vectorized UDFs in 
> Python, in which a block of data is transferred over to Python in some 
> columnar format for execution.
>  
>  
> *Target Personas*
> Data scientists, data engineers, library developers.
>  
> *Goals*
> - Support vectorized UDFs that apply on chunks of the data frame
> - Low system overhead: Substantially reduce serialization and deserialization 
> overhead when compared with row-at-a-time interface
> - UDF performance: Enable users to leverage native libraries in Python (e.g. 
> numpy, Pandas) for data manipulation in these UDFs
>  
> *Non-Goals*
> The following are explicitly out of scope for the current SPIP, and should be 
> done in future SPIPs. Nonetheless, it would be good to consider these future 
> use cases during API design, so we can achieve some consistency when rolling 
> out new APIs.
>  
> - Define block oriented UDFs in other languages (that are not Python).
> - Define aggregate UDFs
> - Tight integration with machine learning frameworks
>  
> *Proposed API Changes*
> The following sketches some possibilities. I haven’t spent a lot of time 
> thinking about the API (wrote it down in 5 mins) and I am not attached to 
> this design at all. The main purpose of the SPIP is to get feedback on use 
> cases and see how they can impact API design.
>  
> Two things to consider are:
>  
> 1. Python is dynamically typed, whereas DataFrames/SQL requires static, 
> analysis time typing. This means users would need to specify the return type 
> of their UDFs.
>  
> 2. Ratio of input rows to output rows. We propose initially we require number 
> of output rows to be the same as the number of input rows. In the future, we 
> can consider relaxing this constraint with support for vectorized aggregate 
> UDFs.
>  
> Proposed API sketch (using examples):
>  
> Use case 1. A function that defines all the columns of a DataFrame (similar 
> to a “map” function):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_on_entire_df(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A Pandas data frame.
>   """
>   input[c] = input[a] + input[b]
>   Input[d] = input[a] - input[b]
>   return input
>  
> spark.range(1000).selectExpr("id a", "id / 2 b")
>   .mapBatches(my_func_on_entire_df)
> {code}
>  
> Use case 2. A function that defines only one column (similar to existing 
> UDFs):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_that_returns_one_column(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A numpy array
>   """
>   return input[a] + input[b]
>  
> my_func = udf(my_func_that_returns_one_column)
>  
> df = spark.range(1000).selectExpr("id a", "id / 2 b")
> df.withColumn("c", my_func(df.a, df.b))
> {code}
>  
>  
>  
> *Optional Design Sketch*
> I’m more concerned about getting proper feedback for API design. The 
> implementation should be pretty straightforward and is not a huge concern at 
> this point. We can leverage the same implementation for faster toPandas 
> (using Arrow).
>  
>  
> *Optional Rejected Designs*
> See above.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-

[jira] [Comment Edited] (SPARK-21198) SparkSession catalog is terribly slow

2017-06-26 Thread Saif Addin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063406#comment-16063406
 ] 

Saif Addin edited comment on SPARK-21198 at 6/26/17 5:56 PM:
-

Okay, I think there is something odd somewhere in between. It may be hard to 
tackle the issue but, i'll start slowly line to line.

(*spark-submit*, local[8])
1. This line, gets stuck forever and the program doesnt continue after waiting 
2 minutes (spark-task is stuck in collect())
{code:java}
import spark.implicits._
private val databases: Array[String] =  REFERENCEDB +: 
(spark.catalog.listDatabases.map(_.name).filter(_ != blacklistdb).collect)
  println((new Date().getTime - d2)/1000.0)
{code}
2. Changing such line to (notice filter after collect) takes 8ms

{code:java}
import spark.implicits._
private val databases: Array[String] = REFERENCEDB +: 
(spark.catalog.listDatabases.map(_.name).collect.filterNot(_ == blacklistdb))
{code}

3. The following line, takes instead 2ms

{code:java}
private val databases: Array[String] = (REFERENCEDB +: spark.sql("show 
databases").collect.map(_.getString(0))).filterNot(_ == blacklistdb)
{code}


Here's the weirdest of all, if I instead start *spark-shell* and do item number 
1, it works (takes 1ms which is even faster than other lines in my 
*spark-submit.* the other lines are also faster down to 1ms as well)


was (Author: revolucion09):
Okay, I think there is something odd somewhere in between. It may be hard to 
tackle the issue but, i'll start slowly line to line.

(*spark-submit*, local[8])
1. This line, gets stuck forever and the program doesnt continue after waiting 
2 minutes (spark-task is stuck in collect())
{code:java}
import spark.implicits._
private val databases: Array[String] = RREFERENCEDB +: 
(spark.catalog.listDatabases.map(_.name).collect.filter(_ != blacklistdb))
{code}
2. Changing such line to (notice filter after collect) takes 8ms

{code:java}
import spark.implicits._
private val databases: Array[String] = REFERENCEDB +: 
(spark.catalog.listDatabases.map(_.name).collect.filterNot(_ == blacklistdb))
{code}

3. The following line, takes instead 2ms

{code:java}
private val databases: Array[String] = (REFERENCEDB +: spark.sql("show 
databases").collect.map(_.getString(0))).filterNot(_ == blacklistdb)
{code}


Here's the weirdest of all, if I instead start *spark-shell* and do item number 
1, it works (takes 1ms which is even faster than other lines in my 
*spark-submit.* the other lines are also faster down to 1ms as well)

> SparkSession catalog is terribly slow
> -
>
> Key: SPARK-21198
> URL: https://issues.apache.org/jira/browse/SPARK-21198
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Saif Addin
>
> We have a considerably large Hive metastore and a Spark program that goes 
> through Hive data availability.
> In spark 1.x, we were using sqlConext.tableNames, sqlContext.sql() and 
> sqlContext.isCached() to go throgh Hive metastore information.
> Once migrated to spark 2.x we switched over SparkSession.catalog instead, but 
> it turns out that both listDatabases() and listTables() take between 5 to 20 
> minutes depending on the database to return results, using operations such as 
> the following one:
> spark.catalog.listTables(db).filter(__.isTemporary).map(__.name).collect
> and made the program unbearably slow to return a list of tables.
> I know we still have spark.sqlContext.tableNames as workaround but I am 
> assuming this is going to be deprecated anytime soon?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21198) SparkSession catalog is terribly slow

2017-06-26 Thread Saif Addin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063406#comment-16063406
 ] 

Saif Addin edited comment on SPARK-21198 at 6/26/17 5:55 PM:
-

Okay, I think there is something odd somewhere in between. It may be hard to 
tackle the issue but, i'll start slowly line to line.

(*spark-submit*, local[8])
1. This line, gets stuck forever and the program doesnt continue after waiting 
2 minutes (spark-task is stuck in collect())
{code:java}
import spark.implicits._
private val databases: Array[String] = RREFERENCEDB +: 
(spark.catalog.listDatabases.map(_.name).collect.filter(_ != blacklistdb))
{code}
2. Changing such line to (notice filter after collect) takes 8ms

{code:java}
import spark.implicits._
private val databases: Array[String] = REFERENCEDB +: 
(spark.catalog.listDatabases.map(_.name).collect.filterNot(_ == blacklistdb))
{code}

3. The following line, takes instead 2ms

{code:java}
private val databases: Array[String] = (REFERENCEDB +: spark.sql("show 
databases").collect.map(_.getString(0))).filterNot(_ == blacklistdb)
{code}


Here's the weirdest of all, if I instead start *spark-shell* and do item number 
1, it works (takes 1ms which is even faster than other lines in my 
*spark-submit.* the other lines are also faster down to 1ms as well)


was (Author: revolucion09):
Okay, I think there is something odd somewhere in between. It may be hard to 
tackle the issue but, i'll start slowly line to line.

(*spark-submit*, local[8])
1. This line, gets stuck forever and the program doesnt continue after waiting 
2 minutes (spark-task is stuck in collect())
{code:java}
import spark.implicits._
private val databases: Array[String] = REFERENCEDB +: 
(spark.catalog.listDatabases.filter(_ != blacklistdb).map(_.name).collect)
{code}
2. Changing such line to (notice filter after collect) takes 8ms

{code:java}
import spark.implicits._
private val databases: Array[String] = REFERENCEDB +: 
(spark.catalog.listDatabases.map(_.name).collect.filterNot(_ == blacklistdb))
{code}

3. The following line, takes instead 2ms

{code:java}
private val databases: Array[String] = (REFERENCEDB +: spark.sql("show 
databases").collect.map(_.getString(0))).filterNot(_ == blacklistdb)
{code}


Here's the weirdest of all, if I instead start *spark-shell* and do item number 
1, it works (takes 1ms which is even faster than other lines in my 
*spark-submit.* the other lines are also faster down to 1ms as well)

> SparkSession catalog is terribly slow
> -
>
> Key: SPARK-21198
> URL: https://issues.apache.org/jira/browse/SPARK-21198
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Saif Addin
>
> We have a considerably large Hive metastore and a Spark program that goes 
> through Hive data availability.
> In spark 1.x, we were using sqlConext.tableNames, sqlContext.sql() and 
> sqlContext.isCached() to go throgh Hive metastore information.
> Once migrated to spark 2.x we switched over SparkSession.catalog instead, but 
> it turns out that both listDatabases() and listTables() take between 5 to 20 
> minutes depending on the database to return results, using operations such as 
> the following one:
> spark.catalog.listTables(db).filter(__.isTemporary).map(__.name).collect
> and made the program unbearably slow to return a list of tables.
> I know we still have spark.sqlContext.tableNames as workaround but I am 
> assuming this is going to be deprecated anytime soon?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21214) Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '

2017-06-26 Thread Michael Kunkel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063462#comment-16063462
 ] 

Michael Kunkel commented on SPARK-21214:


Greetings,

Forget my last email.


BR
MK

Michael C. Kunkel, USMC, PhD
Forschungszentrum Jülich
Nuclear Physics Institute and Juelich Center for Hadron Physics
Experimental Hadron Structure (IKP-1)
www.fz-juelich.de/ikp






Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt





> Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve '
> ---
>
> Key: SPARK-21214
> URL: https://issues.apache.org/jira/browse/SPARK-21214
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core, SQL
>Affects Versions: 2.1.1
> Environment: macOSX
>Reporter: Michael Kunkel
>
> First Spark project.
> I have a Java method that returns a Dataset. I want to convert this to a 
> Dataset, where the Object is named StatusChangeDB. I have created a 
> POJO StatusChangeDB.java and coded it with all the query objects found in the 
> mySQL table.
> I then create a Encoder and convert the Dataset to a 
> Dataset. However when I try to .show() the values of the 
> Dataset I receive the error
> bq. 
> bq. Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve '`hvpinid_quad`' given input columns: [status_change_type, 
> superLayer, loclayer, sector, locwire];
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:86)
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:83)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:289)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$10.apply(TreeNode.scala:324)
> bq.   at 
> scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
> bq.   at 
> scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
> bq.   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> bq.   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> bq.   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> bq.   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> bq.   at scala.collection.IterableLike$$anon$1.foreach(IterableLike.scala:311)
> bq.   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> bq.   at 
> scala.collection.mutable.MapBuilder.$plus$plus$eq(MapBuilder.scala:25)
> bq.   at 
> scala.collection.TraversableViewLike$class.force(TraversableViewLike.scala:88)
> bq.   at scala.collection.IterableLike$$anon$1.force(IterableLike.scala:311)
> bq.   at 
> org.

[jira] [Commented] (SPARK-21214) Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '

2017-06-26 Thread Michael Kunkel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063461#comment-16063461
 ] 

Michael Kunkel commented on SPARK-21214:


Greetings,

Would you please inform me on the location of the mailing list?

BR
MK

Michael C. Kunkel, USMC, PhD
Forschungszentrum Jülich
Nuclear Physics Institute and Juelich Center for Hadron Physics
Experimental Hadron Structure (IKP-1)
www.fz-juelich.de/ikp






Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt





> Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve '
> ---
>
> Key: SPARK-21214
> URL: https://issues.apache.org/jira/browse/SPARK-21214
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core, SQL
>Affects Versions: 2.1.1
> Environment: macOSX
>Reporter: Michael Kunkel
>
> First Spark project.
> I have a Java method that returns a Dataset. I want to convert this to a 
> Dataset, where the Object is named StatusChangeDB. I have created a 
> POJO StatusChangeDB.java and coded it with all the query objects found in the 
> mySQL table.
> I then create a Encoder and convert the Dataset to a 
> Dataset. However when I try to .show() the values of the 
> Dataset I receive the error
> bq. 
> bq. Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve '`hvpinid_quad`' given input columns: [status_change_type, 
> superLayer, loclayer, sector, locwire];
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:86)
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:83)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:289)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$10.apply(TreeNode.scala:324)
> bq.   at 
> scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
> bq.   at 
> scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
> bq.   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> bq.   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> bq.   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> bq.   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> bq.   at scala.collection.IterableLike$$anon$1.foreach(IterableLike.scala:311)
> bq.   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> bq.   at 
> scala.collection.mutable.MapBuilder.$plus$plus$eq(MapBuilder.scala:25)
> bq.   at 
> scala.collection.TraversableViewLike$class.force(TraversableViewLike.scala:88)
> bq.   at scala.collection.IterableLike$$anon$1.force(I

[jira] [Comment Edited] (SPARK-21198) SparkSession catalog is terribly slow

2017-06-26 Thread Saif Addin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063431#comment-16063431
 ] 

Saif Addin edited comment on SPARK-21198 at 6/26/17 5:24 PM:
-

Regarding listtables, here is the  code used inside the program:

{code:java}

 println(s"processing all tables for every db. db length is: 
${databases.tail.length}")
  for (d <- databases.tail) {
val d1 = new Date().getTime
val dbs = spark.sqlContext.tables(d).filter("isTemporary = 
false").select("tableName").collect.map(_.getString(0))
println("Processed tables in DB using sqlContext. Time: " + ((new 
Date().getTime - d1) / 1000.0) + s" seconds. ${dbs.length} tables")
val d2 = new Date().getTime
val dbs2 = 
spark.catalog.listTables(d).filter(!_.isTemporary).map(_.name).collect
println("Processed tables in DB using catalog. Time: " + ((new 
Date().getTime - d2) / 1000.0) + s" seconds. ${dbs.length} tables")
other stuff
{code}

and timings are as follows 

{code:java}

processing all tables for every db. db length is: 30
Processed tables in DB using sqlContext. Time: 0.863 seconds. 19 tables
Processed tables in DB using catalog. Time: 6.978 seconds. 19 tables
Processed tables in DB using sqlContext. Time: 0.276 seconds. 607 tables
Processed tables in DB using catalog. Time: 194.501 seconds. 607 tables
Processed tables in DB using sqlContext. Time: 0.243 seconds. 55 tables
Processed tables in DB using catalog. Time: 17.907 seconds. 55 tables
Processed tables in DB using sqlContext. Time: 0.238 seconds. 13 tables
Processed tables in DB using catalog. Time: 4.642 seconds. 13 tables
Processed tables in DB using sqlContext. Time: 0.238 seconds. 392 tables
Processed tables in DB using catalog. Time: 126.999 seconds. 392 tables
... goes on...
{code}

in stand-alone spark-shell, as opposed to full-program spark-submit:

{code:java}

import java.util.Date
val dbs = spark.catalog.listDatabases.map(_.name).collect
for (d <- dbs) {
val d1 = new Date().getTime
val dbs = spark.sqlContext.tables(d).filter("isTemporary = 
false").select("tableName").collect.map(_.getString(0))
println("Processed tables in DB using sqlContext. Time: " + ((new 
Date().getTime - d1) / 1000.0) + s" seconds. ${dbs.length} tables")
val d2 = new Date().getTime
val dbs2 = 
spark.catalog.listTables(d).filter(!_.isTemporary).map(_.name).collect
println("Processed tables in DB using catalog. Time: " + ((new 
Date().getTime - d2) / 1000.0) + s" seconds. ${dbs.length} tables")
}
{code}


{code:java}
Processed tables in DB using sqlContext. Time: 0.59 seconds. 19 tables
Processed tables in DB using catalog. Time: 6.285 seconds. 19 tables
Processed tables in DB using sqlContext. Time: 0.276 seconds. 608 tables
Processed tables in DB using catalog. Time: 201.295 seconds. 608 tables
Processed tables in DB using sqlContext. Time: 0.241 seconds. 55 tables
... goes on. timings similar
{code}

So. Apart from the weird listDatabases issue, listTables is consistently slow.



was (Author: revolucion09):
Regarding listtables, here is the  code used inside the program:

{code:java}

 println(s"processing all tables for every db. db length is: 
${databases.tail.length}")
  for (d <- databases.tail) {
val d1 = new Date().getTime
val dbs = spark.sqlContext.tables(d).filter("isTemporary = 
false").select("tableName").collect.map(_.getString(0))
println("Processed tables in DB using sqlContext. Time: " + ((new 
Date().getTime - d1) / 1000.0) + s" seconds. ${dbs.length} tables")
val d2 = new Date().getTime
val dbs2 = 
spark.catalog.listTables(d).filter(!_.isTemporary).map(_.name).collect
println("Processed tables in DB using catalog. Time: " + ((new 
Date().getTime - d2) / 1000.0) + s" seconds. ${dbs.length} tables")
other stuff
{code}

and timings are as follows 

{code:java}

processing all tables for every db. db length is: 30
Processed tables in DB using sqlContext. Time: 0.863 seconds. 19 tables
Processed tables in DB using catalog. Time: {color:red}6.978 seconds{color}. 19 
tables
Processed tables in DB using sqlContext. Time: 0.276 seconds. 607 tables
Processed tables in DB using catalog. Time: {color:red}194.501 seconds{color}. 
607 tables
Processed tables in DB using sqlContext. Time: 0.243 seconds. 55 tables
Processed tables in DB using catalog. Time: {color:red}17.907 seconds{color}. 
55 tables
Processed tables in DB using sqlContext. Time: 0.238 seconds. 13 tables
Processed tables in DB using catalog. Time: {color:red}4.642 seconds{color}. 13 
tables
Processed tables in DB using sqlContext. Time: 0.238 seconds. 392 tables
Processed tables in DB using catalog. Time: {color:red}126.999 seconds{color}. 
392 tables
... goes on...
{code}

in stand-alone spark-shell, as opposed to full-program 

[jira] [Comment Edited] (SPARK-21198) SparkSession catalog is terribly slow

2017-06-26 Thread Saif Addin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063431#comment-16063431
 ] 

Saif Addin edited comment on SPARK-21198 at 6/26/17 5:23 PM:
-

Regarding listtables, here is the  code used inside the program:

{code:java}

 println(s"processing all tables for every db. db length is: 
${databases.tail.length}")
  for (d <- databases.tail) {
val d1 = new Date().getTime
val dbs = spark.sqlContext.tables(d).filter("isTemporary = 
false").select("tableName").collect.map(_.getString(0))
println("Processed tables in DB using sqlContext. Time: " + ((new 
Date().getTime - d1) / 1000.0) + s" seconds. ${dbs.length} tables")
val d2 = new Date().getTime
val dbs2 = 
spark.catalog.listTables(d).filter(!_.isTemporary).map(_.name).collect
println("Processed tables in DB using catalog. Time: " + ((new 
Date().getTime - d2) / 1000.0) + s" seconds. ${dbs.length} tables")
other stuff
{code}

and timings are as follows 

{code:java}

processing all tables for every db. db length is: 30
Processed tables in DB using sqlContext. Time: 0.863 seconds. 19 tables
Processed tables in DB using catalog. Time: {color:red}6.978 seconds{color}. 19 
tables
Processed tables in DB using sqlContext. Time: 0.276 seconds. 607 tables
Processed tables in DB using catalog. Time: {color:red}194.501 seconds{color}. 
607 tables
Processed tables in DB using sqlContext. Time: 0.243 seconds. 55 tables
Processed tables in DB using catalog. Time: {color:red}17.907 seconds{color}. 
55 tables
Processed tables in DB using sqlContext. Time: 0.238 seconds. 13 tables
Processed tables in DB using catalog. Time: {color:red}4.642 seconds{color}. 13 
tables
Processed tables in DB using sqlContext. Time: 0.238 seconds. 392 tables
Processed tables in DB using catalog. Time: {color:red}126.999 seconds{color}. 
392 tables
... goes on...
{code}

in stand-alone spark-shell, as opposed to full-program spark-submit:

{code:java}

import java.util.Date
val dbs = spark.catalog.listDatabases.map(_.name).collect
for (d <- dbs) {
val d1 = new Date().getTime
val dbs = spark.sqlContext.tables(d).filter("isTemporary = 
false").select("tableName").collect.map(_.getString(0))
println("Processed tables in DB using sqlContext. Time: " + ((new 
Date().getTime - d1) / 1000.0) + s" seconds. ${dbs.length} tables")
val d2 = new Date().getTime
val dbs2 = 
spark.catalog.listTables(d).filter(!_.isTemporary).map(_.name).collect
println("Processed tables in DB using catalog. Time: " + ((new 
Date().getTime - d2) / 1000.0) + s" seconds. ${dbs.length} tables")
}
{code}


{code:java}
Processed tables in DB using sqlContext. Time: 0.59 seconds. 19 tables
Processed tables in DB using catalog. Time: {color:red}6.285 seconds{color}. 19 
tables
Processed tables in DB using sqlContext. Time: 0.276 seconds. 608 tables
Processed tables in DB using catalog. Time: {color:red}201.295 seconds{color}. 
608 tables
Processed tables in DB using sqlContext. Time: 0.241 seconds. 55 tables
... goes on. timings similar
{code}

So. Apart from the weird listDatabases issue, listTables is consistently slow.



was (Author: revolucion09):
Regarding listtables, here is the  code used inside the program:

{code:java}

 println(s"processing all tables for every db. db length is: 
${databases.tail.length}")
  for (d <- databases.tail) {
val d1 = new Date().getTime
val dbs = spark.sqlContext.tables(d).filter("isTemporary = 
false").select("tableName").collect.map(_.getString(0))
println("Processed tables in DB using sqlContext. Time: " + ((new 
Date().getTime - d1) / 1000.0) + s" seconds. ${dbs.length} tables")
val d2 = new Date().getTime
val dbs2 = 
spark.catalog.listTables(d).filter(!_.isTemporary).map(_.name).collect
println("Processed tables in DB using catalog. Time: " + ((new 
Date().getTime - d2) / 1000.0) + s" seconds. ${dbs.length} tables")
other stuff
{code}

and timings are as follows 

{code:java}

processing all tables for every db. db length is: 30
Processed tables in DB using sqlContext. Time: 0.863 seconds. 19 tables
Processed tables in DB using catalog. Time: 6.978 seconds. 19 tables
Processed tables in DB using sqlContext. Time: 0.276 seconds. 607 tables
Processed tables in DB using catalog. Time: 194.501 seconds. 607 tables
Processed tables in DB using sqlContext. Time: 0.243 seconds. 55 tables
Processed tables in DB using catalog. Time: 17.907 seconds. 55 tables
Processed tables in DB using sqlContext. Time: 0.238 seconds. 13 tables
Processed tables in DB using catalog. Time: 4.642 seconds. 13 tables
Processed tables in DB using sqlContext. Time: 0.238 seconds. 392 tables
Processed tables in DB using catalog. Time: 126.999 seconds. 392 tables
... goes on...
{code}

in stand-alone sp

[jira] [Comment Edited] (SPARK-21198) SparkSession catalog is terribly slow

2017-06-26 Thread Saif Addin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063431#comment-16063431
 ] 

Saif Addin edited comment on SPARK-21198 at 6/26/17 5:22 PM:
-

Regarding listtables, here is the  code used inside the program:

{code:java}

 println(s"processing all tables for every db. db length is: 
${databases.tail.length}")
  for (d <- databases.tail) {
val d1 = new Date().getTime
val dbs = spark.sqlContext.tables(d).filter("isTemporary = 
false").select("tableName").collect.map(_.getString(0))
println("Processed tables in DB using sqlContext. Time: " + ((new 
Date().getTime - d1) / 1000.0) + s" seconds. ${dbs.length} tables")
val d2 = new Date().getTime
val dbs2 = 
spark.catalog.listTables(d).filter(!_.isTemporary).map(_.name).collect
println("Processed tables in DB using catalog. Time: " + ((new 
Date().getTime - d2) / 1000.0) + s" seconds. ${dbs.length} tables")
other stuff
{code}

and timings are as follows 

{code:java}

processing all tables for every db. db length is: 30
Processed tables in DB using sqlContext. Time: 0.863 seconds. 19 tables
Processed tables in DB using catalog. Time: 6.978 seconds. 19 tables
Processed tables in DB using sqlContext. Time: 0.276 seconds. 607 tables
Processed tables in DB using catalog. Time: 194.501 seconds. 607 tables
Processed tables in DB using sqlContext. Time: 0.243 seconds. 55 tables
Processed tables in DB using catalog. Time: 17.907 seconds. 55 tables
Processed tables in DB using sqlContext. Time: 0.238 seconds. 13 tables
Processed tables in DB using catalog. Time: 4.642 seconds. 13 tables
Processed tables in DB using sqlContext. Time: 0.238 seconds. 392 tables
Processed tables in DB using catalog. Time: 126.999 seconds. 392 tables
... goes on...
{code}

in stand-alone spark-shell, as opposed to full-program spark-submit:

{code:java}

import java.util.Date
val dbs = spark.catalog.listDatabases.map(_.name).collect
for (d <- dbs) {
val d1 = new Date().getTime
val dbs = spark.sqlContext.tables(d).filter("isTemporary = 
false").select("tableName").collect.map(_.getString(0))
println("Processed tables in DB using sqlContext. Time: " + ((new 
Date().getTime - d1) / 1000.0) + s" seconds. ${dbs.length} tables")
val d2 = new Date().getTime
val dbs2 = 
spark.catalog.listTables(d).filter(!_.isTemporary).map(_.name).collect
println("Processed tables in DB using catalog. Time: " + ((new 
Date().getTime - d2) / 1000.0) + s" seconds. ${dbs.length} tables")
}
{code}


{code:java}
Processed tables in DB using sqlContext. Time: 0.59 seconds. 19 tables
Processed tables in DB using catalog. Time: 6.285 seconds. 19 tables
Processed tables in DB using sqlContext. Time: 0.276 seconds. 608 tables
Processed tables in DB using catalog. Time: 201.295 seconds. 608 tables
Processed tables in DB using sqlContext. Time: 0.241 seconds. 55 tables
... goes on. timings similar
{code}

So. Apart from the weird listDatabases issue, listTables is consistently slow.



was (Author: revolucion09):
Regarding listtables, here is the  code used inside the program:

{code:java}

 println(s"processing all tables for every db. db length is: 
${databases.tail.length}")
  for (d <- databases.tail) {
val d1 = new Date().getTime
val dbs = spark.sqlContext.tables(d).filter("isTemporary = 
false").select("tableName").collect.map(_.getString(0))
println("Processed tables in DB using sqlContext. Time: " + ((new 
Date().getTime - d1) / 1000.0) + s" seconds. ${dbs.length} tables")
val d2 = new Date().getTime
val dbs2 = 
spark.catalog.listTables(d).filter(!_.isTemporary).map(_.name).collect
println("Processed tables in DB using catalog. Time: " + ((new 
Date().getTime - d2) / 1000.0) + s" seconds. ${dbs.length} tables")
other stuff
{code}

and timings are as follows 

{code:java}

processing all tables for every db. db length is: 30
Processed tables in DB using sqlContext. Time: 0.863 seconds. 19 tables
Processed tables in DB using catalog. Time: 6.978 seconds. 19 tables
Processed tables in DB using sqlContext. Time: 0.276 seconds. 607 tables
Processed tables in DB using catalog. Time: 194.501 seconds. 607 tables
Processed tables in DB using sqlContext. Time: 0.243 seconds. 55 tables
Processed tables in DB using catalog. Time: 17.907 seconds. 55 tables
Processed tables in DB using sqlContext. Time: 0.238 seconds. 13 tables
Processed tables in DB using catalog. Time: 4.642 seconds. 13 tables
Processed tables in DB using sqlContext. Time: 0.238 seconds. 392 tables
Processed tables in DB using catalog. Time: 126.999 seconds. 392 tables
... goes on...
{code}




> SparkSession catalog is terribly slow
> -
>
> Key: SPARK-21198
> URL: https:/

[jira] [Commented] (SPARK-21215) Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve

2017-06-26 Thread Michael Kunkel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063453#comment-16063453
 ] 

Michael Kunkel commented on SPARK-21215:


[~srowen] I am new to this board, so I do not understand how to see the 
resolution.

> Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve
> -
>
> Key: SPARK-21215
> URL: https://issues.apache.org/jira/browse/SPARK-21215
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 2.1.1
> Environment: macOSX
>Reporter: Michael Kunkel
>
> First Spark project.
> I have a Java method that returns a Dataset. I want to convert this to a 
> Dataset, where the Object is named StatusChangeDB. I have created a 
> POJO StatusChangeDB.java and coded it with all the query objects found in the 
> mySQL table.
> I then create a Encoder and convert the Dataset to a 
> Dataset. However when I try to .show() the values of the 
> Dataset I receive the error
> bq. 
> bq. Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve '`hvpinid_quad`' given input columns: [status_change_type, 
> superLayer, loclayer, sector, locwire];
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:86)
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:83)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:289)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$10.apply(TreeNode.scala:324)
> bq.   at 
> scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
> bq.   at 
> scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
> bq.   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> bq.   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> bq.   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> bq.   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> bq.   at scala.collection.IterableLike$$anon$1.foreach(IterableLike.scala:311)
> bq.   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> bq.   at 
> scala.collection.mutable.MapBuilder.$plus$plus$eq(MapBuilder.scala:25)
> bq.   at 
> scala.collection.TraversableViewLike$class.force(TraversableViewLike.scala:88)
> bq.   at scala.collection.IterableLike$$anon$1.force(IterableLike.scala:311)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:332)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:266)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:276)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$6.apply(QueryPlan.scala:285)
> bq.   at 
> 

[jira] [Commented] (SPARK-21183) Unable to return Google BigQuery INTEGER data type into Spark via google BigQuery JDBC driver: java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value t

2017-06-26 Thread Matthew Walton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063448#comment-16063448
 ] 

Matthew Walton commented on SPARK-21183:


Hi Sean,

I think the issue is Spark is not handling the integer data types properly and 
this causes a conversion error.  Running a simple select against the same data 
with a tool like SQuirrel works fine to fetch back integer data but Spark seems 
to be using a different method to fetch.  I'm not sure if you have access to 
Google BigQuery but you can easily reproduce this issue.  If not, please take a 
look at the other ticket for Hive JDBC drivers as I think you probably have 
access to Hive for testing.  If this is asking too much I guess I'll try to log 
this and the other issue with Simba and let them pursue the issues more.  
Essentially, I would not expect an error on handling the integer type with 
Spark especially with the driver that Google expects its customers to use.

Thanks,
Matt

> Unable to return Google BigQuery INTEGER data type into Spark via google 
> BigQuery JDBC driver: java.sql.SQLDataException: [Simba][JDBC](10140) Error 
> converting value to long.
> --
>
> Key: SPARK-21183
> URL: https://issues.apache.org/jira/browse/SPARK-21183
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 1.6.0, 2.0.0, 2.1.1
> Environment: OS:  Linux
> Spark  version 2.1.1
> JDBC:  Download the latest google BigQuery JDBC Driver from Google
>Reporter: Matthew Walton
>
> I'm trying to fetch back data in Spark using a JDBC connection to Google 
> BigQuery.  Unfortunately, when I try to query data that resides in an INTEGER 
> column I get the following error:  
> java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to 
> long.  
> Steps to reproduce:
> 1) On Google BigQuery console create a simple table with an INT column and 
> insert some data 
> 2) Copy the Google BigQuery JDBC driver to the machine where you will run 
> Spark Shell
> 3) Start Spark shell loading the GoogleBigQuery JDBC driver jar files
> ./spark-shell --jars 
> /home/ec2-user/jdbc/gbq/GoogleBigQueryJDBC42.jar,/home/ec2-user/jdbc/gbq/google-api-client-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-api-services-bigquery-v2-rev320-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-http-client-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-http-client-jackson2-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-oauth-client-1.22.0.jar,/home/ec2-user/jdbc/gbq/jackson-core-2.1.3.jar
> 4) In Spark shell load the data from Google BigQuery using the JDBC driver
> val gbq = spark.read.format("jdbc").options(Map("url" -> 
> "jdbc:bigquery://https://www.googleapis.com/bigquery/v2;ProjectId=your-project-name-here;OAuthType=0;OAuthPvtKeyPath=/usr/lib/spark/YourProjectPrivateKey.json;OAuthServiceAcctEmail=YourEmail@gmail.comAllowLargeResults=1;LargeResultDataset=_bqodbc_temp_tables;LargeResultTable=_matthew;Timeout=600","dbtable";
>  -> 
> "test.lu_test_integer")).option("driver","com.simba.googlebigquery.jdbc42.Driver").option("user","").option("password","").load()
> 5) In Spark shell try to display the data
> gbq.show()
> At this point you should see the error:
> scala> gbq.show()
> 17/06/22 19:34:57 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 6, 
> ip-172-31-37-165.ec2.internal, executor 3): java.sql.SQLDataException: 
> [Simba][JDBC](10140) Error converting value to long.
> at com.simba.exceptions.ExceptionConverter.toSQLException(Unknown 
> Source)
> at com.simba.utilities.conversion.TypeConverter.toLong(Unknown Source)
> at com.simba.jdbc.common.SForwardResultSet.getLong(Unknown Source)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:365)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:364)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:286)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:268)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.

[jira] [Comment Edited] (SPARK-21199) Its not possible to impute Vector types

2017-06-26 Thread Franklyn Dsouza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063416#comment-16063416
 ] 

Franklyn Dsouza edited comment on SPARK-21199 at 6/26/17 5:16 PM:
--

For this particular scenario I have a table with two columns one is a string 
`document_type` and the other is a an array of tokens for the document. I want 
to do a TF-IDF on these tokens.
the IDF needs to be done per `document_type` so i pivot on `document_type` and 
then do the IDF on the TF vectors.

This pivoting introduces nulls for missing columns that need to be imputed. I 
can't impute array type either and fixing it at the token generation step would 
involve a lot of left joins to align various data sources.


was (Author: franklyndsouza):
For this particular scenario I have a table with two columns one is a string 
`document_type` and the other is a an array of tokens for the document. I want 
to do a TF-IDF on these tokens.
the IDF needs to be done per `document_type` so i pivot on `document_type` and 
then do the IDF on the TF vetors.

This pivoting introduces nulls for missing columns that need to be imputed. I 
can't impute array type either and fixing it at the token generation step would 
involve a lot of left joins to align various data sources.

> Its not possible to impute Vector types
> ---
>
> Key: SPARK-21199
> URL: https://issues.apache.org/jira/browse/SPARK-21199
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.1.1
>Reporter: Franklyn Dsouza
>
> There are cases where nulls end up in vector columns in dataframes. Currently 
> there is no way to fill in these nulls because its not possible to create a 
> literal vector column expression using lit().
> Also the entire pyspark ml api will fail when they encounter nulls so this 
> makes it hard to work with the data.
> I think that either vector support should be added to the imputer or vectors 
> should be supported in column expressions so they can be used in a coalesce.
> [~mlnick]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21199) Its not possible to impute Vector types

2017-06-26 Thread Franklyn Dsouza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063416#comment-16063416
 ] 

Franklyn Dsouza edited comment on SPARK-21199 at 6/26/17 5:16 PM:
--

For this particular scenario I have a table with two columns one is a string 
`document_type` and the other is a an array of tokens for the document. I want 
to do a TF-IDF on these tokens.
the IDF needs to be done per `document_type` so i pivot on `document_type` and 
then do the IDF on the TF vetors.

This pivoting introduces nulls for missing columns that need to be imputed. I 
can't impute array type either and fixing it at the token generation step would 
involve a lot of left joins to align various data sources.


was (Author: franklyndsouza):
For this particular scenario I have a table with two columns one is a string 
`document_type` and the other is a an array of tokens for the document. I want 
to do a TF-IDF on these tokens.
the IDF needs to be done per `document_type` so i do pivot on `document_type` 
and then do the IDF on the TF vetors.

This pivoting introduces nulls for missing columns that need to be imputed. I 
can't impute array type either and fixing it at the token generation step would 
involve a lot of left joins to align various data sources.

> Its not possible to impute Vector types
> ---
>
> Key: SPARK-21199
> URL: https://issues.apache.org/jira/browse/SPARK-21199
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.1.1
>Reporter: Franklyn Dsouza
>
> There are cases where nulls end up in vector columns in dataframes. Currently 
> there is no way to fill in these nulls because its not possible to create a 
> literal vector column expression using lit().
> Also the entire pyspark ml api will fail when they encounter nulls so this 
> makes it hard to work with the data.
> I think that either vector support should be added to the imputer or vectors 
> should be supported in column expressions so they can be used in a coalesce.
> [~mlnick]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21198) SparkSession catalog is terribly slow

2017-06-26 Thread Saif Addin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063431#comment-16063431
 ] 

Saif Addin edited comment on SPARK-21198 at 6/26/17 5:15 PM:
-

Regarding listtables, here is the  code used inside the program:

{code:java}

 println(s"processing all tables for every db. db length is: 
${databases.tail.length}")
  for (d <- databases.tail) {
val d1 = new Date().getTime
val dbs = spark.sqlContext.tables(d).filter("isTemporary = 
false").select("tableName").collect.map(_.getString(0))
println("Processed tables in DB using sqlContext. Time: " + ((new 
Date().getTime - d1) / 1000.0) + s" seconds. ${dbs.length} tables")
val d2 = new Date().getTime
val dbs2 = 
spark.catalog.listTables(d).filter(!_.isTemporary).map(_.name).collect
println("Processed tables in DB using catalog. Time: " + ((new 
Date().getTime - d2) / 1000.0) + s" seconds. ${dbs.length} tables")
other stuff
{code}

and timings are as follows 

{code:java}

processing all tables for every db. db length is: 30
Processed tables in DB using sqlContext. Time: 0.863 seconds. 19 tables
Processed tables in DB using catalog. Time: 6.978 seconds. 19 tables
Processed tables in DB using sqlContext. Time: 0.276 seconds. 607 tables
Processed tables in DB using catalog. Time: 194.501 seconds. 607 tables
Processed tables in DB using sqlContext. Time: 0.243 seconds. 55 tables
Processed tables in DB using catalog. Time: 17.907 seconds. 55 tables
Processed tables in DB using sqlContext. Time: 0.238 seconds. 13 tables
Processed tables in DB using catalog. Time: 4.642 seconds. 13 tables
Processed tables in DB using sqlContext. Time: 0.238 seconds. 392 tables
Processed tables in DB using catalog. Time: 126.999 seconds. 392 tables
... goes on...
{code}





was (Author: revolucion09):
Regarding listtables, here is the  code used inside the program:

{code:java}

 println(s"processing all tables for every db. db length is: 
${databases.tail.length}")
  for (d <- databases.tail) {
val d1 = new Date().getTime
val dbs = spark.sqlContext.tables(d).filter("isTemporary = 
false").select("tableName").collect.map(_.getString(0))
println("Processed tables in DB using sqlContext. Time: " + ((new 
Date().getTime - d1) / 1000.0) + s" seconds. ${dbs.length} tables")
val d2 = new Date().getTime
val dbs2 = 
spark.catalog.listTables(d).filter(!_.isTemporary).map(_.name).collect
println("Processed tables in DB using catalog. Time: " + ((new 
Date().getTime - d2) / 1000.0) + s" seconds. ${dbs.length} tables")
other stuff
{code}

and timings are as follows 

{code:java}

processing all tables for every db. db length is: 30
Processed tables in DB using sqlContext. Time: 0.863 seconds. 19 tables
Processed tables in DB using catalog. Time: 6.978 seconds. 19 tables
Processed tables in DB using sqlContext. Time: 0.276 seconds. 607 tables
Processed tables in DB using catalog. Time: 194.501 seconds. 607 tables
Processed tables in DB using sqlContext. Time: 0.243 seconds. 55 tables
Processed tables in DB using catalog. Time: 17.907 seconds. 55 tables
Processed tables in DB using sqlContext. Time: 0.238 seconds. 13 tables
Processed tables in DB using catalog. Time: 4.642 seconds. 13 tables
Processed tables in DB using sqlContext. Time: 0.238 seconds. 392 tables
... goes on...
{code}




> SparkSession catalog is terribly slow
> -
>
> Key: SPARK-21198
> URL: https://issues.apache.org/jira/browse/SPARK-21198
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Saif Addin
>
> We have a considerably large Hive metastore and a Spark program that goes 
> through Hive data availability.
> In spark 1.x, we were using sqlConext.tableNames, sqlContext.sql() and 
> sqlContext.isCached() to go throgh Hive metastore information.
> Once migrated to spark 2.x we switched over SparkSession.catalog instead, but 
> it turns out that both listDatabases() and listTables() take between 5 to 20 
> minutes depending on the database to return results, using operations such as 
> the following one:
> spark.catalog.listTables(db).filter(__.isTemporary).map(__.name).collect
> and made the program unbearably slow to return a list of tables.
> I know we still have spark.sqlContext.tableNames as workaround but I am 
> assuming this is going to be deprecated anytime soon?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21199) Its not possible to impute Vector types

2017-06-26 Thread Franklyn Dsouza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063416#comment-16063416
 ] 

Franklyn Dsouza edited comment on SPARK-21199 at 6/26/17 5:15 PM:
--

For this particular scenario I have a table with two columns one is a string 
`document_type` and the other is a an array of tokens for the document. I want 
to do a TF-IDF on these tokens.
the IDF needs to be done per `document_type` so i do pivot on `document_type` 
and then do the IDF on the TF vetors.

This pivoting introduces nulls for missing columns that need to be imputed. I 
can't impute array type either and fixing it at the token generation step would 
involve a lot of left joins to align various data sources.


was (Author: franklyndsouza):
For this particular scenario I have a table with two columns one is a string 
`document_type` and the other is a an array of tokens for the document. I want 
to do a TF-IDF. 
the IDF needs to be done per `document_type` so i do pivot on `document_type` 
and then do the IDF on the TF vetors.

This pivoting introduces nulls for missing columns that need to be imputed. I 
can't impute array type either and fixing it at the token generation step would 
involve a lot of left joins to align various data sources.

> Its not possible to impute Vector types
> ---
>
> Key: SPARK-21199
> URL: https://issues.apache.org/jira/browse/SPARK-21199
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.1.1
>Reporter: Franklyn Dsouza
>
> There are cases where nulls end up in vector columns in dataframes. Currently 
> there is no way to fill in these nulls because its not possible to create a 
> literal vector column expression using lit().
> Also the entire pyspark ml api will fail when they encounter nulls so this 
> makes it hard to work with the data.
> I think that either vector support should be added to the imputer or vectors 
> should be supported in column expressions so they can be used in a coalesce.
> [~mlnick]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21215) Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve

2017-06-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21215.
---
Resolution: Duplicate

> Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve
> -
>
> Key: SPARK-21215
> URL: https://issues.apache.org/jira/browse/SPARK-21215
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 2.1.1
> Environment: macOSX
>Reporter: Michael Kunkel
>
> First Spark project.
> I have a Java method that returns a Dataset. I want to convert this to a 
> Dataset, where the Object is named StatusChangeDB. I have created a 
> POJO StatusChangeDB.java and coded it with all the query objects found in the 
> mySQL table.
> I then create a Encoder and convert the Dataset to a 
> Dataset. However when I try to .show() the values of the 
> Dataset I receive the error
> bq. 
> bq. Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve '`hvpinid_quad`' given input columns: [status_change_type, 
> superLayer, loclayer, sector, locwire];
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:86)
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:83)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:289)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$10.apply(TreeNode.scala:324)
> bq.   at 
> scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
> bq.   at 
> scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
> bq.   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> bq.   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> bq.   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> bq.   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> bq.   at scala.collection.IterableLike$$anon$1.foreach(IterableLike.scala:311)
> bq.   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> bq.   at 
> scala.collection.mutable.MapBuilder.$plus$plus$eq(MapBuilder.scala:25)
> bq.   at 
> scala.collection.TraversableViewLike$class.force(TraversableViewLike.scala:88)
> bq.   at scala.collection.IterableLike$$anon$1.force(IterableLike.scala:311)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:332)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:266)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:276)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$6.apply(QueryPlan.scala:285)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> bq.   at 
> org.apache.spark.sql.cataly

[jira] [Resolved] (SPARK-21214) Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '

2017-06-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21214.
---
Resolution: Invalid

This should be a question on the mailing list. 

> Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve '
> ---
>
> Key: SPARK-21214
> URL: https://issues.apache.org/jira/browse/SPARK-21214
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core, SQL
>Affects Versions: 2.1.1
> Environment: macOSX
>Reporter: Michael Kunkel
>
> First Spark project.
> I have a Java method that returns a Dataset. I want to convert this to a 
> Dataset, where the Object is named StatusChangeDB. I have created a 
> POJO StatusChangeDB.java and coded it with all the query objects found in the 
> mySQL table.
> I then create a Encoder and convert the Dataset to a 
> Dataset. However when I try to .show() the values of the 
> Dataset I receive the error
> bq. 
> bq. Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve '`hvpinid_quad`' given input columns: [status_change_type, 
> superLayer, loclayer, sector, locwire];
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:86)
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:83)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:289)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$10.apply(TreeNode.scala:324)
> bq.   at 
> scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
> bq.   at 
> scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
> bq.   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> bq.   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> bq.   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> bq.   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> bq.   at scala.collection.IterableLike$$anon$1.foreach(IterableLike.scala:311)
> bq.   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> bq.   at 
> scala.collection.mutable.MapBuilder.$plus$plus$eq(MapBuilder.scala:25)
> bq.   at 
> scala.collection.TraversableViewLike$class.force(TraversableViewLike.scala:88)
> bq.   at scala.collection.IterableLike$$anon$1.force(IterableLike.scala:311)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:332)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:266)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:276)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$6.apply(QueryPlan.scala:285)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterato

[jira] [Commented] (SPARK-21198) SparkSession catalog is terribly slow

2017-06-26 Thread Saif Addin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063431#comment-16063431
 ] 

Saif Addin commented on SPARK-21198:


Regarding listtables, here is the  code used inside the program:

{code:java}

 println(s"processing all tables for every db. db length is: 
${databases.tail.length}")
  for (d <- databases.tail) {
val d1 = new Date().getTime
val dbs = spark.sqlContext.tables(d).filter("isTemporary = 
false").select("tableName").collect.map(_.getString(0))
println("Processed tables in DB using sqlContext. Time: " + ((new 
Date().getTime - d1) / 1000.0) + s" seconds. ${dbs.length} tables")
val d2 = new Date().getTime
val dbs2 = 
spark.catalog.listTables(d).filter(!_.isTemporary).map(_.name).collect
println("Processed tables in DB using catalog. Time: " + ((new 
Date().getTime - d2) / 1000.0) + s" seconds. ${dbs.length} tables")
other stuff
{code}

and timings are as follows 

{code:java}

processing all tables for every db. db length is: 30
Processed tables in DB using sqlContext. Time: 0.863 seconds. 19 tables
Processed tables in DB using catalog. Time: 6.978 seconds. 19 tables
Processed tables in DB using sqlContext. Time: 0.276 seconds. 607 tables
Processed tables in DB using catalog. Time: 194.501 seconds. 607 tables
Processed tables in DB using sqlContext. Time: 0.243 seconds. 55 tables
Processed tables in DB using catalog. Time: 17.907 seconds. 55 tables
Processed tables in DB using sqlContext. Time: 0.238 seconds. 13 tables
Processed tables in DB using catalog. Time: 4.642 seconds. 13 tables
Processed tables in DB using sqlContext. Time: 0.238 seconds. 392 tables
... goes on...
{code}




> SparkSession catalog is terribly slow
> -
>
> Key: SPARK-21198
> URL: https://issues.apache.org/jira/browse/SPARK-21198
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Saif Addin
>
> We have a considerably large Hive metastore and a Spark program that goes 
> through Hive data availability.
> In spark 1.x, we were using sqlConext.tableNames, sqlContext.sql() and 
> sqlContext.isCached() to go throgh Hive metastore information.
> Once migrated to spark 2.x we switched over SparkSession.catalog instead, but 
> it turns out that both listDatabases() and listTables() take between 5 to 20 
> minutes depending on the database to return results, using operations such as 
> the following one:
> spark.catalog.listTables(db).filter(__.isTemporary).map(__.name).collect
> and made the program unbearably slow to return a list of tables.
> I know we still have spark.sqlContext.tableNames as workaround but I am 
> assuming this is going to be deprecated anytime soon?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21183) Unable to return Google BigQuery INTEGER data type into Spark via google BigQuery JDBC driver: java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value t

2017-06-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063428#comment-16063428
 ] 

Sean Owen commented on SPARK-21183:
---

I'm not sure that follows, but I don't know either. I think you'd have to 
establish what you think the Spark issue is here.

> Unable to return Google BigQuery INTEGER data type into Spark via google 
> BigQuery JDBC driver: java.sql.SQLDataException: [Simba][JDBC](10140) Error 
> converting value to long.
> --
>
> Key: SPARK-21183
> URL: https://issues.apache.org/jira/browse/SPARK-21183
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 1.6.0, 2.0.0, 2.1.1
> Environment: OS:  Linux
> Spark  version 2.1.1
> JDBC:  Download the latest google BigQuery JDBC Driver from Google
>Reporter: Matthew Walton
>
> I'm trying to fetch back data in Spark using a JDBC connection to Google 
> BigQuery.  Unfortunately, when I try to query data that resides in an INTEGER 
> column I get the following error:  
> java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to 
> long.  
> Steps to reproduce:
> 1) On Google BigQuery console create a simple table with an INT column and 
> insert some data 
> 2) Copy the Google BigQuery JDBC driver to the machine where you will run 
> Spark Shell
> 3) Start Spark shell loading the GoogleBigQuery JDBC driver jar files
> ./spark-shell --jars 
> /home/ec2-user/jdbc/gbq/GoogleBigQueryJDBC42.jar,/home/ec2-user/jdbc/gbq/google-api-client-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-api-services-bigquery-v2-rev320-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-http-client-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-http-client-jackson2-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-oauth-client-1.22.0.jar,/home/ec2-user/jdbc/gbq/jackson-core-2.1.3.jar
> 4) In Spark shell load the data from Google BigQuery using the JDBC driver
> val gbq = spark.read.format("jdbc").options(Map("url" -> 
> "jdbc:bigquery://https://www.googleapis.com/bigquery/v2;ProjectId=your-project-name-here;OAuthType=0;OAuthPvtKeyPath=/usr/lib/spark/YourProjectPrivateKey.json;OAuthServiceAcctEmail=YourEmail@gmail.comAllowLargeResults=1;LargeResultDataset=_bqodbc_temp_tables;LargeResultTable=_matthew;Timeout=600","dbtable";
>  -> 
> "test.lu_test_integer")).option("driver","com.simba.googlebigquery.jdbc42.Driver").option("user","").option("password","").load()
> 5) In Spark shell try to display the data
> gbq.show()
> At this point you should see the error:
> scala> gbq.show()
> 17/06/22 19:34:57 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 6, 
> ip-172-31-37-165.ec2.internal, executor 3): java.sql.SQLDataException: 
> [Simba][JDBC](10140) Error converting value to long.
> at com.simba.exceptions.ExceptionConverter.toSQLException(Unknown 
> Source)
> at com.simba.utilities.conversion.TypeConverter.toLong(Unknown Source)
> at com.simba.jdbc.common.SForwardResultSet.getLong(Unknown Source)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:365)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:364)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:286)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:268)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.sc

[jira] [Created] (SPARK-21215) Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve

2017-06-26 Thread Michael Kunkel (JIRA)
Michael Kunkel created SPARK-21215:
--

 Summary: Exception in thread "main" 
org.apache.spark.sql.AnalysisException: cannot resolve
 Key: SPARK-21215
 URL: https://issues.apache.org/jira/browse/SPARK-21215
 Project: Spark
  Issue Type: Bug
  Components: Java API, SQL
Affects Versions: 2.1.1
 Environment: macOSX
Reporter: Michael Kunkel


First Spark project.

I have a Java method that returns a Dataset. I want to convert this to a 
Dataset, where the Object is named StatusChangeDB. I have created a 
POJO StatusChangeDB.java and coded it with all the query objects found in the 
mySQL table.
I then create a Encoder and convert the Dataset to a 
Dataset. However when I try to .show() the values of the 
Dataset I receive the error
bq. 
bq. Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
resolve '`hvpinid_quad`' given input columns: [status_change_type, superLayer, 
loclayer, sector, locwire];
bq. at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
bq. at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:86)
bq. at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:83)
bq. at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
bq. at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
bq. at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
bq. at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:289)
bq. at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
bq. at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
bq. at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
bq. at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
bq. at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
bq. at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
bq. at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
bq. at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
bq. at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$10.apply(TreeNode.scala:324)
bq. at 
scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
bq. at 
scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
bq. at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
bq. at scala.collection.Iterator$class.foreach(Iterator.scala:893)
bq. at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
bq. at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
bq. at scala.collection.IterableLike$$anon$1.foreach(IterableLike.scala:311)
bq. at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
bq. at 
scala.collection.mutable.MapBuilder.$plus$plus$eq(MapBuilder.scala:25)
bq. at 
scala.collection.TraversableViewLike$class.force(TraversableViewLike.scala:88)
bq. at scala.collection.IterableLike$$anon$1.force(IterableLike.scala:311)
bq. at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:332)
bq. at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
bq. at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
bq. at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
bq. at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255)
bq. at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255)
bq. at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:266)
bq. at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:276)
bq. at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$6.apply(QueryPlan.scala:285)
bq. at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
bq. at 
org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:285)
bq. at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:255)
bq. at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:83)
bq. at 
org.apache.spark.s

[jira] [Created] (SPARK-21214) Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '

2017-06-26 Thread Michael Kunkel (JIRA)
Michael Kunkel created SPARK-21214:
--

 Summary: Exception in thread "main" 
org.apache.spark.sql.AnalysisException: cannot resolve '
 Key: SPARK-21214
 URL: https://issues.apache.org/jira/browse/SPARK-21214
 Project: Spark
  Issue Type: Bug
  Components: Java API, Spark Core, SQL
Affects Versions: 2.1.1
 Environment: macOSX
Reporter: Michael Kunkel


First Spark project.

I have a Java method that returns a Dataset. I want to convert this to a 
Dataset, where the Object is named StatusChangeDB. I have created a 
POJO StatusChangeDB.java and coded it with all the query objects found in the 
mySQL table.
I then create a Encoder and convert the Dataset to a 
Dataset. However when I try to .show() the values of the 
Dataset I receive the error
bq. 
bq. Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
resolve '`hvpinid_quad`' given input columns: [status_change_type, superLayer, 
loclayer, sector, locwire];
bq. at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
bq. at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:86)
bq. at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:83)
bq. at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
bq. at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
bq. at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
bq. at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:289)
bq. at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
bq. at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
bq. at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
bq. at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
bq. at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
bq. at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
bq. at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
bq. at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
bq. at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$10.apply(TreeNode.scala:324)
bq. at 
scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
bq. at 
scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
bq. at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
bq. at scala.collection.Iterator$class.foreach(Iterator.scala:893)
bq. at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
bq. at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
bq. at scala.collection.IterableLike$$anon$1.foreach(IterableLike.scala:311)
bq. at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
bq. at 
scala.collection.mutable.MapBuilder.$plus$plus$eq(MapBuilder.scala:25)
bq. at 
scala.collection.TraversableViewLike$class.force(TraversableViewLike.scala:88)
bq. at scala.collection.IterableLike$$anon$1.force(IterableLike.scala:311)
bq. at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:332)
bq. at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
bq. at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
bq. at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
bq. at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255)
bq. at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255)
bq. at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:266)
bq. at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:276)
bq. at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$6.apply(QueryPlan.scala:285)
bq. at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
bq. at 
org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:285)
bq. at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:255)
bq. at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:83)
bq. at 
org.

[jira] [Commented] (SPARK-21183) Unable to return Google BigQuery INTEGER data type into Spark via google BigQuery JDBC driver: java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value t

2017-06-26 Thread Matthew Walton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063421#comment-16063421
 ] 

Matthew Walton commented on SPARK-21183:


Well, if only the SQuirreL Client tool didn't work on the same query.  It seems 
that the Spark code is the culprit and is causing the Simba driver to report 
that error.  I have another ticket opened for Spark with data coming from Hive 
drivers and it is similar in that it deals with how the integer data type is 
handled by Spark.  In that case I have several drivers I can test to show that 
the issue does not seem to originate in the driver.   Unfortunately, that one 
has not been assigned:  [https://issues.apache.org/jira/browse/SPARK-21179]  I 
only mention this issue since it seems to show a behavior in Spark when it 
comes to handling INTEGER types with certain JDBC drivers.  In each case, 
running the same SQL outside of Spark works fine.

> Unable to return Google BigQuery INTEGER data type into Spark via google 
> BigQuery JDBC driver: java.sql.SQLDataException: [Simba][JDBC](10140) Error 
> converting value to long.
> --
>
> Key: SPARK-21183
> URL: https://issues.apache.org/jira/browse/SPARK-21183
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 1.6.0, 2.0.0, 2.1.1
> Environment: OS:  Linux
> Spark  version 2.1.1
> JDBC:  Download the latest google BigQuery JDBC Driver from Google
>Reporter: Matthew Walton
>
> I'm trying to fetch back data in Spark using a JDBC connection to Google 
> BigQuery.  Unfortunately, when I try to query data that resides in an INTEGER 
> column I get the following error:  
> java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to 
> long.  
> Steps to reproduce:
> 1) On Google BigQuery console create a simple table with an INT column and 
> insert some data 
> 2) Copy the Google BigQuery JDBC driver to the machine where you will run 
> Spark Shell
> 3) Start Spark shell loading the GoogleBigQuery JDBC driver jar files
> ./spark-shell --jars 
> /home/ec2-user/jdbc/gbq/GoogleBigQueryJDBC42.jar,/home/ec2-user/jdbc/gbq/google-api-client-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-api-services-bigquery-v2-rev320-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-http-client-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-http-client-jackson2-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-oauth-client-1.22.0.jar,/home/ec2-user/jdbc/gbq/jackson-core-2.1.3.jar
> 4) In Spark shell load the data from Google BigQuery using the JDBC driver
> val gbq = spark.read.format("jdbc").options(Map("url" -> 
> "jdbc:bigquery://https://www.googleapis.com/bigquery/v2;ProjectId=your-project-name-here;OAuthType=0;OAuthPvtKeyPath=/usr/lib/spark/YourProjectPrivateKey.json;OAuthServiceAcctEmail=YourEmail@gmail.comAllowLargeResults=1;LargeResultDataset=_bqodbc_temp_tables;LargeResultTable=_matthew;Timeout=600","dbtable";
>  -> 
> "test.lu_test_integer")).option("driver","com.simba.googlebigquery.jdbc42.Driver").option("user","").option("password","").load()
> 5) In Spark shell try to display the data
> gbq.show()
> At this point you should see the error:
> scala> gbq.show()
> 17/06/22 19:34:57 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 6, 
> ip-172-31-37-165.ec2.internal, executor 3): java.sql.SQLDataException: 
> [Simba][JDBC](10140) Error converting value to long.
> at com.simba.exceptions.ExceptionConverter.toSQLException(Unknown 
> Source)
> at com.simba.utilities.conversion.TypeConverter.toLong(Unknown Source)
> at com.simba.jdbc.common.SForwardResultSet.getLong(Unknown Source)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:365)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:364)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:286)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:268)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIt

[jira] [Commented] (SPARK-21199) Its not possible to impute Vector types

2017-06-26 Thread Franklyn Dsouza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063416#comment-16063416
 ] 

Franklyn Dsouza commented on SPARK-21199:
-

For this particular scenario I have a table with two columns one is a string 
`document_type` and the other is a an array of tokens for the document. I want 
to do a TF-IDF. 
the IDF needs to be done per `document_type` so i do pivot on `document_type` 
and then do the IDF on the TF vetors.

This pivoting introduces nulls for missing columns that need to be imputed. I 
can't impute array type either and fixing it at the token generation step would 
involve a lot of left joins to align various data sources.

> Its not possible to impute Vector types
> ---
>
> Key: SPARK-21199
> URL: https://issues.apache.org/jira/browse/SPARK-21199
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.1.1
>Reporter: Franklyn Dsouza
>
> There are cases where nulls end up in vector columns in dataframes. Currently 
> there is no way to fill in these nulls because its not possible to create a 
> literal vector column expression using lit().
> Also the entire pyspark ml api will fail when they encounter nulls so this 
> makes it hard to work with the data.
> I think that either vector support should be added to the imputer or vectors 
> should be supported in column expressions so they can be used in a coalesce.
> [~mlnick]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21198) SparkSession catalog is terribly slow

2017-06-26 Thread Saif Addin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063406#comment-16063406
 ] 

Saif Addin edited comment on SPARK-21198 at 6/26/17 5:03 PM:
-

Okay, I think there is something odd somewhere in between. It may be hard to 
tackle the issue but, i'll start slowly line to line.

(*spark-submit*, local[8])
1. This line, gets stuck forever and the program doesnt continue after waiting 
2 minutes (spark-task is stuck in collect())
{code:java}
import spark.implicits._
private val databases: Array[String] = REFERENCEDB +: 
(spark.catalog.listDatabases.filter(_ != blacklistdb).map(_.name).collect)
{code}
2. Changing such line to (notice filter after collect) takes 8ms

{code:java}
import spark.implicits._
private val databases: Array[String] = REFERENCEDB +: 
(spark.catalog.listDatabases.map(_.name).collect.filterNot(_ == blacklistdb))
{code}

3. The following line, takes instead 2ms

{code:java}
private val databases: Array[String] = (REFERENCEDB +: spark.sql("show 
databases").collect.map(_.getString(0))).filterNot(_ == blacklistdb)
{code}


Here's the weirdest of all, if I instead start *spark-shell* and do item number 
1, it works (takes 1ms which is even faster than other lines in my 
*spark-submit.* the other lines are also faster down to 1ms as well)


was (Author: revolucion09):
Okay, I think there is something odd somewhere in between. It may be hard to 
tackle the issue but, i'll start slowly line to line.

(*spark-submit*, local[8])
1. This line, gets stuck forever and the program doesnt continue after waiting 
2 minutes (spark-task is stuck in collect())
{code:java}
import spark.implicits._
private val databases: Array[String] = REFERENCEDB +: 
(spark.catalog.listDatabases.filter(_ != blacklistdb).map(_.name).collect)
{code}
2. Changing such line to (notice filter after collect) takes 8ms

{code:java}
import spark.implicits._
private val databases: Array[String] = REFERENCEDB +: 
(spark.catalog.listDatabases.map(_.name).collect.filterNot(_ == blacklistdb))
{code}

3. The following line, takes instead 2ms

{code:java}
private val databases: Array[String] = (REFERENCEDB +: spark.sql("show 
databases").collect.map(_.getString(0))).filterNot(_ == blacklistdb)
{code}


Here's the weirdest of all, if I instead start *spark-shell* and do item number 
1, it works (takes 1ms which is even faster than in my program. the other lines 
are also faster down to 1ms as well)

> SparkSession catalog is terribly slow
> -
>
> Key: SPARK-21198
> URL: https://issues.apache.org/jira/browse/SPARK-21198
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Saif Addin
>
> We have a considerably large Hive metastore and a Spark program that goes 
> through Hive data availability.
> In spark 1.x, we were using sqlConext.tableNames, sqlContext.sql() and 
> sqlContext.isCached() to go throgh Hive metastore information.
> Once migrated to spark 2.x we switched over SparkSession.catalog instead, but 
> it turns out that both listDatabases() and listTables() take between 5 to 20 
> minutes depending on the database to return results, using operations such as 
> the following one:
> spark.catalog.listTables(db).filter(__.isTemporary).map(__.name).collect
> and made the program unbearably slow to return a list of tables.
> I know we still have spark.sqlContext.tableNames as workaround but I am 
> assuming this is going to be deprecated anytime soon?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21198) SparkSession catalog is terribly slow

2017-06-26 Thread Saif Addin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063406#comment-16063406
 ] 

Saif Addin edited comment on SPARK-21198 at 6/26/17 5:02 PM:
-

Okay, I think there is something odd somewhere in between. It may be hard to 
tackle the issue but, i'll start slowly line to line.

(*spark-submit*, local[8])
1. This line, gets stuck forever and the program doesnt continue after waiting 
2 minutes (spark-task is stuck in collect())
{code:java}
import spark.implicits._
private val databases: Array[String] = REFERENCEDB +: 
(spark.catalog.listDatabases.filter(_ != blacklistdb).map(_.name).collect)
{code}
2. Changing such line to (notice filter after collect) takes 8ms

{code:java}
import spark.implicits._
private val databases: Array[String] = REFERENCEDB +: 
(spark.catalog.listDatabases.map(_.name).collect.filterNot(_ == blacklistdb))
{code}

3. The following line, takes instead 2ms

{code:java}
private val databases: Array[String] = (REFERENCEDB +: spark.sql("show 
databases").collect.map(_.getString(0))).filterNot(_ == blacklistdb)
{code}


Here's the weirdest of all, if I instead start *spark-shell* and do item number 
1, it works (takes 1ms which is even faster than in my program. the other lines 
are also faster down to 1ms as well)


was (Author: revolucion09):
Okay, I think there is something odd somewhere in between. It may be hard to 
tackle the issue but, i'll start slowly line to line.

(*spark-submit*, local[8])
1. This line, gets stuck forever and the program doesnt continue after waiting 
2 minutes (spark-task is stuck in collect())
{code:java}
import spark.implicits._
private val databases: Array[String] = REFERENCEDB +: 
(spark.catalog.listDatabases.filter(_ != blacklistdb).map(_.name).collect)
{code}
2. Changing such line to (notice filter after collect) takes 8ms
import spark.implicits._
private val databases: Array[String] = REFERENCEDB +: 
(spark.catalog.listDatabases.map(_.name).collect.filterNot(_ == blacklistdb))
3. The following line, takes instead 2ms
private val databases: Array[String] = (REFERENCEDB +: spark.sql("show 
databases").collect.map(_.getString(0))).filterNot(_ == blacklistdb)

Here's the weirdest of all, if I instead start *spark-shell* and do item number 
1, it works (takes 1ms which is even faster than in my program. the other lines 
are also faster down to 1ms as well)

> SparkSession catalog is terribly slow
> -
>
> Key: SPARK-21198
> URL: https://issues.apache.org/jira/browse/SPARK-21198
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Saif Addin
>
> We have a considerably large Hive metastore and a Spark program that goes 
> through Hive data availability.
> In spark 1.x, we were using sqlConext.tableNames, sqlContext.sql() and 
> sqlContext.isCached() to go throgh Hive metastore information.
> Once migrated to spark 2.x we switched over SparkSession.catalog instead, but 
> it turns out that both listDatabases() and listTables() take between 5 to 20 
> minutes depending on the database to return results, using operations such as 
> the following one:
> spark.catalog.listTables(db).filter(__.isTemporary).map(__.name).collect
> and made the program unbearably slow to return a list of tables.
> I know we still have spark.sqlContext.tableNames as workaround but I am 
> assuming this is going to be deprecated anytime soon?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21198) SparkSession catalog is terribly slow

2017-06-26 Thread Saif Addin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063406#comment-16063406
 ] 

Saif Addin commented on SPARK-21198:


Okay, I think there is something odd somewhere in between. It may be hard to 
tackle the issue but, i'll start slowly line to line.

(spark_submit, local[8])
1. This line, gets stuck forever and the program doesnt continue after waiting 
2 minutes (spark-task is stuck in collect())
{code:java}
import spark.implicits._
private val databases: Array[String] = REFERENCEDB +: 
(spark.catalog.listDatabases.filter(_ != blacklistdb).map(_.name).collect)
{code}
2. Changing such line to (notice filter after collect) takes 8ms
import spark.implicits._
private val databases: Array[String] = REFERENCEDB +: 
(spark.catalog.listDatabases.map(_.name).collect.filterNot(_ == blacklistdb))
3. The following line, takes instead 2ms
private val databases: Array[String] = (REFERENCEDB +: spark.sql("show 
databases").collect.map(_.getString(0))).filterNot(_ == blacklistdb)

Here's the weirdest of all, if I instead start spark-shell and do item number 
1, it works (takes 1ms which is even faster than in my program. the other lines 
are also faster down to 1ms as well)

> SparkSession catalog is terribly slow
> -
>
> Key: SPARK-21198
> URL: https://issues.apache.org/jira/browse/SPARK-21198
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Saif Addin
>
> We have a considerably large Hive metastore and a Spark program that goes 
> through Hive data availability.
> In spark 1.x, we were using sqlConext.tableNames, sqlContext.sql() and 
> sqlContext.isCached() to go throgh Hive metastore information.
> Once migrated to spark 2.x we switched over SparkSession.catalog instead, but 
> it turns out that both listDatabases() and listTables() take between 5 to 20 
> minutes depending on the database to return results, using operations such as 
> the following one:
> spark.catalog.listTables(db).filter(__.isTemporary).map(__.name).collect
> and made the program unbearably slow to return a list of tables.
> I know we still have spark.sqlContext.tableNames as workaround but I am 
> assuming this is going to be deprecated anytime soon?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21198) SparkSession catalog is terribly slow

2017-06-26 Thread Saif Addin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063406#comment-16063406
 ] 

Saif Addin edited comment on SPARK-21198 at 6/26/17 5:01 PM:
-

Okay, I think there is something odd somewhere in between. It may be hard to 
tackle the issue but, i'll start slowly line to line.

(*spark-submit*, local[8])
1. This line, gets stuck forever and the program doesnt continue after waiting 
2 minutes (spark-task is stuck in collect())
{code:java}
import spark.implicits._
private val databases: Array[String] = REFERENCEDB +: 
(spark.catalog.listDatabases.filter(_ != blacklistdb).map(_.name).collect)
{code}
2. Changing such line to (notice filter after collect) takes 8ms
import spark.implicits._
private val databases: Array[String] = REFERENCEDB +: 
(spark.catalog.listDatabases.map(_.name).collect.filterNot(_ == blacklistdb))
3. The following line, takes instead 2ms
private val databases: Array[String] = (REFERENCEDB +: spark.sql("show 
databases").collect.map(_.getString(0))).filterNot(_ == blacklistdb)

Here's the weirdest of all, if I instead start *spark-shell* and do item number 
1, it works (takes 1ms which is even faster than in my program. the other lines 
are also faster down to 1ms as well)


was (Author: revolucion09):
Okay, I think there is something odd somewhere in between. It may be hard to 
tackle the issue but, i'll start slowly line to line.

(spark_submit, local[8])
1. This line, gets stuck forever and the program doesnt continue after waiting 
2 minutes (spark-task is stuck in collect())
{code:java}
import spark.implicits._
private val databases: Array[String] = REFERENCEDB +: 
(spark.catalog.listDatabases.filter(_ != blacklistdb).map(_.name).collect)
{code}
2. Changing such line to (notice filter after collect) takes 8ms
import spark.implicits._
private val databases: Array[String] = REFERENCEDB +: 
(spark.catalog.listDatabases.map(_.name).collect.filterNot(_ == blacklistdb))
3. The following line, takes instead 2ms
private val databases: Array[String] = (REFERENCEDB +: spark.sql("show 
databases").collect.map(_.getString(0))).filterNot(_ == blacklistdb)

Here's the weirdest of all, if I instead start spark-shell and do item number 
1, it works (takes 1ms which is even faster than in my program. the other lines 
are also faster down to 1ms as well)

> SparkSession catalog is terribly slow
> -
>
> Key: SPARK-21198
> URL: https://issues.apache.org/jira/browse/SPARK-21198
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Saif Addin
>
> We have a considerably large Hive metastore and a Spark program that goes 
> through Hive data availability.
> In spark 1.x, we were using sqlConext.tableNames, sqlContext.sql() and 
> sqlContext.isCached() to go throgh Hive metastore information.
> Once migrated to spark 2.x we switched over SparkSession.catalog instead, but 
> it turns out that both listDatabases() and listTables() take between 5 to 20 
> minutes depending on the database to return results, using operations such as 
> the following one:
> spark.catalog.listTables(db).filter(__.isTemporary).map(__.name).collect
> and made the program unbearably slow to return a list of tables.
> I know we still have spark.sqlContext.tableNames as workaround but I am 
> assuming this is going to be deprecated anytime soon?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21183) Unable to return Google BigQuery INTEGER data type into Spark via google BigQuery JDBC driver: java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value t

2017-06-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063387#comment-16063387
 ] 

Sean Owen commented on SPARK-21183:
---

This looks like an error from simba, not Spark?

> Unable to return Google BigQuery INTEGER data type into Spark via google 
> BigQuery JDBC driver: java.sql.SQLDataException: [Simba][JDBC](10140) Error 
> converting value to long.
> --
>
> Key: SPARK-21183
> URL: https://issues.apache.org/jira/browse/SPARK-21183
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 1.6.0, 2.0.0, 2.1.1
> Environment: OS:  Linux
> Spark  version 2.1.1
> JDBC:  Download the latest google BigQuery JDBC Driver from Google
>Reporter: Matthew Walton
>
> I'm trying to fetch back data in Spark using a JDBC connection to Google 
> BigQuery.  Unfortunately, when I try to query data that resides in an INTEGER 
> column I get the following error:  
> java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to 
> long.  
> Steps to reproduce:
> 1) On Google BigQuery console create a simple table with an INT column and 
> insert some data 
> 2) Copy the Google BigQuery JDBC driver to the machine where you will run 
> Spark Shell
> 3) Start Spark shell loading the GoogleBigQuery JDBC driver jar files
> ./spark-shell --jars 
> /home/ec2-user/jdbc/gbq/GoogleBigQueryJDBC42.jar,/home/ec2-user/jdbc/gbq/google-api-client-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-api-services-bigquery-v2-rev320-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-http-client-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-http-client-jackson2-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-oauth-client-1.22.0.jar,/home/ec2-user/jdbc/gbq/jackson-core-2.1.3.jar
> 4) In Spark shell load the data from Google BigQuery using the JDBC driver
> val gbq = spark.read.format("jdbc").options(Map("url" -> 
> "jdbc:bigquery://https://www.googleapis.com/bigquery/v2;ProjectId=your-project-name-here;OAuthType=0;OAuthPvtKeyPath=/usr/lib/spark/YourProjectPrivateKey.json;OAuthServiceAcctEmail=YourEmail@gmail.comAllowLargeResults=1;LargeResultDataset=_bqodbc_temp_tables;LargeResultTable=_matthew;Timeout=600","dbtable";
>  -> 
> "test.lu_test_integer")).option("driver","com.simba.googlebigquery.jdbc42.Driver").option("user","").option("password","").load()
> 5) In Spark shell try to display the data
> gbq.show()
> At this point you should see the error:
> scala> gbq.show()
> 17/06/22 19:34:57 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 6, 
> ip-172-31-37-165.ec2.internal, executor 3): java.sql.SQLDataException: 
> [Simba][JDBC](10140) Error converting value to long.
> at com.simba.exceptions.ExceptionConverter.toSQLException(Unknown 
> Source)
> at com.simba.utilities.conversion.TypeConverter.toLong(Unknown Source)
> at com.simba.jdbc.common.SForwardResultSet.getLong(Unknown Source)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:365)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:364)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:286)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:268)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.sca

[jira] [Commented] (SPARK-21212) Can't use Count(*) with Order Clause

2017-06-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063376#comment-16063376
 ] 

Sean Owen commented on SPARK-21212:
---

You don't define 'value' anywhere, as it says.

> Can't use Count(*) with Order Clause
> 
>
> Key: SPARK-21212
> URL: https://issues.apache.org/jira/browse/SPARK-21212
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Windows; external data provided through data source api
>Reporter: Shawn Lavelle
>Priority: Minor
>
> I don't think this should fail the query:
> {code}jdbc:hive2://user:port/> select count(*) from table where value between 
> 1498240079000 and cast(now() as bigint)*1000 order by value;
> {code}
> {code}
> Error: org.apache.spark.sql.AnalysisException: cannot resolve '`value`' given 
> input columns: [count(1)]; line 1 pos 113;
> 'Sort ['value ASC NULLS FIRST], true
> +- Aggregate [count(1) AS count(1)#718L]
>+- Filter ((value#413L >= 1498240079000) && (value#413L <= 
> (cast(current_timestamp() as bigint) * cast(1000 as bigint
>   +- SubqueryAlias table
>  +- 
> Relation[field1#411L,field2#412,value#413L,field3#414,field4#415,field5#416,field6#417,field7#418,field8#419,field9#420]
>  com.redacted@16004579 (state=,code=0)
> {code}
> Arguably, the optimizer could ignore the "order by" clause, but I leave that 
> to more informed minds than my own.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17129) Support statistics collection and cardinality estimation for partitioned tables

2017-06-26 Thread Maria (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063342#comment-16063342
 ] 

Maria commented on SPARK-17129:
---

[~ZenWzh], I opened SPARK-21213 and submitted PR 
https://github.com/apache/spark/pull/18421 . Would you help review it?

> Support statistics collection and cardinality estimation for partitioned 
> tables
> ---
>
> Key: SPARK-17129
> URL: https://issues.apache.org/jira/browse/SPARK-17129
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>
> I upgrade this JIRA, because there are many tasks found and needed to be done 
> here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13669) Job will always fail in the external shuffle service unavailable situation

2017-06-26 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-13669.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

> Job will always fail in the external shuffle service unavailable situation
> --
>
> Key: SPARK-13669
> URL: https://issues.apache.org/jira/browse/SPARK-13669
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Reporter: Saisai Shao
>Assignee: Saisai Shao
> Fix For: 2.3.0
>
>
> Currently we are running into an issue with Yarn work preserving enabled + 
> external shuffle service. 
> In the work preserving enabled scenario, the failure of NM will not lead to 
> the exit of executors, so executors can still accept and run the tasks. The 
> problem here is when NM is failed, external shuffle service is actually 
> inaccessible, so reduce tasks will always complain about the “Fetch failure”, 
> and the failure of reduce stage will make the parent stage (map stage) rerun. 
> The tricky thing here is Spark scheduler is not aware of the unavailability 
> of external shuffle service, and will reschedule the map tasks on the 
> executor where NM is failed, and again reduce stage will be failed with 
> “Fetch failure”, and after 4 retries, the job is failed.
> So here the main problem is that we should avoid assigning tasks to those bad 
> executors (where shuffle service is unavailable). Current Spark's blacklist 
> mechanism could blacklist executors/nodes by failure tasks, but it doesn't 
> handle this specific fetch failure scenario. So here propose to improve the 
> current application blacklist mechanism to handle fetch failure issue 
> (especially with external shuffle service unavailable issue), to blacklist 
> the executors/nodes where shuffle fetch is unavailable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13669) Job will always fail in the external shuffle service unavailable situation

2017-06-26 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-13669:
-

Assignee: Saisai Shao

> Job will always fail in the external shuffle service unavailable situation
> --
>
> Key: SPARK-13669
> URL: https://issues.apache.org/jira/browse/SPARK-13669
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Reporter: Saisai Shao
>Assignee: Saisai Shao
> Fix For: 2.3.0
>
>
> Currently we are running into an issue with Yarn work preserving enabled + 
> external shuffle service. 
> In the work preserving enabled scenario, the failure of NM will not lead to 
> the exit of executors, so executors can still accept and run the tasks. The 
> problem here is when NM is failed, external shuffle service is actually 
> inaccessible, so reduce tasks will always complain about the “Fetch failure”, 
> and the failure of reduce stage will make the parent stage (map stage) rerun. 
> The tricky thing here is Spark scheduler is not aware of the unavailability 
> of external shuffle service, and will reschedule the map tasks on the 
> executor where NM is failed, and again reduce stage will be failed with 
> “Fetch failure”, and after 4 retries, the job is failed.
> So here the main problem is that we should avoid assigning tasks to those bad 
> executors (where shuffle service is unavailable). Current Spark's blacklist 
> mechanism could blacklist executors/nodes by failure tasks, but it doesn't 
> handle this specific fetch failure scenario. So here propose to improve the 
> current application blacklist mechanism to handle fetch failure issue 
> (especially with external shuffle service unavailable issue), to blacklist 
> the executors/nodes where shuffle fetch is unavailable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20898) spark.blacklist.killBlacklistedExecutors doesn't work in YARN

2017-06-26 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-20898:
-

Assignee: Saisai Shao

> spark.blacklist.killBlacklistedExecutors doesn't work in YARN
> -
>
> Key: SPARK-20898
> URL: https://issues.apache.org/jira/browse/SPARK-20898
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Thomas Graves
>Assignee: Saisai Shao
> Fix For: 2.3.0
>
>
> I was trying out the new spark.blacklist.killBlacklistedExecutors on YARN but 
> it doesn't appear to work.  Everytime I get:
> 17/05/26 16:28:12 WARN BlacklistTracker: Not attempting to kill blacklisted 
> executor id 4 since allocation client is not defined
> Even though dynamic allocation is on.  Taking a quick look, I think the way 
> it creates the blacklisttracker and passes the allocation client is wrong. 
> The scheduler backend is 
>  not set yet so it never passes the allocation client to the blacklisttracker 
> correctly.  Thus it will never kill.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20898) spark.blacklist.killBlacklistedExecutors doesn't work in YARN

2017-06-26 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-20898.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

> spark.blacklist.killBlacklistedExecutors doesn't work in YARN
> -
>
> Key: SPARK-20898
> URL: https://issues.apache.org/jira/browse/SPARK-20898
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Thomas Graves
>Assignee: Saisai Shao
> Fix For: 2.3.0
>
>
> I was trying out the new spark.blacklist.killBlacklistedExecutors on YARN but 
> it doesn't appear to work.  Everytime I get:
> 17/05/26 16:28:12 WARN BlacklistTracker: Not attempting to kill blacklisted 
> executor id 4 since allocation client is not defined
> Even though dynamic allocation is on.  Taking a quick look, I think the way 
> it creates the blacklisttracker and passes the allocation client is wrong. 
> The scheduler backend is 
>  not set yet so it never passes the allocation client to the blacklisttracker 
> correctly.  Thus it will never kill.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21213) Support collecting partition-level statistics: rowCount and sizeInBytes

2017-06-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063338#comment-16063338
 ] 

Apache Spark commented on SPARK-21213:
--

User 'mbasmanova' has created a pull request for this issue:
https://github.com/apache/spark/pull/18421

> Support collecting partition-level statistics: rowCount and sizeInBytes
> ---
>
> Key: SPARK-21213
> URL: https://issues.apache.org/jira/browse/SPARK-21213
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Maria
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Support ANALYZE TABLE table PARTITION (key=value,...) COMPUTE STATISTICS 
> [NOSCAN] SQL command to compute and store in Hive Metastore number of rows 
> and total size in bytes for a single partition.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21213) Support collecting partition-level statistics: rowCount and sizeInBytes

2017-06-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21213:


Assignee: (was: Apache Spark)

> Support collecting partition-level statistics: rowCount and sizeInBytes
> ---
>
> Key: SPARK-21213
> URL: https://issues.apache.org/jira/browse/SPARK-21213
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Maria
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Support ANALYZE TABLE table PARTITION (key=value,...) COMPUTE STATISTICS 
> [NOSCAN] SQL command to compute and store in Hive Metastore number of rows 
> and total size in bytes for a single partition.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21213) Support collecting partition-level statistics: rowCount and sizeInBytes

2017-06-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21213:


Assignee: Apache Spark

> Support collecting partition-level statistics: rowCount and sizeInBytes
> ---
>
> Key: SPARK-21213
> URL: https://issues.apache.org/jira/browse/SPARK-21213
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Maria
>Assignee: Apache Spark
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Support ANALYZE TABLE table PARTITION (key=value,...) COMPUTE STATISTICS 
> [NOSCAN] SQL command to compute and store in Hive Metastore number of rows 
> and total size in bytes for a single partition.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21213) Support collecting partition-level statistics: rowCount and sizeInBytes

2017-06-26 Thread Maria (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maria updated SPARK-21213:
--
Summary: Support collecting partition-level statistics: rowCount and 
sizeInBytes  (was: Support collecting partition-level statistics)

> Support collecting partition-level statistics: rowCount and sizeInBytes
> ---
>
> Key: SPARK-21213
> URL: https://issues.apache.org/jira/browse/SPARK-21213
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Maria
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Support ANALYZE TABLE table PARTITION (key=value,...) COMPUTE STATISTICS 
> [NOSCAN] SQL command to compute and store in Hive Metastore number of rows 
> and total size in bytes for a single partition.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21063) Spark return an empty result from remote hadoop cluster

2017-06-26 Thread Peter Bykov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063289#comment-16063289
 ] 

Peter Bykov commented on SPARK-21063:
-

[~q79969786] I tried this solution, but same result (empty result set).
[~srowen] Can you share a specific configuration from your cluster?

> Spark return an empty result from remote hadoop cluster
> ---
>
> Key: SPARK-21063
> URL: https://issues.apache.org/jira/browse/SPARK-21063
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Peter Bykov
>
> Spark returning empty result from when querying remote hadoop cluster.
> All firewall settings removed.
> Querying using JDBC working properly using hive-jdbc driver from version 1.1.1
> Code snippet is:
> {code:java}
> val spark = SparkSession.builder
> .appName("RemoteSparkTest")
> .master("local")
> .getOrCreate()
> val df = spark.read
>   .option("url", "jdbc:hive2://remote.hive.local:1/default")
>   .option("user", "user")
>   .option("password", "pass")
>   .option("dbtable", "test_table")
>   .option("driver", "org.apache.hive.jdbc.HiveDriver")
>   .format("jdbc")
>   .load()
>  
> df.show()
> {code}
> Result:
> {noformat}
> +---+
> |test_table.test_col|
> +---+
> +---+
> {noformat}
> All manipulations like: 
> {code:java}
> df.select(*).show()
> {code}
> returns empty result too.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21213) Support collecting partition-level statistics

2017-06-26 Thread Maria (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maria updated SPARK-21213:
--
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-17129

> Support collecting partition-level statistics
> -
>
> Key: SPARK-21213
> URL: https://issues.apache.org/jira/browse/SPARK-21213
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Maria
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Support ANALYZE TABLE table PARTITION (key=value,...) COMPUTE STATISTICS 
> [NOSCAN] SQL command to compute and store in Hive Metastore number of rows 
> and total size in bytes for a single partition.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21213) Support collecting partition-level statistics

2017-06-26 Thread Maria (JIRA)
Maria created SPARK-21213:
-

 Summary: Support collecting partition-level statistics
 Key: SPARK-21213
 URL: https://issues.apache.org/jira/browse/SPARK-21213
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.1.1
Reporter: Maria


Support ANALYZE TABLE table PARTITION (key=value,...) COMPUTE STATISTICS 
[NOSCAN] SQL command to compute and store in Hive Metastore number of rows and 
total size in bytes for a single partition.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >