[jira] [Updated] (SPARK-20965) Support PREPARE/EXECUTE/DECLARE/FETCH statements
[ https://issues.apache.org/jira/browse/SPARK-20965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-20965: - Summary: Support PREPARE/EXECUTE/DECLARE/FETCH statements (was: Support PREPARE and EXECUTE statements) > Support PREPARE/EXECUTE/DECLARE/FETCH statements > > > Key: SPARK-20965 > URL: https://issues.apache.org/jira/browse/SPARK-20965 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Takeshi Yamamuro >Priority: Minor > > https://github.com/apache/spark/compare/master...maropu:SPARK-20965#diff-06e7b9d9e1afd7aca1a3b2cd18561953R63 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20965) Support PREPARE/EXECUTE/DECLARE/FETCH statements
[ https://issues.apache.org/jira/browse/SPARK-20965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-20965: - Description: It might help to implement PREPARE/EXECUTE/DECLARE/FETCH statements by referring the ANSI/SQL standard. An example query in PostgreSQL (MySQL also support these statements) is as follows; {code} // Prepared statement PREPARE pstmt(int) AS SELECT * FROM t WEHERE id = $1; EXECUTE pstmt(1); DEALLOCATE pstmt; // Cursor DECLARE cursor CURSOR FOR SELECT * FROM t; FETCH 1 FROM cursor; CLOSE cursor; {code} I probably think one of ideas to implement these statements is to put register implementations in `SessionState` to hold prepared statements and open cursors like; prototype: https://github.com/apache/spark/compare/master...maropu:SPARK-20965#diff-06e7b9d9e1afd7aca1a3b2cd18561953R63 was:https://github.com/apache/spark/compare/master...maropu:SPARK-20965#diff-06e7b9d9e1afd7aca1a3b2cd18561953R63 > Support PREPARE/EXECUTE/DECLARE/FETCH statements > > > Key: SPARK-20965 > URL: https://issues.apache.org/jira/browse/SPARK-20965 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Takeshi Yamamuro >Priority: Minor > > It might help to implement PREPARE/EXECUTE/DECLARE/FETCH statements by > referring the ANSI/SQL standard. > An example query in PostgreSQL (MySQL also support these statements) is as > follows; > {code} > // Prepared statement > PREPARE pstmt(int) AS SELECT * FROM t WEHERE id = $1; > EXECUTE pstmt(1); > DEALLOCATE pstmt; > // Cursor > DECLARE cursor CURSOR FOR SELECT * FROM t; > FETCH 1 FROM cursor; > CLOSE cursor; > {code} > I probably think one of ideas to implement these statements is to put > register implementations > in `SessionState` to hold prepared statements and open cursors like; > prototype: > https://github.com/apache/spark/compare/master...maropu:SPARK-20965#diff-06e7b9d9e1afd7aca1a3b2cd18561953R63 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20965) Support PREPARE and EXECUTE statements
[ https://issues.apache.org/jira/browse/SPARK-20965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-20965: - Description: https://github.com/apache/spark/compare/master...maropu:SPARK-20965#diff-06e7b9d9e1afd7aca1a3b2cd18561953R63 (was: Parameterized queries might help for some users, so we might support PREPRE and EXECUTE statements by referring the ANSI/SQL standard (e.g., it is some useful for users who frequently use the same queries) {code} PREPARE sqlstmt (int) AS SELECT * FROM t WEHERE id = $1; EXECUTE sqlstmt(1); {code} One of implementation references: https://www.postgresql.org/docs/current/static/sql-prepare.html) > Support PREPARE and EXECUTE statements > -- > > Key: SPARK-20965 > URL: https://issues.apache.org/jira/browse/SPARK-20965 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Takeshi Yamamuro >Priority: Minor > > https://github.com/apache/spark/compare/master...maropu:SPARK-20965#diff-06e7b9d9e1afd7aca1a3b2cd18561953R63 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21221) CrossValidator and TrainValidationSplit Persist Nested Estimators
[ https://issues.apache.org/jira/browse/SPARK-21221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-21221: -- Shepherd: Joseph K. Bradley > CrossValidator and TrainValidationSplit Persist Nested Estimators > - > > Key: SPARK-21221 > URL: https://issues.apache.org/jira/browse/SPARK-21221 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.1.1 >Reporter: Ajay Saini > > Currently, the saving of parameters done in ValidatorParams.scala assumes > that all parameters in EstimatorParameterMaps are JSON serializable. Such an > assumption causes CrossValidator and TrainValidationSplit persistence to fail > when the internal estimator to these meta-algorithms is not JSON > serializable. One example is OneVsRest which has a classifier (estimator) as > a parameter. > The changes would involve removing the assumption in validator params that > all the estimator params are JSON serializable. This could mean saving > parameters that are not JSON serializable separately at a specified path. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21222) Move elimination of Distinct clause from analyzer to optimizer
[ https://issues.apache.org/jira/browse/SPARK-21222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064229#comment-16064229 ] Apache Spark commented on SPARK-21222: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/18429 > Move elimination of Distinct clause from analyzer to optimizer > -- > > Key: SPARK-21222 > URL: https://issues.apache.org/jira/browse/SPARK-21222 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Gengliang Wang >Priority: Minor > > Distinct clause is after MAX/MIN clause > "Select MAX(distinct a) FROM src from" > is equivalent of > "Select MAX(distinct a) FROM src from" > However, this optimization is implemented in analyzer. It should be in > optimizer. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21222) Move elimination of Distinct clause from analyzer to optimizer
[ https://issues.apache.org/jira/browse/SPARK-21222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21222: Assignee: Apache Spark > Move elimination of Distinct clause from analyzer to optimizer > -- > > Key: SPARK-21222 > URL: https://issues.apache.org/jira/browse/SPARK-21222 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Minor > > Distinct clause is after MAX/MIN clause > "Select MAX(distinct a) FROM src from" > is equivalent of > "Select MAX(distinct a) FROM src from" > However, this optimization is implemented in analyzer. It should be in > optimizer. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21222) Move elimination of Distinct clause from analyzer to optimizer
[ https://issues.apache.org/jira/browse/SPARK-21222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21222: Assignee: (was: Apache Spark) > Move elimination of Distinct clause from analyzer to optimizer > -- > > Key: SPARK-21222 > URL: https://issues.apache.org/jira/browse/SPARK-21222 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Gengliang Wang >Priority: Minor > > Distinct clause is after MAX/MIN clause > "Select MAX(distinct a) FROM src from" > is equivalent of > "Select MAX(distinct a) FROM src from" > However, this optimization is implemented in analyzer. It should be in > optimizer. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21222) Move elimination of Distinct clause from analyzer to optimizer
Gengliang Wang created SPARK-21222: -- Summary: Move elimination of Distinct clause from analyzer to optimizer Key: SPARK-21222 URL: https://issues.apache.org/jira/browse/SPARK-21222 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Gengliang Wang Priority: Minor Distinct clause is after MAX/MIN clause "Select MAX(distinct a) FROM src from" is equivalent of "Select MAX(distinct a) FROM src from" However, this optimization is implemented in analyzer. It should be in optimizer. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6635) DataFrame.withColumn can create columns with identical names
[ https://issues.apache.org/jira/browse/SPARK-6635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064095#comment-16064095 ] Yichuan Wang edited comment on SPARK-6635 at 6/27/17 1:43 AM: -- withColumn have this strange behavior with join, it replaces both columns from the left and right side of the join, and replace them with the new column (Spark 2.0.2) {code:java} val left = Seq((1, "one", null, "p1"), (2, "two", "b", null)).toDF("id", "name", "note", "note2") val right = Seq((1, "c", 11, "n1"), (2, null, 22, "n2")).toDF("id", "note", "seq", "note2") val j = left.join(right, "id") j.show() {code} {code:java} +---+++-++---+-+ | id|name|note|note2|note|seq|note2| +---+++-++---+-+ | 1| one|null| p1| c| 11| n1| | 2| two| b| null|null| 22| n2| +---+++-++---+-+ {code} {code:java} val k = j.withColumn("note", coalesce(left("note"), right("note"))) k.show() {code} {code:java} +---+++-++---+-+ | id|name|note|note2|note|seq|note2| +---+++-++---+-+ | 1| one| c| p1| c| 11| n1| | 2| two| b| null| b| 22| n2| +---+++-++---+-+ {code} {code:java} val l = k.drop("note") l.show() {code} {code:java} +---++-+---+-+ | id|name|note2|seq|note2| +---++-+---+-+ | 1| one| p1| 11| n1| | 2| two| null| 22| n2| +---++-+---+-+ {code} was (Author: yichuan): withColumn have this strange behavior with join, it replace both columns from the left and right side of join, and replace them with the new column (Spark 2.0.2) {code:java} val left = Seq((1, "one", null, "p1"), (2, "two", "b", null)).toDF("id", "name", "note", "note2") val right = Seq((1, "c", 11, "n1"), (2, null, 22, "n2")).toDF("id", "note", "seq", "note2") val j = left.join(right, "id") j.show() {code} {code:java} +---+++-++---+-+ | id|name|note|note2|note|seq|note2| +---+++-++---+-+ | 1| one|null| p1| c| 11| n1| | 2| two| b| null|null| 22| n2| +---+++-++---+-+ {code} {code:java} val k = j.withColumn("note", coalesce(left("note"), right("note"))) k.show() {code} {code:java} +---+++-++---+-+ | id|name|note|note2|note|seq|note2| +---+++-++---+-+ | 1| one| c| p1| c| 11| n1| | 2| two| b| null| b| 22| n2| +---+++-++---+-+ {code} {code:java} val l = k.drop("note") l.show() {code} {code:java} +---++-+---+-+ | id|name|note2|seq|note2| +---++-+---+-+ | 1| one| p1| 11| n1| | 2| two| null| 22| n2| +---++-+---+-+ {code} > DataFrame.withColumn can create columns with identical names > > > Key: SPARK-6635 > URL: https://issues.apache.org/jira/browse/SPARK-6635 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Liang-Chi Hsieh > Fix For: 1.4.0 > > > DataFrame lets you create multiple columns with the same name, which causes > problems when you try to refer to columns by name. > Proposal: If a column is added to a DataFrame with a column of the same name, > then the new column should replace the old column. > {code} > scala> val df = sc.parallelize(Array(1,2,3)).toDF("x") > df: org.apache.spark.sql.DataFrame = [x: int] > scala> val df3 = df.withColumn("x", df("x") + 1) > df3: org.apache.spark.sql.DataFrame = [x: int, x: int] > scala> df3.collect() > res1: Array[org.apache.spark.sql.Row] = Array([1,2], [2,3], [3,4]) > scala> df3("x") > org.apache.spark.sql.AnalysisException: Reference 'x' is ambiguous, could be: > x, x.; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:216) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:121) > at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:161) > at org.apache.spark.sql.DataFrame.col(DataFrame.scala:436) > at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:426) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:31) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33) > at $iwC$$iwC$$iwC$$iwC$$iwC.(:35) > at $iwC$$iwC$$iwC$$iwC.(:37) > at $iwC$$iwC$$iwC.(:39) > at $iwC$$iwC.(:41) > at $iwC.(:43) > at (:45) > at .(:49) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAc
[jira] [Commented] (SPARK-6635) DataFrame.withColumn can create columns with identical names
[ https://issues.apache.org/jira/browse/SPARK-6635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064095#comment-16064095 ] Yichuan Wang commented on SPARK-6635: - withColumn have this strange behavior with join, it replace both columns from the left and right side of join, and replace them with the new column (Spark 2.0.2) {code:java} val left = Seq((1, "one", null, "p1"), (2, "two", "b", null)).toDF("id", "name", "note", "note2") val right = Seq((1, "c", 11, "n1"), (2, null, 22, "n2")).toDF("id", "note", "seq", "note2") val j = left.join(right, "id") j.show() {code} {code:java} +---+++-++---+-+ | id|name|note|note2|note|seq|note2| +---+++-++---+-+ | 1| one|null| p1| c| 11| n1| | 2| two| b| null|null| 22| n2| +---+++-++---+-+ {code} {code:java} val k = j.withColumn("note", coalesce(left("note"), right("note"))) k.show() {code} {code:java} +---+++-++---+-+ | id|name|note|note2|note|seq|note2| +---+++-++---+-+ | 1| one| c| p1| c| 11| n1| | 2| two| b| null| b| 22| n2| +---+++-++---+-+ {code} {code:java} val l = k.drop("note") l.show() {code} {code:java} +---++-+---+-+ | id|name|note2|seq|note2| +---++-+---+-+ | 1| one| p1| 11| n1| | 2| two| null| 22| n2| +---++-+---+-+ {code} > DataFrame.withColumn can create columns with identical names > > > Key: SPARK-6635 > URL: https://issues.apache.org/jira/browse/SPARK-6635 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Liang-Chi Hsieh > Fix For: 1.4.0 > > > DataFrame lets you create multiple columns with the same name, which causes > problems when you try to refer to columns by name. > Proposal: If a column is added to a DataFrame with a column of the same name, > then the new column should replace the old column. > {code} > scala> val df = sc.parallelize(Array(1,2,3)).toDF("x") > df: org.apache.spark.sql.DataFrame = [x: int] > scala> val df3 = df.withColumn("x", df("x") + 1) > df3: org.apache.spark.sql.DataFrame = [x: int, x: int] > scala> df3.collect() > res1: Array[org.apache.spark.sql.Row] = Array([1,2], [2,3], [3,4]) > scala> df3("x") > org.apache.spark.sql.AnalysisException: Reference 'x' is ambiguous, could be: > x, x.; > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:216) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:121) > at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:161) > at org.apache.spark.sql.DataFrame.col(DataFrame.scala:436) > at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:426) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:31) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33) > at $iwC$$iwC$$iwC$$iwC$$iwC.(:35) > at $iwC$$iwC$$iwC$$iwC.(:37) > at $iwC$$iwC$$iwC.(:39) > at $iwC$$iwC.(:41) > at $iwC.(:43) > at (:45) > at .(:49) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) > at > org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) > at > org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) > at > org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856) > at > org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901) > at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813) > at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656) > at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664) > at > org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944) > at
[jira] [Commented] (SPARK-21221) CrossValidator and TrainValidationSplit Persist Nested Estimators
[ https://issues.apache.org/jira/browse/SPARK-21221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064068#comment-16064068 ] Ajay Saini commented on SPARK-21221: Note: In order for python persistence of OneVsRest inside a CrossValidator/TrainValidationSplit to work this change needs to be merged because Python persistence of meta-algorithms relies on the Scala saving implementation. > CrossValidator and TrainValidationSplit Persist Nested Estimators > - > > Key: SPARK-21221 > URL: https://issues.apache.org/jira/browse/SPARK-21221 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.1.1 >Reporter: Ajay Saini > > Currently, the saving of parameters done in ValidatorParams.scala assumes > that all parameters in EstimatorParameterMaps are JSON serializable. Such an > assumption causes CrossValidator and TrainValidationSplit persistence to fail > when the internal estimator to these meta-algorithms is not JSON > serializable. One example is OneVsRest which has a classifier (estimator) as > a parameter. > The changes would involve removing the assumption in validator params that > all the estimator params are JSON serializable. This could mean saving > parameters that are not JSON serializable separately at a specified path. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-21221) CrossValidator and TrainValidationSplit Persist Nested Estimators
[ https://issues.apache.org/jira/browse/SPARK-21221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ajay Saini updated SPARK-21221: --- Comment: was deleted (was: Pull Request Here: https://github.com/apache/spark/pull/18428) > CrossValidator and TrainValidationSplit Persist Nested Estimators > - > > Key: SPARK-21221 > URL: https://issues.apache.org/jira/browse/SPARK-21221 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.1.1 >Reporter: Ajay Saini > > Currently, the saving of parameters done in ValidatorParams.scala assumes > that all parameters in EstimatorParameterMaps are JSON serializable. Such an > assumption causes CrossValidator and TrainValidationSplit persistence to fail > when the internal estimator to these meta-algorithms is not JSON > serializable. One example is OneVsRest which has a classifier (estimator) as > a parameter. > The changes would involve removing the assumption in validator params that > all the estimator params are JSON serializable. This could mean saving > parameters that are not JSON serializable separately at a specified path. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21221) CrossValidator and TrainValidationSplit Persist Nested Estimators
[ https://issues.apache.org/jira/browse/SPARK-21221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064065#comment-16064065 ] Apache Spark commented on SPARK-21221: -- User 'ajaysaini725' has created a pull request for this issue: https://github.com/apache/spark/pull/18428 > CrossValidator and TrainValidationSplit Persist Nested Estimators > - > > Key: SPARK-21221 > URL: https://issues.apache.org/jira/browse/SPARK-21221 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.1.1 >Reporter: Ajay Saini > > Currently, the saving of parameters done in ValidatorParams.scala assumes > that all parameters in EstimatorParameterMaps are JSON serializable. Such an > assumption causes CrossValidator and TrainValidationSplit persistence to fail > when the internal estimator to these meta-algorithms is not JSON > serializable. One example is OneVsRest which has a classifier (estimator) as > a parameter. > The changes would involve removing the assumption in validator params that > all the estimator params are JSON serializable. This could mean saving > parameters that are not JSON serializable separately at a specified path. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21221) CrossValidator and TrainValidationSplit Persist Nested Estimators
[ https://issues.apache.org/jira/browse/SPARK-21221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21221: Assignee: (was: Apache Spark) > CrossValidator and TrainValidationSplit Persist Nested Estimators > - > > Key: SPARK-21221 > URL: https://issues.apache.org/jira/browse/SPARK-21221 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.1.1 >Reporter: Ajay Saini > > Currently, the saving of parameters done in ValidatorParams.scala assumes > that all parameters in EstimatorParameterMaps are JSON serializable. Such an > assumption causes CrossValidator and TrainValidationSplit persistence to fail > when the internal estimator to these meta-algorithms is not JSON > serializable. One example is OneVsRest which has a classifier (estimator) as > a parameter. > The changes would involve removing the assumption in validator params that > all the estimator params are JSON serializable. This could mean saving > parameters that are not JSON serializable separately at a specified path. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21221) CrossValidator and TrainValidationSplit Persist Nested Estimators
[ https://issues.apache.org/jira/browse/SPARK-21221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21221: Assignee: Apache Spark > CrossValidator and TrainValidationSplit Persist Nested Estimators > - > > Key: SPARK-21221 > URL: https://issues.apache.org/jira/browse/SPARK-21221 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.1.1 >Reporter: Ajay Saini >Assignee: Apache Spark > > Currently, the saving of parameters done in ValidatorParams.scala assumes > that all parameters in EstimatorParameterMaps are JSON serializable. Such an > assumption causes CrossValidator and TrainValidationSplit persistence to fail > when the internal estimator to these meta-algorithms is not JSON > serializable. One example is OneVsRest which has a classifier (estimator) as > a parameter. > The changes would involve removing the assumption in validator params that > all the estimator params are JSON serializable. This could mean saving > parameters that are not JSON serializable separately at a specified path. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21221) CrossValidator and TrainValidationSplit Persist Nested Estimators
[ https://issues.apache.org/jira/browse/SPARK-21221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064064#comment-16064064 ] Ajay Saini commented on SPARK-21221: Pull Request Here: https://github.com/apache/spark/pull/18428 > CrossValidator and TrainValidationSplit Persist Nested Estimators > - > > Key: SPARK-21221 > URL: https://issues.apache.org/jira/browse/SPARK-21221 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.1.1 >Reporter: Ajay Saini > > Currently, the saving of parameters done in ValidatorParams.scala assumes > that all parameters in EstimatorParameterMaps are JSON serializable. Such an > assumption causes CrossValidator and TrainValidationSplit persistence to fail > when the internal estimator to these meta-algorithms is not JSON > serializable. One example is OneVsRest which has a classifier (estimator) as > a parameter. > The changes would involve removing the assumption in validator params that > all the estimator params are JSON serializable. This could mean saving > parameters that are not JSON serializable separately at a specified path. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17129) Support statistics collection and cardinality estimation for partitioned tables
[ https://issues.apache.org/jira/browse/SPARK-17129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064062#comment-16064062 ] Zhenhua Wang commented on SPARK-17129: -- [~mbasmanova] Thanks for working on it~ I'll review it in these days. > Support statistics collection and cardinality estimation for partitioned > tables > --- > > Key: SPARK-17129 > URL: https://issues.apache.org/jira/browse/SPARK-17129 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Zhenhua Wang > > I upgrade this JIRA, because there are many tasks found and needed to be done > here. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21221) CrossValidator and TrainValidationSplit Persist Nested Estimators
Ajay Saini created SPARK-21221: -- Summary: CrossValidator and TrainValidationSplit Persist Nested Estimators Key: SPARK-21221 URL: https://issues.apache.org/jira/browse/SPARK-21221 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 2.1.1 Reporter: Ajay Saini Currently, the saving of parameters done in ValidatorParams.scala assumes that all parameters in EstimatorParameterMaps are JSON serializable. Such an assumption causes CrossValidator and TrainValidationSplit persistence to fail when the internal estimator to these meta-algorithms is not JSON serializable. One example is OneVsRest which has a classifier (estimator) as a parameter. The changes would involve removing the assumption in validator params that all the estimator params are JSON serializable. This could mean saving parameters that are not JSON serializable separately at a specified path. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21063) Spark return an empty result from remote hadoop cluster
[ https://issues.apache.org/jira/browse/SPARK-21063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064007#comment-16064007 ] Yuming Wang commented on SPARK-21063: - [~pbykov], I just verified it. It can get the result without any specific configuration if {{jdbc:hive2://remote.hive.local:1/default}} is [Spark hive-thriftserver|https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala]. But it will throw {{org.apache.thrift.TApplicationException}} if {{jdbc:hive2://remote.hive.local:1/default}} is [hive-thriftserver|https://github.com/apache/hive/blob/master/service/src/java/org/apache/hive/service/server/HiveServer2.java], So what is your {{jdbc:hive2://remote.hive.local:1/default}}? > Spark return an empty result from remote hadoop cluster > --- > > Key: SPARK-21063 > URL: https://issues.apache.org/jira/browse/SPARK-21063 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.1.0, 2.1.1 >Reporter: Peter Bykov > > Spark returning empty result from when querying remote hadoop cluster. > All firewall settings removed. > Querying using JDBC working properly using hive-jdbc driver from version 1.1.1 > Code snippet is: > {code:java} > val spark = SparkSession.builder > .appName("RemoteSparkTest") > .master("local") > .getOrCreate() > val df = spark.read > .option("url", "jdbc:hive2://remote.hive.local:1/default") > .option("user", "user") > .option("password", "pass") > .option("dbtable", "test_table") > .option("driver", "org.apache.hive.jdbc.HiveDriver") > .format("jdbc") > .load() > > df.show() > {code} > Result: > {noformat} > +---+ > |test_table.test_col| > +---+ > +---+ > {noformat} > All manipulations like: > {code:java} > df.select(*).show() > {code} > returns empty result too. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-21212) Can't use Count(*) with Order Clause
[ https://issues.apache.org/jira/browse/SPARK-21212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shawn Lavelle closed SPARK-21212. - Resolution: Not A Problem > Can't use Count(*) with Order Clause > > > Key: SPARK-21212 > URL: https://issues.apache.org/jira/browse/SPARK-21212 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: Windows; external data provided through data source api >Reporter: Shawn Lavelle >Priority: Minor > > I don't think this should fail the query: > _Notes: VALUE is a column of table TABLE. columns and table names redacted. I > can generate a simplified test case if needed, but this is easy to reproduce. > _ > {code}jdbc:hive2://user:port/> select count(*) from table where value between > 1498240079000 and cast(now() as bigint)*1000 order by value; > {code} > {code} > Error: org.apache.spark.sql.AnalysisException: cannot resolve '`value`' given > input columns: [count(1)]; line 1 pos 113; > 'Sort ['value ASC NULLS FIRST], true > +- Aggregate [count(1) AS count(1)#718L] >+- Filter ((value#413L >= 1498240079000) && (value#413L <= > (cast(current_timestamp() as bigint) * cast(1000 as bigint > +- SubqueryAlias table > +- > Relation[field1#411L,field2#412,value#413L,field3#414,field4#415,field5#416,field6#417,field7#418,field8#419,field9#420] > com.redacted@16004579 (state=,code=0) > {code} > Arguably, the optimizer could ignore the "order by" clause, but I leave that > to more informed minds than my own. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21212) Can't use Count(*) with Order Clause
[ https://issues.apache.org/jira/browse/SPARK-21212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064005#comment-16064005 ] Shawn Lavelle commented on SPARK-21212: --- I think you're right. I know my users (not skilled at their SQL) will continue attempt the supplied problem regardless of what training I throw at them. > Can't use Count(*) with Order Clause > > > Key: SPARK-21212 > URL: https://issues.apache.org/jira/browse/SPARK-21212 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: Windows; external data provided through data source api >Reporter: Shawn Lavelle >Priority: Minor > > I don't think this should fail the query: > _Notes: VALUE is a column of table TABLE. columns and table names redacted. I > can generate a simplified test case if needed, but this is easy to reproduce. > _ > {code}jdbc:hive2://user:port/> select count(*) from table where value between > 1498240079000 and cast(now() as bigint)*1000 order by value; > {code} > {code} > Error: org.apache.spark.sql.AnalysisException: cannot resolve '`value`' given > input columns: [count(1)]; line 1 pos 113; > 'Sort ['value ASC NULLS FIRST], true > +- Aggregate [count(1) AS count(1)#718L] >+- Filter ((value#413L >= 1498240079000) && (value#413L <= > (cast(current_timestamp() as bigint) * cast(1000 as bigint > +- SubqueryAlias table > +- > Relation[field1#411L,field2#412,value#413L,field3#414,field4#415,field5#416,field6#417,field7#418,field8#419,field9#420] > com.redacted@16004579 (state=,code=0) > {code} > Arguably, the optimizer could ignore the "order by" clause, but I leave that > to more informed minds than my own. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21220) Use outputPartitioning's bucketing if possible on write
Andrew Ash created SPARK-21220: -- Summary: Use outputPartitioning's bucketing if possible on write Key: SPARK-21220 URL: https://issues.apache.org/jira/browse/SPARK-21220 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.2.0 Reporter: Andrew Ash When reading a bucketed dataset and writing it back with no transformations (a copy) the bucketing information is lost and the user is required to re-specify the bucketing information on write. This negatively affects read performance on the copied dataset since the bucketing information enables significant optimizations that aren't possible on the un-bucketed copied table. Spark should propagate this bucketing information for copied datasets, and more generally could support inferring bucket information based on the known partitioning of the final RDD at save time when that partitioning is a {{HashPartitioning}}. https://github.com/apache/spark/blob/v2.2.0-rc5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L118 In the above linked {{bucketIdExpression}}, we could {{.orElse}} a bucket expression based on outputPartitionings that are HashPartitioning. This preserves bucket information for bucketed datasets, and also supports saving this metadata at write time for datasets with a known partitioning. Both of these cases should improve performance at read time of the newly-written dataset. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14220) Build and test Spark against Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063984#comment-16063984 ] Reynold Xin commented on SPARK-14220: - If all those issues have been released than it would be easy. Do you want to take a stab at that and see if it works with the DataFrame API? > Build and test Spark against Scala 2.12 > --- > > Key: SPARK-14220 > URL: https://issues.apache.org/jira/browse/SPARK-14220 > Project: Spark > Issue Type: Umbrella > Components: Build, Project Infra >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Priority: Blocker > > This umbrella JIRA tracks the requirements for building and testing Spark > against the current Scala 2.12 milestone. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21218) Convert IN predicate to equivalent Parquet filter
[ https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063981#comment-16063981 ] Andrew Duffy commented on SPARK-21218: -- Good catch, looks like a dupe. [~hyukjin.kwon] did profiling on the issue last year and it was determined at the time that eval'ing the disjunction in Spark was faster. > Convert IN predicate to equivalent Parquet filter > - > > Key: SPARK-21218 > URL: https://issues.apache.org/jira/browse/SPARK-21218 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 2.1.1 >Reporter: Michael Styles > > Convert IN predicate to equivalent expression involving equality conditions > to allow the filter to be pushed down to Parquet. > For instance, > C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17091) ParquetFilters rewrite IN to OR of Eq
[ https://issues.apache.org/jira/browse/SPARK-17091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Duffy resolved SPARK-17091. -- Resolution: Won't Fix Should've closed this last year, but at the time based on Hyukjin Kwon's profiling, evaluating the disjunction at the Spark level was determined to yield better performance. > ParquetFilters rewrite IN to OR of Eq > - > > Key: SPARK-17091 > URL: https://issues.apache.org/jira/browse/SPARK-17091 > Project: Spark > Issue Type: Bug >Reporter: Andrew Duffy > > Past attempts at pushing down the InSet operation for Parquet relied on > user-defined predicates. It would be simpler to rewrite an IN clause into the > corresponding OR union of a set of equality conditions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21212) Can't use Count(*) with Order Clause
[ https://issues.apache.org/jira/browse/SPARK-21212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063952#comment-16063952 ] Shawn Lavelle edited comment on SPARK-21212 at 6/26/17 11:26 PM: - [~srowen],Are you trying to say is that the order by is attempting to apply itself to the aggregate count column while ignoring columns within the table itself? was (Author: azeroth2b): [~srowen], I can assure you that value is a column in the table. For example, {code} select * from table where value between 1 and 5 order by value; {code} and {code} select count(*) from table where value between 1 and 5 {code} both complete successfully. > Can't use Count(*) with Order Clause > > > Key: SPARK-21212 > URL: https://issues.apache.org/jira/browse/SPARK-21212 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: Windows; external data provided through data source api >Reporter: Shawn Lavelle >Priority: Minor > > I don't think this should fail the query: > _Notes: VALUE is a column of table TABLE. columns and table names redacted. I > can generate a simplified test case if needed, but this is easy to reproduce. > _ > {code}jdbc:hive2://user:port/> select count(*) from table where value between > 1498240079000 and cast(now() as bigint)*1000 order by value; > {code} > {code} > Error: org.apache.spark.sql.AnalysisException: cannot resolve '`value`' given > input columns: [count(1)]; line 1 pos 113; > 'Sort ['value ASC NULLS FIRST], true > +- Aggregate [count(1) AS count(1)#718L] >+- Filter ((value#413L >= 1498240079000) && (value#413L <= > (cast(current_timestamp() as bigint) * cast(1000 as bigint > +- SubqueryAlias table > +- > Relation[field1#411L,field2#412,value#413L,field3#414,field4#415,field5#416,field6#417,field7#418,field8#419,field9#420] > com.redacted@16004579 (state=,code=0) > {code} > Arguably, the optimizer could ignore the "order by" clause, but I leave that > to more informed minds than my own. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21212) Can't use Count(*) with Order Clause
[ https://issues.apache.org/jira/browse/SPARK-21212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063968#comment-16063968 ] Sean Owen commented on SPARK-21212: --- Yes but you are not selecting the thing you order by. I thought this was required in SQL. It seems to be at least in some engines and seems to be here. Add whatever it is to the select clause. > Can't use Count(*) with Order Clause > > > Key: SPARK-21212 > URL: https://issues.apache.org/jira/browse/SPARK-21212 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: Windows; external data provided through data source api >Reporter: Shawn Lavelle >Priority: Minor > > I don't think this should fail the query: > _Notes: VALUE is a column of table TABLE. columns and table names redacted. I > can generate a simplified test case if needed, but this is easy to reproduce. > _ > {code}jdbc:hive2://user:port/> select count(*) from table where value between > 1498240079000 and cast(now() as bigint)*1000 order by value; > {code} > {code} > Error: org.apache.spark.sql.AnalysisException: cannot resolve '`value`' given > input columns: [count(1)]; line 1 pos 113; > 'Sort ['value ASC NULLS FIRST], true > +- Aggregate [count(1) AS count(1)#718L] >+- Filter ((value#413L >= 1498240079000) && (value#413L <= > (cast(current_timestamp() as bigint) * cast(1000 as bigint > +- SubqueryAlias table > +- > Relation[field1#411L,field2#412,value#413L,field3#414,field4#415,field5#416,field6#417,field7#418,field8#419,field9#420] > com.redacted@16004579 (state=,code=0) > {code} > Arguably, the optimizer could ignore the "order by" clause, but I leave that > to more informed minds than my own. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21212) Can't use Count(*) with Order Clause
[ https://issues.apache.org/jira/browse/SPARK-21212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shawn Lavelle updated SPARK-21212: -- Description: I don't think this should fail the query: _Notes: VALUE is a column of table TABLE. columns and table names redacted. I can generate a simplified test case if needed, but this is easy to reproduce. _ {code}jdbc:hive2://user:port/> select count(*) from table where value between 1498240079000 and cast(now() as bigint)*1000 order by value; {code} {code} Error: org.apache.spark.sql.AnalysisException: cannot resolve '`value`' given input columns: [count(1)]; line 1 pos 113; 'Sort ['value ASC NULLS FIRST], true +- Aggregate [count(1) AS count(1)#718L] +- Filter ((value#413L >= 1498240079000) && (value#413L <= (cast(current_timestamp() as bigint) * cast(1000 as bigint +- SubqueryAlias table +- Relation[field1#411L,field2#412,value#413L,field3#414,field4#415,field5#416,field6#417,field7#418,field8#419,field9#420] com.redacted@16004579 (state=,code=0) {code} Arguably, the optimizer could ignore the "order by" clause, but I leave that to more informed minds than my own. was: I don't think this should fail the query: {code}jdbc:hive2://user:port/> select count(*) from table where value between 1498240079000 and cast(now() as bigint)*1000 order by value; {code} {code} Error: org.apache.spark.sql.AnalysisException: cannot resolve '`value`' given input columns: [count(1)]; line 1 pos 113; 'Sort ['value ASC NULLS FIRST], true +- Aggregate [count(1) AS count(1)#718L] +- Filter ((value#413L >= 1498240079000) && (value#413L <= (cast(current_timestamp() as bigint) * cast(1000 as bigint +- SubqueryAlias table +- Relation[field1#411L,field2#412,value#413L,field3#414,field4#415,field5#416,field6#417,field7#418,field8#419,field9#420] com.redacted@16004579 (state=,code=0) {code} Arguably, the optimizer could ignore the "order by" clause, but I leave that to more informed minds than my own. > Can't use Count(*) with Order Clause > > > Key: SPARK-21212 > URL: https://issues.apache.org/jira/browse/SPARK-21212 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: Windows; external data provided through data source api >Reporter: Shawn Lavelle >Priority: Minor > > I don't think this should fail the query: > _Notes: VALUE is a column of table TABLE. columns and table names redacted. I > can generate a simplified test case if needed, but this is easy to reproduce. > _ > {code}jdbc:hive2://user:port/> select count(*) from table where value between > 1498240079000 and cast(now() as bigint)*1000 order by value; > {code} > {code} > Error: org.apache.spark.sql.AnalysisException: cannot resolve '`value`' given > input columns: [count(1)]; line 1 pos 113; > 'Sort ['value ASC NULLS FIRST], true > +- Aggregate [count(1) AS count(1)#718L] >+- Filter ((value#413L >= 1498240079000) && (value#413L <= > (cast(current_timestamp() as bigint) * cast(1000 as bigint > +- SubqueryAlias table > +- > Relation[field1#411L,field2#412,value#413L,field3#414,field4#415,field5#416,field6#417,field7#418,field8#419,field9#420] > com.redacted@16004579 (state=,code=0) > {code} > Arguably, the optimizer could ignore the "order by" clause, but I leave that > to more informed minds than my own. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14220) Build and test Spark against Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063957#comment-16063957 ] Flavio Brasil commented on SPARK-14220: --- [~rxin] Could you expand your last comment? Is it hard because of the issue described in [this doc | https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit]? Scala 2.12 already [supports automatic conversion | https://scastie.scala-lang.org/dvkk8TFOR1OYKUz2snfzbw] from scala functions to java functional interfaces, maybe that wasn't the case at the time that the doc was written. Spark is one of the last dependencies at Twitter that doesn't support Scala 2.12. Would it be possible for us to collaborate on this? > Build and test Spark against Scala 2.12 > --- > > Key: SPARK-14220 > URL: https://issues.apache.org/jira/browse/SPARK-14220 > Project: Spark > Issue Type: Umbrella > Components: Build, Project Infra >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Priority: Blocker > > This umbrella JIRA tracks the requirements for building and testing Spark > against the current Scala 2.12 milestone. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21212) Can't use Count(*) with Order Clause
[ https://issues.apache.org/jira/browse/SPARK-21212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063952#comment-16063952 ] Shawn Lavelle edited comment on SPARK-21212 at 6/26/17 11:17 PM: - [~srowen], I can assure you that value is a column in the table. For example, {code} select * from table where value between 1 and 5 order by value; {code} and {code} select count(*) from table where value between 1 and 5 {code} both complete successfully. was (Author: azeroth2b): [~srowen], I can assure you that value is a column in the table. For example, [code] select * from table where value between 1 and 5 order by value; [code] and [code] select coun(*) from table where value between 1 and 5 [code] both complete successfully. > Can't use Count(*) with Order Clause > > > Key: SPARK-21212 > URL: https://issues.apache.org/jira/browse/SPARK-21212 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: Windows; external data provided through data source api >Reporter: Shawn Lavelle >Priority: Minor > > I don't think this should fail the query: > {code}jdbc:hive2://user:port/> select count(*) from table where value between > 1498240079000 and cast(now() as bigint)*1000 order by value; > {code} > {code} > Error: org.apache.spark.sql.AnalysisException: cannot resolve '`value`' given > input columns: [count(1)]; line 1 pos 113; > 'Sort ['value ASC NULLS FIRST], true > +- Aggregate [count(1) AS count(1)#718L] >+- Filter ((value#413L >= 1498240079000) && (value#413L <= > (cast(current_timestamp() as bigint) * cast(1000 as bigint > +- SubqueryAlias table > +- > Relation[field1#411L,field2#412,value#413L,field3#414,field4#415,field5#416,field6#417,field7#418,field8#419,field9#420] > com.redacted@16004579 (state=,code=0) > {code} > Arguably, the optimizer could ignore the "order by" clause, but I leave that > to more informed minds than my own. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21212) Can't use Count(*) with Order Clause
[ https://issues.apache.org/jira/browse/SPARK-21212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063952#comment-16063952 ] Shawn Lavelle edited comment on SPARK-21212 at 6/26/17 11:17 PM: - [~srowen], I can assure you that value is a column in the table. For example, [code] select * from table where value between 1 and 5 order by value; [code] and [code] select coun(*) from table where value between 1 and 5 [code] both complete successfully. was (Author: azeroth2b): [~srowen], I can assure you that value is a column in the table. > Can't use Count(*) with Order Clause > > > Key: SPARK-21212 > URL: https://issues.apache.org/jira/browse/SPARK-21212 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: Windows; external data provided through data source api >Reporter: Shawn Lavelle >Priority: Minor > > I don't think this should fail the query: > {code}jdbc:hive2://user:port/> select count(*) from table where value between > 1498240079000 and cast(now() as bigint)*1000 order by value; > {code} > {code} > Error: org.apache.spark.sql.AnalysisException: cannot resolve '`value`' given > input columns: [count(1)]; line 1 pos 113; > 'Sort ['value ASC NULLS FIRST], true > +- Aggregate [count(1) AS count(1)#718L] >+- Filter ((value#413L >= 1498240079000) && (value#413L <= > (cast(current_timestamp() as bigint) * cast(1000 as bigint > +- SubqueryAlias table > +- > Relation[field1#411L,field2#412,value#413L,field3#414,field4#415,field5#416,field6#417,field7#418,field8#419,field9#420] > com.redacted@16004579 (state=,code=0) > {code} > Arguably, the optimizer could ignore the "order by" clause, but I leave that > to more informed minds than my own. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21212) Can't use Count(*) with Order Clause
[ https://issues.apache.org/jira/browse/SPARK-21212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063952#comment-16063952 ] Shawn Lavelle commented on SPARK-21212: --- [~srowen], I can assure you that value is a column in the table. > Can't use Count(*) with Order Clause > > > Key: SPARK-21212 > URL: https://issues.apache.org/jira/browse/SPARK-21212 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: Windows; external data provided through data source api >Reporter: Shawn Lavelle >Priority: Minor > > I don't think this should fail the query: > {code}jdbc:hive2://user:port/> select count(*) from table where value between > 1498240079000 and cast(now() as bigint)*1000 order by value; > {code} > {code} > Error: org.apache.spark.sql.AnalysisException: cannot resolve '`value`' given > input columns: [count(1)]; line 1 pos 113; > 'Sort ['value ASC NULLS FIRST], true > +- Aggregate [count(1) AS count(1)#718L] >+- Filter ((value#413L >= 1498240079000) && (value#413L <= > (cast(current_timestamp() as bigint) * cast(1000 as bigint > +- SubqueryAlias table > +- > Relation[field1#411L,field2#412,value#413L,field3#414,field4#415,field5#416,field6#417,field7#418,field8#419,field9#420] > com.redacted@16004579 (state=,code=0) > {code} > Arguably, the optimizer could ignore the "order by" clause, but I leave that > to more informed minds than my own. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21218) Convert IN predicate to equivalent Parquet filter
[ https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063925#comment-16063925 ] Hyukjin Kwon commented on SPARK-21218: -- I believe it is a duplicate of SPARK-17091. > Convert IN predicate to equivalent Parquet filter > - > > Key: SPARK-21218 > URL: https://issues.apache.org/jira/browse/SPARK-21218 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 2.1.1 >Reporter: Michael Styles > > Convert IN predicate to equivalent expression involving equality conditions > to allow the filter to be pushed down to Parquet. > For instance, > C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21212) Can't use Count(*) with Order Clause
[ https://issues.apache.org/jira/browse/SPARK-21212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063915#comment-16063915 ] Sean Owen commented on SPARK-21212: --- Yes, but this doesn't work unless you say what 'value' is: {{select count(*) as value from table where ...}} > Can't use Count(*) with Order Clause > > > Key: SPARK-21212 > URL: https://issues.apache.org/jira/browse/SPARK-21212 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: Windows; external data provided through data source api >Reporter: Shawn Lavelle >Priority: Minor > > I don't think this should fail the query: > {code}jdbc:hive2://user:port/> select count(*) from table where value between > 1498240079000 and cast(now() as bigint)*1000 order by value; > {code} > {code} > Error: org.apache.spark.sql.AnalysisException: cannot resolve '`value`' given > input columns: [count(1)]; line 1 pos 113; > 'Sort ['value ASC NULLS FIRST], true > +- Aggregate [count(1) AS count(1)#718L] >+- Filter ((value#413L >= 1498240079000) && (value#413L <= > (cast(current_timestamp() as bigint) * cast(1000 as bigint > +- SubqueryAlias table > +- > Relation[field1#411L,field2#412,value#413L,field3#414,field4#415,field5#416,field6#417,field7#418,field8#419,field9#420] > com.redacted@16004579 (state=,code=0) > {code} > Arguably, the optimizer could ignore the "order by" clause, but I leave that > to more informed minds than my own. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21219) Task retry occurs on same executor due to race condition with blacklisting
[ https://issues.apache.org/jira/browse/SPARK-21219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063909#comment-16063909 ] Apache Spark commented on SPARK-21219: -- User 'ericvandenbergfb' has created a pull request for this issue: https://github.com/apache/spark/pull/18427 > Task retry occurs on same executor due to race condition with blacklisting > -- > > Key: SPARK-21219 > URL: https://issues.apache.org/jira/browse/SPARK-21219 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.1 >Reporter: Eric Vandenberg >Priority: Minor > Attachments: spark_driver.log.anon, spark_executor.log.anon > > > When a task fails it is (1) added into the pending task list and then (2) > corresponding black list policy is enforced (ie, specifying if it can/can't > run on a particular node/executor/etc.) Unfortunately the ordering is such > that retrying the task could assign the task to the same executor, which, > incidentally could be shutting down and immediately fail the retry. Instead > the order should be (1) the black list state should be updated and then (2) > the task assigned, ensuring that the black list policy is properly enforced. > The attached logs demonstrate the race condition. > See spark_executor.log.anon: > 1. Task 55.2 fails on the executor > 17/06/20 13:25:07 ERROR Executor: Exception in task 55.2 in stage 5.0 (TID > 39575) > java.lang.OutOfMemoryError: Java heap space > 2. Immediately the same executor is assigned the retry task: > 17/06/20 13:25:07 INFO CoarseGrainedExecutorBackend: Got assigned task 39651 > 17/06/20 13:25:07 INFO Executor: Running task 55.3 in stage 5.0 (TID 39651) > 3. The retry task of course fails since the executor is also shutting down > due to the original task 55.2 OOM failure. > See the spark_driver.log.anon: > The driver processes the lost task 55.2: > 17/06/20 13:25:07 WARN TaskSetManager: Lost task 55.2 in stage 5.0 (TID > 39575, foobar.masked-server.com, executor > attempt_foobar.masked-server.com-___.masked-server.com-____0): > java.lang.OutOfMemoryError: Java heap space > The driver then receives the ExecutorLostFailure for the retry task 55.3 > (although it's obfuscated in these logs, the server info is same...) > 17/06/20 13:25:10 WARN TaskSetManager: Lost task 55.3 in stage 5.0 (TID > 39651, foobar.masked-server.com, executor > attempt_foobar.masked-server.com-___.masked-server.com-____0): > ExecutorLostFailure (executor > attempt_foobar.masked-server.com-___.masked-server.com-____0 > exited caused by one of the running tasks) Reason: Remote RPC client > disassociated. Likely due to containers exceeding thresholds, or network > issues. Check driver logs for WARN messages. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21219) Task retry occurs on same executor due to race condition with blacklisting
[ https://issues.apache.org/jira/browse/SPARK-21219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21219: Assignee: (was: Apache Spark) > Task retry occurs on same executor due to race condition with blacklisting > -- > > Key: SPARK-21219 > URL: https://issues.apache.org/jira/browse/SPARK-21219 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.1 >Reporter: Eric Vandenberg >Priority: Minor > Attachments: spark_driver.log.anon, spark_executor.log.anon > > > When a task fails it is (1) added into the pending task list and then (2) > corresponding black list policy is enforced (ie, specifying if it can/can't > run on a particular node/executor/etc.) Unfortunately the ordering is such > that retrying the task could assign the task to the same executor, which, > incidentally could be shutting down and immediately fail the retry. Instead > the order should be (1) the black list state should be updated and then (2) > the task assigned, ensuring that the black list policy is properly enforced. > The attached logs demonstrate the race condition. > See spark_executor.log.anon: > 1. Task 55.2 fails on the executor > 17/06/20 13:25:07 ERROR Executor: Exception in task 55.2 in stage 5.0 (TID > 39575) > java.lang.OutOfMemoryError: Java heap space > 2. Immediately the same executor is assigned the retry task: > 17/06/20 13:25:07 INFO CoarseGrainedExecutorBackend: Got assigned task 39651 > 17/06/20 13:25:07 INFO Executor: Running task 55.3 in stage 5.0 (TID 39651) > 3. The retry task of course fails since the executor is also shutting down > due to the original task 55.2 OOM failure. > See the spark_driver.log.anon: > The driver processes the lost task 55.2: > 17/06/20 13:25:07 WARN TaskSetManager: Lost task 55.2 in stage 5.0 (TID > 39575, foobar.masked-server.com, executor > attempt_foobar.masked-server.com-___.masked-server.com-____0): > java.lang.OutOfMemoryError: Java heap space > The driver then receives the ExecutorLostFailure for the retry task 55.3 > (although it's obfuscated in these logs, the server info is same...) > 17/06/20 13:25:10 WARN TaskSetManager: Lost task 55.3 in stage 5.0 (TID > 39651, foobar.masked-server.com, executor > attempt_foobar.masked-server.com-___.masked-server.com-____0): > ExecutorLostFailure (executor > attempt_foobar.masked-server.com-___.masked-server.com-____0 > exited caused by one of the running tasks) Reason: Remote RPC client > disassociated. Likely due to containers exceeding thresholds, or network > issues. Check driver logs for WARN messages. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21219) Task retry occurs on same executor due to race condition with blacklisting
[ https://issues.apache.org/jira/browse/SPARK-21219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21219: Assignee: Apache Spark > Task retry occurs on same executor due to race condition with blacklisting > -- > > Key: SPARK-21219 > URL: https://issues.apache.org/jira/browse/SPARK-21219 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.1 >Reporter: Eric Vandenberg >Assignee: Apache Spark >Priority: Minor > Attachments: spark_driver.log.anon, spark_executor.log.anon > > > When a task fails it is (1) added into the pending task list and then (2) > corresponding black list policy is enforced (ie, specifying if it can/can't > run on a particular node/executor/etc.) Unfortunately the ordering is such > that retrying the task could assign the task to the same executor, which, > incidentally could be shutting down and immediately fail the retry. Instead > the order should be (1) the black list state should be updated and then (2) > the task assigned, ensuring that the black list policy is properly enforced. > The attached logs demonstrate the race condition. > See spark_executor.log.anon: > 1. Task 55.2 fails on the executor > 17/06/20 13:25:07 ERROR Executor: Exception in task 55.2 in stage 5.0 (TID > 39575) > java.lang.OutOfMemoryError: Java heap space > 2. Immediately the same executor is assigned the retry task: > 17/06/20 13:25:07 INFO CoarseGrainedExecutorBackend: Got assigned task 39651 > 17/06/20 13:25:07 INFO Executor: Running task 55.3 in stage 5.0 (TID 39651) > 3. The retry task of course fails since the executor is also shutting down > due to the original task 55.2 OOM failure. > See the spark_driver.log.anon: > The driver processes the lost task 55.2: > 17/06/20 13:25:07 WARN TaskSetManager: Lost task 55.2 in stage 5.0 (TID > 39575, foobar.masked-server.com, executor > attempt_foobar.masked-server.com-___.masked-server.com-____0): > java.lang.OutOfMemoryError: Java heap space > The driver then receives the ExecutorLostFailure for the retry task 55.3 > (although it's obfuscated in these logs, the server info is same...) > 17/06/20 13:25:10 WARN TaskSetManager: Lost task 55.3 in stage 5.0 (TID > 39651, foobar.masked-server.com, executor > attempt_foobar.masked-server.com-___.masked-server.com-____0): > ExecutorLostFailure (executor > attempt_foobar.masked-server.com-___.masked-server.com-____0 > exited caused by one of the running tasks) Reason: Remote RPC client > disassociated. Likely due to containers exceeding thresholds, or network > issues. Check driver logs for WARN messages. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21212) Can't use Count(*) with Order Clause
[ https://issues.apache.org/jira/browse/SPARK-21212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063903#comment-16063903 ] Shawn Lavelle commented on SPARK-21212: --- I redacted most of the information to protect proprietary information. The SQL Optimizer replaces count(*) with count(1). > Can't use Count(*) with Order Clause > > > Key: SPARK-21212 > URL: https://issues.apache.org/jira/browse/SPARK-21212 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: Windows; external data provided through data source api >Reporter: Shawn Lavelle >Priority: Minor > > I don't think this should fail the query: > {code}jdbc:hive2://user:port/> select count(*) from table where value between > 1498240079000 and cast(now() as bigint)*1000 order by value; > {code} > {code} > Error: org.apache.spark.sql.AnalysisException: cannot resolve '`value`' given > input columns: [count(1)]; line 1 pos 113; > 'Sort ['value ASC NULLS FIRST], true > +- Aggregate [count(1) AS count(1)#718L] >+- Filter ((value#413L >= 1498240079000) && (value#413L <= > (cast(current_timestamp() as bigint) * cast(1000 as bigint > +- SubqueryAlias table > +- > Relation[field1#411L,field2#412,value#413L,field3#414,field4#415,field5#416,field6#417,field7#418,field8#419,field9#420] > com.redacted@16004579 (state=,code=0) > {code} > Arguably, the optimizer could ignore the "order by" clause, but I leave that > to more informed minds than my own. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21190) SPIP: Vectorized UDFs in Python
[ https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063873#comment-16063873 ] Li Jin edited comment on SPARK-21190 at 6/26/17 10:02 PM: -- [~r...@databricks.com], The use case of seeing entire partition at a time is one of the less interesting use case, because partition is somehow an implementation detail of spark. From the use cases I saw from our internal pyspark/pandas users, the majority interests are to use pandas udf with a spark windowing and grouping operation, for instance, breaking data into day/month/year piece and perform some analysis using pandas for each piece. However, the challenge still holds if one group is too large. One idea is to implement local aggregation and merge in pandas operations so it can be executed efficiently, but I haven't put too much thoughts in this. I think without solving the problem of "one group doesn't fit in memory", this can already be a pretty powerful tool because currently to perform any pandas analysis, the *entire dataset* has to fit in memory (that is, without using tools like dask). was (Author: icexelloss): [~r...@databricks.com], The use case of seeing entire partition at a time is one of the less interesting use case, because partition is somehow an implementation detail of spark. From the use cases I saw from our internal pyspark/pandas users, the majority interests are to use pandas udf with a spark windowing and grouping operation, for instance, breaking data into day/month/year piece and perform some analysis using pandas for each piece. However, the challenge still holds if one group is too large. One idea is to implement local aggregation and merge in pandas operations so it can be executed efficiently. I think without solving the problem of "one group doesn't fit in memory", this can already be a pretty powerful tool because currently to perform any pandas analysis, the *entire dataset* has to fit in memory (that is, without using tools like dask). > SPIP: Vectorized UDFs in Python > --- > > Key: SPARK-21190 > URL: https://issues.apache.org/jira/browse/SPARK-21190 > Project: Spark > Issue Type: New Feature > Components: PySpark, SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > Labels: SPIP > Attachments: SPIPVectorizedUDFsforPython (1).pdf > > > *Background and Motivation* > Python is one of the most popular programming languages among Spark users. > Spark currently exposes a row-at-a-time interface for defining and executing > user-defined functions (UDFs). This introduces high overhead in serialization > and deserialization, and also makes it difficult to leverage Python libraries > (e.g. numpy, Pandas) that are written in native code. > > This proposal advocates introducing new APIs to support vectorized UDFs in > Python, in which a block of data is transferred over to Python in some > columnar format for execution. > > > *Target Personas* > Data scientists, data engineers, library developers. > > *Goals* > - Support vectorized UDFs that apply on chunks of the data frame > - Low system overhead: Substantially reduce serialization and deserialization > overhead when compared with row-at-a-time interface > - UDF performance: Enable users to leverage native libraries in Python (e.g. > numpy, Pandas) for data manipulation in these UDFs > > *Non-Goals* > The following are explicitly out of scope for the current SPIP, and should be > done in future SPIPs. Nonetheless, it would be good to consider these future > use cases during API design, so we can achieve some consistency when rolling > out new APIs. > > - Define block oriented UDFs in other languages (that are not Python). > - Define aggregate UDFs > - Tight integration with machine learning frameworks > > *Proposed API Changes* > The following sketches some possibilities. I haven’t spent a lot of time > thinking about the API (wrote it down in 5 mins) and I am not attached to > this design at all. The main purpose of the SPIP is to get feedback on use > cases and see how they can impact API design. > > Two things to consider are: > > 1. Python is dynamically typed, whereas DataFrames/SQL requires static, > analysis time typing. This means users would need to specify the return type > of their UDFs. > > 2. Ratio of input rows to output rows. We propose initially we require number > of output rows to be the same as the number of input rows. In the future, we > can consider relaxing this constraint with support for vectorized aggregate > UDFs. > > Proposed API sketch (using examples): > > Use case 1. A function that defines all the columns of a DataFrame (similar > to a “map” function): > > {code} > @spark_
[jira] [Commented] (SPARK-21190) SPIP: Vectorized UDFs in Python
[ https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063873#comment-16063873 ] Li Jin commented on SPARK-21190: [~r...@databricks.com], The use case of seeing entire partition at a time is one of the less interesting use case, because partition is somehow an implementation detail of spark. From the use cases I saw from our internal pyspark/pandas users, the majority interests are to use pandas udf with a spark windowing and grouping operation, for instance, breaking data into day/month/year piece and perform some analysis using pandas for each piece. However, the challenge still holds if one group is too large. One idea is to implement local aggregation and merge in pandas operations so it can be executed efficiently. I think without solving the problem of "one group doesn't fit in memory", this can already be a pretty powerful tool because currently to perform any pandas analysis, the *entire dataset* has to fit in memory (that is, without using tools like dask). > SPIP: Vectorized UDFs in Python > --- > > Key: SPARK-21190 > URL: https://issues.apache.org/jira/browse/SPARK-21190 > Project: Spark > Issue Type: New Feature > Components: PySpark, SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > Labels: SPIP > Attachments: SPIPVectorizedUDFsforPython (1).pdf > > > *Background and Motivation* > Python is one of the most popular programming languages among Spark users. > Spark currently exposes a row-at-a-time interface for defining and executing > user-defined functions (UDFs). This introduces high overhead in serialization > and deserialization, and also makes it difficult to leverage Python libraries > (e.g. numpy, Pandas) that are written in native code. > > This proposal advocates introducing new APIs to support vectorized UDFs in > Python, in which a block of data is transferred over to Python in some > columnar format for execution. > > > *Target Personas* > Data scientists, data engineers, library developers. > > *Goals* > - Support vectorized UDFs that apply on chunks of the data frame > - Low system overhead: Substantially reduce serialization and deserialization > overhead when compared with row-at-a-time interface > - UDF performance: Enable users to leverage native libraries in Python (e.g. > numpy, Pandas) for data manipulation in these UDFs > > *Non-Goals* > The following are explicitly out of scope for the current SPIP, and should be > done in future SPIPs. Nonetheless, it would be good to consider these future > use cases during API design, so we can achieve some consistency when rolling > out new APIs. > > - Define block oriented UDFs in other languages (that are not Python). > - Define aggregate UDFs > - Tight integration with machine learning frameworks > > *Proposed API Changes* > The following sketches some possibilities. I haven’t spent a lot of time > thinking about the API (wrote it down in 5 mins) and I am not attached to > this design at all. The main purpose of the SPIP is to get feedback on use > cases and see how they can impact API design. > > Two things to consider are: > > 1. Python is dynamically typed, whereas DataFrames/SQL requires static, > analysis time typing. This means users would need to specify the return type > of their UDFs. > > 2. Ratio of input rows to output rows. We propose initially we require number > of output rows to be the same as the number of input rows. In the future, we > can consider relaxing this constraint with support for vectorized aggregate > UDFs. > > Proposed API sketch (using examples): > > Use case 1. A function that defines all the columns of a DataFrame (similar > to a “map” function): > > {code} > @spark_udf(some way to describe the return schema) > def my_func_on_entire_df(input): > """ Some user-defined function. > > :param input: A Pandas DataFrame with two columns, a and b. > :return: :class: A Pandas data frame. > """ > input[c] = input[a] + input[b] > Input[d] = input[a] - input[b] > return input > > spark.range(1000).selectExpr("id a", "id / 2 b") > .mapBatches(my_func_on_entire_df) > {code} > > Use case 2. A function that defines only one column (similar to existing > UDFs): > > {code} > @spark_udf(some way to describe the return schema) > def my_func_that_returns_one_column(input): > """ Some user-defined function. > > :param input: A Pandas DataFrame with two columns, a and b. > :return: :class: A numpy array > """ > return input[a] + input[b] > > my_func = udf(my_func_that_returns_one_column) > > df = spark.range(1000).selectExpr("id a", "id / 2 b") > df.withColumn("c", my_func(df.a, df.b)) > {code} > > > > *Optional Design Sketch*
[jira] [Updated] (SPARK-21219) Task retry occurs on same executor due to race condition with blacklisting
[ https://issues.apache.org/jira/browse/SPARK-21219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Vandenberg updated SPARK-21219: Description: When a task fails it is (1) added into the pending task list and then (2) corresponding black list policy is enforced (ie, specifying if it can/can't run on a particular node/executor/etc.) Unfortunately the ordering is such that retrying the task could assign the task to the same executor, which, incidentally could be shutting down and immediately fail the retry. Instead the order should be (1) the black list state should be updated and then (2) the task assigned, ensuring that the black list policy is properly enforced. The attached logs demonstrate the race condition. See spark_executor.log.anon: 1. Task 55.2 fails on the executor 17/06/20 13:25:07 ERROR Executor: Exception in task 55.2 in stage 5.0 (TID 39575) java.lang.OutOfMemoryError: Java heap space 2. Immediately the same executor is assigned the retry task: 17/06/20 13:25:07 INFO CoarseGrainedExecutorBackend: Got assigned task 39651 17/06/20 13:25:07 INFO Executor: Running task 55.3 in stage 5.0 (TID 39651) 3. The retry task of course fails since the executor is also shutting down due to the original task 55.2 OOM failure. See the spark_driver.log.anon: The driver processes the lost task 55.2: 17/06/20 13:25:07 WARN TaskSetManager: Lost task 55.2 in stage 5.0 (TID 39575, foobar.masked-server.com, executor attempt_foobar.masked-server.com-___.masked-server.com-____0): java.lang.OutOfMemoryError: Java heap space The driver then receives the ExecutorLostFailure for the retry task 55.3 (although it's obfuscated in these logs, the server info is same...) 17/06/20 13:25:10 WARN TaskSetManager: Lost task 55.3 in stage 5.0 (TID 39651, foobar.masked-server.com, executor attempt_foobar.masked-server.com-___.masked-server.com-____0): ExecutorLostFailure (executor attempt_foobar.masked-server.com-___.masked-server.com-____0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. was: When a task fails it is added into the pending task list and corresponding black list policy is enforced (ie, specifying if it can/can't run on a particular node/executor/etc.) Unfortunately the ordering is such that retrying the task could assign the task to the same executor, which, incidentally could be shutting down and immediately fail the retry. Instead the black list state should be updated and then the task assigned, ensuring that the black list policy is properly enforced. The attached logs demonstrate the race condition. See spark_executor.log.anon: 1. Task 55.2 fails on the executor 17/06/20 13:25:07 ERROR Executor: Exception in task 55.2 in stage 5.0 (TID 39575) java.lang.OutOfMemoryError: Java heap space 2. Immediately the same executor is assigned the retry task: 17/06/20 13:25:07 INFO CoarseGrainedExecutorBackend: Got assigned task 39651 17/06/20 13:25:07 INFO Executor: Running task 55.3 in stage 5.0 (TID 39651) 3. The retry task of course fails since the executor is also shutting down due to the original task 55.2 OOM failure. See the spark_driver.log.anon: The driver processes the lost task 55.2: 17/06/20 13:25:07 WARN TaskSetManager: Lost task 55.2 in stage 5.0 (TID 39575, foobar.masked-server.com, executor attempt_foobar.masked-server.com-___.masked-server.com-____0): java.lang.OutOfMemoryError: Java heap space The driver then receives the ExecutorLostFailure for the retry task 55.3 (although it's obfuscated in these logs, the server info is same...) 17/06/20 13:25:10 WARN TaskSetManager: Lost task 55.3 in stage 5.0 (TID 39651, foobar.masked-server.com, executor attempt_foobar.masked-server.com-___.masked-server.com-____0): ExecutorLostFailure (executor attempt_foobar.masked-server.com-___.masked-server.com-____0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. > Task retry occurs on same executor due to race condition with blacklisting > -- > > Key: SPARK-21219 > URL: https://issues.apache.org/jira/browse/SPARK-21219 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.1 >Reporter: Eric Vandenberg >Priority: Minor > Attachments: spark_driver.log.anon, spark_executor.log.anon > > > W
[jira] [Updated] (SPARK-21219) Task retry occurs on same executor due to race condition with blacklisting
[ https://issues.apache.org/jira/browse/SPARK-21219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Vandenberg updated SPARK-21219: Description: When a task fails it is added into the pending task list and corresponding black list policy is enforced (ie, specifying if it can/can't run on a particular node/executor/etc.) Unfortunately the ordering is such that retrying the task could assign the task to the same executor, which, incidentally could be shutting down and immediately fail the retry. Instead the black list state should be updated and then the task assigned, ensuring that the black list policy is properly enforced. The attached logs demonstrate the race condition. See spark_executor.log.anon: 1. Task 55.2 fails on the executor 17/06/20 13:25:07 ERROR Executor: Exception in task 55.2 in stage 5.0 (TID 39575) java.lang.OutOfMemoryError: Java heap space 2. Immediately the same executor is assigned the retry task: 17/06/20 13:25:07 INFO CoarseGrainedExecutorBackend: Got assigned task 39651 17/06/20 13:25:07 INFO Executor: Running task 55.3 in stage 5.0 (TID 39651) 3. The retry task of course fails since the executor is also shutting down due to the original task 55.2 OOM failure. See the spark_driver.log.anon: The driver processes the lost task 55.2: 17/06/20 13:25:07 WARN TaskSetManager: Lost task 55.2 in stage 5.0 (TID 39575, foobar.masked-server.com, executor attempt_foobar.masked-server.com-___.masked-server.com-____0): java.lang.OutOfMemoryError: Java heap space The driver then receives the ExecutorLostFailure for the retry task 55.3 (although it's obfuscated in these logs, the server info is same...) 17/06/20 13:25:10 WARN TaskSetManager: Lost task 55.3 in stage 5.0 (TID 39651, foobar.masked-server.com, executor attempt_foobar.masked-server.com-___.masked-server.com-____0): ExecutorLostFailure (executor attempt_foobar.masked-server.com-___.masked-server.com-____0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. was: When a task fails it is added into the pending task list and corresponding black list policy is enforced (ie, specifying if it can/can't run on a particular node/executor/etc.) Unfortunately the ordering is such that retrying the task could assign the task to the same executor, which, incidentally could be shutting down and immediately fail the retry. Instead the black list state should be updated and then the task assigned, ensuring that the black list policy is properly enforced. > Task retry occurs on same executor due to race condition with blacklisting > -- > > Key: SPARK-21219 > URL: https://issues.apache.org/jira/browse/SPARK-21219 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.1 >Reporter: Eric Vandenberg >Priority: Minor > Attachments: spark_driver.log.anon, spark_executor.log.anon > > > When a task fails it is added into the pending task list and corresponding > black list policy is enforced (ie, specifying if it can/can't run on a > particular node/executor/etc.) Unfortunately the ordering is such that > retrying the task could assign the task to the same executor, which, > incidentally could be shutting down and immediately fail the retry. Instead > the black list state should be updated and then the task assigned, ensuring > that the black list policy is properly enforced. > The attached logs demonstrate the race condition. > See spark_executor.log.anon: > 1. Task 55.2 fails on the executor > 17/06/20 13:25:07 ERROR Executor: Exception in task 55.2 in stage 5.0 (TID > 39575) > java.lang.OutOfMemoryError: Java heap space > 2. Immediately the same executor is assigned the retry task: > 17/06/20 13:25:07 INFO CoarseGrainedExecutorBackend: Got assigned task 39651 > 17/06/20 13:25:07 INFO Executor: Running task 55.3 in stage 5.0 (TID 39651) > 3. The retry task of course fails since the executor is also shutting down > due to the original task 55.2 OOM failure. > See the spark_driver.log.anon: > The driver processes the lost task 55.2: > 17/06/20 13:25:07 WARN TaskSetManager: Lost task 55.2 in stage 5.0 (TID > 39575, foobar.masked-server.com, executor > attempt_foobar.masked-server.com-___.masked-server.com-____0): > java.lang.OutOfMemoryError: Java heap space > The driver then receives the ExecutorLostFailure for the retry task 55.3 > (although it's obfuscated in these logs, the server info is same...) > 17/06/20 13:25:10 WARN TaskSetManager: Lost
[jira] [Updated] (SPARK-21219) Task retry occurs on same executor due to race condition with blacklisting
[ https://issues.apache.org/jira/browse/SPARK-21219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Vandenberg updated SPARK-21219: Attachment: spark_executor.log.anon spark_driver.log.anon > Task retry occurs on same executor due to race condition with blacklisting > -- > > Key: SPARK-21219 > URL: https://issues.apache.org/jira/browse/SPARK-21219 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.1 >Reporter: Eric Vandenberg >Priority: Minor > Attachments: spark_driver.log.anon, spark_executor.log.anon > > > When a task fails it is added into the pending task list and corresponding > black list policy is enforced (ie, specifying if it can/can't run on a > particular node/executor/etc.) Unfortunately the ordering is such that > retrying the task could assign the task to the same executor, which, > incidentally could be shutting down and immediately fail the retry. Instead > the black list state should be updated and then the task assigned, ensuring > that the black list policy is properly enforced. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21219) Task retry occurs on same executor due to race condition with blacklisting
Eric Vandenberg created SPARK-21219: --- Summary: Task retry occurs on same executor due to race condition with blacklisting Key: SPARK-21219 URL: https://issues.apache.org/jira/browse/SPARK-21219 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 2.1.1 Reporter: Eric Vandenberg Priority: Minor When a task fails it is added into the pending task list and corresponding black list policy is enforced (ie, specifying if it can/can't run on a particular node/executor/etc.) Unfortunately the ordering is such that retrying the task could assign the task to the same executor, which, incidentally could be shutting down and immediately fail the retry. Instead the black list state should be updated and then the task assigned, ensuring that the black list policy is properly enforced. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21216) Streaming DataFrames fail to join with Hive tables
[ https://issues.apache.org/jira/browse/SPARK-21216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21216: Assignee: Apache Spark (was: Burak Yavuz) > Streaming DataFrames fail to join with Hive tables > -- > > Key: SPARK-21216 > URL: https://issues.apache.org/jira/browse/SPARK-21216 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.1 >Reporter: Burak Yavuz >Assignee: Apache Spark > > The following code will throw a cryptic exception: > {code} > import org.apache.spark.sql.execution.streaming.MemoryStream > import testImplicits._ > implicit val _sqlContext = spark.sqlContext > Seq((1, "one"), (2, "two"), (4, "four")).toDF("number", > "word").createOrReplaceTempView("t1") > // Make a table and ensure it will be broadcast. > sql("""CREATE TABLE smallTable(word string, number int) > |ROW FORMAT SERDE > 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' > |STORED AS TEXTFILE > """.stripMargin) > sql( > """INSERT INTO smallTable > |SELECT word, number from t1 > """.stripMargin) > val inputData = MemoryStream[Int] > val joined = inputData.toDS().toDF() > .join(spark.table("smallTable"), $"value" === $"number") > val sq = joined.writeStream > .format("memory") > .queryName("t2") > .start() > try { > inputData.addData(1, 2) > sq.processAllAvailable() > } finally { > sq.stop() > } > {code} > If someone creates a HiveSession, the planner in `IncrementalExecution` > doesn't take into account the Hive scan strategies -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21216) Streaming DataFrames fail to join with Hive tables
[ https://issues.apache.org/jira/browse/SPARK-21216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063662#comment-16063662 ] Apache Spark commented on SPARK-21216: -- User 'brkyvz' has created a pull request for this issue: https://github.com/apache/spark/pull/18426 > Streaming DataFrames fail to join with Hive tables > -- > > Key: SPARK-21216 > URL: https://issues.apache.org/jira/browse/SPARK-21216 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.1 >Reporter: Burak Yavuz >Assignee: Burak Yavuz > > The following code will throw a cryptic exception: > {code} > import org.apache.spark.sql.execution.streaming.MemoryStream > import testImplicits._ > implicit val _sqlContext = spark.sqlContext > Seq((1, "one"), (2, "two"), (4, "four")).toDF("number", > "word").createOrReplaceTempView("t1") > // Make a table and ensure it will be broadcast. > sql("""CREATE TABLE smallTable(word string, number int) > |ROW FORMAT SERDE > 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' > |STORED AS TEXTFILE > """.stripMargin) > sql( > """INSERT INTO smallTable > |SELECT word, number from t1 > """.stripMargin) > val inputData = MemoryStream[Int] > val joined = inputData.toDS().toDF() > .join(spark.table("smallTable"), $"value" === $"number") > val sq = joined.writeStream > .format("memory") > .queryName("t2") > .start() > try { > inputData.addData(1, 2) > sq.processAllAvailable() > } finally { > sq.stop() > } > {code} > If someone creates a HiveSession, the planner in `IncrementalExecution` > doesn't take into account the Hive scan strategies -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21216) Streaming DataFrames fail to join with Hive tables
[ https://issues.apache.org/jira/browse/SPARK-21216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21216: Assignee: Burak Yavuz (was: Apache Spark) > Streaming DataFrames fail to join with Hive tables > -- > > Key: SPARK-21216 > URL: https://issues.apache.org/jira/browse/SPARK-21216 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.1 >Reporter: Burak Yavuz >Assignee: Burak Yavuz > > The following code will throw a cryptic exception: > {code} > import org.apache.spark.sql.execution.streaming.MemoryStream > import testImplicits._ > implicit val _sqlContext = spark.sqlContext > Seq((1, "one"), (2, "two"), (4, "four")).toDF("number", > "word").createOrReplaceTempView("t1") > // Make a table and ensure it will be broadcast. > sql("""CREATE TABLE smallTable(word string, number int) > |ROW FORMAT SERDE > 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' > |STORED AS TEXTFILE > """.stripMargin) > sql( > """INSERT INTO smallTable > |SELECT word, number from t1 > """.stripMargin) > val inputData = MemoryStream[Int] > val joined = inputData.toDS().toDF() > .join(spark.table("smallTable"), $"value" === $"number") > val sq = joined.writeStream > .format("memory") > .queryName("t2") > .start() > try { > inputData.addData(1, 2) > sq.processAllAvailable() > } finally { > sq.stop() > } > {code} > If someone creates a HiveSession, the planner in `IncrementalExecution` > doesn't take into account the Hive scan strategies -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21217) Support ColumnVector.Array.toArray()
[ https://issues.apache.org/jira/browse/SPARK-21217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063631#comment-16063631 ] Apache Spark commented on SPARK-21217: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/18425 > Support ColumnVector.Array.toArray() > -- > > Key: SPARK-21217 > URL: https://issues.apache.org/jira/browse/SPARK-21217 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki > > When {{toArray()}} in {{ColumnVector.Array}} (e.g. {{toIntArray()}}) is > called, the generic method in {{ArrayData}} is called. It is not fast since > element-wise copy is performed. > This JIRA entry implements bulk-copy for these methods in > {{ColumnVector.Array}} by using {{System.arrayCopy()}} or > {{Platform.copyMemory()}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21217) Support ColumnVector.Array.toArray()
[ https://issues.apache.org/jira/browse/SPARK-21217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21217: Assignee: (was: Apache Spark) > Support ColumnVector.Array.toArray() > -- > > Key: SPARK-21217 > URL: https://issues.apache.org/jira/browse/SPARK-21217 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki > > When {{toArray()}} in {{ColumnVector.Array}} (e.g. {{toIntArray()}}) is > called, the generic method in {{ArrayData}} is called. It is not fast since > element-wise copy is performed. > This JIRA entry implements bulk-copy for these methods in > {{ColumnVector.Array}} by using {{System.arrayCopy()}} or > {{Platform.copyMemory()}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21217) Support ColumnVector.Array.toArray()
[ https://issues.apache.org/jira/browse/SPARK-21217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21217: Assignee: Apache Spark > Support ColumnVector.Array.toArray() > -- > > Key: SPARK-21217 > URL: https://issues.apache.org/jira/browse/SPARK-21217 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki >Assignee: Apache Spark > > When {{toArray()}} in {{ColumnVector.Array}} (e.g. {{toIntArray()}}) is > called, the generic method in {{ArrayData}} is called. It is not fast since > element-wise copy is performed. > This JIRA entry implements bulk-copy for these methods in > {{ColumnVector.Array}} by using {{System.arrayCopy()}} or > {{Platform.copyMemory()}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21218) Convert IN predicate to equivalent Parquet filter
[ https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21218: Assignee: (was: Apache Spark) > Convert IN predicate to equivalent Parquet filter > - > > Key: SPARK-21218 > URL: https://issues.apache.org/jira/browse/SPARK-21218 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 2.1.1 >Reporter: Michael Styles > > Convert IN predicate to equivalent expression involving equality conditions > to allow the filter to be pushed down to Parquet. > For instance, > C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21218) Convert IN predicate to equivalent Parquet filter
[ https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21218: Assignee: Apache Spark > Convert IN predicate to equivalent Parquet filter > - > > Key: SPARK-21218 > URL: https://issues.apache.org/jira/browse/SPARK-21218 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 2.1.1 >Reporter: Michael Styles >Assignee: Apache Spark > > Convert IN predicate to equivalent expression involving equality conditions > to allow the filter to be pushed down to Parquet. > For instance, > C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21218) Convert IN predicate to equivalent Parquet filter
[ https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063620#comment-16063620 ] Apache Spark commented on SPARK-21218: -- User 'ptkool' has created a pull request for this issue: https://github.com/apache/spark/pull/18424 > Convert IN predicate to equivalent Parquet filter > - > > Key: SPARK-21218 > URL: https://issues.apache.org/jira/browse/SPARK-21218 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 2.1.1 >Reporter: Michael Styles > > Convert IN predicate to equivalent expression involving equality conditions > to allow the filter to be pushed down to Parquet. > For instance, > C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21218) Convert IN predicate to equivalent Parquet filter
Michael Styles created SPARK-21218: -- Summary: Convert IN predicate to equivalent Parquet filter Key: SPARK-21218 URL: https://issues.apache.org/jira/browse/SPARK-21218 Project: Spark Issue Type: Improvement Components: Optimizer Affects Versions: 2.1.1 Reporter: Michael Styles Convert IN predicate to equivalent expression involving equality conditions to allow the filter to be pushed down to Parquet. For instance, C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21217) Support ColumnVector.Array.toArray()
Kazuaki Ishizaki created SPARK-21217: Summary: Support ColumnVector.Array.toArray() Key: SPARK-21217 URL: https://issues.apache.org/jira/browse/SPARK-21217 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Kazuaki Ishizaki When {{toArray()}} in {{ColumnVector.Array}} (e.g. {{toIntArray()}}) is called, the generic method in {{ArrayData}} is called. It is not fast since element-wise copy is performed. This JIRA entry implements bulk-copy for these methods in {{ColumnVector.Array}} by using {{System.arrayCopy()}} or {{Platform.copyMemory()}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21216) Streaming DataFrames fail to join with Hive tables
Burak Yavuz created SPARK-21216: --- Summary: Streaming DataFrames fail to join with Hive tables Key: SPARK-21216 URL: https://issues.apache.org/jira/browse/SPARK-21216 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.1.1 Reporter: Burak Yavuz Assignee: Burak Yavuz The following code will throw a cryptic exception: {code} import org.apache.spark.sql.execution.streaming.MemoryStream import testImplicits._ implicit val _sqlContext = spark.sqlContext Seq((1, "one"), (2, "two"), (4, "four")).toDF("number", "word").createOrReplaceTempView("t1") // Make a table and ensure it will be broadcast. sql("""CREATE TABLE smallTable(word string, number int) |ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' |STORED AS TEXTFILE """.stripMargin) sql( """INSERT INTO smallTable |SELECT word, number from t1 """.stripMargin) val inputData = MemoryStream[Int] val joined = inputData.toDS().toDF() .join(spark.table("smallTable"), $"value" === $"number") val sq = joined.writeStream .format("memory") .queryName("t2") .start() try { inputData.addData(1, 2) sq.processAllAvailable() } finally { sq.stop() } {code} If someone creates a HiveSession, the planner in `IncrementalExecution` doesn't take into account the Hive scan strategies -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21158) SparkSQL function SparkSession.Catalog.ListTables() does not handle spark setting for case-sensitivity
[ https://issues.apache.org/jira/browse/SPARK-21158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063561#comment-16063561 ] Apache Spark commented on SPARK-21158: -- User 'cammachusa' has created a pull request for this issue: https://github.com/apache/spark/pull/18423 > SparkSQL function SparkSession.Catalog.ListTables() does not handle spark > setting for case-sensitivity > -- > > Key: SPARK-21158 > URL: https://issues.apache.org/jira/browse/SPARK-21158 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: Windows 10 > IntelliJ > Scala >Reporter: Kathryn McClintic >Priority: Minor > Labels: easyfix, features, sparksql, windows > Original Estimate: 24h > Remaining Estimate: 24h > > When working with SQL table names in Spark SQL we have noticed some issues > with case-sensitivity. > If you set spark.sql.caseSensitive setting to be true, SparkSQL stores the > table names in the way it was provided. This is correct. > If you set spark.sql.caseSensitive setting to be false, SparkSQL stores the > table names in lower case. > Then, we use the function sqlContext.tableNames() to get all the tables in > our DB. We check if this list contains(<"string of table name">) to determine > if we have already created a table. If case-sensitivity is turned off > (false), this function should look if the table name is contained in the > table list regardless of case. > However, it tries to look for only ones that match the lower case version of > the stored table. Therefore, if you pass in a camel or upper case table name, > this function would return false when in fact the table does exist. > The root cause of this issue is in the function > SparkSession.Catalog.ListTables() > For example: > In your SQL context - you have four tables and you have chosen to have > spark.sql.case-Sensitive=false so it stores your tables in lowercase: > carnames > carmodels > carnamesandmodels > users > dealerlocations > When running your pipeline, you want to see if you have already created the > temp join table of 'carnamesandmodels'. However, you have stored it as a > constant which reads: CarNamesAndModels for readability. > So you can use the function > sqlContext.tableNames().contains("CarNamesAndModels"). > This should return true - because we know its already created, but it will > currently return false since CarNamesAndModels is not in lowercase. > The responsibility to change the name passed into the .contains method to be > lowercase should not be put on the spark user. This should be done by spark > sql if case-sensitivity is turned to false. > Proposed solutions: > - Setting case sensitive in the sql context should make the sql context > be agnostic to case but not change the storage of the table > - There should be a custom contains method for ListTables() which converts > the tablename to be lowercase before checking > - SparkSession.Catalog.ListTables() should return the list of tables in the > input format instead of in all lowercase. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21158) SparkSQL function SparkSession.Catalog.ListTables() does not handle spark setting for case-sensitivity
[ https://issues.apache.org/jira/browse/SPARK-21158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21158: Assignee: (was: Apache Spark) > SparkSQL function SparkSession.Catalog.ListTables() does not handle spark > setting for case-sensitivity > -- > > Key: SPARK-21158 > URL: https://issues.apache.org/jira/browse/SPARK-21158 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: Windows 10 > IntelliJ > Scala >Reporter: Kathryn McClintic >Priority: Minor > Labels: easyfix, features, sparksql, windows > Original Estimate: 24h > Remaining Estimate: 24h > > When working with SQL table names in Spark SQL we have noticed some issues > with case-sensitivity. > If you set spark.sql.caseSensitive setting to be true, SparkSQL stores the > table names in the way it was provided. This is correct. > If you set spark.sql.caseSensitive setting to be false, SparkSQL stores the > table names in lower case. > Then, we use the function sqlContext.tableNames() to get all the tables in > our DB. We check if this list contains(<"string of table name">) to determine > if we have already created a table. If case-sensitivity is turned off > (false), this function should look if the table name is contained in the > table list regardless of case. > However, it tries to look for only ones that match the lower case version of > the stored table. Therefore, if you pass in a camel or upper case table name, > this function would return false when in fact the table does exist. > The root cause of this issue is in the function > SparkSession.Catalog.ListTables() > For example: > In your SQL context - you have four tables and you have chosen to have > spark.sql.case-Sensitive=false so it stores your tables in lowercase: > carnames > carmodels > carnamesandmodels > users > dealerlocations > When running your pipeline, you want to see if you have already created the > temp join table of 'carnamesandmodels'. However, you have stored it as a > constant which reads: CarNamesAndModels for readability. > So you can use the function > sqlContext.tableNames().contains("CarNamesAndModels"). > This should return true - because we know its already created, but it will > currently return false since CarNamesAndModels is not in lowercase. > The responsibility to change the name passed into the .contains method to be > lowercase should not be put on the spark user. This should be done by spark > sql if case-sensitivity is turned to false. > Proposed solutions: > - Setting case sensitive in the sql context should make the sql context > be agnostic to case but not change the storage of the table > - There should be a custom contains method for ListTables() which converts > the tablename to be lowercase before checking > - SparkSession.Catalog.ListTables() should return the list of tables in the > input format instead of in all lowercase. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21158) SparkSQL function SparkSession.Catalog.ListTables() does not handle spark setting for case-sensitivity
[ https://issues.apache.org/jira/browse/SPARK-21158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21158: Assignee: Apache Spark > SparkSQL function SparkSession.Catalog.ListTables() does not handle spark > setting for case-sensitivity > -- > > Key: SPARK-21158 > URL: https://issues.apache.org/jira/browse/SPARK-21158 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: Windows 10 > IntelliJ > Scala >Reporter: Kathryn McClintic >Assignee: Apache Spark >Priority: Minor > Labels: easyfix, features, sparksql, windows > Original Estimate: 24h > Remaining Estimate: 24h > > When working with SQL table names in Spark SQL we have noticed some issues > with case-sensitivity. > If you set spark.sql.caseSensitive setting to be true, SparkSQL stores the > table names in the way it was provided. This is correct. > If you set spark.sql.caseSensitive setting to be false, SparkSQL stores the > table names in lower case. > Then, we use the function sqlContext.tableNames() to get all the tables in > our DB. We check if this list contains(<"string of table name">) to determine > if we have already created a table. If case-sensitivity is turned off > (false), this function should look if the table name is contained in the > table list regardless of case. > However, it tries to look for only ones that match the lower case version of > the stored table. Therefore, if you pass in a camel or upper case table name, > this function would return false when in fact the table does exist. > The root cause of this issue is in the function > SparkSession.Catalog.ListTables() > For example: > In your SQL context - you have four tables and you have chosen to have > spark.sql.case-Sensitive=false so it stores your tables in lowercase: > carnames > carmodels > carnamesandmodels > users > dealerlocations > When running your pipeline, you want to see if you have already created the > temp join table of 'carnamesandmodels'. However, you have stored it as a > constant which reads: CarNamesAndModels for readability. > So you can use the function > sqlContext.tableNames().contains("CarNamesAndModels"). > This should return true - because we know its already created, but it will > currently return false since CarNamesAndModels is not in lowercase. > The responsibility to change the name passed into the .contains method to be > lowercase should not be put on the spark user. This should be done by spark > sql if case-sensitivity is turned to false. > Proposed solutions: > - Setting case sensitive in the sql context should make the sql context > be agnostic to case but not change the storage of the table > - There should be a custom contains method for ListTables() which converts > the tablename to be lowercase before checking > - SparkSession.Catalog.ListTables() should return the list of tables in the > input format instead of in all lowercase. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20889) SparkR grouped documentation for Column methods
[ https://issues.apache.org/jira/browse/SPARK-20889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063552#comment-16063552 ] Apache Spark commented on SPARK-20889: -- User 'actuaryzhang' has created a pull request for this issue: https://github.com/apache/spark/pull/18422 > SparkR grouped documentation for Column methods > --- > > Key: SPARK-20889 > URL: https://issues.apache.org/jira/browse/SPARK-20889 > Project: Spark > Issue Type: Documentation > Components: SparkR >Affects Versions: 2.1.1 >Reporter: Wayne Zhang >Assignee: Wayne Zhang > Labels: documentation > Fix For: 2.3.0 > > > Group the documentation of individual methods defined for the Column class. > This aims to create the following improvements: > - Centralized documentation for easy navigation (user can view multiple > related methods on one single page). > - Reduced number of items in Seealso. > - Betters examples using shared data. This avoids creating a data frame for > each function if they are documented separately. And more importantly, user > can copy and paste to run them directly! > - Cleaner structure and much fewer Rd files (remove a large number of Rd > files). > - Remove duplicated definition of param (since they share exactly the same > argument). > - No need to write meaningless examples for trivial functions (because of > grouping). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21190) SPIP: Vectorized UDFs in Python
[ https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063515#comment-16063515 ] Reynold Xin commented on SPARK-21190: - [~icexelloss] Thanks. Your proposal brings up a good point, which is do we want to allow users to see the entire partition at a time. As demonstrated by your example, this can be a very powerful tool (especially for leveraging the time series analysis features available in Pandas). The primary challenge there is that a partition can be very large, and it is possible to run out of memory when using Pandas. Any thoughts on how we should handle those cases? > SPIP: Vectorized UDFs in Python > --- > > Key: SPARK-21190 > URL: https://issues.apache.org/jira/browse/SPARK-21190 > Project: Spark > Issue Type: New Feature > Components: PySpark, SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > Labels: SPIP > Attachments: SPIPVectorizedUDFsforPython (1).pdf > > > *Background and Motivation* > Python is one of the most popular programming languages among Spark users. > Spark currently exposes a row-at-a-time interface for defining and executing > user-defined functions (UDFs). This introduces high overhead in serialization > and deserialization, and also makes it difficult to leverage Python libraries > (e.g. numpy, Pandas) that are written in native code. > > This proposal advocates introducing new APIs to support vectorized UDFs in > Python, in which a block of data is transferred over to Python in some > columnar format for execution. > > > *Target Personas* > Data scientists, data engineers, library developers. > > *Goals* > - Support vectorized UDFs that apply on chunks of the data frame > - Low system overhead: Substantially reduce serialization and deserialization > overhead when compared with row-at-a-time interface > - UDF performance: Enable users to leverage native libraries in Python (e.g. > numpy, Pandas) for data manipulation in these UDFs > > *Non-Goals* > The following are explicitly out of scope for the current SPIP, and should be > done in future SPIPs. Nonetheless, it would be good to consider these future > use cases during API design, so we can achieve some consistency when rolling > out new APIs. > > - Define block oriented UDFs in other languages (that are not Python). > - Define aggregate UDFs > - Tight integration with machine learning frameworks > > *Proposed API Changes* > The following sketches some possibilities. I haven’t spent a lot of time > thinking about the API (wrote it down in 5 mins) and I am not attached to > this design at all. The main purpose of the SPIP is to get feedback on use > cases and see how they can impact API design. > > Two things to consider are: > > 1. Python is dynamically typed, whereas DataFrames/SQL requires static, > analysis time typing. This means users would need to specify the return type > of their UDFs. > > 2. Ratio of input rows to output rows. We propose initially we require number > of output rows to be the same as the number of input rows. In the future, we > can consider relaxing this constraint with support for vectorized aggregate > UDFs. > > Proposed API sketch (using examples): > > Use case 1. A function that defines all the columns of a DataFrame (similar > to a “map” function): > > {code} > @spark_udf(some way to describe the return schema) > def my_func_on_entire_df(input): > """ Some user-defined function. > > :param input: A Pandas DataFrame with two columns, a and b. > :return: :class: A Pandas data frame. > """ > input[c] = input[a] + input[b] > Input[d] = input[a] - input[b] > return input > > spark.range(1000).selectExpr("id a", "id / 2 b") > .mapBatches(my_func_on_entire_df) > {code} > > Use case 2. A function that defines only one column (similar to existing > UDFs): > > {code} > @spark_udf(some way to describe the return schema) > def my_func_that_returns_one_column(input): > """ Some user-defined function. > > :param input: A Pandas DataFrame with two columns, a and b. > :return: :class: A numpy array > """ > return input[a] + input[b] > > my_func = udf(my_func_that_returns_one_column) > > df = spark.range(1000).selectExpr("id a", "id / 2 b") > df.withColumn("c", my_func(df.a, df.b)) > {code} > > > > *Optional Design Sketch* > I’m more concerned about getting proper feedback for API design. The > implementation should be pretty straightforward and is not a huge concern at > this point. We can leverage the same implementation for faster toPandas > (using Arrow). > > > *Optional Rejected Designs* > See above. > > > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) -
[jira] [Comment Edited] (SPARK-21198) SparkSession catalog is terribly slow
[ https://issues.apache.org/jira/browse/SPARK-21198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063406#comment-16063406 ] Saif Addin edited comment on SPARK-21198 at 6/26/17 5:56 PM: - Okay, I think there is something odd somewhere in between. It may be hard to tackle the issue but, i'll start slowly line to line. (*spark-submit*, local[8]) 1. This line, gets stuck forever and the program doesnt continue after waiting 2 minutes (spark-task is stuck in collect()) {code:java} import spark.implicits._ private val databases: Array[String] = REFERENCEDB +: (spark.catalog.listDatabases.map(_.name).filter(_ != blacklistdb).collect) println((new Date().getTime - d2)/1000.0) {code} 2. Changing such line to (notice filter after collect) takes 8ms {code:java} import spark.implicits._ private val databases: Array[String] = REFERENCEDB +: (spark.catalog.listDatabases.map(_.name).collect.filterNot(_ == blacklistdb)) {code} 3. The following line, takes instead 2ms {code:java} private val databases: Array[String] = (REFERENCEDB +: spark.sql("show databases").collect.map(_.getString(0))).filterNot(_ == blacklistdb) {code} Here's the weirdest of all, if I instead start *spark-shell* and do item number 1, it works (takes 1ms which is even faster than other lines in my *spark-submit.* the other lines are also faster down to 1ms as well) was (Author: revolucion09): Okay, I think there is something odd somewhere in between. It may be hard to tackle the issue but, i'll start slowly line to line. (*spark-submit*, local[8]) 1. This line, gets stuck forever and the program doesnt continue after waiting 2 minutes (spark-task is stuck in collect()) {code:java} import spark.implicits._ private val databases: Array[String] = RREFERENCEDB +: (spark.catalog.listDatabases.map(_.name).collect.filter(_ != blacklistdb)) {code} 2. Changing such line to (notice filter after collect) takes 8ms {code:java} import spark.implicits._ private val databases: Array[String] = REFERENCEDB +: (spark.catalog.listDatabases.map(_.name).collect.filterNot(_ == blacklistdb)) {code} 3. The following line, takes instead 2ms {code:java} private val databases: Array[String] = (REFERENCEDB +: spark.sql("show databases").collect.map(_.getString(0))).filterNot(_ == blacklistdb) {code} Here's the weirdest of all, if I instead start *spark-shell* and do item number 1, it works (takes 1ms which is even faster than other lines in my *spark-submit.* the other lines are also faster down to 1ms as well) > SparkSession catalog is terribly slow > - > > Key: SPARK-21198 > URL: https://issues.apache.org/jira/browse/SPARK-21198 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Saif Addin > > We have a considerably large Hive metastore and a Spark program that goes > through Hive data availability. > In spark 1.x, we were using sqlConext.tableNames, sqlContext.sql() and > sqlContext.isCached() to go throgh Hive metastore information. > Once migrated to spark 2.x we switched over SparkSession.catalog instead, but > it turns out that both listDatabases() and listTables() take between 5 to 20 > minutes depending on the database to return results, using operations such as > the following one: > spark.catalog.listTables(db).filter(__.isTemporary).map(__.name).collect > and made the program unbearably slow to return a list of tables. > I know we still have spark.sqlContext.tableNames as workaround but I am > assuming this is going to be deprecated anytime soon? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21198) SparkSession catalog is terribly slow
[ https://issues.apache.org/jira/browse/SPARK-21198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063406#comment-16063406 ] Saif Addin edited comment on SPARK-21198 at 6/26/17 5:55 PM: - Okay, I think there is something odd somewhere in between. It may be hard to tackle the issue but, i'll start slowly line to line. (*spark-submit*, local[8]) 1. This line, gets stuck forever and the program doesnt continue after waiting 2 minutes (spark-task is stuck in collect()) {code:java} import spark.implicits._ private val databases: Array[String] = RREFERENCEDB +: (spark.catalog.listDatabases.map(_.name).collect.filter(_ != blacklistdb)) {code} 2. Changing such line to (notice filter after collect) takes 8ms {code:java} import spark.implicits._ private val databases: Array[String] = REFERENCEDB +: (spark.catalog.listDatabases.map(_.name).collect.filterNot(_ == blacklistdb)) {code} 3. The following line, takes instead 2ms {code:java} private val databases: Array[String] = (REFERENCEDB +: spark.sql("show databases").collect.map(_.getString(0))).filterNot(_ == blacklistdb) {code} Here's the weirdest of all, if I instead start *spark-shell* and do item number 1, it works (takes 1ms which is even faster than other lines in my *spark-submit.* the other lines are also faster down to 1ms as well) was (Author: revolucion09): Okay, I think there is something odd somewhere in between. It may be hard to tackle the issue but, i'll start slowly line to line. (*spark-submit*, local[8]) 1. This line, gets stuck forever and the program doesnt continue after waiting 2 minutes (spark-task is stuck in collect()) {code:java} import spark.implicits._ private val databases: Array[String] = REFERENCEDB +: (spark.catalog.listDatabases.filter(_ != blacklistdb).map(_.name).collect) {code} 2. Changing such line to (notice filter after collect) takes 8ms {code:java} import spark.implicits._ private val databases: Array[String] = REFERENCEDB +: (spark.catalog.listDatabases.map(_.name).collect.filterNot(_ == blacklistdb)) {code} 3. The following line, takes instead 2ms {code:java} private val databases: Array[String] = (REFERENCEDB +: spark.sql("show databases").collect.map(_.getString(0))).filterNot(_ == blacklistdb) {code} Here's the weirdest of all, if I instead start *spark-shell* and do item number 1, it works (takes 1ms which is even faster than other lines in my *spark-submit.* the other lines are also faster down to 1ms as well) > SparkSession catalog is terribly slow > - > > Key: SPARK-21198 > URL: https://issues.apache.org/jira/browse/SPARK-21198 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Saif Addin > > We have a considerably large Hive metastore and a Spark program that goes > through Hive data availability. > In spark 1.x, we were using sqlConext.tableNames, sqlContext.sql() and > sqlContext.isCached() to go throgh Hive metastore information. > Once migrated to spark 2.x we switched over SparkSession.catalog instead, but > it turns out that both listDatabases() and listTables() take between 5 to 20 > minutes depending on the database to return results, using operations such as > the following one: > spark.catalog.listTables(db).filter(__.isTemporary).map(__.name).collect > and made the program unbearably slow to return a list of tables. > I know we still have spark.sqlContext.tableNames as workaround but I am > assuming this is going to be deprecated anytime soon? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21214) Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '
[ https://issues.apache.org/jira/browse/SPARK-21214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063462#comment-16063462 ] Michael Kunkel commented on SPARK-21214: Greetings, Forget my last email. BR MK Michael C. Kunkel, USMC, PhD Forschungszentrum Jülich Nuclear Physics Institute and Juelich Center for Hadron Physics Experimental Hadron Structure (IKP-1) www.fz-juelich.de/ikp Forschungszentrum Juelich GmbH 52425 Juelich Sitz der Gesellschaft: Juelich Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender), Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt, Prof. Dr. Sebastian M. Schmidt > Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot > resolve ' > --- > > Key: SPARK-21214 > URL: https://issues.apache.org/jira/browse/SPARK-21214 > Project: Spark > Issue Type: Bug > Components: Java API, Spark Core, SQL >Affects Versions: 2.1.1 > Environment: macOSX >Reporter: Michael Kunkel > > First Spark project. > I have a Java method that returns a Dataset. I want to convert this to a > Dataset, where the Object is named StatusChangeDB. I have created a > POJO StatusChangeDB.java and coded it with all the query objects found in the > mySQL table. > I then create a Encoder and convert the Dataset to a > Dataset. However when I try to .show() the values of the > Dataset I receive the error > bq. > bq. Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot > resolve '`hvpinid_quad`' given input columns: [status_change_type, > superLayer, loclayer, sector, locwire]; > bq. at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > bq. at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:86) > bq. at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:83) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290) > bq. at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:289) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$10.apply(TreeNode.scala:324) > bq. at > scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246) > bq. at > scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246) > bq. at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > bq. at scala.collection.Iterator$class.foreach(Iterator.scala:893) > bq. at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > bq. at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > bq. at scala.collection.IterableLike$$anon$1.foreach(IterableLike.scala:311) > bq. at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > bq. at > scala.collection.mutable.MapBuilder.$plus$plus$eq(MapBuilder.scala:25) > bq. at > scala.collection.TraversableViewLike$class.force(TraversableViewLike.scala:88) > bq. at scala.collection.IterableLike$$anon$1.force(IterableLike.scala:311) > bq. at > org.
[jira] [Commented] (SPARK-21214) Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '
[ https://issues.apache.org/jira/browse/SPARK-21214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063461#comment-16063461 ] Michael Kunkel commented on SPARK-21214: Greetings, Would you please inform me on the location of the mailing list? BR MK Michael C. Kunkel, USMC, PhD Forschungszentrum Jülich Nuclear Physics Institute and Juelich Center for Hadron Physics Experimental Hadron Structure (IKP-1) www.fz-juelich.de/ikp Forschungszentrum Juelich GmbH 52425 Juelich Sitz der Gesellschaft: Juelich Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender), Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt, Prof. Dr. Sebastian M. Schmidt > Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot > resolve ' > --- > > Key: SPARK-21214 > URL: https://issues.apache.org/jira/browse/SPARK-21214 > Project: Spark > Issue Type: Bug > Components: Java API, Spark Core, SQL >Affects Versions: 2.1.1 > Environment: macOSX >Reporter: Michael Kunkel > > First Spark project. > I have a Java method that returns a Dataset. I want to convert this to a > Dataset, where the Object is named StatusChangeDB. I have created a > POJO StatusChangeDB.java and coded it with all the query objects found in the > mySQL table. > I then create a Encoder and convert the Dataset to a > Dataset. However when I try to .show() the values of the > Dataset I receive the error > bq. > bq. Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot > resolve '`hvpinid_quad`' given input columns: [status_change_type, > superLayer, loclayer, sector, locwire]; > bq. at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > bq. at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:86) > bq. at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:83) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290) > bq. at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:289) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$10.apply(TreeNode.scala:324) > bq. at > scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246) > bq. at > scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246) > bq. at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > bq. at scala.collection.Iterator$class.foreach(Iterator.scala:893) > bq. at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > bq. at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > bq. at scala.collection.IterableLike$$anon$1.foreach(IterableLike.scala:311) > bq. at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > bq. at > scala.collection.mutable.MapBuilder.$plus$plus$eq(MapBuilder.scala:25) > bq. at > scala.collection.TraversableViewLike$class.force(TraversableViewLike.scala:88) > bq. at scala.collection.IterableLike$$anon$1.force(I
[jira] [Comment Edited] (SPARK-21198) SparkSession catalog is terribly slow
[ https://issues.apache.org/jira/browse/SPARK-21198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063431#comment-16063431 ] Saif Addin edited comment on SPARK-21198 at 6/26/17 5:24 PM: - Regarding listtables, here is the code used inside the program: {code:java} println(s"processing all tables for every db. db length is: ${databases.tail.length}") for (d <- databases.tail) { val d1 = new Date().getTime val dbs = spark.sqlContext.tables(d).filter("isTemporary = false").select("tableName").collect.map(_.getString(0)) println("Processed tables in DB using sqlContext. Time: " + ((new Date().getTime - d1) / 1000.0) + s" seconds. ${dbs.length} tables") val d2 = new Date().getTime val dbs2 = spark.catalog.listTables(d).filter(!_.isTemporary).map(_.name).collect println("Processed tables in DB using catalog. Time: " + ((new Date().getTime - d2) / 1000.0) + s" seconds. ${dbs.length} tables") other stuff {code} and timings are as follows {code:java} processing all tables for every db. db length is: 30 Processed tables in DB using sqlContext. Time: 0.863 seconds. 19 tables Processed tables in DB using catalog. Time: 6.978 seconds. 19 tables Processed tables in DB using sqlContext. Time: 0.276 seconds. 607 tables Processed tables in DB using catalog. Time: 194.501 seconds. 607 tables Processed tables in DB using sqlContext. Time: 0.243 seconds. 55 tables Processed tables in DB using catalog. Time: 17.907 seconds. 55 tables Processed tables in DB using sqlContext. Time: 0.238 seconds. 13 tables Processed tables in DB using catalog. Time: 4.642 seconds. 13 tables Processed tables in DB using sqlContext. Time: 0.238 seconds. 392 tables Processed tables in DB using catalog. Time: 126.999 seconds. 392 tables ... goes on... {code} in stand-alone spark-shell, as opposed to full-program spark-submit: {code:java} import java.util.Date val dbs = spark.catalog.listDatabases.map(_.name).collect for (d <- dbs) { val d1 = new Date().getTime val dbs = spark.sqlContext.tables(d).filter("isTemporary = false").select("tableName").collect.map(_.getString(0)) println("Processed tables in DB using sqlContext. Time: " + ((new Date().getTime - d1) / 1000.0) + s" seconds. ${dbs.length} tables") val d2 = new Date().getTime val dbs2 = spark.catalog.listTables(d).filter(!_.isTemporary).map(_.name).collect println("Processed tables in DB using catalog. Time: " + ((new Date().getTime - d2) / 1000.0) + s" seconds. ${dbs.length} tables") } {code} {code:java} Processed tables in DB using sqlContext. Time: 0.59 seconds. 19 tables Processed tables in DB using catalog. Time: 6.285 seconds. 19 tables Processed tables in DB using sqlContext. Time: 0.276 seconds. 608 tables Processed tables in DB using catalog. Time: 201.295 seconds. 608 tables Processed tables in DB using sqlContext. Time: 0.241 seconds. 55 tables ... goes on. timings similar {code} So. Apart from the weird listDatabases issue, listTables is consistently slow. was (Author: revolucion09): Regarding listtables, here is the code used inside the program: {code:java} println(s"processing all tables for every db. db length is: ${databases.tail.length}") for (d <- databases.tail) { val d1 = new Date().getTime val dbs = spark.sqlContext.tables(d).filter("isTemporary = false").select("tableName").collect.map(_.getString(0)) println("Processed tables in DB using sqlContext. Time: " + ((new Date().getTime - d1) / 1000.0) + s" seconds. ${dbs.length} tables") val d2 = new Date().getTime val dbs2 = spark.catalog.listTables(d).filter(!_.isTemporary).map(_.name).collect println("Processed tables in DB using catalog. Time: " + ((new Date().getTime - d2) / 1000.0) + s" seconds. ${dbs.length} tables") other stuff {code} and timings are as follows {code:java} processing all tables for every db. db length is: 30 Processed tables in DB using sqlContext. Time: 0.863 seconds. 19 tables Processed tables in DB using catalog. Time: {color:red}6.978 seconds{color}. 19 tables Processed tables in DB using sqlContext. Time: 0.276 seconds. 607 tables Processed tables in DB using catalog. Time: {color:red}194.501 seconds{color}. 607 tables Processed tables in DB using sqlContext. Time: 0.243 seconds. 55 tables Processed tables in DB using catalog. Time: {color:red}17.907 seconds{color}. 55 tables Processed tables in DB using sqlContext. Time: 0.238 seconds. 13 tables Processed tables in DB using catalog. Time: {color:red}4.642 seconds{color}. 13 tables Processed tables in DB using sqlContext. Time: 0.238 seconds. 392 tables Processed tables in DB using catalog. Time: {color:red}126.999 seconds{color}. 392 tables ... goes on... {code} in stand-alone spark-shell, as opposed to full-program
[jira] [Comment Edited] (SPARK-21198) SparkSession catalog is terribly slow
[ https://issues.apache.org/jira/browse/SPARK-21198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063431#comment-16063431 ] Saif Addin edited comment on SPARK-21198 at 6/26/17 5:23 PM: - Regarding listtables, here is the code used inside the program: {code:java} println(s"processing all tables for every db. db length is: ${databases.tail.length}") for (d <- databases.tail) { val d1 = new Date().getTime val dbs = spark.sqlContext.tables(d).filter("isTemporary = false").select("tableName").collect.map(_.getString(0)) println("Processed tables in DB using sqlContext. Time: " + ((new Date().getTime - d1) / 1000.0) + s" seconds. ${dbs.length} tables") val d2 = new Date().getTime val dbs2 = spark.catalog.listTables(d).filter(!_.isTemporary).map(_.name).collect println("Processed tables in DB using catalog. Time: " + ((new Date().getTime - d2) / 1000.0) + s" seconds. ${dbs.length} tables") other stuff {code} and timings are as follows {code:java} processing all tables for every db. db length is: 30 Processed tables in DB using sqlContext. Time: 0.863 seconds. 19 tables Processed tables in DB using catalog. Time: {color:red}6.978 seconds{color}. 19 tables Processed tables in DB using sqlContext. Time: 0.276 seconds. 607 tables Processed tables in DB using catalog. Time: {color:red}194.501 seconds{color}. 607 tables Processed tables in DB using sqlContext. Time: 0.243 seconds. 55 tables Processed tables in DB using catalog. Time: {color:red}17.907 seconds{color}. 55 tables Processed tables in DB using sqlContext. Time: 0.238 seconds. 13 tables Processed tables in DB using catalog. Time: {color:red}4.642 seconds{color}. 13 tables Processed tables in DB using sqlContext. Time: 0.238 seconds. 392 tables Processed tables in DB using catalog. Time: {color:red}126.999 seconds{color}. 392 tables ... goes on... {code} in stand-alone spark-shell, as opposed to full-program spark-submit: {code:java} import java.util.Date val dbs = spark.catalog.listDatabases.map(_.name).collect for (d <- dbs) { val d1 = new Date().getTime val dbs = spark.sqlContext.tables(d).filter("isTemporary = false").select("tableName").collect.map(_.getString(0)) println("Processed tables in DB using sqlContext. Time: " + ((new Date().getTime - d1) / 1000.0) + s" seconds. ${dbs.length} tables") val d2 = new Date().getTime val dbs2 = spark.catalog.listTables(d).filter(!_.isTemporary).map(_.name).collect println("Processed tables in DB using catalog. Time: " + ((new Date().getTime - d2) / 1000.0) + s" seconds. ${dbs.length} tables") } {code} {code:java} Processed tables in DB using sqlContext. Time: 0.59 seconds. 19 tables Processed tables in DB using catalog. Time: {color:red}6.285 seconds{color}. 19 tables Processed tables in DB using sqlContext. Time: 0.276 seconds. 608 tables Processed tables in DB using catalog. Time: {color:red}201.295 seconds{color}. 608 tables Processed tables in DB using sqlContext. Time: 0.241 seconds. 55 tables ... goes on. timings similar {code} So. Apart from the weird listDatabases issue, listTables is consistently slow. was (Author: revolucion09): Regarding listtables, here is the code used inside the program: {code:java} println(s"processing all tables for every db. db length is: ${databases.tail.length}") for (d <- databases.tail) { val d1 = new Date().getTime val dbs = spark.sqlContext.tables(d).filter("isTemporary = false").select("tableName").collect.map(_.getString(0)) println("Processed tables in DB using sqlContext. Time: " + ((new Date().getTime - d1) / 1000.0) + s" seconds. ${dbs.length} tables") val d2 = new Date().getTime val dbs2 = spark.catalog.listTables(d).filter(!_.isTemporary).map(_.name).collect println("Processed tables in DB using catalog. Time: " + ((new Date().getTime - d2) / 1000.0) + s" seconds. ${dbs.length} tables") other stuff {code} and timings are as follows {code:java} processing all tables for every db. db length is: 30 Processed tables in DB using sqlContext. Time: 0.863 seconds. 19 tables Processed tables in DB using catalog. Time: 6.978 seconds. 19 tables Processed tables in DB using sqlContext. Time: 0.276 seconds. 607 tables Processed tables in DB using catalog. Time: 194.501 seconds. 607 tables Processed tables in DB using sqlContext. Time: 0.243 seconds. 55 tables Processed tables in DB using catalog. Time: 17.907 seconds. 55 tables Processed tables in DB using sqlContext. Time: 0.238 seconds. 13 tables Processed tables in DB using catalog. Time: 4.642 seconds. 13 tables Processed tables in DB using sqlContext. Time: 0.238 seconds. 392 tables Processed tables in DB using catalog. Time: 126.999 seconds. 392 tables ... goes on... {code} in stand-alone sp
[jira] [Comment Edited] (SPARK-21198) SparkSession catalog is terribly slow
[ https://issues.apache.org/jira/browse/SPARK-21198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063431#comment-16063431 ] Saif Addin edited comment on SPARK-21198 at 6/26/17 5:22 PM: - Regarding listtables, here is the code used inside the program: {code:java} println(s"processing all tables for every db. db length is: ${databases.tail.length}") for (d <- databases.tail) { val d1 = new Date().getTime val dbs = spark.sqlContext.tables(d).filter("isTemporary = false").select("tableName").collect.map(_.getString(0)) println("Processed tables in DB using sqlContext. Time: " + ((new Date().getTime - d1) / 1000.0) + s" seconds. ${dbs.length} tables") val d2 = new Date().getTime val dbs2 = spark.catalog.listTables(d).filter(!_.isTemporary).map(_.name).collect println("Processed tables in DB using catalog. Time: " + ((new Date().getTime - d2) / 1000.0) + s" seconds. ${dbs.length} tables") other stuff {code} and timings are as follows {code:java} processing all tables for every db. db length is: 30 Processed tables in DB using sqlContext. Time: 0.863 seconds. 19 tables Processed tables in DB using catalog. Time: 6.978 seconds. 19 tables Processed tables in DB using sqlContext. Time: 0.276 seconds. 607 tables Processed tables in DB using catalog. Time: 194.501 seconds. 607 tables Processed tables in DB using sqlContext. Time: 0.243 seconds. 55 tables Processed tables in DB using catalog. Time: 17.907 seconds. 55 tables Processed tables in DB using sqlContext. Time: 0.238 seconds. 13 tables Processed tables in DB using catalog. Time: 4.642 seconds. 13 tables Processed tables in DB using sqlContext. Time: 0.238 seconds. 392 tables Processed tables in DB using catalog. Time: 126.999 seconds. 392 tables ... goes on... {code} in stand-alone spark-shell, as opposed to full-program spark-submit: {code:java} import java.util.Date val dbs = spark.catalog.listDatabases.map(_.name).collect for (d <- dbs) { val d1 = new Date().getTime val dbs = spark.sqlContext.tables(d).filter("isTemporary = false").select("tableName").collect.map(_.getString(0)) println("Processed tables in DB using sqlContext. Time: " + ((new Date().getTime - d1) / 1000.0) + s" seconds. ${dbs.length} tables") val d2 = new Date().getTime val dbs2 = spark.catalog.listTables(d).filter(!_.isTemporary).map(_.name).collect println("Processed tables in DB using catalog. Time: " + ((new Date().getTime - d2) / 1000.0) + s" seconds. ${dbs.length} tables") } {code} {code:java} Processed tables in DB using sqlContext. Time: 0.59 seconds. 19 tables Processed tables in DB using catalog. Time: 6.285 seconds. 19 tables Processed tables in DB using sqlContext. Time: 0.276 seconds. 608 tables Processed tables in DB using catalog. Time: 201.295 seconds. 608 tables Processed tables in DB using sqlContext. Time: 0.241 seconds. 55 tables ... goes on. timings similar {code} So. Apart from the weird listDatabases issue, listTables is consistently slow. was (Author: revolucion09): Regarding listtables, here is the code used inside the program: {code:java} println(s"processing all tables for every db. db length is: ${databases.tail.length}") for (d <- databases.tail) { val d1 = new Date().getTime val dbs = spark.sqlContext.tables(d).filter("isTemporary = false").select("tableName").collect.map(_.getString(0)) println("Processed tables in DB using sqlContext. Time: " + ((new Date().getTime - d1) / 1000.0) + s" seconds. ${dbs.length} tables") val d2 = new Date().getTime val dbs2 = spark.catalog.listTables(d).filter(!_.isTemporary).map(_.name).collect println("Processed tables in DB using catalog. Time: " + ((new Date().getTime - d2) / 1000.0) + s" seconds. ${dbs.length} tables") other stuff {code} and timings are as follows {code:java} processing all tables for every db. db length is: 30 Processed tables in DB using sqlContext. Time: 0.863 seconds. 19 tables Processed tables in DB using catalog. Time: 6.978 seconds. 19 tables Processed tables in DB using sqlContext. Time: 0.276 seconds. 607 tables Processed tables in DB using catalog. Time: 194.501 seconds. 607 tables Processed tables in DB using sqlContext. Time: 0.243 seconds. 55 tables Processed tables in DB using catalog. Time: 17.907 seconds. 55 tables Processed tables in DB using sqlContext. Time: 0.238 seconds. 13 tables Processed tables in DB using catalog. Time: 4.642 seconds. 13 tables Processed tables in DB using sqlContext. Time: 0.238 seconds. 392 tables Processed tables in DB using catalog. Time: 126.999 seconds. 392 tables ... goes on... {code} > SparkSession catalog is terribly slow > - > > Key: SPARK-21198 > URL: https:/
[jira] [Commented] (SPARK-21215) Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve
[ https://issues.apache.org/jira/browse/SPARK-21215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063453#comment-16063453 ] Michael Kunkel commented on SPARK-21215: [~srowen] I am new to this board, so I do not understand how to see the resolution. > Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot > resolve > - > > Key: SPARK-21215 > URL: https://issues.apache.org/jira/browse/SPARK-21215 > Project: Spark > Issue Type: Bug > Components: Java API, SQL >Affects Versions: 2.1.1 > Environment: macOSX >Reporter: Michael Kunkel > > First Spark project. > I have a Java method that returns a Dataset. I want to convert this to a > Dataset, where the Object is named StatusChangeDB. I have created a > POJO StatusChangeDB.java and coded it with all the query objects found in the > mySQL table. > I then create a Encoder and convert the Dataset to a > Dataset. However when I try to .show() the values of the > Dataset I receive the error > bq. > bq. Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot > resolve '`hvpinid_quad`' given input columns: [status_change_type, > superLayer, loclayer, sector, locwire]; > bq. at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > bq. at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:86) > bq. at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:83) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290) > bq. at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:289) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$10.apply(TreeNode.scala:324) > bq. at > scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246) > bq. at > scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246) > bq. at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > bq. at scala.collection.Iterator$class.foreach(Iterator.scala:893) > bq. at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > bq. at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > bq. at scala.collection.IterableLike$$anon$1.foreach(IterableLike.scala:311) > bq. at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > bq. at > scala.collection.mutable.MapBuilder.$plus$plus$eq(MapBuilder.scala:25) > bq. at > scala.collection.TraversableViewLike$class.force(TraversableViewLike.scala:88) > bq. at scala.collection.IterableLike$$anon$1.force(IterableLike.scala:311) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:332) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287) > bq. at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255) > bq. at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255) > bq. at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:266) > bq. at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:276) > bq. at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$6.apply(QueryPlan.scala:285) > bq. at >
[jira] [Commented] (SPARK-21183) Unable to return Google BigQuery INTEGER data type into Spark via google BigQuery JDBC driver: java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value t
[ https://issues.apache.org/jira/browse/SPARK-21183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063448#comment-16063448 ] Matthew Walton commented on SPARK-21183: Hi Sean, I think the issue is Spark is not handling the integer data types properly and this causes a conversion error. Running a simple select against the same data with a tool like SQuirrel works fine to fetch back integer data but Spark seems to be using a different method to fetch. I'm not sure if you have access to Google BigQuery but you can easily reproduce this issue. If not, please take a look at the other ticket for Hive JDBC drivers as I think you probably have access to Hive for testing. If this is asking too much I guess I'll try to log this and the other issue with Simba and let them pursue the issues more. Essentially, I would not expect an error on handling the integer type with Spark especially with the driver that Google expects its customers to use. Thanks, Matt > Unable to return Google BigQuery INTEGER data type into Spark via google > BigQuery JDBC driver: java.sql.SQLDataException: [Simba][JDBC](10140) Error > converting value to long. > -- > > Key: SPARK-21183 > URL: https://issues.apache.org/jira/browse/SPARK-21183 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 1.6.0, 2.0.0, 2.1.1 > Environment: OS: Linux > Spark version 2.1.1 > JDBC: Download the latest google BigQuery JDBC Driver from Google >Reporter: Matthew Walton > > I'm trying to fetch back data in Spark using a JDBC connection to Google > BigQuery. Unfortunately, when I try to query data that resides in an INTEGER > column I get the following error: > java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to > long. > Steps to reproduce: > 1) On Google BigQuery console create a simple table with an INT column and > insert some data > 2) Copy the Google BigQuery JDBC driver to the machine where you will run > Spark Shell > 3) Start Spark shell loading the GoogleBigQuery JDBC driver jar files > ./spark-shell --jars > /home/ec2-user/jdbc/gbq/GoogleBigQueryJDBC42.jar,/home/ec2-user/jdbc/gbq/google-api-client-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-api-services-bigquery-v2-rev320-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-http-client-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-http-client-jackson2-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-oauth-client-1.22.0.jar,/home/ec2-user/jdbc/gbq/jackson-core-2.1.3.jar > 4) In Spark shell load the data from Google BigQuery using the JDBC driver > val gbq = spark.read.format("jdbc").options(Map("url" -> > "jdbc:bigquery://https://www.googleapis.com/bigquery/v2;ProjectId=your-project-name-here;OAuthType=0;OAuthPvtKeyPath=/usr/lib/spark/YourProjectPrivateKey.json;OAuthServiceAcctEmail=YourEmail@gmail.comAllowLargeResults=1;LargeResultDataset=_bqodbc_temp_tables;LargeResultTable=_matthew;Timeout=600","dbtable"; > -> > "test.lu_test_integer")).option("driver","com.simba.googlebigquery.jdbc42.Driver").option("user","").option("password","").load() > 5) In Spark shell try to display the data > gbq.show() > At this point you should see the error: > scala> gbq.show() > 17/06/22 19:34:57 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 6, > ip-172-31-37-165.ec2.internal, executor 3): java.sql.SQLDataException: > [Simba][JDBC](10140) Error converting value to long. > at com.simba.exceptions.ExceptionConverter.toSQLException(Unknown > Source) > at com.simba.utilities.conversion.TypeConverter.toLong(Unknown Source) > at com.simba.jdbc.common.SForwardResultSet.getLong(Unknown Source) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:365) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:364) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:286) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:268) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.
[jira] [Comment Edited] (SPARK-21199) Its not possible to impute Vector types
[ https://issues.apache.org/jira/browse/SPARK-21199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063416#comment-16063416 ] Franklyn Dsouza edited comment on SPARK-21199 at 6/26/17 5:16 PM: -- For this particular scenario I have a table with two columns one is a string `document_type` and the other is a an array of tokens for the document. I want to do a TF-IDF on these tokens. the IDF needs to be done per `document_type` so i pivot on `document_type` and then do the IDF on the TF vectors. This pivoting introduces nulls for missing columns that need to be imputed. I can't impute array type either and fixing it at the token generation step would involve a lot of left joins to align various data sources. was (Author: franklyndsouza): For this particular scenario I have a table with two columns one is a string `document_type` and the other is a an array of tokens for the document. I want to do a TF-IDF on these tokens. the IDF needs to be done per `document_type` so i pivot on `document_type` and then do the IDF on the TF vetors. This pivoting introduces nulls for missing columns that need to be imputed. I can't impute array type either and fixing it at the token generation step would involve a lot of left joins to align various data sources. > Its not possible to impute Vector types > --- > > Key: SPARK-21199 > URL: https://issues.apache.org/jira/browse/SPARK-21199 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0, 2.1.1 >Reporter: Franklyn Dsouza > > There are cases where nulls end up in vector columns in dataframes. Currently > there is no way to fill in these nulls because its not possible to create a > literal vector column expression using lit(). > Also the entire pyspark ml api will fail when they encounter nulls so this > makes it hard to work with the data. > I think that either vector support should be added to the imputer or vectors > should be supported in column expressions so they can be used in a coalesce. > [~mlnick] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21199) Its not possible to impute Vector types
[ https://issues.apache.org/jira/browse/SPARK-21199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063416#comment-16063416 ] Franklyn Dsouza edited comment on SPARK-21199 at 6/26/17 5:16 PM: -- For this particular scenario I have a table with two columns one is a string `document_type` and the other is a an array of tokens for the document. I want to do a TF-IDF on these tokens. the IDF needs to be done per `document_type` so i pivot on `document_type` and then do the IDF on the TF vetors. This pivoting introduces nulls for missing columns that need to be imputed. I can't impute array type either and fixing it at the token generation step would involve a lot of left joins to align various data sources. was (Author: franklyndsouza): For this particular scenario I have a table with two columns one is a string `document_type` and the other is a an array of tokens for the document. I want to do a TF-IDF on these tokens. the IDF needs to be done per `document_type` so i do pivot on `document_type` and then do the IDF on the TF vetors. This pivoting introduces nulls for missing columns that need to be imputed. I can't impute array type either and fixing it at the token generation step would involve a lot of left joins to align various data sources. > Its not possible to impute Vector types > --- > > Key: SPARK-21199 > URL: https://issues.apache.org/jira/browse/SPARK-21199 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0, 2.1.1 >Reporter: Franklyn Dsouza > > There are cases where nulls end up in vector columns in dataframes. Currently > there is no way to fill in these nulls because its not possible to create a > literal vector column expression using lit(). > Also the entire pyspark ml api will fail when they encounter nulls so this > makes it hard to work with the data. > I think that either vector support should be added to the imputer or vectors > should be supported in column expressions so they can be used in a coalesce. > [~mlnick] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21198) SparkSession catalog is terribly slow
[ https://issues.apache.org/jira/browse/SPARK-21198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063431#comment-16063431 ] Saif Addin edited comment on SPARK-21198 at 6/26/17 5:15 PM: - Regarding listtables, here is the code used inside the program: {code:java} println(s"processing all tables for every db. db length is: ${databases.tail.length}") for (d <- databases.tail) { val d1 = new Date().getTime val dbs = spark.sqlContext.tables(d).filter("isTemporary = false").select("tableName").collect.map(_.getString(0)) println("Processed tables in DB using sqlContext. Time: " + ((new Date().getTime - d1) / 1000.0) + s" seconds. ${dbs.length} tables") val d2 = new Date().getTime val dbs2 = spark.catalog.listTables(d).filter(!_.isTemporary).map(_.name).collect println("Processed tables in DB using catalog. Time: " + ((new Date().getTime - d2) / 1000.0) + s" seconds. ${dbs.length} tables") other stuff {code} and timings are as follows {code:java} processing all tables for every db. db length is: 30 Processed tables in DB using sqlContext. Time: 0.863 seconds. 19 tables Processed tables in DB using catalog. Time: 6.978 seconds. 19 tables Processed tables in DB using sqlContext. Time: 0.276 seconds. 607 tables Processed tables in DB using catalog. Time: 194.501 seconds. 607 tables Processed tables in DB using sqlContext. Time: 0.243 seconds. 55 tables Processed tables in DB using catalog. Time: 17.907 seconds. 55 tables Processed tables in DB using sqlContext. Time: 0.238 seconds. 13 tables Processed tables in DB using catalog. Time: 4.642 seconds. 13 tables Processed tables in DB using sqlContext. Time: 0.238 seconds. 392 tables Processed tables in DB using catalog. Time: 126.999 seconds. 392 tables ... goes on... {code} was (Author: revolucion09): Regarding listtables, here is the code used inside the program: {code:java} println(s"processing all tables for every db. db length is: ${databases.tail.length}") for (d <- databases.tail) { val d1 = new Date().getTime val dbs = spark.sqlContext.tables(d).filter("isTemporary = false").select("tableName").collect.map(_.getString(0)) println("Processed tables in DB using sqlContext. Time: " + ((new Date().getTime - d1) / 1000.0) + s" seconds. ${dbs.length} tables") val d2 = new Date().getTime val dbs2 = spark.catalog.listTables(d).filter(!_.isTemporary).map(_.name).collect println("Processed tables in DB using catalog. Time: " + ((new Date().getTime - d2) / 1000.0) + s" seconds. ${dbs.length} tables") other stuff {code} and timings are as follows {code:java} processing all tables for every db. db length is: 30 Processed tables in DB using sqlContext. Time: 0.863 seconds. 19 tables Processed tables in DB using catalog. Time: 6.978 seconds. 19 tables Processed tables in DB using sqlContext. Time: 0.276 seconds. 607 tables Processed tables in DB using catalog. Time: 194.501 seconds. 607 tables Processed tables in DB using sqlContext. Time: 0.243 seconds. 55 tables Processed tables in DB using catalog. Time: 17.907 seconds. 55 tables Processed tables in DB using sqlContext. Time: 0.238 seconds. 13 tables Processed tables in DB using catalog. Time: 4.642 seconds. 13 tables Processed tables in DB using sqlContext. Time: 0.238 seconds. 392 tables ... goes on... {code} > SparkSession catalog is terribly slow > - > > Key: SPARK-21198 > URL: https://issues.apache.org/jira/browse/SPARK-21198 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Saif Addin > > We have a considerably large Hive metastore and a Spark program that goes > through Hive data availability. > In spark 1.x, we were using sqlConext.tableNames, sqlContext.sql() and > sqlContext.isCached() to go throgh Hive metastore information. > Once migrated to spark 2.x we switched over SparkSession.catalog instead, but > it turns out that both listDatabases() and listTables() take between 5 to 20 > minutes depending on the database to return results, using operations such as > the following one: > spark.catalog.listTables(db).filter(__.isTemporary).map(__.name).collect > and made the program unbearably slow to return a list of tables. > I know we still have spark.sqlContext.tableNames as workaround but I am > assuming this is going to be deprecated anytime soon? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21199) Its not possible to impute Vector types
[ https://issues.apache.org/jira/browse/SPARK-21199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063416#comment-16063416 ] Franklyn Dsouza edited comment on SPARK-21199 at 6/26/17 5:15 PM: -- For this particular scenario I have a table with two columns one is a string `document_type` and the other is a an array of tokens for the document. I want to do a TF-IDF on these tokens. the IDF needs to be done per `document_type` so i do pivot on `document_type` and then do the IDF on the TF vetors. This pivoting introduces nulls for missing columns that need to be imputed. I can't impute array type either and fixing it at the token generation step would involve a lot of left joins to align various data sources. was (Author: franklyndsouza): For this particular scenario I have a table with two columns one is a string `document_type` and the other is a an array of tokens for the document. I want to do a TF-IDF. the IDF needs to be done per `document_type` so i do pivot on `document_type` and then do the IDF on the TF vetors. This pivoting introduces nulls for missing columns that need to be imputed. I can't impute array type either and fixing it at the token generation step would involve a lot of left joins to align various data sources. > Its not possible to impute Vector types > --- > > Key: SPARK-21199 > URL: https://issues.apache.org/jira/browse/SPARK-21199 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0, 2.1.1 >Reporter: Franklyn Dsouza > > There are cases where nulls end up in vector columns in dataframes. Currently > there is no way to fill in these nulls because its not possible to create a > literal vector column expression using lit(). > Also the entire pyspark ml api will fail when they encounter nulls so this > makes it hard to work with the data. > I think that either vector support should be added to the imputer or vectors > should be supported in column expressions so they can be used in a coalesce. > [~mlnick] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21215) Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve
[ https://issues.apache.org/jira/browse/SPARK-21215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21215. --- Resolution: Duplicate > Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot > resolve > - > > Key: SPARK-21215 > URL: https://issues.apache.org/jira/browse/SPARK-21215 > Project: Spark > Issue Type: Bug > Components: Java API, SQL >Affects Versions: 2.1.1 > Environment: macOSX >Reporter: Michael Kunkel > > First Spark project. > I have a Java method that returns a Dataset. I want to convert this to a > Dataset, where the Object is named StatusChangeDB. I have created a > POJO StatusChangeDB.java and coded it with all the query objects found in the > mySQL table. > I then create a Encoder and convert the Dataset to a > Dataset. However when I try to .show() the values of the > Dataset I receive the error > bq. > bq. Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot > resolve '`hvpinid_quad`' given input columns: [status_change_type, > superLayer, loclayer, sector, locwire]; > bq. at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > bq. at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:86) > bq. at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:83) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290) > bq. at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:289) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$10.apply(TreeNode.scala:324) > bq. at > scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246) > bq. at > scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246) > bq. at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > bq. at scala.collection.Iterator$class.foreach(Iterator.scala:893) > bq. at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > bq. at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > bq. at scala.collection.IterableLike$$anon$1.foreach(IterableLike.scala:311) > bq. at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > bq. at > scala.collection.mutable.MapBuilder.$plus$plus$eq(MapBuilder.scala:25) > bq. at > scala.collection.TraversableViewLike$class.force(TraversableViewLike.scala:88) > bq. at scala.collection.IterableLike$$anon$1.force(IterableLike.scala:311) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:332) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287) > bq. at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255) > bq. at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255) > bq. at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:266) > bq. at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:276) > bq. at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$6.apply(QueryPlan.scala:285) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > bq. at > org.apache.spark.sql.cataly
[jira] [Resolved] (SPARK-21214) Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '
[ https://issues.apache.org/jira/browse/SPARK-21214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21214. --- Resolution: Invalid This should be a question on the mailing list. > Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot > resolve ' > --- > > Key: SPARK-21214 > URL: https://issues.apache.org/jira/browse/SPARK-21214 > Project: Spark > Issue Type: Bug > Components: Java API, Spark Core, SQL >Affects Versions: 2.1.1 > Environment: macOSX >Reporter: Michael Kunkel > > First Spark project. > I have a Java method that returns a Dataset. I want to convert this to a > Dataset, where the Object is named StatusChangeDB. I have created a > POJO StatusChangeDB.java and coded it with all the query objects found in the > mySQL table. > I then create a Encoder and convert the Dataset to a > Dataset. However when I try to .show() the values of the > Dataset I receive the error > bq. > bq. Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot > resolve '`hvpinid_quad`' given input columns: [status_change_type, > superLayer, loclayer, sector, locwire]; > bq. at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > bq. at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:86) > bq. at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:83) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290) > bq. at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:289) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$10.apply(TreeNode.scala:324) > bq. at > scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246) > bq. at > scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246) > bq. at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > bq. at scala.collection.Iterator$class.foreach(Iterator.scala:893) > bq. at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > bq. at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > bq. at scala.collection.IterableLike$$anon$1.foreach(IterableLike.scala:311) > bq. at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > bq. at > scala.collection.mutable.MapBuilder.$plus$plus$eq(MapBuilder.scala:25) > bq. at > scala.collection.TraversableViewLike$class.force(TraversableViewLike.scala:88) > bq. at scala.collection.IterableLike$$anon$1.force(IterableLike.scala:311) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:332) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287) > bq. at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255) > bq. at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255) > bq. at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:266) > bq. at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:276) > bq. at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$6.apply(QueryPlan.scala:285) > bq. at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterato
[jira] [Commented] (SPARK-21198) SparkSession catalog is terribly slow
[ https://issues.apache.org/jira/browse/SPARK-21198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063431#comment-16063431 ] Saif Addin commented on SPARK-21198: Regarding listtables, here is the code used inside the program: {code:java} println(s"processing all tables for every db. db length is: ${databases.tail.length}") for (d <- databases.tail) { val d1 = new Date().getTime val dbs = spark.sqlContext.tables(d).filter("isTemporary = false").select("tableName").collect.map(_.getString(0)) println("Processed tables in DB using sqlContext. Time: " + ((new Date().getTime - d1) / 1000.0) + s" seconds. ${dbs.length} tables") val d2 = new Date().getTime val dbs2 = spark.catalog.listTables(d).filter(!_.isTemporary).map(_.name).collect println("Processed tables in DB using catalog. Time: " + ((new Date().getTime - d2) / 1000.0) + s" seconds. ${dbs.length} tables") other stuff {code} and timings are as follows {code:java} processing all tables for every db. db length is: 30 Processed tables in DB using sqlContext. Time: 0.863 seconds. 19 tables Processed tables in DB using catalog. Time: 6.978 seconds. 19 tables Processed tables in DB using sqlContext. Time: 0.276 seconds. 607 tables Processed tables in DB using catalog. Time: 194.501 seconds. 607 tables Processed tables in DB using sqlContext. Time: 0.243 seconds. 55 tables Processed tables in DB using catalog. Time: 17.907 seconds. 55 tables Processed tables in DB using sqlContext. Time: 0.238 seconds. 13 tables Processed tables in DB using catalog. Time: 4.642 seconds. 13 tables Processed tables in DB using sqlContext. Time: 0.238 seconds. 392 tables ... goes on... {code} > SparkSession catalog is terribly slow > - > > Key: SPARK-21198 > URL: https://issues.apache.org/jira/browse/SPARK-21198 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Saif Addin > > We have a considerably large Hive metastore and a Spark program that goes > through Hive data availability. > In spark 1.x, we were using sqlConext.tableNames, sqlContext.sql() and > sqlContext.isCached() to go throgh Hive metastore information. > Once migrated to spark 2.x we switched over SparkSession.catalog instead, but > it turns out that both listDatabases() and listTables() take between 5 to 20 > minutes depending on the database to return results, using operations such as > the following one: > spark.catalog.listTables(db).filter(__.isTemporary).map(__.name).collect > and made the program unbearably slow to return a list of tables. > I know we still have spark.sqlContext.tableNames as workaround but I am > assuming this is going to be deprecated anytime soon? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21183) Unable to return Google BigQuery INTEGER data type into Spark via google BigQuery JDBC driver: java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value t
[ https://issues.apache.org/jira/browse/SPARK-21183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063428#comment-16063428 ] Sean Owen commented on SPARK-21183: --- I'm not sure that follows, but I don't know either. I think you'd have to establish what you think the Spark issue is here. > Unable to return Google BigQuery INTEGER data type into Spark via google > BigQuery JDBC driver: java.sql.SQLDataException: [Simba][JDBC](10140) Error > converting value to long. > -- > > Key: SPARK-21183 > URL: https://issues.apache.org/jira/browse/SPARK-21183 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 1.6.0, 2.0.0, 2.1.1 > Environment: OS: Linux > Spark version 2.1.1 > JDBC: Download the latest google BigQuery JDBC Driver from Google >Reporter: Matthew Walton > > I'm trying to fetch back data in Spark using a JDBC connection to Google > BigQuery. Unfortunately, when I try to query data that resides in an INTEGER > column I get the following error: > java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to > long. > Steps to reproduce: > 1) On Google BigQuery console create a simple table with an INT column and > insert some data > 2) Copy the Google BigQuery JDBC driver to the machine where you will run > Spark Shell > 3) Start Spark shell loading the GoogleBigQuery JDBC driver jar files > ./spark-shell --jars > /home/ec2-user/jdbc/gbq/GoogleBigQueryJDBC42.jar,/home/ec2-user/jdbc/gbq/google-api-client-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-api-services-bigquery-v2-rev320-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-http-client-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-http-client-jackson2-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-oauth-client-1.22.0.jar,/home/ec2-user/jdbc/gbq/jackson-core-2.1.3.jar > 4) In Spark shell load the data from Google BigQuery using the JDBC driver > val gbq = spark.read.format("jdbc").options(Map("url" -> > "jdbc:bigquery://https://www.googleapis.com/bigquery/v2;ProjectId=your-project-name-here;OAuthType=0;OAuthPvtKeyPath=/usr/lib/spark/YourProjectPrivateKey.json;OAuthServiceAcctEmail=YourEmail@gmail.comAllowLargeResults=1;LargeResultDataset=_bqodbc_temp_tables;LargeResultTable=_matthew;Timeout=600","dbtable"; > -> > "test.lu_test_integer")).option("driver","com.simba.googlebigquery.jdbc42.Driver").option("user","").option("password","").load() > 5) In Spark shell try to display the data > gbq.show() > At this point you should see the error: > scala> gbq.show() > 17/06/22 19:34:57 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 6, > ip-172-31-37-165.ec2.internal, executor 3): java.sql.SQLDataException: > [Simba][JDBC](10140) Error converting value to long. > at com.simba.exceptions.ExceptionConverter.toSQLException(Unknown > Source) > at com.simba.utilities.conversion.TypeConverter.toLong(Unknown Source) > at com.simba.jdbc.common.SForwardResultSet.getLong(Unknown Source) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:365) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:364) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:286) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:268) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.sc
[jira] [Created] (SPARK-21215) Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve
Michael Kunkel created SPARK-21215: -- Summary: Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve Key: SPARK-21215 URL: https://issues.apache.org/jira/browse/SPARK-21215 Project: Spark Issue Type: Bug Components: Java API, SQL Affects Versions: 2.1.1 Environment: macOSX Reporter: Michael Kunkel First Spark project. I have a Java method that returns a Dataset. I want to convert this to a Dataset, where the Object is named StatusChangeDB. I have created a POJO StatusChangeDB.java and coded it with all the query objects found in the mySQL table. I then create a Encoder and convert the Dataset to a Dataset. However when I try to .show() the values of the Dataset I receive the error bq. bq. Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`hvpinid_quad`' given input columns: [status_change_type, superLayer, loclayer, sector, locwire]; bq. at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) bq. at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:86) bq. at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:83) bq. at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290) bq. at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290) bq. at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) bq. at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:289) bq. at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287) bq. at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287) bq. at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307) bq. at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) bq. at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305) bq. at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287) bq. at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287) bq. at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287) bq. at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$10.apply(TreeNode.scala:324) bq. at scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246) bq. at scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246) bq. at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) bq. at scala.collection.Iterator$class.foreach(Iterator.scala:893) bq. at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) bq. at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) bq. at scala.collection.IterableLike$$anon$1.foreach(IterableLike.scala:311) bq. at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) bq. at scala.collection.mutable.MapBuilder.$plus$plus$eq(MapBuilder.scala:25) bq. at scala.collection.TraversableViewLike$class.force(TraversableViewLike.scala:88) bq. at scala.collection.IterableLike$$anon$1.force(IterableLike.scala:311) bq. at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:332) bq. at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) bq. at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305) bq. at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287) bq. at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255) bq. at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255) bq. at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:266) bq. at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:276) bq. at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$6.apply(QueryPlan.scala:285) bq. at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) bq. at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:285) bq. at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:255) bq. at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:83) bq. at org.apache.spark.s
[jira] [Created] (SPARK-21214) Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '
Michael Kunkel created SPARK-21214: -- Summary: Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve ' Key: SPARK-21214 URL: https://issues.apache.org/jira/browse/SPARK-21214 Project: Spark Issue Type: Bug Components: Java API, Spark Core, SQL Affects Versions: 2.1.1 Environment: macOSX Reporter: Michael Kunkel First Spark project. I have a Java method that returns a Dataset. I want to convert this to a Dataset, where the Object is named StatusChangeDB. I have created a POJO StatusChangeDB.java and coded it with all the query objects found in the mySQL table. I then create a Encoder and convert the Dataset to a Dataset. However when I try to .show() the values of the Dataset I receive the error bq. bq. Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`hvpinid_quad`' given input columns: [status_change_type, superLayer, loclayer, sector, locwire]; bq. at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) bq. at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:86) bq. at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:83) bq. at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290) bq. at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290) bq. at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) bq. at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:289) bq. at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287) bq. at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287) bq. at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307) bq. at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) bq. at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305) bq. at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287) bq. at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287) bq. at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287) bq. at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$10.apply(TreeNode.scala:324) bq. at scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246) bq. at scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246) bq. at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) bq. at scala.collection.Iterator$class.foreach(Iterator.scala:893) bq. at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) bq. at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) bq. at scala.collection.IterableLike$$anon$1.foreach(IterableLike.scala:311) bq. at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) bq. at scala.collection.mutable.MapBuilder.$plus$plus$eq(MapBuilder.scala:25) bq. at scala.collection.TraversableViewLike$class.force(TraversableViewLike.scala:88) bq. at scala.collection.IterableLike$$anon$1.force(IterableLike.scala:311) bq. at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:332) bq. at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) bq. at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305) bq. at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287) bq. at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255) bq. at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255) bq. at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:266) bq. at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:276) bq. at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$6.apply(QueryPlan.scala:285) bq. at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) bq. at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:285) bq. at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:255) bq. at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:83) bq. at org.
[jira] [Commented] (SPARK-21183) Unable to return Google BigQuery INTEGER data type into Spark via google BigQuery JDBC driver: java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value t
[ https://issues.apache.org/jira/browse/SPARK-21183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063421#comment-16063421 ] Matthew Walton commented on SPARK-21183: Well, if only the SQuirreL Client tool didn't work on the same query. It seems that the Spark code is the culprit and is causing the Simba driver to report that error. I have another ticket opened for Spark with data coming from Hive drivers and it is similar in that it deals with how the integer data type is handled by Spark. In that case I have several drivers I can test to show that the issue does not seem to originate in the driver. Unfortunately, that one has not been assigned: [https://issues.apache.org/jira/browse/SPARK-21179] I only mention this issue since it seems to show a behavior in Spark when it comes to handling INTEGER types with certain JDBC drivers. In each case, running the same SQL outside of Spark works fine. > Unable to return Google BigQuery INTEGER data type into Spark via google > BigQuery JDBC driver: java.sql.SQLDataException: [Simba][JDBC](10140) Error > converting value to long. > -- > > Key: SPARK-21183 > URL: https://issues.apache.org/jira/browse/SPARK-21183 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 1.6.0, 2.0.0, 2.1.1 > Environment: OS: Linux > Spark version 2.1.1 > JDBC: Download the latest google BigQuery JDBC Driver from Google >Reporter: Matthew Walton > > I'm trying to fetch back data in Spark using a JDBC connection to Google > BigQuery. Unfortunately, when I try to query data that resides in an INTEGER > column I get the following error: > java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to > long. > Steps to reproduce: > 1) On Google BigQuery console create a simple table with an INT column and > insert some data > 2) Copy the Google BigQuery JDBC driver to the machine where you will run > Spark Shell > 3) Start Spark shell loading the GoogleBigQuery JDBC driver jar files > ./spark-shell --jars > /home/ec2-user/jdbc/gbq/GoogleBigQueryJDBC42.jar,/home/ec2-user/jdbc/gbq/google-api-client-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-api-services-bigquery-v2-rev320-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-http-client-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-http-client-jackson2-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-oauth-client-1.22.0.jar,/home/ec2-user/jdbc/gbq/jackson-core-2.1.3.jar > 4) In Spark shell load the data from Google BigQuery using the JDBC driver > val gbq = spark.read.format("jdbc").options(Map("url" -> > "jdbc:bigquery://https://www.googleapis.com/bigquery/v2;ProjectId=your-project-name-here;OAuthType=0;OAuthPvtKeyPath=/usr/lib/spark/YourProjectPrivateKey.json;OAuthServiceAcctEmail=YourEmail@gmail.comAllowLargeResults=1;LargeResultDataset=_bqodbc_temp_tables;LargeResultTable=_matthew;Timeout=600","dbtable"; > -> > "test.lu_test_integer")).option("driver","com.simba.googlebigquery.jdbc42.Driver").option("user","").option("password","").load() > 5) In Spark shell try to display the data > gbq.show() > At this point you should see the error: > scala> gbq.show() > 17/06/22 19:34:57 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 6, > ip-172-31-37-165.ec2.internal, executor 3): java.sql.SQLDataException: > [Simba][JDBC](10140) Error converting value to long. > at com.simba.exceptions.ExceptionConverter.toSQLException(Unknown > Source) > at com.simba.utilities.conversion.TypeConverter.toLong(Unknown Source) > at com.simba.jdbc.common.SForwardResultSet.getLong(Unknown Source) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:365) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:364) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:286) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:268) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIt
[jira] [Commented] (SPARK-21199) Its not possible to impute Vector types
[ https://issues.apache.org/jira/browse/SPARK-21199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063416#comment-16063416 ] Franklyn Dsouza commented on SPARK-21199: - For this particular scenario I have a table with two columns one is a string `document_type` and the other is a an array of tokens for the document. I want to do a TF-IDF. the IDF needs to be done per `document_type` so i do pivot on `document_type` and then do the IDF on the TF vetors. This pivoting introduces nulls for missing columns that need to be imputed. I can't impute array type either and fixing it at the token generation step would involve a lot of left joins to align various data sources. > Its not possible to impute Vector types > --- > > Key: SPARK-21199 > URL: https://issues.apache.org/jira/browse/SPARK-21199 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0, 2.1.1 >Reporter: Franklyn Dsouza > > There are cases where nulls end up in vector columns in dataframes. Currently > there is no way to fill in these nulls because its not possible to create a > literal vector column expression using lit(). > Also the entire pyspark ml api will fail when they encounter nulls so this > makes it hard to work with the data. > I think that either vector support should be added to the imputer or vectors > should be supported in column expressions so they can be used in a coalesce. > [~mlnick] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21198) SparkSession catalog is terribly slow
[ https://issues.apache.org/jira/browse/SPARK-21198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063406#comment-16063406 ] Saif Addin edited comment on SPARK-21198 at 6/26/17 5:03 PM: - Okay, I think there is something odd somewhere in between. It may be hard to tackle the issue but, i'll start slowly line to line. (*spark-submit*, local[8]) 1. This line, gets stuck forever and the program doesnt continue after waiting 2 minutes (spark-task is stuck in collect()) {code:java} import spark.implicits._ private val databases: Array[String] = REFERENCEDB +: (spark.catalog.listDatabases.filter(_ != blacklistdb).map(_.name).collect) {code} 2. Changing such line to (notice filter after collect) takes 8ms {code:java} import spark.implicits._ private val databases: Array[String] = REFERENCEDB +: (spark.catalog.listDatabases.map(_.name).collect.filterNot(_ == blacklistdb)) {code} 3. The following line, takes instead 2ms {code:java} private val databases: Array[String] = (REFERENCEDB +: spark.sql("show databases").collect.map(_.getString(0))).filterNot(_ == blacklistdb) {code} Here's the weirdest of all, if I instead start *spark-shell* and do item number 1, it works (takes 1ms which is even faster than other lines in my *spark-submit.* the other lines are also faster down to 1ms as well) was (Author: revolucion09): Okay, I think there is something odd somewhere in between. It may be hard to tackle the issue but, i'll start slowly line to line. (*spark-submit*, local[8]) 1. This line, gets stuck forever and the program doesnt continue after waiting 2 minutes (spark-task is stuck in collect()) {code:java} import spark.implicits._ private val databases: Array[String] = REFERENCEDB +: (spark.catalog.listDatabases.filter(_ != blacklistdb).map(_.name).collect) {code} 2. Changing such line to (notice filter after collect) takes 8ms {code:java} import spark.implicits._ private val databases: Array[String] = REFERENCEDB +: (spark.catalog.listDatabases.map(_.name).collect.filterNot(_ == blacklistdb)) {code} 3. The following line, takes instead 2ms {code:java} private val databases: Array[String] = (REFERENCEDB +: spark.sql("show databases").collect.map(_.getString(0))).filterNot(_ == blacklistdb) {code} Here's the weirdest of all, if I instead start *spark-shell* and do item number 1, it works (takes 1ms which is even faster than in my program. the other lines are also faster down to 1ms as well) > SparkSession catalog is terribly slow > - > > Key: SPARK-21198 > URL: https://issues.apache.org/jira/browse/SPARK-21198 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Saif Addin > > We have a considerably large Hive metastore and a Spark program that goes > through Hive data availability. > In spark 1.x, we were using sqlConext.tableNames, sqlContext.sql() and > sqlContext.isCached() to go throgh Hive metastore information. > Once migrated to spark 2.x we switched over SparkSession.catalog instead, but > it turns out that both listDatabases() and listTables() take between 5 to 20 > minutes depending on the database to return results, using operations such as > the following one: > spark.catalog.listTables(db).filter(__.isTemporary).map(__.name).collect > and made the program unbearably slow to return a list of tables. > I know we still have spark.sqlContext.tableNames as workaround but I am > assuming this is going to be deprecated anytime soon? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21198) SparkSession catalog is terribly slow
[ https://issues.apache.org/jira/browse/SPARK-21198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063406#comment-16063406 ] Saif Addin edited comment on SPARK-21198 at 6/26/17 5:02 PM: - Okay, I think there is something odd somewhere in between. It may be hard to tackle the issue but, i'll start slowly line to line. (*spark-submit*, local[8]) 1. This line, gets stuck forever and the program doesnt continue after waiting 2 minutes (spark-task is stuck in collect()) {code:java} import spark.implicits._ private val databases: Array[String] = REFERENCEDB +: (spark.catalog.listDatabases.filter(_ != blacklistdb).map(_.name).collect) {code} 2. Changing such line to (notice filter after collect) takes 8ms {code:java} import spark.implicits._ private val databases: Array[String] = REFERENCEDB +: (spark.catalog.listDatabases.map(_.name).collect.filterNot(_ == blacklistdb)) {code} 3. The following line, takes instead 2ms {code:java} private val databases: Array[String] = (REFERENCEDB +: spark.sql("show databases").collect.map(_.getString(0))).filterNot(_ == blacklistdb) {code} Here's the weirdest of all, if I instead start *spark-shell* and do item number 1, it works (takes 1ms which is even faster than in my program. the other lines are also faster down to 1ms as well) was (Author: revolucion09): Okay, I think there is something odd somewhere in between. It may be hard to tackle the issue but, i'll start slowly line to line. (*spark-submit*, local[8]) 1. This line, gets stuck forever and the program doesnt continue after waiting 2 minutes (spark-task is stuck in collect()) {code:java} import spark.implicits._ private val databases: Array[String] = REFERENCEDB +: (spark.catalog.listDatabases.filter(_ != blacklistdb).map(_.name).collect) {code} 2. Changing such line to (notice filter after collect) takes 8ms import spark.implicits._ private val databases: Array[String] = REFERENCEDB +: (spark.catalog.listDatabases.map(_.name).collect.filterNot(_ == blacklistdb)) 3. The following line, takes instead 2ms private val databases: Array[String] = (REFERENCEDB +: spark.sql("show databases").collect.map(_.getString(0))).filterNot(_ == blacklistdb) Here's the weirdest of all, if I instead start *spark-shell* and do item number 1, it works (takes 1ms which is even faster than in my program. the other lines are also faster down to 1ms as well) > SparkSession catalog is terribly slow > - > > Key: SPARK-21198 > URL: https://issues.apache.org/jira/browse/SPARK-21198 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Saif Addin > > We have a considerably large Hive metastore and a Spark program that goes > through Hive data availability. > In spark 1.x, we were using sqlConext.tableNames, sqlContext.sql() and > sqlContext.isCached() to go throgh Hive metastore information. > Once migrated to spark 2.x we switched over SparkSession.catalog instead, but > it turns out that both listDatabases() and listTables() take between 5 to 20 > minutes depending on the database to return results, using operations such as > the following one: > spark.catalog.listTables(db).filter(__.isTemporary).map(__.name).collect > and made the program unbearably slow to return a list of tables. > I know we still have spark.sqlContext.tableNames as workaround but I am > assuming this is going to be deprecated anytime soon? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21198) SparkSession catalog is terribly slow
[ https://issues.apache.org/jira/browse/SPARK-21198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063406#comment-16063406 ] Saif Addin commented on SPARK-21198: Okay, I think there is something odd somewhere in between. It may be hard to tackle the issue but, i'll start slowly line to line. (spark_submit, local[8]) 1. This line, gets stuck forever and the program doesnt continue after waiting 2 minutes (spark-task is stuck in collect()) {code:java} import spark.implicits._ private val databases: Array[String] = REFERENCEDB +: (spark.catalog.listDatabases.filter(_ != blacklistdb).map(_.name).collect) {code} 2. Changing such line to (notice filter after collect) takes 8ms import spark.implicits._ private val databases: Array[String] = REFERENCEDB +: (spark.catalog.listDatabases.map(_.name).collect.filterNot(_ == blacklistdb)) 3. The following line, takes instead 2ms private val databases: Array[String] = (REFERENCEDB +: spark.sql("show databases").collect.map(_.getString(0))).filterNot(_ == blacklistdb) Here's the weirdest of all, if I instead start spark-shell and do item number 1, it works (takes 1ms which is even faster than in my program. the other lines are also faster down to 1ms as well) > SparkSession catalog is terribly slow > - > > Key: SPARK-21198 > URL: https://issues.apache.org/jira/browse/SPARK-21198 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Saif Addin > > We have a considerably large Hive metastore and a Spark program that goes > through Hive data availability. > In spark 1.x, we were using sqlConext.tableNames, sqlContext.sql() and > sqlContext.isCached() to go throgh Hive metastore information. > Once migrated to spark 2.x we switched over SparkSession.catalog instead, but > it turns out that both listDatabases() and listTables() take between 5 to 20 > minutes depending on the database to return results, using operations such as > the following one: > spark.catalog.listTables(db).filter(__.isTemporary).map(__.name).collect > and made the program unbearably slow to return a list of tables. > I know we still have spark.sqlContext.tableNames as workaround but I am > assuming this is going to be deprecated anytime soon? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21198) SparkSession catalog is terribly slow
[ https://issues.apache.org/jira/browse/SPARK-21198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063406#comment-16063406 ] Saif Addin edited comment on SPARK-21198 at 6/26/17 5:01 PM: - Okay, I think there is something odd somewhere in between. It may be hard to tackle the issue but, i'll start slowly line to line. (*spark-submit*, local[8]) 1. This line, gets stuck forever and the program doesnt continue after waiting 2 minutes (spark-task is stuck in collect()) {code:java} import spark.implicits._ private val databases: Array[String] = REFERENCEDB +: (spark.catalog.listDatabases.filter(_ != blacklistdb).map(_.name).collect) {code} 2. Changing such line to (notice filter after collect) takes 8ms import spark.implicits._ private val databases: Array[String] = REFERENCEDB +: (spark.catalog.listDatabases.map(_.name).collect.filterNot(_ == blacklistdb)) 3. The following line, takes instead 2ms private val databases: Array[String] = (REFERENCEDB +: spark.sql("show databases").collect.map(_.getString(0))).filterNot(_ == blacklistdb) Here's the weirdest of all, if I instead start *spark-shell* and do item number 1, it works (takes 1ms which is even faster than in my program. the other lines are also faster down to 1ms as well) was (Author: revolucion09): Okay, I think there is something odd somewhere in between. It may be hard to tackle the issue but, i'll start slowly line to line. (spark_submit, local[8]) 1. This line, gets stuck forever and the program doesnt continue after waiting 2 minutes (spark-task is stuck in collect()) {code:java} import spark.implicits._ private val databases: Array[String] = REFERENCEDB +: (spark.catalog.listDatabases.filter(_ != blacklistdb).map(_.name).collect) {code} 2. Changing such line to (notice filter after collect) takes 8ms import spark.implicits._ private val databases: Array[String] = REFERENCEDB +: (spark.catalog.listDatabases.map(_.name).collect.filterNot(_ == blacklistdb)) 3. The following line, takes instead 2ms private val databases: Array[String] = (REFERENCEDB +: spark.sql("show databases").collect.map(_.getString(0))).filterNot(_ == blacklistdb) Here's the weirdest of all, if I instead start spark-shell and do item number 1, it works (takes 1ms which is even faster than in my program. the other lines are also faster down to 1ms as well) > SparkSession catalog is terribly slow > - > > Key: SPARK-21198 > URL: https://issues.apache.org/jira/browse/SPARK-21198 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Saif Addin > > We have a considerably large Hive metastore and a Spark program that goes > through Hive data availability. > In spark 1.x, we were using sqlConext.tableNames, sqlContext.sql() and > sqlContext.isCached() to go throgh Hive metastore information. > Once migrated to spark 2.x we switched over SparkSession.catalog instead, but > it turns out that both listDatabases() and listTables() take between 5 to 20 > minutes depending on the database to return results, using operations such as > the following one: > spark.catalog.listTables(db).filter(__.isTemporary).map(__.name).collect > and made the program unbearably slow to return a list of tables. > I know we still have spark.sqlContext.tableNames as workaround but I am > assuming this is going to be deprecated anytime soon? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21183) Unable to return Google BigQuery INTEGER data type into Spark via google BigQuery JDBC driver: java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value t
[ https://issues.apache.org/jira/browse/SPARK-21183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063387#comment-16063387 ] Sean Owen commented on SPARK-21183: --- This looks like an error from simba, not Spark? > Unable to return Google BigQuery INTEGER data type into Spark via google > BigQuery JDBC driver: java.sql.SQLDataException: [Simba][JDBC](10140) Error > converting value to long. > -- > > Key: SPARK-21183 > URL: https://issues.apache.org/jira/browse/SPARK-21183 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 1.6.0, 2.0.0, 2.1.1 > Environment: OS: Linux > Spark version 2.1.1 > JDBC: Download the latest google BigQuery JDBC Driver from Google >Reporter: Matthew Walton > > I'm trying to fetch back data in Spark using a JDBC connection to Google > BigQuery. Unfortunately, when I try to query data that resides in an INTEGER > column I get the following error: > java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to > long. > Steps to reproduce: > 1) On Google BigQuery console create a simple table with an INT column and > insert some data > 2) Copy the Google BigQuery JDBC driver to the machine where you will run > Spark Shell > 3) Start Spark shell loading the GoogleBigQuery JDBC driver jar files > ./spark-shell --jars > /home/ec2-user/jdbc/gbq/GoogleBigQueryJDBC42.jar,/home/ec2-user/jdbc/gbq/google-api-client-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-api-services-bigquery-v2-rev320-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-http-client-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-http-client-jackson2-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-oauth-client-1.22.0.jar,/home/ec2-user/jdbc/gbq/jackson-core-2.1.3.jar > 4) In Spark shell load the data from Google BigQuery using the JDBC driver > val gbq = spark.read.format("jdbc").options(Map("url" -> > "jdbc:bigquery://https://www.googleapis.com/bigquery/v2;ProjectId=your-project-name-here;OAuthType=0;OAuthPvtKeyPath=/usr/lib/spark/YourProjectPrivateKey.json;OAuthServiceAcctEmail=YourEmail@gmail.comAllowLargeResults=1;LargeResultDataset=_bqodbc_temp_tables;LargeResultTable=_matthew;Timeout=600","dbtable"; > -> > "test.lu_test_integer")).option("driver","com.simba.googlebigquery.jdbc42.Driver").option("user","").option("password","").load() > 5) In Spark shell try to display the data > gbq.show() > At this point you should see the error: > scala> gbq.show() > 17/06/22 19:34:57 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 6, > ip-172-31-37-165.ec2.internal, executor 3): java.sql.SQLDataException: > [Simba][JDBC](10140) Error converting value to long. > at com.simba.exceptions.ExceptionConverter.toSQLException(Unknown > Source) > at com.simba.utilities.conversion.TypeConverter.toLong(Unknown Source) > at com.simba.jdbc.common.SForwardResultSet.getLong(Unknown Source) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:365) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:364) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:286) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:268) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.sca
[jira] [Commented] (SPARK-21212) Can't use Count(*) with Order Clause
[ https://issues.apache.org/jira/browse/SPARK-21212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063376#comment-16063376 ] Sean Owen commented on SPARK-21212: --- You don't define 'value' anywhere, as it says. > Can't use Count(*) with Order Clause > > > Key: SPARK-21212 > URL: https://issues.apache.org/jira/browse/SPARK-21212 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: Windows; external data provided through data source api >Reporter: Shawn Lavelle >Priority: Minor > > I don't think this should fail the query: > {code}jdbc:hive2://user:port/> select count(*) from table where value between > 1498240079000 and cast(now() as bigint)*1000 order by value; > {code} > {code} > Error: org.apache.spark.sql.AnalysisException: cannot resolve '`value`' given > input columns: [count(1)]; line 1 pos 113; > 'Sort ['value ASC NULLS FIRST], true > +- Aggregate [count(1) AS count(1)#718L] >+- Filter ((value#413L >= 1498240079000) && (value#413L <= > (cast(current_timestamp() as bigint) * cast(1000 as bigint > +- SubqueryAlias table > +- > Relation[field1#411L,field2#412,value#413L,field3#414,field4#415,field5#416,field6#417,field7#418,field8#419,field9#420] > com.redacted@16004579 (state=,code=0) > {code} > Arguably, the optimizer could ignore the "order by" clause, but I leave that > to more informed minds than my own. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17129) Support statistics collection and cardinality estimation for partitioned tables
[ https://issues.apache.org/jira/browse/SPARK-17129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063342#comment-16063342 ] Maria commented on SPARK-17129: --- [~ZenWzh], I opened SPARK-21213 and submitted PR https://github.com/apache/spark/pull/18421 . Would you help review it? > Support statistics collection and cardinality estimation for partitioned > tables > --- > > Key: SPARK-17129 > URL: https://issues.apache.org/jira/browse/SPARK-17129 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Zhenhua Wang > > I upgrade this JIRA, because there are many tasks found and needed to be done > here. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13669) Job will always fail in the external shuffle service unavailable situation
[ https://issues.apache.org/jira/browse/SPARK-13669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-13669. --- Resolution: Fixed Fix Version/s: 2.3.0 > Job will always fail in the external shuffle service unavailable situation > -- > > Key: SPARK-13669 > URL: https://issues.apache.org/jira/browse/SPARK-13669 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Reporter: Saisai Shao >Assignee: Saisai Shao > Fix For: 2.3.0 > > > Currently we are running into an issue with Yarn work preserving enabled + > external shuffle service. > In the work preserving enabled scenario, the failure of NM will not lead to > the exit of executors, so executors can still accept and run the tasks. The > problem here is when NM is failed, external shuffle service is actually > inaccessible, so reduce tasks will always complain about the “Fetch failure”, > and the failure of reduce stage will make the parent stage (map stage) rerun. > The tricky thing here is Spark scheduler is not aware of the unavailability > of external shuffle service, and will reschedule the map tasks on the > executor where NM is failed, and again reduce stage will be failed with > “Fetch failure”, and after 4 retries, the job is failed. > So here the main problem is that we should avoid assigning tasks to those bad > executors (where shuffle service is unavailable). Current Spark's blacklist > mechanism could blacklist executors/nodes by failure tasks, but it doesn't > handle this specific fetch failure scenario. So here propose to improve the > current application blacklist mechanism to handle fetch failure issue > (especially with external shuffle service unavailable issue), to blacklist > the executors/nodes where shuffle fetch is unavailable. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13669) Job will always fail in the external shuffle service unavailable situation
[ https://issues.apache.org/jira/browse/SPARK-13669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned SPARK-13669: - Assignee: Saisai Shao > Job will always fail in the external shuffle service unavailable situation > -- > > Key: SPARK-13669 > URL: https://issues.apache.org/jira/browse/SPARK-13669 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Reporter: Saisai Shao >Assignee: Saisai Shao > Fix For: 2.3.0 > > > Currently we are running into an issue with Yarn work preserving enabled + > external shuffle service. > In the work preserving enabled scenario, the failure of NM will not lead to > the exit of executors, so executors can still accept and run the tasks. The > problem here is when NM is failed, external shuffle service is actually > inaccessible, so reduce tasks will always complain about the “Fetch failure”, > and the failure of reduce stage will make the parent stage (map stage) rerun. > The tricky thing here is Spark scheduler is not aware of the unavailability > of external shuffle service, and will reschedule the map tasks on the > executor where NM is failed, and again reduce stage will be failed with > “Fetch failure”, and after 4 retries, the job is failed. > So here the main problem is that we should avoid assigning tasks to those bad > executors (where shuffle service is unavailable). Current Spark's blacklist > mechanism could blacklist executors/nodes by failure tasks, but it doesn't > handle this specific fetch failure scenario. So here propose to improve the > current application blacklist mechanism to handle fetch failure issue > (especially with external shuffle service unavailable issue), to blacklist > the executors/nodes where shuffle fetch is unavailable. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20898) spark.blacklist.killBlacklistedExecutors doesn't work in YARN
[ https://issues.apache.org/jira/browse/SPARK-20898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned SPARK-20898: - Assignee: Saisai Shao > spark.blacklist.killBlacklistedExecutors doesn't work in YARN > - > > Key: SPARK-20898 > URL: https://issues.apache.org/jira/browse/SPARK-20898 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Thomas Graves >Assignee: Saisai Shao > Fix For: 2.3.0 > > > I was trying out the new spark.blacklist.killBlacklistedExecutors on YARN but > it doesn't appear to work. Everytime I get: > 17/05/26 16:28:12 WARN BlacklistTracker: Not attempting to kill blacklisted > executor id 4 since allocation client is not defined > Even though dynamic allocation is on. Taking a quick look, I think the way > it creates the blacklisttracker and passes the allocation client is wrong. > The scheduler backend is > not set yet so it never passes the allocation client to the blacklisttracker > correctly. Thus it will never kill. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20898) spark.blacklist.killBlacklistedExecutors doesn't work in YARN
[ https://issues.apache.org/jira/browse/SPARK-20898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-20898. --- Resolution: Fixed Fix Version/s: 2.3.0 > spark.blacklist.killBlacklistedExecutors doesn't work in YARN > - > > Key: SPARK-20898 > URL: https://issues.apache.org/jira/browse/SPARK-20898 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Thomas Graves >Assignee: Saisai Shao > Fix For: 2.3.0 > > > I was trying out the new spark.blacklist.killBlacklistedExecutors on YARN but > it doesn't appear to work. Everytime I get: > 17/05/26 16:28:12 WARN BlacklistTracker: Not attempting to kill blacklisted > executor id 4 since allocation client is not defined > Even though dynamic allocation is on. Taking a quick look, I think the way > it creates the blacklisttracker and passes the allocation client is wrong. > The scheduler backend is > not set yet so it never passes the allocation client to the blacklisttracker > correctly. Thus it will never kill. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21213) Support collecting partition-level statistics: rowCount and sizeInBytes
[ https://issues.apache.org/jira/browse/SPARK-21213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063338#comment-16063338 ] Apache Spark commented on SPARK-21213: -- User 'mbasmanova' has created a pull request for this issue: https://github.com/apache/spark/pull/18421 > Support collecting partition-level statistics: rowCount and sizeInBytes > --- > > Key: SPARK-21213 > URL: https://issues.apache.org/jira/browse/SPARK-21213 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.1 >Reporter: Maria > Original Estimate: 168h > Remaining Estimate: 168h > > Support ANALYZE TABLE table PARTITION (key=value,...) COMPUTE STATISTICS > [NOSCAN] SQL command to compute and store in Hive Metastore number of rows > and total size in bytes for a single partition. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21213) Support collecting partition-level statistics: rowCount and sizeInBytes
[ https://issues.apache.org/jira/browse/SPARK-21213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21213: Assignee: (was: Apache Spark) > Support collecting partition-level statistics: rowCount and sizeInBytes > --- > > Key: SPARK-21213 > URL: https://issues.apache.org/jira/browse/SPARK-21213 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.1 >Reporter: Maria > Original Estimate: 168h > Remaining Estimate: 168h > > Support ANALYZE TABLE table PARTITION (key=value,...) COMPUTE STATISTICS > [NOSCAN] SQL command to compute and store in Hive Metastore number of rows > and total size in bytes for a single partition. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21213) Support collecting partition-level statistics: rowCount and sizeInBytes
[ https://issues.apache.org/jira/browse/SPARK-21213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21213: Assignee: Apache Spark > Support collecting partition-level statistics: rowCount and sizeInBytes > --- > > Key: SPARK-21213 > URL: https://issues.apache.org/jira/browse/SPARK-21213 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.1 >Reporter: Maria >Assignee: Apache Spark > Original Estimate: 168h > Remaining Estimate: 168h > > Support ANALYZE TABLE table PARTITION (key=value,...) COMPUTE STATISTICS > [NOSCAN] SQL command to compute and store in Hive Metastore number of rows > and total size in bytes for a single partition. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21213) Support collecting partition-level statistics: rowCount and sizeInBytes
[ https://issues.apache.org/jira/browse/SPARK-21213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maria updated SPARK-21213: -- Summary: Support collecting partition-level statistics: rowCount and sizeInBytes (was: Support collecting partition-level statistics) > Support collecting partition-level statistics: rowCount and sizeInBytes > --- > > Key: SPARK-21213 > URL: https://issues.apache.org/jira/browse/SPARK-21213 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.1 >Reporter: Maria > Original Estimate: 168h > Remaining Estimate: 168h > > Support ANALYZE TABLE table PARTITION (key=value,...) COMPUTE STATISTICS > [NOSCAN] SQL command to compute and store in Hive Metastore number of rows > and total size in bytes for a single partition. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21063) Spark return an empty result from remote hadoop cluster
[ https://issues.apache.org/jira/browse/SPARK-21063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063289#comment-16063289 ] Peter Bykov commented on SPARK-21063: - [~q79969786] I tried this solution, but same result (empty result set). [~srowen] Can you share a specific configuration from your cluster? > Spark return an empty result from remote hadoop cluster > --- > > Key: SPARK-21063 > URL: https://issues.apache.org/jira/browse/SPARK-21063 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.1.0, 2.1.1 >Reporter: Peter Bykov > > Spark returning empty result from when querying remote hadoop cluster. > All firewall settings removed. > Querying using JDBC working properly using hive-jdbc driver from version 1.1.1 > Code snippet is: > {code:java} > val spark = SparkSession.builder > .appName("RemoteSparkTest") > .master("local") > .getOrCreate() > val df = spark.read > .option("url", "jdbc:hive2://remote.hive.local:1/default") > .option("user", "user") > .option("password", "pass") > .option("dbtable", "test_table") > .option("driver", "org.apache.hive.jdbc.HiveDriver") > .format("jdbc") > .load() > > df.show() > {code} > Result: > {noformat} > +---+ > |test_table.test_col| > +---+ > +---+ > {noformat} > All manipulations like: > {code:java} > df.select(*).show() > {code} > returns empty result too. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21213) Support collecting partition-level statistics
[ https://issues.apache.org/jira/browse/SPARK-21213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maria updated SPARK-21213: -- Issue Type: Sub-task (was: New Feature) Parent: SPARK-17129 > Support collecting partition-level statistics > - > > Key: SPARK-21213 > URL: https://issues.apache.org/jira/browse/SPARK-21213 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.1 >Reporter: Maria > Original Estimate: 168h > Remaining Estimate: 168h > > Support ANALYZE TABLE table PARTITION (key=value,...) COMPUTE STATISTICS > [NOSCAN] SQL command to compute and store in Hive Metastore number of rows > and total size in bytes for a single partition. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21213) Support collecting partition-level statistics
Maria created SPARK-21213: - Summary: Support collecting partition-level statistics Key: SPARK-21213 URL: https://issues.apache.org/jira/browse/SPARK-21213 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.1.1 Reporter: Maria Support ANALYZE TABLE table PARTITION (key=value,...) COMPUTE STATISTICS [NOSCAN] SQL command to compute and store in Hive Metastore number of rows and total size in bytes for a single partition. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org