[jira] [Resolved] (SPARK-13962) spark.ml Evaluators should support other numeric types for label
[ https://issues.apache.org/jira/browse/SPARK-13962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-13962. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12500 [https://github.com/apache/spark/pull/12500] > spark.ml Evaluators should support other numeric types for label > > > Key: SPARK-13962 > URL: https://issues.apache.org/jira/browse/SPARK-13962 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Nick Pentreath >Assignee: Benjamin Fradet >Priority: Minor > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14913) Simplify configuration API
[ https://issues.apache.org/jira/browse/SPARK-14913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14913: Assignee: Apache Spark (was: Reynold Xin) > Simplify configuration API > -- > > Key: SPARK-14913 > URL: https://issues.apache.org/jira/browse/SPARK-14913 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > We currently expose both Hadoop configuration and Spark SQL configuration in > RuntimeConfig. I think we can remove the Hadoop configuration part, and > simply generate Hadoop Configuration on the fly by passing all the SQL > configurations into it. This way, there is a single interface (in > Java/Scala/Python/SQL) for end-users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14913) Simplify configuration API
[ https://issues.apache.org/jira/browse/SPARK-14913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15257650#comment-15257650 ] Apache Spark commented on SPARK-14913: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/12689 > Simplify configuration API > -- > > Key: SPARK-14913 > URL: https://issues.apache.org/jira/browse/SPARK-14913 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > We currently expose both Hadoop configuration and Spark SQL configuration in > RuntimeConfig. I think we can remove the Hadoop configuration part, and > simply generate Hadoop Configuration on the fly by passing all the SQL > configurations into it. This way, there is a single interface (in > Java/Scala/Python/SQL) for end-users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14913) Simplify configuration API
[ https://issues.apache.org/jira/browse/SPARK-14913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14913: Assignee: Reynold Xin (was: Apache Spark) > Simplify configuration API > -- > > Key: SPARK-14913 > URL: https://issues.apache.org/jira/browse/SPARK-14913 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > We currently expose both Hadoop configuration and Spark SQL configuration in > RuntimeConfig. I think we can remove the Hadoop configuration part, and > simply generate Hadoop Configuration on the fly by passing all the SQL > configurations into it. This way, there is a single interface (in > Java/Scala/Python/SQL) for end-users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14913) Simplify configuration API
Reynold Xin created SPARK-14913: --- Summary: Simplify configuration API Key: SPARK-14913 URL: https://issues.apache.org/jira/browse/SPARK-14913 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin We currently expose both Hadoop configuration and Spark SQL configuration in RuntimeConfig. I think we can remove the Hadoop configuration part, and simply generate Hadoop Configuration on the fly by passing all the SQL configurations into it. This way, there is a single interface (in Java/Scala/Python/SQL) for end-users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14912) Propagate data source options to Hadoop configurations
[ https://issues.apache.org/jira/browse/SPARK-14912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14912: Assignee: Reynold Xin (was: Apache Spark) > Propagate data source options to Hadoop configurations > -- > > Key: SPARK-14912 > URL: https://issues.apache.org/jira/browse/SPARK-14912 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > We currently have no way for users to propagate options to the underlying > library that rely in Hadoop configurations to work. For example, there are > various options in parquet-mr that users might want to set, but the data > source API does not expose a per-job way to set it. > This patch propagates the user-specified options also into Hadoop > Configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14912) Propagate data source options to Hadoop configurations
[ https://issues.apache.org/jira/browse/SPARK-14912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14912: Assignee: Apache Spark (was: Reynold Xin) > Propagate data source options to Hadoop configurations > -- > > Key: SPARK-14912 > URL: https://issues.apache.org/jira/browse/SPARK-14912 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > We currently have no way for users to propagate options to the underlying > library that rely in Hadoop configurations to work. For example, there are > various options in parquet-mr that users might want to set, but the data > source API does not expose a per-job way to set it. > This patch propagates the user-specified options also into Hadoop > Configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14912) Propagate data source options to Hadoop configurations
[ https://issues.apache.org/jira/browse/SPARK-14912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15257589#comment-15257589 ] Apache Spark commented on SPARK-14912: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/12688 > Propagate data source options to Hadoop configurations > -- > > Key: SPARK-14912 > URL: https://issues.apache.org/jira/browse/SPARK-14912 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > We currently have no way for users to propagate options to the underlying > library that rely in Hadoop configurations to work. For example, there are > various options in parquet-mr that users might want to set, but the data > source API does not expose a per-job way to set it. > This patch propagates the user-specified options also into Hadoop > Configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14912) Propagate data source options to Hadoop configurations
Reynold Xin created SPARK-14912: --- Summary: Propagate data source options to Hadoop configurations Key: SPARK-14912 URL: https://issues.apache.org/jira/browse/SPARK-14912 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin We currently have no way for users to propagate options to the underlying library that rely in Hadoop configurations to work. For example, there are various options in parquet-mr that users might want to set, but the data source API does not expose a per-job way to set it. This patch propagates the user-specified options also into Hadoop Configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14313) AFTSurvivalRegression model persistence in SparkR
[ https://issues.apache.org/jira/browse/SPARK-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15257563#comment-15257563 ] Apache Spark commented on SPARK-14313: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/12685 > AFTSurvivalRegression model persistence in SparkR > - > > Key: SPARK-14313 > URL: https://issues.apache.org/jira/browse/SPARK-14313 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14313) AFTSurvivalRegression model persistence in SparkR
[ https://issues.apache.org/jira/browse/SPARK-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14313: Assignee: Yanbo Liang (was: Apache Spark) > AFTSurvivalRegression model persistence in SparkR > - > > Key: SPARK-14313 > URL: https://issues.apache.org/jira/browse/SPARK-14313 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14313) AFTSurvivalRegression model persistence in SparkR
[ https://issues.apache.org/jira/browse/SPARK-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14313: Assignee: Apache Spark (was: Yanbo Liang) > AFTSurvivalRegression model persistence in SparkR > - > > Key: SPARK-14313 > URL: https://issues.apache.org/jira/browse/SPARK-14313 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Reporter: Xiangrui Meng >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14874) Remove the obsolete Batch representation
[ https://issues.apache.org/jira/browse/SPARK-14874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liwei Lin updated SPARK-14874: -- Description: The Batch class, which had been used to indicate progress in a stream, was abandoned by SPARK-13985 and then became useless. Let's: - removes the Batch class - -renames getBatch(...) to getData(...) for Source- (update: as discussed in the PR, this is not necessary) - -renames addBatch(...) to addData(...) for Sink- (update: as discussed in the PR, this is not necessary) was: The Batch class, which had been used to indicate progress in a stream, was abandoned by SPARK-13985 and then became useless. Let's: - removes the Batch class - renames getBatch(...) to getData(...) for Source - renames addBatch(...) to addData(...) for Sink > Remove the obsolete Batch representation > > > Key: SPARK-14874 > URL: https://issues.apache.org/jira/browse/SPARK-14874 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Liwei Lin >Priority: Minor > > The Batch class, which had been used to indicate progress in a stream, was > abandoned by SPARK-13985 and then became useless. > Let's: > - removes the Batch class > - -renames getBatch(...) to getData(...) for Source- (update: as discussed in > the PR, this is not necessary) > - -renames addBatch(...) to addData(...) for Sink- (update: as discussed in > the PR, this is not necessary) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-14806) Alias original Hive options in Spark SQL conf
[ https://issues.apache.org/jira/browse/SPARK-14806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-14806. --- Resolution: Won't Fix > Alias original Hive options in Spark SQL conf > - > > Key: SPARK-14806 > URL: https://issues.apache.org/jira/browse/SPARK-14806 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > There are couple options we should alias: spark.sql.variable.substitute and > spark.sql.variable.substitute.depth. > The Hive config options are hive.variable.substitute and > hive.variable.substitute.depth -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14315) GLMs model persistence in SparkR
[ https://issues.apache.org/jira/browse/SPARK-14315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14315: Assignee: (was: Apache Spark) > GLMs model persistence in SparkR > > > Key: SPARK-14315 > URL: https://issues.apache.org/jira/browse/SPARK-14315 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Reporter: Xiangrui Meng > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14315) GLMs model persistence in SparkR
[ https://issues.apache.org/jira/browse/SPARK-14315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15257532#comment-15257532 ] Apache Spark commented on SPARK-14315: -- User 'GayathriMurali' has created a pull request for this issue: https://github.com/apache/spark/pull/12683 > GLMs model persistence in SparkR > > > Key: SPARK-14315 > URL: https://issues.apache.org/jira/browse/SPARK-14315 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Reporter: Xiangrui Meng > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14315) GLMs model persistence in SparkR
[ https://issues.apache.org/jira/browse/SPARK-14315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14315: Assignee: Apache Spark > GLMs model persistence in SparkR > > > Key: SPARK-14315 > URL: https://issues.apache.org/jira/browse/SPARK-14315 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Reporter: Xiangrui Meng >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14861) Replace internal usages of SQLContext with SparkSession
[ https://issues.apache.org/jira/browse/SPARK-14861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-14861. - Resolution: Fixed Fix Version/s: 2.0.0 > Replace internal usages of SQLContext with SparkSession > --- > > Key: SPARK-14861 > URL: https://issues.apache.org/jira/browse/SPARK-14861 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or > Fix For: 2.0.0 > > > We should try to use SparkSession (the new thing) in as many places as > possible. We should be careful not to break the public datasource API though. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14904) Add back HiveContext in compatibility package
[ https://issues.apache.org/jira/browse/SPARK-14904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15257521#comment-15257521 ] Apache Spark commented on SPARK-14904: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/12682 > Add back HiveContext in compatibility package > - > > Key: SPARK-14904 > URL: https://issues.apache.org/jira/browse/SPARK-14904 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14904) Add back HiveContext in compatibility package
[ https://issues.apache.org/jira/browse/SPARK-14904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-14904. - Resolution: Fixed Fix Version/s: 2.0.0 > Add back HiveContext in compatibility package > - > > Key: SPARK-14904 > URL: https://issues.apache.org/jira/browse/SPARK-14904 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14911) Fix a potential data race in TaskMemoryManager
[ https://issues.apache.org/jira/browse/SPARK-14911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15257497#comment-15257497 ] Apache Spark commented on SPARK-14911: -- User 'lw-lin' has created a pull request for this issue: https://github.com/apache/spark/pull/12681 > Fix a potential data race in TaskMemoryManager > -- > > Key: SPARK-14911 > URL: https://issues.apache.org/jira/browse/SPARK-14911 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Liwei Lin >Priority: Minor > > SPARK-13210 introduced an `acquiredButNotUsed` field, but it might not be > correctly synchronized: > - the write `acquiredButNotUsed += acquired` is guarded by `this` lock (see > [here|https://github.com/apache/spark/blame/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L271]); > - the read `memoryManager.releaseExecutionMemory(acquiredButNotUsed, > taskAttemptId, tungstenMemoryMode)` (see > [here|https://github.com/apache/spark/blame/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L400]) > might not be correctly synchronized, and might not see > `acquiredButNotUsed`'s new written value. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14911) Fix a potential data race in TaskMemoryManager
[ https://issues.apache.org/jira/browse/SPARK-14911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14911: Assignee: (was: Apache Spark) > Fix a potential data race in TaskMemoryManager > -- > > Key: SPARK-14911 > URL: https://issues.apache.org/jira/browse/SPARK-14911 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Liwei Lin >Priority: Minor > > SPARK-13210 introduced an `acquiredButNotUsed` field, but it might not be > correctly synchronized: > - the write `acquiredButNotUsed += acquired` is guarded by `this` lock (see > [here|https://github.com/apache/spark/blame/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L271]); > - the read `memoryManager.releaseExecutionMemory(acquiredButNotUsed, > taskAttemptId, tungstenMemoryMode)` (see > [here|https://github.com/apache/spark/blame/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L400]) > might not be correctly synchronized, and might not see > `acquiredButNotUsed`'s new written value. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14911) Fix a potential data race in TaskMemoryManager
[ https://issues.apache.org/jira/browse/SPARK-14911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14911: Assignee: Apache Spark > Fix a potential data race in TaskMemoryManager > -- > > Key: SPARK-14911 > URL: https://issues.apache.org/jira/browse/SPARK-14911 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Liwei Lin >Assignee: Apache Spark >Priority: Minor > > SPARK-13210 introduced an `acquiredButNotUsed` field, but it might not be > correctly synchronized: > - the write `acquiredButNotUsed += acquired` is guarded by `this` lock (see > [here|https://github.com/apache/spark/blame/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L271]); > - the read `memoryManager.releaseExecutionMemory(acquiredButNotUsed, > taskAttemptId, tungstenMemoryMode)` (see > [here|https://github.com/apache/spark/blame/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L400]) > might not be correctly synchronized, and might not see > `acquiredButNotUsed`'s new written value. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13902) Make DAGScheduler.getAncestorShuffleDependencies() return in topological order to ensure building ancestor stages first.
[ https://issues.apache.org/jira/browse/SPARK-13902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15257496#comment-15257496 ] Takuya Ueshin commented on SPARK-13902: --- Yes, exactly. > Make DAGScheduler.getAncestorShuffleDependencies() return in topological > order to ensure building ancestor stages first. > > > Key: SPARK-13902 > URL: https://issues.apache.org/jira/browse/SPARK-13902 > Project: Spark > Issue Type: Bug > Components: Scheduler >Reporter: Takuya Ueshin > > {{DAGScheduler}} sometimes generate incorrect stage graph. > Some stages are generated for the same shuffleId twice or more and they are > referenced by the child stages because the building order of the graph is not > correct. > Here, we submit an RDD\[F\] having a linage of RDDs as follows (please see > this in {{monospaced}} font): > {noformat} > < > / \ > [A] <--(1)-- [B] <--(2)-- [C] <--(3)-- [D] <--(4)-- [E] <--(5)-- [F] >\ / > < > {noformat} > Note: \[\] means an RDD, () means a shuffle dependency. > {{DAGScheduler}} generates the following stages and their parents for each > shuffle: > | | stage | parents | > | (1) | ShuffleMapStage 2 | List() | > | (2) | ShuffleMapStage 1 | List(ShuffleMapStage 0) | > | (3) | ShuffleMapStage 3 | List(ShuffleMapStage 1) | > | (4) | ShuffleMapStage 4 | List(ShuffleMapStage 2, ShuffleMapStage 3) | > | (5) | ShuffleMapStage 5 | List(ShuffleMapStage 1, ShuffleMapStage 4) | > | \- | ResultStage 6 | List(ShuffleMapStage 5) | > The stage for shuffle id {{0}} should be {{ShuffleMapStage 0}}, but the stage > for shuffle id {{0}} is generated twice as {{ShuffleMapStage 2}} and > {{ShuffleMapStage 0}} is overwritten by {{ShuffleMapStage 2}}, and the stage > {{ShuffleMap Stage1}} keeps referring the _old_ stage {{ShuffleMapStage 0}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14911) Fix a potential data race in TaskMemoryManager
Liwei Lin created SPARK-14911: - Summary: Fix a potential data race in TaskMemoryManager Key: SPARK-14911 URL: https://issues.apache.org/jira/browse/SPARK-14911 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.0 Reporter: Liwei Lin Priority: Minor SPARK-13210 introduced an `acquiredButNotUsed` field, but it might not be correctly synchronized: - the write `acquiredButNotUsed += acquired` is guarded by `this` lock (see [here|https://github.com/apache/spark/blame/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L271]); - the read `memoryManager.releaseExecutionMemory(acquiredButNotUsed, taskAttemptId, tungstenMemoryMode)` (see [here|https://github.com/apache/spark/blame/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L400]) might not be correctly synchronized, and might not see `acquiredButNotUsed`'s new written value. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13902) Make DAGScheduler.getAncestorShuffleDependencies() return in topological order to ensure building ancestor stages first.
[ https://issues.apache.org/jira/browse/SPARK-13902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin updated SPARK-13902: -- Description: {{DAGScheduler}} sometimes generate incorrect stage graph. Some stages are generated for the same shuffleId twice or more and they are referenced by the child stages because the building order of the graph is not correct. Here, we submit an RDD\[F\] having a linage of RDDs as follows (please see this in {{monospaced}} font): {noformat} < / \ [A] <--(1)-- [B] <--(2)-- [C] <--(3)-- [D] <--(4)-- [E] <--(5)-- [F] \ / < {noformat} Note: \[\] means an RDD, () means a shuffle dependency. {{DAGScheduler}} generates the following stages and their parents for each shuffle: | | stage | parents | | (1) | ShuffleMapStage 2 | List() | | (2) | ShuffleMapStage 1 | List(ShuffleMapStage 0) | | (3) | ShuffleMapStage 3 | List(ShuffleMapStage 1) | | (4) | ShuffleMapStage 4 | List(ShuffleMapStage 2, ShuffleMapStage 3) | | (5) | ShuffleMapStage 5 | List(ShuffleMapStage 1, ShuffleMapStage 4) | | \- | ResultStage 6 | List(ShuffleMapStage 5) | The stage for shuffle id {{0}} should be {{ShuffleMapStage 0}}, but the stage for shuffle id {{0}} is generated twice as {{ShuffleMapStage 2}} and {{ShuffleMapStage 0}} is overwritten by {{ShuffleMapStage 2}}, and the stage {{ShuffleMap Stage1}} keeps referring the _old_ stage {{ShuffleMapStage 0}}. was: {{DAGScheduler}} sometimes generate incorrect stage graph. Some stages are generated for the same shuffleId twice or more and they are referenced by the child stages because the building order of the graph is not correct. Here, we submit an RDD\[F\] having a linage of RDDs as follows (please see this in {{monospaced}} font): {noformat} < / \ [A] <--(1)-- [B] <--(2)-- [C] <--(3)-- [D] <--(4)-- [E] <--(5)-- [F] \ / < {noformat} {{DAGScheduler}} generates the following stages and their parents for each shuffle id: | shuffle id | stage | parents | | 0 | ShuffleMapStage 2 | List() | | 1 | ShuffleMapStage 1 | List(ShuffleMapStage 0) | | 2 | ShuffleMapStage 3 | List(ShuffleMapStage 1) | | 3 | ShuffleMapStage 4 | List(ShuffleMapStage 2, ShuffleMapStage 3) | | 4 | ShuffleMapStage 5 | List(ShuffleMapStage 1, ShuffleMapStage 4) | | \- | ResultStage 6 | List(ShuffleMapStage 5) | The stage for shuffle id {{0}} should be {{ShuffleMapStage 0}}, but the stage for shuffle id {{0}} is generated twice as {{ShuffleMapStage 2}} and {{ShuffleMapStage 0}} is overwritten by {{ShuffleMapStage 2}}, and the stage {{ShuffleMap Stage1}} keeps referring the _old_ stage {{ShuffleMapStage 0}}. > Make DAGScheduler.getAncestorShuffleDependencies() return in topological > order to ensure building ancestor stages first. > > > Key: SPARK-13902 > URL: https://issues.apache.org/jira/browse/SPARK-13902 > Project: Spark > Issue Type: Bug > Components: Scheduler >Reporter: Takuya Ueshin > > {{DAGScheduler}} sometimes generate incorrect stage graph. > Some stages are generated for the same shuffleId twice or more and they are > referenced by the child stages because the building order of the graph is not > correct. > Here, we submit an RDD\[F\] having a linage of RDDs as follows (please see > this in {{monospaced}} font): > {noformat} > < > / \ > [A] <--(1)-- [B] <--(2)-- [C] <--(3)-- [D] <--(4)-- [E] <--(5)-- [F] >\ / > < > {noformat} > Note: \[\] means an RDD, () means a shuffle dependency. > {{DAGScheduler}} generates the following stages and their parents for each > shuffle: > | | stage | parents | > | (1) | ShuffleMapStage 2 | List() | > | (2) | ShuffleMapStage 1 | List(ShuffleMapStage 0) | > | (3) | ShuffleMapStage 3 | List(ShuffleMapStage 1) | > | (4) | ShuffleMapStage 4 | List(ShuffleMapStage 2, ShuffleMapStage 3) | > | (5) | ShuffleMapStage 5 | List(ShuffleMapStage 1, ShuffleMapStage 4) | > | \- | ResultStage 6 | List(ShuffleMapStage 5) | > The stage for shuffle id {{0}} should be {{ShuffleMapStage 0}}, but the stage > for shuffle id {{0}} is generated twice as {{ShuffleMapStage 2}} and > {{ShuffleMapStage 0}} is overwritten by {{ShuffleMapStage 2}}, and the stage > {{ShuffleMap Stage1}} keeps referring the _old_ stage {{ShuffleMapStage 0}}. -- This message was sen
[jira] [Commented] (SPARK-13902) Make DAGScheduler.getAncestorShuffleDependencies() return in topological order to ensure building ancestor stages first.
[ https://issues.apache.org/jira/browse/SPARK-13902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15257481#comment-15257481 ] Takuya Ueshin commented on SPARK-13902: --- I'm sorry, I made a mistake. I should have written the number in the diagram, but I wrote the actual shuffle id from DAGScheduler we can get when we run the test. I'll update it. So the rest of your questions are right. > Make DAGScheduler.getAncestorShuffleDependencies() return in topological > order to ensure building ancestor stages first. > > > Key: SPARK-13902 > URL: https://issues.apache.org/jira/browse/SPARK-13902 > Project: Spark > Issue Type: Bug > Components: Scheduler >Reporter: Takuya Ueshin > > {{DAGScheduler}} sometimes generate incorrect stage graph. > Some stages are generated for the same shuffleId twice or more and they are > referenced by the child stages because the building order of the graph is not > correct. > Here, we submit an RDD\[F\] having a linage of RDDs as follows (please see > this in {{monospaced}} font): > {noformat} > < > / \ > [A] <--(1)-- [B] <--(2)-- [C] <--(3)-- [D] <--(4)-- [E] <--(5)-- [F] >\ / > < > {noformat} > {{DAGScheduler}} generates the following stages and their parents for each > shuffle id: > | shuffle id | stage | parents | > | 0 | ShuffleMapStage 2 | List() | > | 1 | ShuffleMapStage 1 | List(ShuffleMapStage 0) | > | 2 | ShuffleMapStage 3 | List(ShuffleMapStage 1) | > | 3 | ShuffleMapStage 4 | List(ShuffleMapStage 2, ShuffleMapStage 3) | > | 4 | ShuffleMapStage 5 | List(ShuffleMapStage 1, ShuffleMapStage 4) | > | \- | ResultStage 6 | List(ShuffleMapStage 5) | > The stage for shuffle id {{0}} should be {{ShuffleMapStage 0}}, but the stage > for shuffle id {{0}} is generated twice as {{ShuffleMapStage 2}} and > {{ShuffleMapStage 0}} is overwritten by {{ShuffleMapStage 2}}, and the stage > {{ShuffleMap Stage1}} keeps referring the _old_ stage {{ShuffleMapStage 0}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14314) K-means model persistence in SparkR
[ https://issues.apache.org/jira/browse/SPARK-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14314: Assignee: (was: Apache Spark) > K-means model persistence in SparkR > --- > > Key: SPARK-14314 > URL: https://issues.apache.org/jira/browse/SPARK-14314 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Reporter: Xiangrui Meng > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14314) K-means model persistence in SparkR
[ https://issues.apache.org/jira/browse/SPARK-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15257475#comment-15257475 ] Apache Spark commented on SPARK-14314: -- User 'GayathriMurali' has created a pull request for this issue: https://github.com/apache/spark/pull/12680 > K-means model persistence in SparkR > --- > > Key: SPARK-14314 > URL: https://issues.apache.org/jira/browse/SPARK-14314 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Reporter: Xiangrui Meng > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14314) K-means model persistence in SparkR
[ https://issues.apache.org/jira/browse/SPARK-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14314: Assignee: Apache Spark > K-means model persistence in SparkR > --- > > Key: SPARK-14314 > URL: https://issues.apache.org/jira/browse/SPARK-14314 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Reporter: Xiangrui Meng >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14910) Native DDL Command Support for Describe Function in Non-identifier Format
[ https://issues.apache.org/jira/browse/SPARK-14910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15257427#comment-15257427 ] Apache Spark commented on SPARK-14910: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/12679 > Native DDL Command Support for Describe Function in Non-identifier Format > - > > Key: SPARK-14910 > URL: https://issues.apache.org/jira/browse/SPARK-14910 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > The existing `Describe Function` only support the function name in > `identifier`. This is different from what Hive behaves. That is why many test > cases `udf_abc` in `HiveCompatibilitySuite` do not pass. For example, > - udf_not.q > - udf_bitwise_not.q > We need to support the command of `Describe Function` whose function names > are in the following formats that are not natively supported: > - `STRING` (e.g., `'func1'`) > - `comparisonOperator` (e.g,. `<`) > - `arithmeticOperator` (e.g., `+`) > - `predicateOperator` (e.g., `or`) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14910) Native DDL Command Support for Describe Function in Non-identifier Format
[ https://issues.apache.org/jira/browse/SPARK-14910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14910: Assignee: (was: Apache Spark) > Native DDL Command Support for Describe Function in Non-identifier Format > - > > Key: SPARK-14910 > URL: https://issues.apache.org/jira/browse/SPARK-14910 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > The existing `Describe Function` only support the function name in > `identifier`. This is different from what Hive behaves. That is why many test > cases `udf_abc` in `HiveCompatibilitySuite` do not pass. For example, > - udf_not.q > - udf_bitwise_not.q > We need to support the command of `Describe Function` whose function names > are in the following formats that are not natively supported: > - `STRING` (e.g., `'func1'`) > - `comparisonOperator` (e.g,. `<`) > - `arithmeticOperator` (e.g., `+`) > - `predicateOperator` (e.g., `or`) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14910) Native DDL Command Support for Describe Function in Non-identifier Format
[ https://issues.apache.org/jira/browse/SPARK-14910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14910: Assignee: Apache Spark > Native DDL Command Support for Describe Function in Non-identifier Format > - > > Key: SPARK-14910 > URL: https://issues.apache.org/jira/browse/SPARK-14910 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Apache Spark > > The existing `Describe Function` only support the function name in > `identifier`. This is different from what Hive behaves. That is why many test > cases `udf_abc` in `HiveCompatibilitySuite` do not pass. For example, > - udf_not.q > - udf_bitwise_not.q > We need to support the command of `Describe Function` whose function names > are in the following formats that are not natively supported: > - `STRING` (e.g., `'func1'`) > - `comparisonOperator` (e.g,. `<`) > - `arithmeticOperator` (e.g., `+`) > - `predicateOperator` (e.g., `or`) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14910) Native DDL Command Support for Describe Function in Non-identifier Format
[ https://issues.apache.org/jira/browse/SPARK-14910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-14910: Issue Type: Sub-task (was: Improvement) Parent: SPARK-14118 > Native DDL Command Support for Describe Function in Non-identifier Format > - > > Key: SPARK-14910 > URL: https://issues.apache.org/jira/browse/SPARK-14910 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > The existing `Describe Function` only support the function name in > `identifier`. This is different from what Hive behaves. That is why many test > cases `udf_abc` in `HiveCompatibilitySuite` do not pass. For example, > - udf_not.q > - udf_bitwise_not.q > We need to support the command of `Describe Function` whose function names > are in the following formats that are not natively supported: > - `STRING` (e.g., `'func1'`) > - `comparisonOperator` (e.g,. `<`) > - `arithmeticOperator` (e.g., `+`) > - `predicateOperator` (e.g., `or`) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14910) Native DDL Command Support for Describe Function in Non-identifier Format
Xiao Li created SPARK-14910: --- Summary: Native DDL Command Support for Describe Function in Non-identifier Format Key: SPARK-14910 URL: https://issues.apache.org/jira/browse/SPARK-14910 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: Xiao Li The existing `Describe Function` only support the function name in `identifier`. This is different from what Hive behaves. That is why many test cases `udf_abc` in `HiveCompatibilitySuite` do not pass. For example, - udf_not.q - udf_bitwise_not.q We need to support the command of `Describe Function` whose function names are in the following formats that are not natively supported: - `STRING` (e.g., `'func1'`) - `comparisonOperator` (e.g,. `<`) - `arithmeticOperator` (e.g., `+`) - `predicateOperator` (e.g., `or`) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14909) Spark UI submitted time is wrong
[ https://issues.apache.org/jira/browse/SPARK-14909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christophe updated SPARK-14909: --- Attachment: time-spark3.png time-spark2.png time-spark1.png spark-submission.png spark-submission is what is displayed on the main web UI. The submitted timestamp are not what I expect. Instead, on the time-spark{123}.png I can see the correct timestamp > Spark UI submitted time is wrong > > > Key: SPARK-14909 > URL: https://issues.apache.org/jira/browse/SPARK-14909 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: Christophe > Attachments: spark-submission.png, time-spark1.png, time-spark2.png, > time-spark3.png > > > There is something wrong with the "submitted time" reported on the main web > UI. > For example, I have jobs submitted every 5 minutes(00; 05; 10; 15 ...) > Under the "Completed applications", I can see my jobs with a submitted > timestamp of same value: 11:04 AM 26/04/2016 > But, if I click on the individual application and look at the submitted time > at the top, I get the expected values, for example: Submit Date: Tue Apr 26 > 01:05:03 UTC 2016 > I'll try to attach some screenshot -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14908) Provide support HDFS-located resources for "spark.executor.extraClasspath" on YARN
[ https://issues.apache.org/jira/browse/SPARK-14908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15257390#comment-15257390 ] Apache Spark commented on SPARK-14908: -- User 'mikhaildubkov' has created a pull request for this issue: https://github.com/apache/spark/pull/12678 > Provide support HDFS-located resources for "spark.executor.extraClasspath" > on YARN > --- > > Key: SPARK-14908 > URL: https://issues.apache.org/jira/browse/SPARK-14908 > Project: Spark > Issue Type: Improvement > Components: YARN >Reporter: Dubkov Mikhail >Priority: Minor > > On our project we use custom implementation of SparkSerializer and we found > that it loads serializer class when launch executor (SparkEnv.create()). So, > we were forced to use "spark.executor.extraClassPath" and custom serializer > class loads fine for now. But, it is not well for deployment process, because > currently, "spark.executor.ClassPath" does not support hdfs-based resoruces, > that means we should deploy artifact with serializer to each Hadoop node. We > would like to simplify deployment process. > We have tried make changes for this purpose and it works now for us. The > changes is relevant only for Hadoop/YARN deployment. > We didn't any workaround how we can avoid extra class path definition for > custom serializer implementation, please, let us know if we missed something. > I will create pull request for master branch, could you please look into > changes and go back with feedback? > We need these changes in master branch to simplify our future upgrade and I > hope this improvement can be helpful for other Spark users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14908) Provide support HDFS-located resources for "spark.executor.extraClasspath" on YARN
[ https://issues.apache.org/jira/browse/SPARK-14908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14908: Assignee: Apache Spark > Provide support HDFS-located resources for "spark.executor.extraClasspath" > on YARN > --- > > Key: SPARK-14908 > URL: https://issues.apache.org/jira/browse/SPARK-14908 > Project: Spark > Issue Type: Improvement > Components: YARN >Reporter: Dubkov Mikhail >Assignee: Apache Spark >Priority: Minor > > On our project we use custom implementation of SparkSerializer and we found > that it loads serializer class when launch executor (SparkEnv.create()). So, > we were forced to use "spark.executor.extraClassPath" and custom serializer > class loads fine for now. But, it is not well for deployment process, because > currently, "spark.executor.ClassPath" does not support hdfs-based resoruces, > that means we should deploy artifact with serializer to each Hadoop node. We > would like to simplify deployment process. > We have tried make changes for this purpose and it works now for us. The > changes is relevant only for Hadoop/YARN deployment. > We didn't any workaround how we can avoid extra class path definition for > custom serializer implementation, please, let us know if we missed something. > I will create pull request for master branch, could you please look into > changes and go back with feedback? > We need these changes in master branch to simplify our future upgrade and I > hope this improvement can be helpful for other Spark users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14908) Provide support HDFS-located resources for "spark.executor.extraClasspath" on YARN
[ https://issues.apache.org/jira/browse/SPARK-14908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14908: Assignee: (was: Apache Spark) > Provide support HDFS-located resources for "spark.executor.extraClasspath" > on YARN > --- > > Key: SPARK-14908 > URL: https://issues.apache.org/jira/browse/SPARK-14908 > Project: Spark > Issue Type: Improvement > Components: YARN >Reporter: Dubkov Mikhail >Priority: Minor > > On our project we use custom implementation of SparkSerializer and we found > that it loads serializer class when launch executor (SparkEnv.create()). So, > we were forced to use "spark.executor.extraClassPath" and custom serializer > class loads fine for now. But, it is not well for deployment process, because > currently, "spark.executor.ClassPath" does not support hdfs-based resoruces, > that means we should deploy artifact with serializer to each Hadoop node. We > would like to simplify deployment process. > We have tried make changes for this purpose and it works now for us. The > changes is relevant only for Hadoop/YARN deployment. > We didn't any workaround how we can avoid extra class path definition for > custom serializer implementation, please, let us know if we missed something. > I will create pull request for master branch, could you please look into > changes and go back with feedback? > We need these changes in master branch to simplify our future upgrade and I > hope this improvement can be helpful for other Spark users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14909) Spark UI submitted time is wrong
Christophe created SPARK-14909: -- Summary: Spark UI submitted time is wrong Key: SPARK-14909 URL: https://issues.apache.org/jira/browse/SPARK-14909 Project: Spark Issue Type: Bug Affects Versions: 1.6.0 Reporter: Christophe There is something wrong with the "submitted time" reported on the main web UI. For example, I have jobs submitted every 5 minutes(00; 05; 10; 15 ...) Under the "Completed applications", I can see my jobs with a submitted timestamp of same value: 11:04 AM 26/04/2016 But, if I click on the individual application and look at the submitted time at the top, I get the expected values, for example: Submit Date: Tue Apr 26 01:05:03 UTC 2016 I'll try to attach some screenshot -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-14772) Python ML Params.copy treats uid, paramMaps differently than Scala
[ https://issues.apache.org/jira/browse/SPARK-14772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hujiayin updated SPARK-14772: - Comment: was deleted (was: I can submit a code to fix this issue and I'm testing it.) > Python ML Params.copy treats uid, paramMaps differently than Scala > -- > > Key: SPARK-14772 > URL: https://issues.apache.org/jira/browse/SPARK-14772 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Reporter: Joseph K. Bradley > > In PySpark, {{ml.param.Params.copy}} does not quite match the Scala > implementation: > * It does not copy the UID > * It does not respect the difference between defaultParamMap and paramMap. > This is an issue with {{_copyValues}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14902) Expose user-facing RuntimeConfig in SparkSession
[ https://issues.apache.org/jira/browse/SPARK-14902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-14902. - Resolution: Fixed Fix Version/s: 2.0.0 > Expose user-facing RuntimeConfig in SparkSession > > > Key: SPARK-14902 > URL: https://issues.apache.org/jira/browse/SPARK-14902 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6339) Support creating temporary tables with DDL
[ https://issues.apache.org/jira/browse/SPARK-6339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-6339: Target Version/s: 2.0.0 > Support creating temporary tables with DDL > -- > > Key: SPARK-6339 > URL: https://issues.apache.org/jira/browse/SPARK-6339 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.3.0 >Reporter: Hossein Falaki > > It would useful to support following: > {code} > create temporary table counted as > select count(transactions), company from sales group by company > {code} > Right now this is possible through registerTempTable() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14908) Provide support HDFS-located resources for "spark.executor.extraClasspath" on YARN
Dubkov Mikhail created SPARK-14908: -- Summary: Provide support HDFS-located resources for "spark.executor.extraClasspath" on YARN Key: SPARK-14908 URL: https://issues.apache.org/jira/browse/SPARK-14908 Project: Spark Issue Type: Improvement Components: YARN Reporter: Dubkov Mikhail Priority: Minor On our project we use custom implementation of SparkSerializer and we found that it loads serializer class when launch executor (SparkEnv.create()). So, we were forced to use "spark.executor.extraClassPath" and custom serializer class loads fine for now. But, it is not well for deployment process, because currently, "spark.executor.ClassPath" does not support hdfs-based resoruces, that means we should deploy artifact with serializer to each Hadoop node. We would like to simplify deployment process. We have tried make changes for this purpose and it works now for us. The changes is relevant only for Hadoop/YARN deployment. We didn't any workaround how we can avoid extra class path definition for custom serializer implementation, please, let us know if we missed something. I will create pull request for master branch, could you please look into changes and go back with feedback? We need these changes in master branch to simplify our future upgrade and I hope this improvement can be helpful for other Spark users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14907) Use repartition in GLMRegressionModel.save
[ https://issues.apache.org/jira/browse/SPARK-14907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14907: Assignee: (was: Apache Spark) > Use repartition in GLMRegressionModel.save > -- > > Key: SPARK-14907 > URL: https://issues.apache.org/jira/browse/SPARK-14907 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Dongjoon Hyun >Priority: Trivial > > This issue changes `GLMRegressionModel.save` function like the following code > that is similar to other algorithms' parquet write. > {code} > - val dataRDD: DataFrame = sc.parallelize(Seq(data), 1).toDF() > - // TODO: repartition with 1 partition after SPARK-5532 gets fixed > - dataRDD.write.parquet(Loader.dataPath(path)) > + > sqlContext.createDataFrame(Seq(data)).repartition(1).write.parquet(Loader.dataPath(path)) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14907) Use repartition in GLMRegressionModel.save
[ https://issues.apache.org/jira/browse/SPARK-14907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15257324#comment-15257324 ] Apache Spark commented on SPARK-14907: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/12676 > Use repartition in GLMRegressionModel.save > -- > > Key: SPARK-14907 > URL: https://issues.apache.org/jira/browse/SPARK-14907 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Dongjoon Hyun >Priority: Trivial > > This issue changes `GLMRegressionModel.save` function like the following code > that is similar to other algorithms' parquet write. > {code} > - val dataRDD: DataFrame = sc.parallelize(Seq(data), 1).toDF() > - // TODO: repartition with 1 partition after SPARK-5532 gets fixed > - dataRDD.write.parquet(Loader.dataPath(path)) > + > sqlContext.createDataFrame(Seq(data)).repartition(1).write.parquet(Loader.dataPath(path)) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14613) Add @Since into the matrix and vector classes in spark-mllib-local
[ https://issues.apache.org/jira/browse/SPARK-14613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-14613: Assignee: Pravin Gadakh > Add @Since into the matrix and vector classes in spark-mllib-local > -- > > Key: SPARK-14613 > URL: https://issues.apache.org/jira/browse/SPARK-14613 > Project: Spark > Issue Type: Sub-task > Components: Build, ML >Reporter: DB Tsai >Assignee: Pravin Gadakh > > In spark-mllib-local, we're no longer to be able to use @Since annotation. As > a result, we will switch to standard java doc style using /* @Since /*. This > task will add those new APIs as @Since 2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14907) Use repartition in GLMRegressionModel.save
[ https://issues.apache.org/jira/browse/SPARK-14907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14907: Assignee: Apache Spark > Use repartition in GLMRegressionModel.save > -- > > Key: SPARK-14907 > URL: https://issues.apache.org/jira/browse/SPARK-14907 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Trivial > > This issue changes `GLMRegressionModel.save` function like the following code > that is similar to other algorithms' parquet write. > {code} > - val dataRDD: DataFrame = sc.parallelize(Seq(data), 1).toDF() > - // TODO: repartition with 1 partition after SPARK-5532 gets fixed > - dataRDD.write.parquet(Loader.dataPath(path)) > + > sqlContext.createDataFrame(Seq(data)).repartition(1).write.parquet(Loader.dataPath(path)) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14906) Move VectorUDT and MatrixUDT in PySpark to new ML package
[ https://issues.apache.org/jira/browse/SPARK-14906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-14906: Issue Type: Sub-task (was: Improvement) Parent: SPARK-13944 > Move VectorUDT and MatrixUDT in PySpark to new ML package > - > > Key: SPARK-14906 > URL: https://issues.apache.org/jira/browse/SPARK-14906 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Liang-Chi Hsieh > > As we move VectorUDT and MatrixUDT in Scala to new ml package, the PySpark > codes should be moved too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14906) Move VectorUDT and MatrixUDT in PySpark to new ML package
[ https://issues.apache.org/jira/browse/SPARK-14906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-14906: Description: As we move VectorUDT and MatrixUDT in Scala to new ml package, the PySpark codes should be moved too. > Move VectorUDT and MatrixUDT in PySpark to new ML package > - > > Key: SPARK-14906 > URL: https://issues.apache.org/jira/browse/SPARK-14906 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Liang-Chi Hsieh > > As we move VectorUDT and MatrixUDT in Scala to new ml package, the PySpark > codes should be moved too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14907) Use repartition in GLMRegressionModel.save
Dongjoon Hyun created SPARK-14907: - Summary: Use repartition in GLMRegressionModel.save Key: SPARK-14907 URL: https://issues.apache.org/jira/browse/SPARK-14907 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Dongjoon Hyun Priority: Trivial This issue changes `GLMRegressionModel.save` function like the following code that is similar to other algorithms' parquet write. {code} - val dataRDD: DataFrame = sc.parallelize(Seq(data), 1).toDF() - // TODO: repartition with 1 partition after SPARK-5532 gets fixed - dataRDD.write.parquet(Loader.dataPath(path)) + sqlContext.createDataFrame(Seq(data)).repartition(1).write.parquet(Loader.dataPath(path)) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14906) Move VectorUDT and MatrixUDT in PySpark to new ML package
Liang-Chi Hsieh created SPARK-14906: --- Summary: Move VectorUDT and MatrixUDT in PySpark to new ML package Key: SPARK-14906 URL: https://issues.apache.org/jira/browse/SPARK-14906 Project: Spark Issue Type: Improvement Components: ML, PySpark Reporter: Liang-Chi Hsieh -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14249) Change MLReader.read to be a property for PySpark
[ https://issues.apache.org/jira/browse/SPARK-14249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15257311#comment-15257311 ] Miao Wang commented on SPARK-14249: --- [~josephkb]Thanks! It is good to learn something new. Miao > Change MLReader.read to be a property for PySpark > - > > Key: SPARK-14249 > URL: https://issues.apache.org/jira/browse/SPARK-14249 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Joseph K. Bradley >Priority: Minor > > To match MLWritable.write and SQLContext.read, it will be good to make the > PySpark MLReader classmethod {{read}} be a property. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14894) Python GaussianMixture summary
[ https://issues.apache.org/jira/browse/SPARK-14894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15257307#comment-15257307 ] Apache Spark commented on SPARK-14894: -- User 'GayathriMurali' has created a pull request for this issue: https://github.com/apache/spark/pull/12675 > Python GaussianMixture summary > -- > > Key: SPARK-14894 > URL: https://issues.apache.org/jira/browse/SPARK-14894 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Reporter: Joseph K. Bradley >Priority: Minor > > In spark.ml, GaussianMixture includes a result summary. The Python API > should provide the same functionality. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14880) Parallel Gradient Descent with less map-reduce shuffle overhead
[ https://issues.apache.org/jira/browse/SPARK-14880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15257297#comment-15257297 ] Ahmed Mahran commented on SPARK-14880: -- I think Zinkevich's requires the loss to be convex regardless from whether it is smooth or not. In their evaluation, they used a Huber loss which I think is not smooth. I'd like to highlight that this algorithm is different from Zinkevich's in two things: - It uses mini-batch SGD instead of strict SGD - It applies higher level iterations I don't have theoretical evidence about the effect of both modifications on convergence. However, they seem plausible for the following reasons. In less technical terms, the trick is to guarantee that the parallel partitions converge to closer limits as possible. Imagine a bunch of climbers, one on each partition, climbing similar hills starting from similar points with the same rate and steps in similar directions; they would eventually end at similar limits. The following seem to be logically plausible guarantees: - Using the same initialization, the same step size per iteration and number of iterations - Using mini-batches with the same sampling distribution reduces stochasticity - Averaging and reiterating resynchronizes the possibly deviated climbers to the same point - Reshuffling helps producing new samples to learn from I would be interested to submit it as a Spark package. I'd also be interested in carrying out experiments, suggestions would be much appreciated. > Parallel Gradient Descent with less map-reduce shuffle overhead > --- > > Key: SPARK-14880 > URL: https://issues.apache.org/jira/browse/SPARK-14880 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Ahmed Mahran > Labels: performance > > The current implementation of (Stochastic) Gradient Descent performs one > map-reduce shuffle per iteration. Moreover, when the sampling fraction gets > smaller, the algorithm becomes shuffle-bound instead of CPU-bound. > {code} > (1 to numIterations or convergence) { > rdd > .sample(fraction) > .map(Gradient) > .reduce(Update) > } > {code} > A more performant variation requires only one map-reduce regardless from the > number of iterations. A local mini-batch SGD could be run on each partition, > then the results could be averaged. This is based on (Zinkevich, Martin, > Markus Weimer, Lihong Li, and Alex J. Smola. "Parallelized stochastic > gradient descent." In Advances in neural information processing systems, > 2010, > http://www.research.rutgers.edu/~lihong/pub/Zinkevich11Parallelized.pdf). > {code} > rdd > .shuffle() > .mapPartitions((1 to numIterations or convergence) { >iter.sample(fraction).map(Gradient).reduce(Update) > }) > .reduce(Average) > {code} > A higher level iteration could enclose the above variation; shuffling the > data before the local mini-batches and feeding back the average weights from > the last iteration. This allows more variability in the sampling of the > mini-batches with the possibility to cover the whole dataset. Here is a Spark > based implementation > https://github.com/mashin-io/rich-spark/blob/master/src/main/scala/org/apache/spark/mllib/optimization/ParallelSGD.scala > {code} > (1 to numIterations1 or convergence) { > rdd > .shuffle() > .mapPartitions((1 to numIterations2 or convergence) { > iter.sample(fraction).map(Gradient).reduce(Update) > }) > .reduce(Average) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14905) create conda environments w/locked package versions
shane knapp created SPARK-14905: --- Summary: create conda environments w/locked package versions Key: SPARK-14905 URL: https://issues.apache.org/jira/browse/SPARK-14905 Project: Spark Issue Type: Improvement Components: Build Reporter: shane knapp right now, the package dependency story for the jenkins build system is... well... non-existent. packages are installed, and only rarely (if ever) updated. when a new anaconda or system python library is installed or updated for a specific user/build requirement, this will randomly update and/or install other packages that may or may not have backwards compatibility. we've survived for a number of years so far without looking to deal with the technical debt, but i don't see how this will remain manageable, especially as spark, and other projects hosted on jenkins grow. example: currently, a non-spark amplab project (e-mission) needs scipy updated from 0.15.1 to 0.17.0 for their tests to pass. this simple upgrade adds three new python libraries (ligbfortran, mkl, wheel) and updates eleven others (conda, conda-env, numpy, openssl, pip, python, pyyaml, requests, setuptools, sqlite, yaml). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14870) NPE in generate aggregate
[ https://issues.apache.org/jira/browse/SPARK-14870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15257244#comment-15257244 ] Apache Spark commented on SPARK-14870: -- User 'sameeragarwal' has created a pull request for this issue: https://github.com/apache/spark/pull/12674 > NPE in generate aggregate > - > > Key: SPARK-14870 > URL: https://issues.apache.org/jira/browse/SPARK-14870 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu >Assignee: Sameer Agarwal > Fix For: 2.0.0 > > > When ran TPCDS Q14a > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 126.0 failed 1 times, most recent failure: Lost task 0.0 in stage 126.0 > (TID 234, localhost): java.lang.NullPointerException > at > org.apache.spark.sql.execution.vectorized.ColumnVector.putDecimal(ColumnVector.java:576) > at > org.apache.spark.sql.execution.vectorized.ColumnarBatch$Row.setDecimal(ColumnarBatch.java:325) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$7$$anon$1.hasNext(WholeStageCodegenExec.scala:361) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:254) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:809) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:809) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:809) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1780) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1793) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1806) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1820) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:880) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:357) > at org.apache.spark.rdd.RDD.collect(RDD.scala:879) > at > org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:453) > at > org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply$mcI$sp(Dataset.scala:2367) > at > org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2367) > at > org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2367) > at > org.apache.spark.sq
[jira] [Resolved] (SPARK-14888) UnresolvedFunction should use FunctionIdentifier rather than just a string for function name
[ https://issues.apache.org/jira/browse/SPARK-14888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-14888. - Resolution: Fixed Fix Version/s: 2.0.0 > UnresolvedFunction should use FunctionIdentifier rather than just a string > for function name > > > Key: SPARK-14888 > URL: https://issues.apache.org/jira/browse/SPARK-14888 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14511) Publish our forked genjavadoc for 2.12.0-M4 or stop using a forked version
[ https://issues.apache.org/jira/browse/SPARK-14511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15257217#comment-15257217 ] Jakob Odersky commented on SPARK-14511: --- Update: an issue was discovered during release-testing upstream. I just submitted a fix for it, tested against Akka and Spark. Javadoc in Spark emits a few error messages, however these were already present previously and do not affect the final, generated documentation. I'll get back when the release is out > Publish our forked genjavadoc for 2.12.0-M4 or stop using a forked version > -- > > Key: SPARK-14511 > URL: https://issues.apache.org/jira/browse/SPARK-14511 > Project: Spark > Issue Type: Sub-task > Components: Build, Project Infra >Reporter: Josh Rosen > > Before we can move to 2.12, we need to publish our forked genjavadoc for > 2.12.0-M4 (or 2.12 final) or stop using a forked version of the plugin. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14555) Python API for methods introduced for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-14555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15257211#comment-15257211 ] Apache Spark commented on SPARK-14555: -- User 'brkyvz' has created a pull request for this issue: https://github.com/apache/spark/pull/12673 > Python API for methods introduced for Structured Streaming > -- > > Key: SPARK-14555 > URL: https://issues.apache.org/jira/browse/SPARK-14555 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL, Streaming >Reporter: Burak Yavuz >Assignee: Burak Yavuz > Fix For: 2.0.0 > > > Methods added for Structured Streaming don't have a Python API yet. > We need to provide APIs for the new methods in: > - DataFrameReader > - DataFrameWriter > - ContinuousQuery > - Trigger -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8855) Python API for Association Rules
[ https://issues.apache.org/jira/browse/SPARK-8855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15257205#comment-15257205 ] Joseph K. Bradley commented on SPARK-8855: -- This may be significantly easier to add in the DataFrame-based API. I think we should prioritize getting AssociationRules into the DataFrame API, after which it should be much easier to add this Python wrapper. Here's the related issue: [SPARK-14501] > Python API for Association Rules > > > Key: SPARK-8855 > URL: https://issues.apache.org/jira/browse/SPARK-8855 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Feynman Liang >Priority: Minor > > A simple Python wrapper and doctests needs to be written for Association > Rules. The relevant method is {{FPGrowthModel.generateAssociationRules}}. The > code will likely live in {{fpm.py}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14904) Add back HiveContext in compatibility package
[ https://issues.apache.org/jira/browse/SPARK-14904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14904: Assignee: Andrew Or (was: Apache Spark) > Add back HiveContext in compatibility package > - > > Key: SPARK-14904 > URL: https://issues.apache.org/jira/browse/SPARK-14904 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14904) Add back HiveContext in compatibility package
[ https://issues.apache.org/jira/browse/SPARK-14904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15257191#comment-15257191 ] Apache Spark commented on SPARK-14904: -- User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/12672 > Add back HiveContext in compatibility package > - > > Key: SPARK-14904 > URL: https://issues.apache.org/jira/browse/SPARK-14904 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14904) Add back HiveContext in compatibility package
[ https://issues.apache.org/jira/browse/SPARK-14904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14904: Assignee: Apache Spark (was: Andrew Or) > Add back HiveContext in compatibility package > - > > Key: SPARK-14904 > URL: https://issues.apache.org/jira/browse/SPARK-14904 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14903) Revert: Change MLWritable.write to be a property
[ https://issues.apache.org/jira/browse/SPARK-14903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15257182#comment-15257182 ] Apache Spark commented on SPARK-14903: -- User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/12671 > Revert: Change MLWritable.write to be a property > > > Key: SPARK-14903 > URL: https://issues.apache.org/jira/browse/SPARK-14903 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > > Per discussion in [SPARK-14249], there is not a good way to support .read as > a property. We will therefore revert the change to write() to keep the API > consistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14903) Revert: Change MLWritable.write to be a property
[ https://issues.apache.org/jira/browse/SPARK-14903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14903: Assignee: Joseph K. Bradley (was: Apache Spark) > Revert: Change MLWritable.write to be a property > > > Key: SPARK-14903 > URL: https://issues.apache.org/jira/browse/SPARK-14903 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > > Per discussion in [SPARK-14249], there is not a good way to support .read as > a property. We will therefore revert the change to write() to keep the API > consistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14903) Revert: Change MLWritable.write to be a property
[ https://issues.apache.org/jira/browse/SPARK-14903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14903: Assignee: Apache Spark (was: Joseph K. Bradley) > Revert: Change MLWritable.write to be a property > > > Key: SPARK-14903 > URL: https://issues.apache.org/jira/browse/SPARK-14903 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Joseph K. Bradley >Assignee: Apache Spark > > Per discussion in [SPARK-14249], there is not a good way to support .read as > a property. We will therefore revert the change to write() to keep the API > consistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14071) Change MLWritable.write to be a property
[ https://issues.apache.org/jira/browse/SPARK-14071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15257183#comment-15257183 ] Apache Spark commented on SPARK-14071: -- User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/12671 > Change MLWritable.write to be a property > > > Key: SPARK-14071 > URL: https://issues.apache.org/jira/browse/SPARK-14071 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Joseph K. Bradley >Assignee: Miao Wang >Priority: Trivial > Fix For: 2.0.0 > > > This will match the Scala API + the DataFrame Python API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14904) Add back HiveContext in compatibility package
Andrew Or created SPARK-14904: - Summary: Add back HiveContext in compatibility package Key: SPARK-14904 URL: https://issues.apache.org/jira/browse/SPARK-14904 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.0.0 Reporter: Andrew Or Assignee: Andrew Or -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14828) Start SparkSession in REPL instead of SQLContext
[ https://issues.apache.org/jira/browse/SPARK-14828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-14828: Issue Type: Sub-task (was: Bug) Parent: SPARK-13485 > Start SparkSession in REPL instead of SQLContext > > > Key: SPARK-14828 > URL: https://issues.apache.org/jira/browse/SPARK-14828 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14828) Start SparkSession in REPL instead of SQLContext
[ https://issues.apache.org/jira/browse/SPARK-14828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-14828. - Resolution: Fixed Fix Version/s: 2.0.0 > Start SparkSession in REPL instead of SQLContext > > > Key: SPARK-14828 > URL: https://issues.apache.org/jira/browse/SPARK-14828 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14894) Python GaussianMixture summary
[ https://issues.apache.org/jira/browse/SPARK-14894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14894: Assignee: Apache Spark > Python GaussianMixture summary > -- > > Key: SPARK-14894 > URL: https://issues.apache.org/jira/browse/SPARK-14894 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Reporter: Joseph K. Bradley >Assignee: Apache Spark >Priority: Minor > > In spark.ml, GaussianMixture includes a result summary. The Python API > should provide the same functionality. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14894) Python GaussianMixture summary
[ https://issues.apache.org/jira/browse/SPARK-14894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15257161#comment-15257161 ] Apache Spark commented on SPARK-14894: -- User 'GayathriMurali' has created a pull request for this issue: https://github.com/apache/spark/pull/12670 > Python GaussianMixture summary > -- > > Key: SPARK-14894 > URL: https://issues.apache.org/jira/browse/SPARK-14894 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Reporter: Joseph K. Bradley >Priority: Minor > > In spark.ml, GaussianMixture includes a result summary. The Python API > should provide the same functionality. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14894) Python GaussianMixture summary
[ https://issues.apache.org/jira/browse/SPARK-14894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14894: Assignee: (was: Apache Spark) > Python GaussianMixture summary > -- > > Key: SPARK-14894 > URL: https://issues.apache.org/jira/browse/SPARK-14894 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Reporter: Joseph K. Bradley >Priority: Minor > > In spark.ml, GaussianMixture includes a result summary. The Python API > should provide the same functionality. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14903) Revert: Change MLWritable.write to be a property
Joseph K. Bradley created SPARK-14903: - Summary: Revert: Change MLWritable.write to be a property Key: SPARK-14903 URL: https://issues.apache.org/jira/browse/SPARK-14903 Project: Spark Issue Type: Improvement Components: ML, PySpark Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14903) Revert: Change MLWritable.write to be a property
[ https://issues.apache.org/jira/browse/SPARK-14903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-14903: -- Description: Per discussion in [SPARK-14249], there is not a good way to support .read as a property. We will therefore revert the change to write() to keep the API consistent. > Revert: Change MLWritable.write to be a property > > > Key: SPARK-14903 > URL: https://issues.apache.org/jira/browse/SPARK-14903 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > > Per discussion in [SPARK-14249], there is not a good way to support .read as > a property. We will therefore revert the change to write() to keep the API > consistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-14249) Change MLReader.read to be a property for PySpark
[ https://issues.apache.org/jira/browse/SPARK-14249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley closed SPARK-14249. - Resolution: Won't Fix > Change MLReader.read to be a property for PySpark > - > > Key: SPARK-14249 > URL: https://issues.apache.org/jira/browse/SPARK-14249 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Joseph K. Bradley >Priority: Minor > > To match MLWritable.write and SQLContext.read, it will be good to make the > PySpark MLReader classmethod {{read}} be a property. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14249) Change MLReader.read to be a property for PySpark
[ https://issues.apache.org/jira/browse/SPARK-14249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15257146#comment-15257146 ] Joseph K. Bradley commented on SPARK-14249: --- I don't see a good way to do this. The suggestion of {{PipelineModel.read = PipelineModelMLReader(PipelineModel)}} does not actually work since we need access to the JVM in the Reader init method, but it is not yet available since this is outside the constructor. I'm going to close this issue. We'll need to revert the change to write() to keep things consistent. Thanks regardless! > Change MLReader.read to be a property for PySpark > - > > Key: SPARK-14249 > URL: https://issues.apache.org/jira/browse/SPARK-14249 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Joseph K. Bradley >Priority: Minor > > To match MLWritable.write and SQLContext.read, it will be good to make the > PySpark MLReader classmethod {{read}} be a property. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14902) Expose user-facing RuntimeConfig in SparkSession
[ https://issues.apache.org/jira/browse/SPARK-14902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14902: Assignee: Andrew Or (was: Apache Spark) > Expose user-facing RuntimeConfig in SparkSession > > > Key: SPARK-14902 > URL: https://issues.apache.org/jira/browse/SPARK-14902 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14902) Expose user-facing RuntimeConfig in SparkSession
[ https://issues.apache.org/jira/browse/SPARK-14902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14902: Assignee: Apache Spark (was: Andrew Or) > Expose user-facing RuntimeConfig in SparkSession > > > Key: SPARK-14902 > URL: https://issues.apache.org/jira/browse/SPARK-14902 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14902) Expose user-facing RuntimeConfig in SparkSession
[ https://issues.apache.org/jira/browse/SPARK-14902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15257100#comment-15257100 ] Apache Spark commented on SPARK-14902: -- User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/12669 > Expose user-facing RuntimeConfig in SparkSession > > > Key: SPARK-14902 > URL: https://issues.apache.org/jira/browse/SPARK-14902 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14902) Expose user-facing RuntimeConfig in SparkSession
Andrew Or created SPARK-14902: - Summary: Expose user-facing RuntimeConfig in SparkSession Key: SPARK-14902 URL: https://issues.apache.org/jira/browse/SPARK-14902 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.0.0 Reporter: Andrew Or Assignee: Andrew Or -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14313) AFTSurvivalRegression model persistence in SparkR
[ https://issues.apache.org/jira/browse/SPARK-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-14313: -- Assignee: Yanbo Liang > AFTSurvivalRegression model persistence in SparkR > - > > Key: SPARK-14313 > URL: https://issues.apache.org/jira/browse/SPARK-14313 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14313) AFTSurvivalRegression model persistence in SparkR
[ https://issues.apache.org/jira/browse/SPARK-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-14313: -- Target Version/s: 2.0.0 > AFTSurvivalRegression model persistence in SparkR > - > > Key: SPARK-14313 > URL: https://issues.apache.org/jira/browse/SPARK-14313 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14312) NaiveBayes model persistence in SparkR
[ https://issues.apache.org/jira/browse/SPARK-14312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-14312. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12573 [https://github.com/apache/spark/pull/12573] > NaiveBayes model persistence in SparkR > -- > > Key: SPARK-14312 > URL: https://issues.apache.org/jira/browse/SPARK-14312 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14900) spark.ml classification metrics should include accuracy
[ https://issues.apache.org/jira/browse/SPARK-14900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15257034#comment-15257034 ] Miao Wang commented on SPARK-14900: --- If no one takes this one, I will work on it. Thanks! Miao > spark.ml classification metrics should include accuracy > --- > > Key: SPARK-14900 > URL: https://issues.apache.org/jira/browse/SPARK-14900 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > To compute "accuracy" (0/1 classification accuracy), users can use > {{precision}} in MulticlassMetrics and > MulticlassClassificationEvaluator.metricName. We should also support > "accuracy" directly as an alias to help users familiar with that name. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14894) Python GaussianMixture summary
[ https://issues.apache.org/jira/browse/SPARK-14894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15257021#comment-15257021 ] Miao Wang commented on SPARK-14894: --- If you have it ready now, please send the pull request. I will help reviewing it. Thanks! Miao > Python GaussianMixture summary > -- > > Key: SPARK-14894 > URL: https://issues.apache.org/jira/browse/SPARK-14894 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Reporter: Joseph K. Bradley >Priority: Minor > > In spark.ml, GaussianMixture includes a result summary. The Python API > should provide the same functionality. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14853) Support LeftSemi/LeftAnti in SortMergeJoin
[ https://issues.apache.org/jira/browse/SPARK-14853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14853: Assignee: Apache Spark (was: Davies Liu) > Support LeftSemi/LeftAnti in SortMergeJoin > -- > > Key: SPARK-14853 > URL: https://issues.apache.org/jira/browse/SPARK-14853 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14853) Support LeftSemi/LeftAnti in SortMergeJoin
[ https://issues.apache.org/jira/browse/SPARK-14853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15257009#comment-15257009 ] Apache Spark commented on SPARK-14853: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/12668 > Support LeftSemi/LeftAnti in SortMergeJoin > -- > > Key: SPARK-14853 > URL: https://issues.apache.org/jira/browse/SPARK-14853 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14894) Python GaussianMixture summary
[ https://issues.apache.org/jira/browse/SPARK-14894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15257006#comment-15257006 ] Gayathri Murali commented on SPARK-14894: - [~wangmiao1981] I have PR ready for this. If you are okay, I can go ahead and submit that. > Python GaussianMixture summary > -- > > Key: SPARK-14894 > URL: https://issues.apache.org/jira/browse/SPARK-14894 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Reporter: Joseph K. Bradley >Priority: Minor > > In spark.ml, GaussianMixture includes a result summary. The Python API > should provide the same functionality. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14853) Support LeftSemi/LeftAnti in SortMergeJoin
[ https://issues.apache.org/jira/browse/SPARK-14853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14853: Assignee: Davies Liu (was: Apache Spark) > Support LeftSemi/LeftAnti in SortMergeJoin > -- > > Key: SPARK-14853 > URL: https://issues.apache.org/jira/browse/SPARK-14853 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14467) Add async io in FileScanRDD
[ https://issues.apache.org/jira/browse/SPARK-14467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15257005#comment-15257005 ] Apache Spark commented on SPARK-14467: -- User 'sameeragarwal' has created a pull request for this issue: https://github.com/apache/spark/pull/12667 > Add async io in FileScanRDD > --- > > Key: SPARK-14467 > URL: https://issues.apache.org/jira/browse/SPARK-14467 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Nong Li > > Experiments running over parquet data in s3 shows poorly interleaving of CPU > and IO. We should do more async IO in FileScanRDD to better use the machine > resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14891) ALS in ML never validates input schema
[ https://issues.apache.org/jira/browse/SPARK-14891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15257004#comment-15257004 ] Joseph K. Bradley commented on SPARK-14891: --- For most use cases, Int should be used to save on memory. Supporting String in the future would be nice but would require internal indexing. I'd say we should validate the input for now and require Int types. Users who need Long can use the ALS.train API. +1 for better docs & data validation. For data validation, it could be nice to accept Long and other types but to make sure that the values are checked before casting to Int types. > ALS in ML never validates input schema > -- > > Key: SPARK-14891 > URL: https://issues.apache.org/jira/browse/SPARK-14891 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Nick Pentreath > > Currently, {{ALS.fit}} never validates the input schema. There is a > {{transformSchema}} impl that calls {{validateAndTransformSchema}}, but it is > never called in either {{ALS.fit}} or {{ALSModel.transform}}. > This was highlighted in SPARK-13857 (and failing PySpark tests > [here|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56849/consoleFull])when > adding a call to {{transformSchema}} in {{ALSModel.transform}} that actually > validates the input schema. The PySpark docstring tests result in Long inputs > by default, which fail validation as Int is required. > Currently, the inputs for user and item ids are cast to Int, with no input > type validation (or warning message). So users could pass in Long, Float, > Double, etc. It's also not made clear anywhere in the docs that only Int > types for user and item are supported. > Enforcing validation seems the best option but might break user code that > previously "just worked" especially in PySpark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14880) Parallel Gradient Descent with less map-reduce shuffle overhead
[ https://issues.apache.org/jira/browse/SPARK-14880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15256996#comment-15256996 ] Joseph K. Bradley commented on SPARK-14880: --- Thanks for this suggestion. To get this feature merged, we would likely need (a) more theoretical evidence supporting the algorithm and (b) significant performance testing to demonstrate the improvements. For (a), as I recall, the Zinkevich work requires that the loss be smooth, which would rule out support for L1 regularization. Also, has the higher level iteration been analyzed to prove its effect on convergence? This could be a good algorithm to post as a Spark package. Would you be interested in doing that? I'm going to close this issue for now, but discussion can continue on the closed JIRA. > Parallel Gradient Descent with less map-reduce shuffle overhead > --- > > Key: SPARK-14880 > URL: https://issues.apache.org/jira/browse/SPARK-14880 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Ahmed Mahran > Labels: performance > > The current implementation of (Stochastic) Gradient Descent performs one > map-reduce shuffle per iteration. Moreover, when the sampling fraction gets > smaller, the algorithm becomes shuffle-bound instead of CPU-bound. > {code} > (1 to numIterations or convergence) { > rdd > .sample(fraction) > .map(Gradient) > .reduce(Update) > } > {code} > A more performant variation requires only one map-reduce regardless from the > number of iterations. A local mini-batch SGD could be run on each partition, > then the results could be averaged. This is based on (Zinkevich, Martin, > Markus Weimer, Lihong Li, and Alex J. Smola. "Parallelized stochastic > gradient descent." In Advances in neural information processing systems, > 2010, > http://www.research.rutgers.edu/~lihong/pub/Zinkevich11Parallelized.pdf). > {code} > rdd > .shuffle() > .mapPartitions((1 to numIterations or convergence) { >iter.sample(fraction).map(Gradient).reduce(Update) > }) > .reduce(Average) > {code} > A higher level iteration could enclose the above variation; shuffling the > data before the local mini-batches and feeding back the average weights from > the last iteration. This allows more variability in the sampling of the > mini-batches with the possibility to cover the whole dataset. Here is a Spark > based implementation > https://github.com/mashin-io/rich-spark/blob/master/src/main/scala/org/apache/spark/mllib/optimization/ParallelSGD.scala > {code} > (1 to numIterations1 or convergence) { > rdd > .shuffle() > .mapPartitions((1 to numIterations2 or convergence) { > iter.sample(fraction).map(Gradient).reduce(Update) > }) > .reduce(Average) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13739) Predicate Push Down Through Window Operator
[ https://issues.apache.org/jira/browse/SPARK-13739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-13739. --- Resolution: Fixed Assignee: Xiao Li Target Version/s: 2.0.0 > Predicate Push Down Through Window Operator > --- > > Key: SPARK-13739 > URL: https://issues.apache.org/jira/browse/SPARK-13739 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Xiao Li > > Push down the predicate through the Window operator. > In this JIRA, predicates are pushed through Window if and only if the > following conditions are satisfied: > - Predicate involves one and only one column that is part of window > partitioning key > - Window partitioning key is just a sequence of attributeReferences. (i.e., > none of them is an expression) > - Predicate must be deterministic -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-14880) Parallel Gradient Descent with less map-reduce shuffle overhead
[ https://issues.apache.org/jira/browse/SPARK-14880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley closed SPARK-14880. - Resolution: Won't Fix > Parallel Gradient Descent with less map-reduce shuffle overhead > --- > > Key: SPARK-14880 > URL: https://issues.apache.org/jira/browse/SPARK-14880 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Ahmed Mahran > Labels: performance > > The current implementation of (Stochastic) Gradient Descent performs one > map-reduce shuffle per iteration. Moreover, when the sampling fraction gets > smaller, the algorithm becomes shuffle-bound instead of CPU-bound. > {code} > (1 to numIterations or convergence) { > rdd > .sample(fraction) > .map(Gradient) > .reduce(Update) > } > {code} > A more performant variation requires only one map-reduce regardless from the > number of iterations. A local mini-batch SGD could be run on each partition, > then the results could be averaged. This is based on (Zinkevich, Martin, > Markus Weimer, Lihong Li, and Alex J. Smola. "Parallelized stochastic > gradient descent." In Advances in neural information processing systems, > 2010, > http://www.research.rutgers.edu/~lihong/pub/Zinkevich11Parallelized.pdf). > {code} > rdd > .shuffle() > .mapPartitions((1 to numIterations or convergence) { >iter.sample(fraction).map(Gradient).reduce(Update) > }) > .reduce(Average) > {code} > A higher level iteration could enclose the above variation; shuffling the > data before the local mini-batches and feeding back the average weights from > the last iteration. This allows more variability in the sampling of the > mini-batches with the possibility to cover the whole dataset. Here is a Spark > based implementation > https://github.com/mashin-io/rich-spark/blob/master/src/main/scala/org/apache/spark/mllib/optimization/ParallelSGD.scala > {code} > (1 to numIterations1 or convergence) { > rdd > .shuffle() > .mapPartitions((1 to numIterations2 or convergence) { > iter.sample(fraction).map(Gradient).reduce(Update) > }) > .reduce(Average) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14844) KMeansModel in spark.ml should allow to change featureCol and predictionCol
[ https://issues.apache.org/jira/browse/SPARK-14844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-14844: -- Priority: Trivial (was: Major) > KMeansModel in spark.ml should allow to change featureCol and predictionCol > --- > > Key: SPARK-14844 > URL: https://issues.apache.org/jira/browse/SPARK-14844 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.6.0, 1.6.1 >Reporter: Dominik Jastrzębski >Priority: Trivial > > We need to add setFeaturesCol, setPredictionCol methods in > org.apache.spark.ml.clustering.KMeansModel. > This will allow to: > * transform a DataFrame with different feature column name than in the > DataFrame the model was fitted on. > * create a prediction column with name other than the name that was set > during model fitting. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14831) Make ML APIs in SparkR consistent
[ https://issues.apache.org/jira/browse/SPARK-14831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15256983#comment-15256983 ] Joseph K. Bradley commented on SPARK-14831: --- 2. {{spark.glm}}, etc. SGTM. For save/load, I'd prefer either {{spark.save/load}} (if that works for DataFrames too), or {{read.ml}} (rather than {{read.model}} since that leaves open the possibility of supporting Estimators and Pipelines in R someday). > Make ML APIs in SparkR consistent > - > > Key: SPARK-14831 > URL: https://issues.apache.org/jira/browse/SPARK-14831 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > > In current master, we have 4 ML methods in SparkR: > {code:none} > glm(formula, family, data, ...) > kmeans(data, centers, ...) > naiveBayes(formula, data, ...) > survreg(formula, data, ...) > {code} > We tried to keep the signatures similar to existing ones in R. However, if we > put them together, they are not consistent. One example is k-means, which > doesn't accept a formula. Instead of looking at each method independently, we > might want to update the signature of kmeans to > {code:none} > kmeans(formula, data, centers, ...) > {code} > We can also discuss possible global changes here. For example, `glm` puts > `family` before `data` while `kmeans` puts `centers` after `data`. This is > not consistent. And logically, the formula doesn't mean anything without > associating with a DataFrame. So it makes more sense to me to have the > following signature: > {code:none} > algorithm(df, formula, [required params], [optional params]) > {code} > If we make this change, we might want to avoid name collisions because they > have different signature. We can use `ml.kmeans`, 'ml.glm`, etc. > Sorry for discussing API changes in the last minute. But I think it would be > better to have consistent signatures in SparkR. > cc: [~shivaram] [~josephkb] [~yanboliang] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14721) Remove the HiveContext class
[ https://issues.apache.org/jira/browse/SPARK-14721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-14721. --- Resolution: Fixed Fix Version/s: 2.0.0 > Remove the HiveContext class > > > Key: SPARK-14721 > URL: https://issues.apache.org/jira/browse/SPARK-14721 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14901) java exception when showing join
Brent Elmer created SPARK-14901: --- Summary: java exception when showing join Key: SPARK-14901 URL: https://issues.apache.org/jira/browse/SPARK-14901 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.6.1 Reporter: Brent Elmer I am using pyspark with netezza. I am getting a java exception when trying to show the first row of a join. I can show the first row for of the two dataframes separately but not the result of a join. I get the same error for any action I take(first, collect, show). Am I doing something wrong? from pyspark.sql import SQLContext sqlContext = SQLContext(sc) dispute_df = sqlContext.read.format('com.ibm.spark.netezza').options(url='jdbc:netezza://***:5480/db', user='***', password='***', dbtable='table1', driver='com.ibm.spark.netezza').load() dispute_df.printSchema() comments_df = sqlContext.read.format('com.ibm.spark.netezza').options(url='jdbc:netezza://***:5480/db', user='***', password='***', dbtable='table2', driver='com.ibm.spark.netezza').load() comments_df.printSchema() dispute_df.join(comments_df, dispute_df.COMMENTID == comments_df.COMMENTID).first() root |-- COMMENTID: string (nullable = true) |-- EXPORTDATETIME: timestamp (nullable = true) |-- ARTAGS: string (nullable = true) |-- POTAGS: string (nullable = true) |-- INVTAG: string (nullable = true) |-- ACTIONTAG: string (nullable = true) |-- DISPUTEFLAG: string (nullable = true) |-- ACTIONFLAG: string (nullable = true) |-- CUSTOMFLAG1: string (nullable = true) |-- CUSTOMFLAG2: string (nullable = true) root |-- COUNTRY: string (nullable = true) |-- CUSTOMER: string (nullable = true) |-- INVNUMBER: string (nullable = true) |-- INVSEQNUMBER: string (nullable = true) |-- LEDGERCODE: string (nullable = true) |-- COMMENTTEXT: string (nullable = true) |-- COMMENTTIMESTAMP: timestamp (nullable = true) |-- COMMENTLENGTH: long (nullable = true) |-- FREEINDEX: long (nullable = true) |-- COMPLETEDFLAG: long (nullable = true) |-- ACTIONFLAG: long (nullable = true) |-- FREETEXT: string (nullable = true) |-- USERNAME: string (nullable = true) |-- ACTION: string (nullable = true) |-- COMMENTID: string (nullable = true) --- Py4JJavaError Traceback (most recent call last) in () 5 comments_df = sqlContext.read.format('com.ibm.spark.netezza').options(url='jdbc:netezza://***:5480/db', user='***', password='***', dbtable='table2', driver='com.ibm.spark.netezza').load() 6 comments_df.printSchema() > 7 dispute_df.join(comments_df, dispute_df.COMMENTID == comments_df.COMMENTID).first() /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in first(self) 802 Row(age=2, name=u'Alice') 803 """ --> 804 return self.head() 805 806 @ignore_unicode_prefix /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in head(self, n) 790 """ 791 if n is None: --> 792 rs = self.head(1) 793 return rs[0] if rs else None 794 return self.take(n) /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in head(self, n) 792 rs = self.head(1) 793 return rs[0] if rs else None --> 794 return self.take(n) 795 796 @ignore_unicode_prefix /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in take(self, num) 304 with SCCallSiteSync(self._sc) as css: 305 port = self._sc._jvm.org.apache.spark.sql.execution.EvaluatePython.takeAndServe( --> 306 self._jdf, num) 307 return list(_load_from_socket(port, BatchedSerializer(PickleSerializer( 308 /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args) 811 answer = self.gateway_client.send_command(command) 812 return_value = get_return_value( --> 813 answer, self.gateway_client, self.target_id, self.name) 814 815 for temp_arg in temp_args: /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/utils.pyc in deco(*a, **kw) 43 def deco(*a, **kw): 44 try: ---> 45 return f(*a, **kw) 46 except py4j.protocol.Py4JJavaError as e: 47 s = e.java_exception.toString() /usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 306 raise Py4JJavaError( 307 "An error occurred while calling {0}{1}{2}.\n". --> 308 format(target_id, ".", name), value) 309