[jira] [Assigned] (SPARK-14912) Propagate data source options to Hadoop configurations

2016-04-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14912:


Assignee: Reynold Xin  (was: Apache Spark)

> Propagate data source options to Hadoop configurations
> --
>
> Key: SPARK-14912
> URL: https://issues.apache.org/jira/browse/SPARK-14912
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> We currently have no way for users to propagate options to the underlying 
> library that rely in Hadoop configurations to work. For example, there are 
> various options in parquet-mr that users might want to set, but the data 
> source API does not expose a per-job way to set it.
> This patch propagates the user-specified options also into Hadoop 
> Configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14912) Propagate data source options to Hadoop configurations

2016-04-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14912:


Assignee: Apache Spark  (was: Reynold Xin)

> Propagate data source options to Hadoop configurations
> --
>
> Key: SPARK-14912
> URL: https://issues.apache.org/jira/browse/SPARK-14912
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> We currently have no way for users to propagate options to the underlying 
> library that rely in Hadoop configurations to work. For example, there are 
> various options in parquet-mr that users might want to set, but the data 
> source API does not expose a per-job way to set it.
> This patch propagates the user-specified options also into Hadoop 
> Configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14912) Propagate data source options to Hadoop configurations

2016-04-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15257589#comment-15257589
 ] 

Apache Spark commented on SPARK-14912:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/12688

> Propagate data source options to Hadoop configurations
> --
>
> Key: SPARK-14912
> URL: https://issues.apache.org/jira/browse/SPARK-14912
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> We currently have no way for users to propagate options to the underlying 
> library that rely in Hadoop configurations to work. For example, there are 
> various options in parquet-mr that users might want to set, but the data 
> source API does not expose a per-job way to set it.
> This patch propagates the user-specified options also into Hadoop 
> Configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14912) Propagate data source options to Hadoop configurations

2016-04-25 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-14912:
---

 Summary: Propagate data source options to Hadoop configurations
 Key: SPARK-14912
 URL: https://issues.apache.org/jira/browse/SPARK-14912
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


We currently have no way for users to propagate options to the underlying 
library that rely in Hadoop configurations to work. For example, there are 
various options in parquet-mr that users might want to set, but the data source 
API does not expose a per-job way to set it.

This patch propagates the user-specified options also into Hadoop Configuration.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14313) AFTSurvivalRegression model persistence in SparkR

2016-04-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15257563#comment-15257563
 ] 

Apache Spark commented on SPARK-14313:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/12685

> AFTSurvivalRegression model persistence in SparkR
> -
>
> Key: SPARK-14313
> URL: https://issues.apache.org/jira/browse/SPARK-14313
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14313) AFTSurvivalRegression model persistence in SparkR

2016-04-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14313:


Assignee: Yanbo Liang  (was: Apache Spark)

> AFTSurvivalRegression model persistence in SparkR
> -
>
> Key: SPARK-14313
> URL: https://issues.apache.org/jira/browse/SPARK-14313
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14313) AFTSurvivalRegression model persistence in SparkR

2016-04-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14313:


Assignee: Apache Spark  (was: Yanbo Liang)

> AFTSurvivalRegression model persistence in SparkR
> -
>
> Key: SPARK-14313
> URL: https://issues.apache.org/jira/browse/SPARK-14313
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14874) Remove the obsolete Batch representation

2016-04-25 Thread Liwei Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liwei Lin updated SPARK-14874:
--
Description: 
The Batch class, which had been used to indicate progress in a stream, was 
abandoned by SPARK-13985 and then became useless.

Let's:
- removes the Batch class
- -renames getBatch(...) to getData(...) for Source- (update: as discussed in 
the PR, this is not necessary)
- -renames addBatch(...) to addData(...) for Sink- (update: as discussed in the 
PR, this is not necessary)

  was:
The Batch class, which had been used to indicate progress in a stream, was 
abandoned by SPARK-13985 and then became useless.

Let's:
- removes the Batch class
- renames getBatch(...) to getData(...) for Source
- renames addBatch(...) to addData(...) for Sink


> Remove the obsolete Batch representation
> 
>
> Key: SPARK-14874
> URL: https://issues.apache.org/jira/browse/SPARK-14874
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Liwei Lin
>Priority: Minor
>
> The Batch class, which had been used to indicate progress in a stream, was 
> abandoned by SPARK-13985 and then became useless.
> Let's:
> - removes the Batch class
> - -renames getBatch(...) to getData(...) for Source- (update: as discussed in 
> the PR, this is not necessary)
> - -renames addBatch(...) to addData(...) for Sink- (update: as discussed in 
> the PR, this is not necessary)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-14806) Alias original Hive options in Spark SQL conf

2016-04-25 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-14806.
---
Resolution: Won't Fix

> Alias original Hive options in Spark SQL conf
> -
>
> Key: SPARK-14806
> URL: https://issues.apache.org/jira/browse/SPARK-14806
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> There are couple options we should alias: spark.sql.variable.substitute and 
> spark.sql.variable.substitute.depth.
> The Hive config options are hive.variable.substitute and 
> hive.variable.substitute.depth



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14315) GLMs model persistence in SparkR

2016-04-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14315:


Assignee: (was: Apache Spark)

> GLMs model persistence in SparkR
> 
>
> Key: SPARK-14315
> URL: https://issues.apache.org/jira/browse/SPARK-14315
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14315) GLMs model persistence in SparkR

2016-04-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15257532#comment-15257532
 ] 

Apache Spark commented on SPARK-14315:
--

User 'GayathriMurali' has created a pull request for this issue:
https://github.com/apache/spark/pull/12683

> GLMs model persistence in SparkR
> 
>
> Key: SPARK-14315
> URL: https://issues.apache.org/jira/browse/SPARK-14315
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14315) GLMs model persistence in SparkR

2016-04-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14315:


Assignee: Apache Spark

> GLMs model persistence in SparkR
> 
>
> Key: SPARK-14315
> URL: https://issues.apache.org/jira/browse/SPARK-14315
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14861) Replace internal usages of SQLContext with SparkSession

2016-04-25 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14861.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Replace internal usages of SQLContext with SparkSession
> ---
>
> Key: SPARK-14861
> URL: https://issues.apache.org/jira/browse/SPARK-14861
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> We should try to use SparkSession (the new thing) in as many places as 
> possible. We should be careful not to break the public datasource API though.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14904) Add back HiveContext in compatibility package

2016-04-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15257521#comment-15257521
 ] 

Apache Spark commented on SPARK-14904:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/12682

> Add back HiveContext in compatibility package
> -
>
> Key: SPARK-14904
> URL: https://issues.apache.org/jira/browse/SPARK-14904
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14904) Add back HiveContext in compatibility package

2016-04-25 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14904.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Add back HiveContext in compatibility package
> -
>
> Key: SPARK-14904
> URL: https://issues.apache.org/jira/browse/SPARK-14904
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14911) Fix a potential data race in TaskMemoryManager

2016-04-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15257497#comment-15257497
 ] 

Apache Spark commented on SPARK-14911:
--

User 'lw-lin' has created a pull request for this issue:
https://github.com/apache/spark/pull/12681

> Fix a potential data race in TaskMemoryManager
> --
>
> Key: SPARK-14911
> URL: https://issues.apache.org/jira/browse/SPARK-14911
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Liwei Lin
>Priority: Minor
>
> SPARK-13210 introduced an `acquiredButNotUsed` field, but it might not be 
> correctly synchronized:
> - the write `acquiredButNotUsed += acquired` is guarded by `this` lock (see 
> [here|https://github.com/apache/spark/blame/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L271]);
> - the read `memoryManager.releaseExecutionMemory(acquiredButNotUsed, 
> taskAttemptId, tungstenMemoryMode)` (see 
> [here|https://github.com/apache/spark/blame/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L400])
>  might not be correctly synchronized, and might not see 
> `acquiredButNotUsed`'s new written value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14911) Fix a potential data race in TaskMemoryManager

2016-04-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14911:


Assignee: (was: Apache Spark)

> Fix a potential data race in TaskMemoryManager
> --
>
> Key: SPARK-14911
> URL: https://issues.apache.org/jira/browse/SPARK-14911
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Liwei Lin
>Priority: Minor
>
> SPARK-13210 introduced an `acquiredButNotUsed` field, but it might not be 
> correctly synchronized:
> - the write `acquiredButNotUsed += acquired` is guarded by `this` lock (see 
> [here|https://github.com/apache/spark/blame/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L271]);
> - the read `memoryManager.releaseExecutionMemory(acquiredButNotUsed, 
> taskAttemptId, tungstenMemoryMode)` (see 
> [here|https://github.com/apache/spark/blame/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L400])
>  might not be correctly synchronized, and might not see 
> `acquiredButNotUsed`'s new written value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14911) Fix a potential data race in TaskMemoryManager

2016-04-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14911:


Assignee: Apache Spark

> Fix a potential data race in TaskMemoryManager
> --
>
> Key: SPARK-14911
> URL: https://issues.apache.org/jira/browse/SPARK-14911
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Liwei Lin
>Assignee: Apache Spark
>Priority: Minor
>
> SPARK-13210 introduced an `acquiredButNotUsed` field, but it might not be 
> correctly synchronized:
> - the write `acquiredButNotUsed += acquired` is guarded by `this` lock (see 
> [here|https://github.com/apache/spark/blame/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L271]);
> - the read `memoryManager.releaseExecutionMemory(acquiredButNotUsed, 
> taskAttemptId, tungstenMemoryMode)` (see 
> [here|https://github.com/apache/spark/blame/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L400])
>  might not be correctly synchronized, and might not see 
> `acquiredButNotUsed`'s new written value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13902) Make DAGScheduler.getAncestorShuffleDependencies() return in topological order to ensure building ancestor stages first.

2016-04-25 Thread Takuya Ueshin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15257496#comment-15257496
 ] 

Takuya Ueshin commented on SPARK-13902:
---

Yes, exactly.

> Make DAGScheduler.getAncestorShuffleDependencies() return in topological 
> order to ensure building ancestor stages first.
> 
>
> Key: SPARK-13902
> URL: https://issues.apache.org/jira/browse/SPARK-13902
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Takuya Ueshin
>
> {{DAGScheduler}} sometimes generate incorrect stage graph.
> Some stages are generated for the same shuffleId twice or more and they are 
> referenced by the child stages because the building order of the graph is not 
> correct.
> Here, we submit an RDD\[F\] having a linage of RDDs as follows (please see 
> this in {{monospaced}} font):
> {noformat}
>   <
> /   \
> [A] <--(1)-- [B] <--(2)-- [C] <--(3)-- [D] <--(4)-- [E] <--(5)-- [F]
>\   /
>  <
> {noformat}
> Note: \[\] means an RDD, () means a shuffle dependency.
> {{DAGScheduler}} generates the following stages and their parents for each 
> shuffle:
> |  | stage | parents |
> | (1) | ShuffleMapStage 2 | List() |
> | (2) | ShuffleMapStage 1 | List(ShuffleMapStage 0) |
> | (3) | ShuffleMapStage 3 | List(ShuffleMapStage 1) |
> | (4) | ShuffleMapStage 4 | List(ShuffleMapStage 2, ShuffleMapStage 3) |
> | (5) | ShuffleMapStage 5 | List(ShuffleMapStage 1, ShuffleMapStage 4) |
> | \- | ResultStage 6 | List(ShuffleMapStage 5) |
> The stage for shuffle id {{0}} should be {{ShuffleMapStage 0}}, but the stage 
> for shuffle id {{0}} is generated twice as {{ShuffleMapStage 2}} and 
> {{ShuffleMapStage 0}} is overwritten by {{ShuffleMapStage 2}}, and the stage 
> {{ShuffleMap Stage1}} keeps referring the _old_ stage {{ShuffleMapStage 0}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14911) Fix a potential data race in TaskMemoryManager

2016-04-25 Thread Liwei Lin (JIRA)
Liwei Lin created SPARK-14911:
-

 Summary: Fix a potential data race in TaskMemoryManager
 Key: SPARK-14911
 URL: https://issues.apache.org/jira/browse/SPARK-14911
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.0
Reporter: Liwei Lin
Priority: Minor


SPARK-13210 introduced an `acquiredButNotUsed` field, but it might not be 
correctly synchronized:
- the write `acquiredButNotUsed += acquired` is guarded by `this` lock (see 
[here|https://github.com/apache/spark/blame/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L271]);
- the read `memoryManager.releaseExecutionMemory(acquiredButNotUsed, 
taskAttemptId, tungstenMemoryMode)` (see 
[here|https://github.com/apache/spark/blame/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L400])
 might not be correctly synchronized, and might not see `acquiredButNotUsed`'s 
new written value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13902) Make DAGScheduler.getAncestorShuffleDependencies() return in topological order to ensure building ancestor stages first.

2016-04-25 Thread Takuya Ueshin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin updated SPARK-13902:
--
Description: 
{{DAGScheduler}} sometimes generate incorrect stage graph.
Some stages are generated for the same shuffleId twice or more and they are 
referenced by the child stages because the building order of the graph is not 
correct.

Here, we submit an RDD\[F\] having a linage of RDDs as follows (please see this 
in {{monospaced}} font):

{noformat}
  <
/   \
[A] <--(1)-- [B] <--(2)-- [C] <--(3)-- [D] <--(4)-- [E] <--(5)-- [F]
   \   /
 <
{noformat}

Note: \[\] means an RDD, () means a shuffle dependency.

{{DAGScheduler}} generates the following stages and their parents for each 
shuffle:

|  | stage | parents |
| (1) | ShuffleMapStage 2 | List() |
| (2) | ShuffleMapStage 1 | List(ShuffleMapStage 0) |
| (3) | ShuffleMapStage 3 | List(ShuffleMapStage 1) |
| (4) | ShuffleMapStage 4 | List(ShuffleMapStage 2, ShuffleMapStage 3) |
| (5) | ShuffleMapStage 5 | List(ShuffleMapStage 1, ShuffleMapStage 4) |
| \- | ResultStage 6 | List(ShuffleMapStage 5) |

The stage for shuffle id {{0}} should be {{ShuffleMapStage 0}}, but the stage 
for shuffle id {{0}} is generated twice as {{ShuffleMapStage 2}} and 
{{ShuffleMapStage 0}} is overwritten by {{ShuffleMapStage 2}}, and the stage 
{{ShuffleMap Stage1}} keeps referring the _old_ stage {{ShuffleMapStage 0}}.


  was:
{{DAGScheduler}} sometimes generate incorrect stage graph.
Some stages are generated for the same shuffleId twice or more and they are 
referenced by the child stages because the building order of the graph is not 
correct.

Here, we submit an RDD\[F\] having a linage of RDDs as follows (please see this 
in {{monospaced}} font):

{noformat}
  <
/   \
[A] <--(1)-- [B] <--(2)-- [C] <--(3)-- [D] <--(4)-- [E] <--(5)-- [F]
   \   /
 <
{noformat}

{{DAGScheduler}} generates the following stages and their parents for each 
shuffle id:

| shuffle id | stage | parents |
| 0 | ShuffleMapStage 2 | List() |
| 1 | ShuffleMapStage 1 | List(ShuffleMapStage 0) |
| 2 | ShuffleMapStage 3 | List(ShuffleMapStage 1) |
| 3 | ShuffleMapStage 4 | List(ShuffleMapStage 2, ShuffleMapStage 3) |
| 4 | ShuffleMapStage 5 | List(ShuffleMapStage 1, ShuffleMapStage 4) |
| \- | ResultStage 6 | List(ShuffleMapStage 5) |

The stage for shuffle id {{0}} should be {{ShuffleMapStage 0}}, but the stage 
for shuffle id {{0}} is generated twice as {{ShuffleMapStage 2}} and 
{{ShuffleMapStage 0}} is overwritten by {{ShuffleMapStage 2}}, and the stage 
{{ShuffleMap Stage1}} keeps referring the _old_ stage {{ShuffleMapStage 0}}.



> Make DAGScheduler.getAncestorShuffleDependencies() return in topological 
> order to ensure building ancestor stages first.
> 
>
> Key: SPARK-13902
> URL: https://issues.apache.org/jira/browse/SPARK-13902
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Takuya Ueshin
>
> {{DAGScheduler}} sometimes generate incorrect stage graph.
> Some stages are generated for the same shuffleId twice or more and they are 
> referenced by the child stages because the building order of the graph is not 
> correct.
> Here, we submit an RDD\[F\] having a linage of RDDs as follows (please see 
> this in {{monospaced}} font):
> {noformat}
>   <
> /   \
> [A] <--(1)-- [B] <--(2)-- [C] <--(3)-- [D] <--(4)-- [E] <--(5)-- [F]
>\   /
>  <
> {noformat}
> Note: \[\] means an RDD, () means a shuffle dependency.
> {{DAGScheduler}} generates the following stages and their parents for each 
> shuffle:
> |  | stage | parents |
> | (1) | ShuffleMapStage 2 | List() |
> | (2) | ShuffleMapStage 1 | List(ShuffleMapStage 0) |
> | (3) | ShuffleMapStage 3 | List(ShuffleMapStage 1) |
> | (4) | ShuffleMapStage 4 | List(ShuffleMapStage 2, ShuffleMapStage 3) |
> | (5) | ShuffleMapStage 5 | List(ShuffleMapStage 1, ShuffleMapStage 4) |
> | \- | ResultStage 6 | List(ShuffleMapStage 5) |
> The stage for shuffle id {{0}} should be {{ShuffleMapStage 0}}, but the stage 
> for shuffle id {{0}} is generated twice as {{ShuffleMapStage 2}} and 
> {{ShuffleMapStage 0}} is overwritten by {{ShuffleMapStage 2}}, and the stage 
> {{ShuffleMap Stage1}} keeps referring the _old_ stage {{ShuffleMapStage 0}}.



--
This message was 

[jira] [Commented] (SPARK-13902) Make DAGScheduler.getAncestorShuffleDependencies() return in topological order to ensure building ancestor stages first.

2016-04-25 Thread Takuya Ueshin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15257481#comment-15257481
 ] 

Takuya Ueshin commented on SPARK-13902:
---

I'm sorry, I made a mistake.
I should have written the number in the diagram, but I wrote the actual shuffle 
id from DAGScheduler we can get when we run the test.
I'll update it.

So the rest of your questions are right.


> Make DAGScheduler.getAncestorShuffleDependencies() return in topological 
> order to ensure building ancestor stages first.
> 
>
> Key: SPARK-13902
> URL: https://issues.apache.org/jira/browse/SPARK-13902
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Takuya Ueshin
>
> {{DAGScheduler}} sometimes generate incorrect stage graph.
> Some stages are generated for the same shuffleId twice or more and they are 
> referenced by the child stages because the building order of the graph is not 
> correct.
> Here, we submit an RDD\[F\] having a linage of RDDs as follows (please see 
> this in {{monospaced}} font):
> {noformat}
>   <
> /   \
> [A] <--(1)-- [B] <--(2)-- [C] <--(3)-- [D] <--(4)-- [E] <--(5)-- [F]
>\   /
>  <
> {noformat}
> {{DAGScheduler}} generates the following stages and their parents for each 
> shuffle id:
> | shuffle id | stage | parents |
> | 0 | ShuffleMapStage 2 | List() |
> | 1 | ShuffleMapStage 1 | List(ShuffleMapStage 0) |
> | 2 | ShuffleMapStage 3 | List(ShuffleMapStage 1) |
> | 3 | ShuffleMapStage 4 | List(ShuffleMapStage 2, ShuffleMapStage 3) |
> | 4 | ShuffleMapStage 5 | List(ShuffleMapStage 1, ShuffleMapStage 4) |
> | \- | ResultStage 6 | List(ShuffleMapStage 5) |
> The stage for shuffle id {{0}} should be {{ShuffleMapStage 0}}, but the stage 
> for shuffle id {{0}} is generated twice as {{ShuffleMapStage 2}} and 
> {{ShuffleMapStage 0}} is overwritten by {{ShuffleMapStage 2}}, and the stage 
> {{ShuffleMap Stage1}} keeps referring the _old_ stage {{ShuffleMapStage 0}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14314) K-means model persistence in SparkR

2016-04-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14314:


Assignee: (was: Apache Spark)

> K-means model persistence in SparkR
> ---
>
> Key: SPARK-14314
> URL: https://issues.apache.org/jira/browse/SPARK-14314
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14314) K-means model persistence in SparkR

2016-04-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15257475#comment-15257475
 ] 

Apache Spark commented on SPARK-14314:
--

User 'GayathriMurali' has created a pull request for this issue:
https://github.com/apache/spark/pull/12680

> K-means model persistence in SparkR
> ---
>
> Key: SPARK-14314
> URL: https://issues.apache.org/jira/browse/SPARK-14314
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14314) K-means model persistence in SparkR

2016-04-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14314:


Assignee: Apache Spark

> K-means model persistence in SparkR
> ---
>
> Key: SPARK-14314
> URL: https://issues.apache.org/jira/browse/SPARK-14314
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14910) Native DDL Command Support for Describe Function in Non-identifier Format

2016-04-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15257427#comment-15257427
 ] 

Apache Spark commented on SPARK-14910:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/12679

> Native DDL Command Support for Describe Function in Non-identifier Format
> -
>
> Key: SPARK-14910
> URL: https://issues.apache.org/jira/browse/SPARK-14910
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> The existing `Describe Function` only support the function name in 
> `identifier`. This is different from what Hive behaves. That is why many test 
> cases `udf_abc` in `HiveCompatibilitySuite` do not pass. For example, 
> - udf_not.q
> - udf_bitwise_not.q
> We need to support the command of `Describe Function` whose function names 
> are in the following formats that are not natively supported:
> - `STRING` (e.g., `'func1'`)
> - `comparisonOperator` (e.g,. `<`)
> - `arithmeticOperator` (e.g., `+`)
> - `predicateOperator` (e.g., `or`)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14910) Native DDL Command Support for Describe Function in Non-identifier Format

2016-04-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14910:


Assignee: (was: Apache Spark)

> Native DDL Command Support for Describe Function in Non-identifier Format
> -
>
> Key: SPARK-14910
> URL: https://issues.apache.org/jira/browse/SPARK-14910
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> The existing `Describe Function` only support the function name in 
> `identifier`. This is different from what Hive behaves. That is why many test 
> cases `udf_abc` in `HiveCompatibilitySuite` do not pass. For example, 
> - udf_not.q
> - udf_bitwise_not.q
> We need to support the command of `Describe Function` whose function names 
> are in the following formats that are not natively supported:
> - `STRING` (e.g., `'func1'`)
> - `comparisonOperator` (e.g,. `<`)
> - `arithmeticOperator` (e.g., `+`)
> - `predicateOperator` (e.g., `or`)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14910) Native DDL Command Support for Describe Function in Non-identifier Format

2016-04-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14910:


Assignee: Apache Spark

> Native DDL Command Support for Describe Function in Non-identifier Format
> -
>
> Key: SPARK-14910
> URL: https://issues.apache.org/jira/browse/SPARK-14910
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> The existing `Describe Function` only support the function name in 
> `identifier`. This is different from what Hive behaves. That is why many test 
> cases `udf_abc` in `HiveCompatibilitySuite` do not pass. For example, 
> - udf_not.q
> - udf_bitwise_not.q
> We need to support the command of `Describe Function` whose function names 
> are in the following formats that are not natively supported:
> - `STRING` (e.g., `'func1'`)
> - `comparisonOperator` (e.g,. `<`)
> - `arithmeticOperator` (e.g., `+`)
> - `predicateOperator` (e.g., `or`)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14910) Native DDL Command Support for Describe Function in Non-identifier Format

2016-04-25 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-14910:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-14118

> Native DDL Command Support for Describe Function in Non-identifier Format
> -
>
> Key: SPARK-14910
> URL: https://issues.apache.org/jira/browse/SPARK-14910
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> The existing `Describe Function` only support the function name in 
> `identifier`. This is different from what Hive behaves. That is why many test 
> cases `udf_abc` in `HiveCompatibilitySuite` do not pass. For example, 
> - udf_not.q
> - udf_bitwise_not.q
> We need to support the command of `Describe Function` whose function names 
> are in the following formats that are not natively supported:
> - `STRING` (e.g., `'func1'`)
> - `comparisonOperator` (e.g,. `<`)
> - `arithmeticOperator` (e.g., `+`)
> - `predicateOperator` (e.g., `or`)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14910) Native DDL Command Support for Describe Function in Non-identifier Format

2016-04-25 Thread Xiao Li (JIRA)
Xiao Li created SPARK-14910:
---

 Summary: Native DDL Command Support for Describe Function in 
Non-identifier Format
 Key: SPARK-14910
 URL: https://issues.apache.org/jira/browse/SPARK-14910
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiao Li


The existing `Describe Function` only support the function name in 
`identifier`. This is different from what Hive behaves. That is why many test 
cases `udf_abc` in `HiveCompatibilitySuite` do not pass. For example, 
- udf_not.q
- udf_bitwise_not.q

We need to support the command of `Describe Function` whose function names are 
in the following formats that are not natively supported:
- `STRING` (e.g., `'func1'`)
- `comparisonOperator` (e.g,. `<`)
- `arithmeticOperator` (e.g., `+`)
- `predicateOperator` (e.g., `or`)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14909) Spark UI submitted time is wrong

2016-04-25 Thread Christophe (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christophe updated SPARK-14909:
---
Attachment: time-spark3.png
time-spark2.png
time-spark1.png
spark-submission.png

spark-submission is what is displayed on the main web UI. The submitted 
timestamp are not what I expect. Instead, on the time-spark{123}.png I can see 
the correct timestamp

> Spark UI submitted time is wrong
> 
>
> Key: SPARK-14909
> URL: https://issues.apache.org/jira/browse/SPARK-14909
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.0
>Reporter: Christophe
> Attachments: spark-submission.png, time-spark1.png, time-spark2.png, 
> time-spark3.png
>
>
> There is something wrong with the "submitted time" reported on the main web 
> UI.
> For example, I have jobs submitted every 5 minutes(00; 05; 10; 15 ...)
> Under the "Completed applications", I can see my jobs with a submitted 
> timestamp of same value: 11:04 AM 26/04/2016
> But, if I click on the individual application and look at the submitted time 
> at the top, I get the expected values, for example: Submit Date: Tue Apr 26 
> 01:05:03 UTC 2016
> I'll try to attach some screenshot



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14908) Provide support HDFS-located resources for "spark.executor.extraClasspath" on YARN

2016-04-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14908:


Assignee: Apache Spark

> Provide support  HDFS-located resources for "spark.executor.extraClasspath" 
> on YARN
> ---
>
> Key: SPARK-14908
> URL: https://issues.apache.org/jira/browse/SPARK-14908
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Dubkov Mikhail
>Assignee: Apache Spark
>Priority: Minor
>
> On our project we use custom implementation of SparkSerializer and we found 
> that it loads serializer class when launch executor (SparkEnv.create()). So, 
> we were forced to use "spark.executor.extraClassPath"  and custom serializer 
> class loads fine for now. But, it is not well for deployment process, because 
> currently, "spark.executor.ClassPath" does not support hdfs-based resoruces, 
> that means we should deploy artifact with serializer to each Hadoop node. We 
> would like to simplify deployment process.
> We have tried make changes for this purpose and it works now for us. The 
> changes is relevant only for Hadoop/YARN deployment.
> We didn't any workaround how we can avoid extra class path definition for 
> custom serializer implementation, please, let us know if we missed something.
> I will create pull request for master branch, could you please look into 
> changes and go back with feedback?
> We need these changes in master branch to simplify our future upgrade and I 
> hope this improvement can be helpful for other Spark users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14908) Provide support HDFS-located resources for "spark.executor.extraClasspath" on YARN

2016-04-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15257390#comment-15257390
 ] 

Apache Spark commented on SPARK-14908:
--

User 'mikhaildubkov' has created a pull request for this issue:
https://github.com/apache/spark/pull/12678

> Provide support  HDFS-located resources for "spark.executor.extraClasspath" 
> on YARN
> ---
>
> Key: SPARK-14908
> URL: https://issues.apache.org/jira/browse/SPARK-14908
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Dubkov Mikhail
>Priority: Minor
>
> On our project we use custom implementation of SparkSerializer and we found 
> that it loads serializer class when launch executor (SparkEnv.create()). So, 
> we were forced to use "spark.executor.extraClassPath"  and custom serializer 
> class loads fine for now. But, it is not well for deployment process, because 
> currently, "spark.executor.ClassPath" does not support hdfs-based resoruces, 
> that means we should deploy artifact with serializer to each Hadoop node. We 
> would like to simplify deployment process.
> We have tried make changes for this purpose and it works now for us. The 
> changes is relevant only for Hadoop/YARN deployment.
> We didn't any workaround how we can avoid extra class path definition for 
> custom serializer implementation, please, let us know if we missed something.
> I will create pull request for master branch, could you please look into 
> changes and go back with feedback?
> We need these changes in master branch to simplify our future upgrade and I 
> hope this improvement can be helpful for other Spark users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14908) Provide support HDFS-located resources for "spark.executor.extraClasspath" on YARN

2016-04-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14908:


Assignee: (was: Apache Spark)

> Provide support  HDFS-located resources for "spark.executor.extraClasspath" 
> on YARN
> ---
>
> Key: SPARK-14908
> URL: https://issues.apache.org/jira/browse/SPARK-14908
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Dubkov Mikhail
>Priority: Minor
>
> On our project we use custom implementation of SparkSerializer and we found 
> that it loads serializer class when launch executor (SparkEnv.create()). So, 
> we were forced to use "spark.executor.extraClassPath"  and custom serializer 
> class loads fine for now. But, it is not well for deployment process, because 
> currently, "spark.executor.ClassPath" does not support hdfs-based resoruces, 
> that means we should deploy artifact with serializer to each Hadoop node. We 
> would like to simplify deployment process.
> We have tried make changes for this purpose and it works now for us. The 
> changes is relevant only for Hadoop/YARN deployment.
> We didn't any workaround how we can avoid extra class path definition for 
> custom serializer implementation, please, let us know if we missed something.
> I will create pull request for master branch, could you please look into 
> changes and go back with feedback?
> We need these changes in master branch to simplify our future upgrade and I 
> hope this improvement can be helpful for other Spark users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14909) Spark UI submitted time is wrong

2016-04-25 Thread Christophe (JIRA)
Christophe created SPARK-14909:
--

 Summary: Spark UI submitted time is wrong
 Key: SPARK-14909
 URL: https://issues.apache.org/jira/browse/SPARK-14909
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.6.0
Reporter: Christophe


There is something wrong with the "submitted time" reported on the main web UI.

For example, I have jobs submitted every 5 minutes(00; 05; 10; 15 ...)
Under the "Completed applications", I can see my jobs with a submitted 
timestamp of same value: 11:04 AM 26/04/2016

But, if I click on the individual application and look at the submitted time at 
the top, I get the expected values, for example: Submit Date: Tue Apr 26 
01:05:03 UTC 2016

I'll try to attach some screenshot



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-14772) Python ML Params.copy treats uid, paramMaps differently than Scala

2016-04-25 Thread hujiayin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hujiayin updated SPARK-14772:
-
Comment: was deleted

(was: I can submit a code to fix this issue and I'm testing it.)

> Python ML Params.copy treats uid, paramMaps differently than Scala
> --
>
> Key: SPARK-14772
> URL: https://issues.apache.org/jira/browse/SPARK-14772
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> In PySpark, {{ml.param.Params.copy}} does not quite match the Scala 
> implementation:
> * It does not copy the UID
> * It does not respect the difference between defaultParamMap and paramMap.  
> This is an issue with {{_copyValues}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14902) Expose user-facing RuntimeConfig in SparkSession

2016-04-25 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14902.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Expose user-facing RuntimeConfig in SparkSession
> 
>
> Key: SPARK-14902
> URL: https://issues.apache.org/jira/browse/SPARK-14902
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6339) Support creating temporary tables with DDL

2016-04-25 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-6339:

Target Version/s: 2.0.0

> Support creating temporary tables with DDL
> --
>
> Key: SPARK-6339
> URL: https://issues.apache.org/jira/browse/SPARK-6339
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Hossein Falaki
>
> It would useful to support following:
> {code}
> create temporary table counted as
> select count(transactions), company from sales group by company
> {code}
> Right now this is possible through registerTempTable()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14908) Provide support HDFS-located resources for "spark.executor.extraClasspath" on YARN

2016-04-25 Thread Dubkov Mikhail (JIRA)
Dubkov Mikhail created SPARK-14908:
--

 Summary: Provide support  HDFS-located resources for 
"spark.executor.extraClasspath" on YARN
 Key: SPARK-14908
 URL: https://issues.apache.org/jira/browse/SPARK-14908
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Reporter: Dubkov Mikhail
Priority: Minor


On our project we use custom implementation of SparkSerializer and we found 
that it loads serializer class when launch executor (SparkEnv.create()). So, we 
were forced to use "spark.executor.extraClassPath"  and custom serializer class 
loads fine for now. But, it is not well for deployment process, because 
currently, "spark.executor.ClassPath" does not support hdfs-based resoruces, 
that means we should deploy artifact with serializer to each Hadoop node. We 
would like to simplify deployment process.

We have tried make changes for this purpose and it works now for us. The 
changes is relevant only for Hadoop/YARN deployment.
We didn't any workaround how we can avoid extra class path definition for 
custom serializer implementation, please, let us know if we missed something.

I will create pull request for master branch, could you please look into 
changes and go back with feedback?

We need these changes in master branch to simplify our future upgrade and I 
hope this improvement can be helpful for other Spark users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14907) Use repartition in GLMRegressionModel.save

2016-04-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14907:


Assignee: (was: Apache Spark)

> Use repartition in GLMRegressionModel.save
> --
>
> Key: SPARK-14907
> URL: https://issues.apache.org/jira/browse/SPARK-14907
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> This issue changes `GLMRegressionModel.save` function like the following code 
> that is similar to other algorithms' parquet write.
> {code}
> -  val dataRDD: DataFrame = sc.parallelize(Seq(data), 1).toDF()
> -  // TODO: repartition with 1 partition after SPARK-5532 gets fixed
> -  dataRDD.write.parquet(Loader.dataPath(path))
> +  
> sqlContext.createDataFrame(Seq(data)).repartition(1).write.parquet(Loader.dataPath(path))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14907) Use repartition in GLMRegressionModel.save

2016-04-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15257324#comment-15257324
 ] 

Apache Spark commented on SPARK-14907:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/12676

> Use repartition in GLMRegressionModel.save
> --
>
> Key: SPARK-14907
> URL: https://issues.apache.org/jira/browse/SPARK-14907
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> This issue changes `GLMRegressionModel.save` function like the following code 
> that is similar to other algorithms' parquet write.
> {code}
> -  val dataRDD: DataFrame = sc.parallelize(Seq(data), 1).toDF()
> -  // TODO: repartition with 1 partition after SPARK-5532 gets fixed
> -  dataRDD.write.parquet(Loader.dataPath(path))
> +  
> sqlContext.createDataFrame(Seq(data)).repartition(1).write.parquet(Loader.dataPath(path))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14613) Add @Since into the matrix and vector classes in spark-mllib-local

2016-04-25 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-14613:

Assignee: Pravin Gadakh

> Add @Since into the matrix and vector classes in spark-mllib-local
> --
>
> Key: SPARK-14613
> URL: https://issues.apache.org/jira/browse/SPARK-14613
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, ML
>Reporter: DB Tsai
>Assignee: Pravin Gadakh
>
> In spark-mllib-local, we're no longer to be able to use @Since annotation. As 
> a result, we will switch to standard java doc style using /* @Since /*. This 
> task will add those new APIs as @Since 2.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14907) Use repartition in GLMRegressionModel.save

2016-04-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14907:


Assignee: Apache Spark

> Use repartition in GLMRegressionModel.save
> --
>
> Key: SPARK-14907
> URL: https://issues.apache.org/jira/browse/SPARK-14907
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Trivial
>
> This issue changes `GLMRegressionModel.save` function like the following code 
> that is similar to other algorithms' parquet write.
> {code}
> -  val dataRDD: DataFrame = sc.parallelize(Seq(data), 1).toDF()
> -  // TODO: repartition with 1 partition after SPARK-5532 gets fixed
> -  dataRDD.write.parquet(Loader.dataPath(path))
> +  
> sqlContext.createDataFrame(Seq(data)).repartition(1).write.parquet(Loader.dataPath(path))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14906) Move VectorUDT and MatrixUDT in PySpark to new ML package

2016-04-25 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-14906:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-13944

> Move VectorUDT and MatrixUDT in PySpark to new ML package
> -
>
> Key: SPARK-14906
> URL: https://issues.apache.org/jira/browse/SPARK-14906
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Liang-Chi Hsieh
>
> As we move VectorUDT and MatrixUDT in Scala to new ml package, the PySpark 
> codes should be moved too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14906) Move VectorUDT and MatrixUDT in PySpark to new ML package

2016-04-25 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-14906:

Description: As we move VectorUDT and MatrixUDT in Scala to new ml package, 
the PySpark codes should be moved too.

> Move VectorUDT and MatrixUDT in PySpark to new ML package
> -
>
> Key: SPARK-14906
> URL: https://issues.apache.org/jira/browse/SPARK-14906
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Liang-Chi Hsieh
>
> As we move VectorUDT and MatrixUDT in Scala to new ml package, the PySpark 
> codes should be moved too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14907) Use repartition in GLMRegressionModel.save

2016-04-25 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-14907:
-

 Summary: Use repartition in GLMRegressionModel.save
 Key: SPARK-14907
 URL: https://issues.apache.org/jira/browse/SPARK-14907
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Dongjoon Hyun
Priority: Trivial


This issue changes `GLMRegressionModel.save` function like the following code 
that is similar to other algorithms' parquet write.
{code}
-  val dataRDD: DataFrame = sc.parallelize(Seq(data), 1).toDF()
-  // TODO: repartition with 1 partition after SPARK-5532 gets fixed
-  dataRDD.write.parquet(Loader.dataPath(path))
+  
sqlContext.createDataFrame(Seq(data)).repartition(1).write.parquet(Loader.dataPath(path))
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14906) Move VectorUDT and MatrixUDT in PySpark to new ML package

2016-04-25 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-14906:
---

 Summary: Move VectorUDT and MatrixUDT in PySpark to new ML package
 Key: SPARK-14906
 URL: https://issues.apache.org/jira/browse/SPARK-14906
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: Liang-Chi Hsieh






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14249) Change MLReader.read to be a property for PySpark

2016-04-25 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15257311#comment-15257311
 ] 

Miao Wang commented on SPARK-14249:
---

[~josephkb]Thanks! It is good to learn something new.

Miao

> Change MLReader.read to be a property for PySpark
> -
>
> Key: SPARK-14249
> URL: https://issues.apache.org/jira/browse/SPARK-14249
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> To match MLWritable.write and SQLContext.read, it will be good to make the 
> PySpark MLReader classmethod {{read}} be a property.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14894) Python GaussianMixture summary

2016-04-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15257307#comment-15257307
 ] 

Apache Spark commented on SPARK-14894:
--

User 'GayathriMurali' has created a pull request for this issue:
https://github.com/apache/spark/pull/12675

> Python GaussianMixture summary
> --
>
> Key: SPARK-14894
> URL: https://issues.apache.org/jira/browse/SPARK-14894
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> In spark.ml, GaussianMixture includes a result summary.  The Python API 
> should provide the same functionality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14880) Parallel Gradient Descent with less map-reduce shuffle overhead

2016-04-25 Thread Ahmed Mahran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15257297#comment-15257297
 ] 

Ahmed Mahran commented on SPARK-14880:
--

I think Zinkevich's requires the loss to be convex regardless from whether it 
is smooth or not. In their evaluation, they used a Huber loss which I think is 
not smooth.

I'd like to highlight that this algorithm is different from Zinkevich's in two 
things:
 - It uses mini-batch SGD instead of strict SGD
 - It applies higher level iterations

I don't have theoretical evidence about the effect of both modifications on 
convergence. However, they seem plausible for the following reasons. In less 
technical terms, the trick is to guarantee that the parallel partitions 
converge to closer limits as possible. Imagine a bunch of climbers, one on each 
partition, climbing similar hills starting from similar points with the same 
rate and steps in similar directions; they would eventually end at similar 
limits. The following seem to be logically plausible guarantees:
 - Using the same initialization, the same step size per iteration and number 
of iterations
 - Using mini-batches with the same sampling distribution reduces stochasticity
 - Averaging and reiterating resynchronizes the possibly deviated climbers to 
the same point
 - Reshuffling helps producing new samples to learn from

I would be interested to submit it as a Spark package. I'd also be interested 
in carrying out experiments, suggestions would be much appreciated.

> Parallel Gradient Descent with less map-reduce shuffle overhead
> ---
>
> Key: SPARK-14880
> URL: https://issues.apache.org/jira/browse/SPARK-14880
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Ahmed Mahran
>  Labels: performance
>
> The current implementation of (Stochastic) Gradient Descent performs one 
> map-reduce shuffle per iteration. Moreover, when the sampling fraction gets 
> smaller, the algorithm becomes shuffle-bound instead of CPU-bound.
> {code}
> (1 to numIterations or convergence) {
>  rdd
>   .sample(fraction)
>   .map(Gradient)
>   .reduce(Update)
> }
> {code}
> A more performant variation requires only one map-reduce regardless from the 
> number of iterations. A local mini-batch SGD could be run on each partition, 
> then the results could be averaged. This is based on (Zinkevich, Martin, 
> Markus Weimer, Lihong Li, and Alex J. Smola. "Parallelized stochastic 
> gradient descent." In Advances in neural information processing systems, 
> 2010, 
> http://www.research.rutgers.edu/~lihong/pub/Zinkevich11Parallelized.pdf).
> {code}
> rdd
>  .shuffle()
>  .mapPartitions((1 to numIterations or convergence) {
>iter.sample(fraction).map(Gradient).reduce(Update)
>  })
>  .reduce(Average)
> {code}
> A higher level iteration could enclose the above variation; shuffling the 
> data before the local mini-batches and feeding back the average weights from 
> the last iteration. This allows more variability in the sampling of the 
> mini-batches with the possibility to cover the whole dataset. Here is a Spark 
> based implementation 
> https://github.com/mashin-io/rich-spark/blob/master/src/main/scala/org/apache/spark/mllib/optimization/ParallelSGD.scala
> {code}
> (1 to numIterations1 or convergence) {
>  rdd
>   .shuffle()
>   .mapPartitions((1 to numIterations2 or convergence) {
> iter.sample(fraction).map(Gradient).reduce(Update)
>   })
>   .reduce(Average)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14905) create conda environments w/locked package versions

2016-04-25 Thread shane knapp (JIRA)
shane knapp created SPARK-14905:
---

 Summary: create conda environments w/locked package versions
 Key: SPARK-14905
 URL: https://issues.apache.org/jira/browse/SPARK-14905
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: shane knapp


right now, the package dependency story for the jenkins build system is...  
well...  non-existent.

packages are installed, and only rarely (if ever) updated.  when a new anaconda 
or system python library is installed or updated for a specific user/build 
requirement, this will randomly update and/or install other packages that may 
or may not have backwards compatibility. 

we've survived for a number of years so far without looking to deal with the 
technical debt, but i don't see how this will remain manageable, especially as 
spark, and other projects hosted on jenkins grow.

example:  currently, a non-spark amplab project (e-mission) needs scipy updated 
from 0.15.1 to 0.17.0 for their tests to pass.  this simple upgrade adds three 
new python libraries (ligbfortran, mkl, wheel) and updates eleven others 
(conda, conda-env, numpy, openssl, pip, python, pyyaml, requests, setuptools, 
sqlite, yaml).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14870) NPE in generate aggregate

2016-04-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15257244#comment-15257244
 ] 

Apache Spark commented on SPARK-14870:
--

User 'sameeragarwal' has created a pull request for this issue:
https://github.com/apache/spark/pull/12674

> NPE in generate aggregate
> -
>
> Key: SPARK-14870
> URL: https://issues.apache.org/jira/browse/SPARK-14870
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Sameer Agarwal
> Fix For: 2.0.0
>
>
> When ran TPCDS Q14a
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 126.0 failed 1 times, most recent failure: Lost task 0.0 in stage 126.0 
> (TID 234, localhost): java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.vectorized.ColumnVector.putDecimal(ColumnVector.java:576)
>   at 
> org.apache.spark.sql.execution.vectorized.ColumnarBatch$Row.setDecimal(ColumnarBatch.java:325)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$7$$anon$1.hasNext(WholeStageCodegenExec.scala:361)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:254)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:809)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:809)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:809)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1780)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1793)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1806)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1820)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:880)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:357)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:879)
>   at 
> org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:453)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply$mcI$sp(Dataset.scala:2367)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2367)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2367)
>   at 
> 

[jira] [Resolved] (SPARK-14888) UnresolvedFunction should use FunctionIdentifier rather than just a string for function name

2016-04-25 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14888.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> UnresolvedFunction should use FunctionIdentifier rather than just a string 
> for function name
> 
>
> Key: SPARK-14888
> URL: https://issues.apache.org/jira/browse/SPARK-14888
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14511) Publish our forked genjavadoc for 2.12.0-M4 or stop using a forked version

2016-04-25 Thread Jakob Odersky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15257217#comment-15257217
 ] 

Jakob Odersky commented on SPARK-14511:
---

Update: an issue was discovered during release-testing upstream. I just 
submitted a fix for it, tested against Akka and Spark.
Javadoc in Spark emits a few error messages, however these were already present 
previously and do not affect the final, generated documentation.
I'll get back when the release is out

> Publish our forked genjavadoc for 2.12.0-M4 or stop using a forked version
> --
>
> Key: SPARK-14511
> URL: https://issues.apache.org/jira/browse/SPARK-14511
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Project Infra
>Reporter: Josh Rosen
>
> Before we can move to 2.12, we need to publish our forked genjavadoc for 
> 2.12.0-M4 (or 2.12 final) or stop using a forked version of the plugin.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14555) Python API for methods introduced for Structured Streaming

2016-04-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15257211#comment-15257211
 ] 

Apache Spark commented on SPARK-14555:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/12673

> Python API for methods introduced for Structured Streaming
> --
>
> Key: SPARK-14555
> URL: https://issues.apache.org/jira/browse/SPARK-14555
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL, Streaming
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 2.0.0
>
>
> Methods added for Structured Streaming don't have a Python API yet.
> We need to provide APIs for the new methods in:
>  - DataFrameReader
>  - DataFrameWriter
>  - ContinuousQuery
>  - Trigger



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8855) Python API for Association Rules

2016-04-25 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15257205#comment-15257205
 ] 

Joseph K. Bradley commented on SPARK-8855:
--

This may be significantly easier to add in the DataFrame-based API.  I think we 
should prioritize getting AssociationRules into the DataFrame API, after which 
it should be much easier to add this Python wrapper.  Here's the related issue: 
[SPARK-14501]

> Python API for Association Rules
> 
>
> Key: SPARK-8855
> URL: https://issues.apache.org/jira/browse/SPARK-8855
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Feynman Liang
>Priority: Minor
>
> A simple Python wrapper and doctests needs to be written for Association 
> Rules. The relevant method is {{FPGrowthModel.generateAssociationRules}}. The 
> code will likely live in {{fpm.py}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14904) Add back HiveContext in compatibility package

2016-04-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15257191#comment-15257191
 ] 

Apache Spark commented on SPARK-14904:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/12672

> Add back HiveContext in compatibility package
> -
>
> Key: SPARK-14904
> URL: https://issues.apache.org/jira/browse/SPARK-14904
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14904) Add back HiveContext in compatibility package

2016-04-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14904:


Assignee: Andrew Or  (was: Apache Spark)

> Add back HiveContext in compatibility package
> -
>
> Key: SPARK-14904
> URL: https://issues.apache.org/jira/browse/SPARK-14904
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14904) Add back HiveContext in compatibility package

2016-04-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14904:


Assignee: Apache Spark  (was: Andrew Or)

> Add back HiveContext in compatibility package
> -
>
> Key: SPARK-14904
> URL: https://issues.apache.org/jira/browse/SPARK-14904
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14903) Revert: Change MLWritable.write to be a property

2016-04-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15257182#comment-15257182
 ] 

Apache Spark commented on SPARK-14903:
--

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/12671

> Revert: Change MLWritable.write to be a property
> 
>
> Key: SPARK-14903
> URL: https://issues.apache.org/jira/browse/SPARK-14903
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> Per discussion in [SPARK-14249], there is not a good way to support .read as 
> a property.  We will therefore revert the change to write() to keep the API 
> consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14903) Revert: Change MLWritable.write to be a property

2016-04-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14903:


Assignee: Joseph K. Bradley  (was: Apache Spark)

> Revert: Change MLWritable.write to be a property
> 
>
> Key: SPARK-14903
> URL: https://issues.apache.org/jira/browse/SPARK-14903
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> Per discussion in [SPARK-14249], there is not a good way to support .read as 
> a property.  We will therefore revert the change to write() to keep the API 
> consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14903) Revert: Change MLWritable.write to be a property

2016-04-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14903:


Assignee: Apache Spark  (was: Joseph K. Bradley)

> Revert: Change MLWritable.write to be a property
> 
>
> Key: SPARK-14903
> URL: https://issues.apache.org/jira/browse/SPARK-14903
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> Per discussion in [SPARK-14249], there is not a good way to support .read as 
> a property.  We will therefore revert the change to write() to keep the API 
> consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14071) Change MLWritable.write to be a property

2016-04-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15257183#comment-15257183
 ] 

Apache Spark commented on SPARK-14071:
--

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/12671

> Change MLWritable.write to be a property
> 
>
> Key: SPARK-14071
> URL: https://issues.apache.org/jira/browse/SPARK-14071
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Miao Wang
>Priority: Trivial
> Fix For: 2.0.0
>
>
> This will match the Scala API + the DataFrame Python API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14904) Add back HiveContext in compatibility package

2016-04-25 Thread Andrew Or (JIRA)
Andrew Or created SPARK-14904:
-

 Summary: Add back HiveContext in compatibility package
 Key: SPARK-14904
 URL: https://issues.apache.org/jira/browse/SPARK-14904
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.0.0
Reporter: Andrew Or
Assignee: Andrew Or






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14828) Start SparkSession in REPL instead of SQLContext

2016-04-25 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-14828:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-13485

> Start SparkSession in REPL instead of SQLContext
> 
>
> Key: SPARK-14828
> URL: https://issues.apache.org/jira/browse/SPARK-14828
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14828) Start SparkSession in REPL instead of SQLContext

2016-04-25 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14828.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Start SparkSession in REPL instead of SQLContext
> 
>
> Key: SPARK-14828
> URL: https://issues.apache.org/jira/browse/SPARK-14828
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14894) Python GaussianMixture summary

2016-04-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14894:


Assignee: Apache Spark

> Python GaussianMixture summary
> --
>
> Key: SPARK-14894
> URL: https://issues.apache.org/jira/browse/SPARK-14894
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Minor
>
> In spark.ml, GaussianMixture includes a result summary.  The Python API 
> should provide the same functionality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14894) Python GaussianMixture summary

2016-04-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15257161#comment-15257161
 ] 

Apache Spark commented on SPARK-14894:
--

User 'GayathriMurali' has created a pull request for this issue:
https://github.com/apache/spark/pull/12670

> Python GaussianMixture summary
> --
>
> Key: SPARK-14894
> URL: https://issues.apache.org/jira/browse/SPARK-14894
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> In spark.ml, GaussianMixture includes a result summary.  The Python API 
> should provide the same functionality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14894) Python GaussianMixture summary

2016-04-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14894:


Assignee: (was: Apache Spark)

> Python GaussianMixture summary
> --
>
> Key: SPARK-14894
> URL: https://issues.apache.org/jira/browse/SPARK-14894
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> In spark.ml, GaussianMixture includes a result summary.  The Python API 
> should provide the same functionality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14903) Revert: Change MLWritable.write to be a property

2016-04-25 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-14903:
-

 Summary: Revert: Change MLWritable.write to be a property
 Key: SPARK-14903
 URL: https://issues.apache.org/jira/browse/SPARK-14903
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14903) Revert: Change MLWritable.write to be a property

2016-04-25 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14903:
--
Description: Per discussion in [SPARK-14249], there is not a good way to 
support .read as a property.  We will therefore revert the change to write() to 
keep the API consistent.

> Revert: Change MLWritable.write to be a property
> 
>
> Key: SPARK-14903
> URL: https://issues.apache.org/jira/browse/SPARK-14903
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> Per discussion in [SPARK-14249], there is not a good way to support .read as 
> a property.  We will therefore revert the change to write() to keep the API 
> consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-14249) Change MLReader.read to be a property for PySpark

2016-04-25 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-14249.
-
Resolution: Won't Fix

> Change MLReader.read to be a property for PySpark
> -
>
> Key: SPARK-14249
> URL: https://issues.apache.org/jira/browse/SPARK-14249
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> To match MLWritable.write and SQLContext.read, it will be good to make the 
> PySpark MLReader classmethod {{read}} be a property.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14249) Change MLReader.read to be a property for PySpark

2016-04-25 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15257146#comment-15257146
 ] 

Joseph K. Bradley commented on SPARK-14249:
---

I don't see a good way to do this.  The suggestion of {{PipelineModel.read = 
PipelineModelMLReader(PipelineModel)}} does not actually work since we need 
access to the JVM in the Reader init method, but it is not yet available since 
this is outside the constructor.

I'm going to close this issue.  We'll need to revert the change to write() to 
keep things consistent.  Thanks regardless!

> Change MLReader.read to be a property for PySpark
> -
>
> Key: SPARK-14249
> URL: https://issues.apache.org/jira/browse/SPARK-14249
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> To match MLWritable.write and SQLContext.read, it will be good to make the 
> PySpark MLReader classmethod {{read}} be a property.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14902) Expose user-facing RuntimeConfig in SparkSession

2016-04-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14902:


Assignee: Andrew Or  (was: Apache Spark)

> Expose user-facing RuntimeConfig in SparkSession
> 
>
> Key: SPARK-14902
> URL: https://issues.apache.org/jira/browse/SPARK-14902
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14902) Expose user-facing RuntimeConfig in SparkSession

2016-04-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14902:


Assignee: Apache Spark  (was: Andrew Or)

> Expose user-facing RuntimeConfig in SparkSession
> 
>
> Key: SPARK-14902
> URL: https://issues.apache.org/jira/browse/SPARK-14902
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14902) Expose user-facing RuntimeConfig in SparkSession

2016-04-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15257100#comment-15257100
 ] 

Apache Spark commented on SPARK-14902:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/12669

> Expose user-facing RuntimeConfig in SparkSession
> 
>
> Key: SPARK-14902
> URL: https://issues.apache.org/jira/browse/SPARK-14902
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14902) Expose user-facing RuntimeConfig in SparkSession

2016-04-25 Thread Andrew Or (JIRA)
Andrew Or created SPARK-14902:
-

 Summary: Expose user-facing RuntimeConfig in SparkSession
 Key: SPARK-14902
 URL: https://issues.apache.org/jira/browse/SPARK-14902
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.0.0
Reporter: Andrew Or
Assignee: Andrew Or






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14313) AFTSurvivalRegression model persistence in SparkR

2016-04-25 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14313:
--
Assignee: Yanbo Liang

> AFTSurvivalRegression model persistence in SparkR
> -
>
> Key: SPARK-14313
> URL: https://issues.apache.org/jira/browse/SPARK-14313
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14313) AFTSurvivalRegression model persistence in SparkR

2016-04-25 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14313:
--
Target Version/s: 2.0.0

> AFTSurvivalRegression model persistence in SparkR
> -
>
> Key: SPARK-14313
> URL: https://issues.apache.org/jira/browse/SPARK-14313
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14312) NaiveBayes model persistence in SparkR

2016-04-25 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-14312.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12573
[https://github.com/apache/spark/pull/12573]

> NaiveBayes model persistence in SparkR
> --
>
> Key: SPARK-14312
> URL: https://issues.apache.org/jira/browse/SPARK-14312
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14900) spark.ml classification metrics should include accuracy

2016-04-25 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15257034#comment-15257034
 ] 

Miao Wang commented on SPARK-14900:
---

If no one takes this one, I will work on it.

Thanks!

Miao

> spark.ml classification metrics should include accuracy
> ---
>
> Key: SPARK-14900
> URL: https://issues.apache.org/jira/browse/SPARK-14900
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> To compute "accuracy" (0/1 classification accuracy), users can use 
> {{precision}} in MulticlassMetrics and 
> MulticlassClassificationEvaluator.metricName.  We should also support 
> "accuracy" directly as an alias to help users familiar with that name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14894) Python GaussianMixture summary

2016-04-25 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15257021#comment-15257021
 ] 

Miao Wang commented on SPARK-14894:
---

If you have it ready now, please send the pull request. I will help reviewing 
it.

Thanks!

Miao

> Python GaussianMixture summary
> --
>
> Key: SPARK-14894
> URL: https://issues.apache.org/jira/browse/SPARK-14894
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> In spark.ml, GaussianMixture includes a result summary.  The Python API 
> should provide the same functionality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14853) Support LeftSemi/LeftAnti in SortMergeJoin

2016-04-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14853:


Assignee: Apache Spark  (was: Davies Liu)

> Support LeftSemi/LeftAnti in SortMergeJoin
> --
>
> Key: SPARK-14853
> URL: https://issues.apache.org/jira/browse/SPARK-14853
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14853) Support LeftSemi/LeftAnti in SortMergeJoin

2016-04-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15257009#comment-15257009
 ] 

Apache Spark commented on SPARK-14853:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/12668

> Support LeftSemi/LeftAnti in SortMergeJoin
> --
>
> Key: SPARK-14853
> URL: https://issues.apache.org/jira/browse/SPARK-14853
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14894) Python GaussianMixture summary

2016-04-25 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15257006#comment-15257006
 ] 

Gayathri Murali commented on SPARK-14894:
-

[~wangmiao1981] I have PR ready for this. If you are okay, I can go ahead and 
submit that.

> Python GaussianMixture summary
> --
>
> Key: SPARK-14894
> URL: https://issues.apache.org/jira/browse/SPARK-14894
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> In spark.ml, GaussianMixture includes a result summary.  The Python API 
> should provide the same functionality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14853) Support LeftSemi/LeftAnti in SortMergeJoin

2016-04-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14853:


Assignee: Davies Liu  (was: Apache Spark)

> Support LeftSemi/LeftAnti in SortMergeJoin
> --
>
> Key: SPARK-14853
> URL: https://issues.apache.org/jira/browse/SPARK-14853
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14467) Add async io in FileScanRDD

2016-04-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15257005#comment-15257005
 ] 

Apache Spark commented on SPARK-14467:
--

User 'sameeragarwal' has created a pull request for this issue:
https://github.com/apache/spark/pull/12667

> Add async io in FileScanRDD
> ---
>
> Key: SPARK-14467
> URL: https://issues.apache.org/jira/browse/SPARK-14467
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nong Li
>
> Experiments running over parquet data in s3 shows poorly interleaving of CPU 
> and IO. We should do more async IO in FileScanRDD to better use the machine 
> resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14891) ALS in ML never validates input schema

2016-04-25 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15257004#comment-15257004
 ] 

Joseph K. Bradley commented on SPARK-14891:
---

For most use cases, Int should be used to save on memory.  Supporting String in 
the future would be nice but would require internal indexing.  I'd say we 
should validate the input for now and require Int types.  Users who need Long 
can use the ALS.train API.

+1 for better docs & data validation.  For data validation, it could be nice to 
accept Long and other types but to make sure that the values are checked before 
casting to Int types.

> ALS in ML never validates input schema
> --
>
> Key: SPARK-14891
> URL: https://issues.apache.org/jira/browse/SPARK-14891
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Nick Pentreath
>
> Currently, {{ALS.fit}} never validates the input schema. There is a 
> {{transformSchema}} impl that calls {{validateAndTransformSchema}}, but it is 
> never called in either {{ALS.fit}} or {{ALSModel.transform}}.
> This was highlighted in SPARK-13857 (and failing PySpark tests 
> [here|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56849/consoleFull])when
>  adding a call to {{transformSchema}} in {{ALSModel.transform}} that actually 
> validates the input schema. The PySpark docstring tests result in Long inputs 
> by default, which fail validation as Int is required.
> Currently, the inputs for user and item ids are cast to Int, with no input 
> type validation (or warning message). So users could pass in Long, Float, 
> Double, etc. It's also not made clear anywhere in the docs that only Int 
> types for user and item are supported.
> Enforcing validation seems the best option but might break user code that 
> previously "just worked" especially in PySpark. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14880) Parallel Gradient Descent with less map-reduce shuffle overhead

2016-04-25 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15256996#comment-15256996
 ] 

Joseph K. Bradley commented on SPARK-14880:
---

Thanks for this suggestion.  To get this feature merged, we would likely need 
(a) more theoretical evidence supporting the algorithm and (b) significant 
performance testing to demonstrate the improvements.  For (a), as I recall, the 
Zinkevich work requires that the loss be smooth, which would rule out support 
for L1 regularization.  Also, has the higher level iteration been analyzed to 
prove its effect on convergence?

This could be a good algorithm to post as a Spark package.  Would you be 
interested in doing that?

I'm going to close this issue for now, but discussion can continue on the 
closed JIRA.

> Parallel Gradient Descent with less map-reduce shuffle overhead
> ---
>
> Key: SPARK-14880
> URL: https://issues.apache.org/jira/browse/SPARK-14880
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Ahmed Mahran
>  Labels: performance
>
> The current implementation of (Stochastic) Gradient Descent performs one 
> map-reduce shuffle per iteration. Moreover, when the sampling fraction gets 
> smaller, the algorithm becomes shuffle-bound instead of CPU-bound.
> {code}
> (1 to numIterations or convergence) {
>  rdd
>   .sample(fraction)
>   .map(Gradient)
>   .reduce(Update)
> }
> {code}
> A more performant variation requires only one map-reduce regardless from the 
> number of iterations. A local mini-batch SGD could be run on each partition, 
> then the results could be averaged. This is based on (Zinkevich, Martin, 
> Markus Weimer, Lihong Li, and Alex J. Smola. "Parallelized stochastic 
> gradient descent." In Advances in neural information processing systems, 
> 2010, 
> http://www.research.rutgers.edu/~lihong/pub/Zinkevich11Parallelized.pdf).
> {code}
> rdd
>  .shuffle()
>  .mapPartitions((1 to numIterations or convergence) {
>iter.sample(fraction).map(Gradient).reduce(Update)
>  })
>  .reduce(Average)
> {code}
> A higher level iteration could enclose the above variation; shuffling the 
> data before the local mini-batches and feeding back the average weights from 
> the last iteration. This allows more variability in the sampling of the 
> mini-batches with the possibility to cover the whole dataset. Here is a Spark 
> based implementation 
> https://github.com/mashin-io/rich-spark/blob/master/src/main/scala/org/apache/spark/mllib/optimization/ParallelSGD.scala
> {code}
> (1 to numIterations1 or convergence) {
>  rdd
>   .shuffle()
>   .mapPartitions((1 to numIterations2 or convergence) {
> iter.sample(fraction).map(Gradient).reduce(Update)
>   })
>   .reduce(Average)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13739) Predicate Push Down Through Window Operator

2016-04-25 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-13739.
---
  Resolution: Fixed
Assignee: Xiao Li
Target Version/s: 2.0.0

> Predicate Push Down Through Window Operator
> ---
>
> Key: SPARK-13739
> URL: https://issues.apache.org/jira/browse/SPARK-13739
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Push down the predicate through the Window operator.
> In this JIRA, predicates are pushed through Window if and only if the 
> following conditions are satisfied:
> - Predicate involves one and only one column that is part of window 
> partitioning key
> - Window partitioning key is just a sequence of attributeReferences. (i.e., 
> none of them is an expression)
> - Predicate must be deterministic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-14880) Parallel Gradient Descent with less map-reduce shuffle overhead

2016-04-25 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-14880.
-
Resolution: Won't Fix

> Parallel Gradient Descent with less map-reduce shuffle overhead
> ---
>
> Key: SPARK-14880
> URL: https://issues.apache.org/jira/browse/SPARK-14880
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Ahmed Mahran
>  Labels: performance
>
> The current implementation of (Stochastic) Gradient Descent performs one 
> map-reduce shuffle per iteration. Moreover, when the sampling fraction gets 
> smaller, the algorithm becomes shuffle-bound instead of CPU-bound.
> {code}
> (1 to numIterations or convergence) {
>  rdd
>   .sample(fraction)
>   .map(Gradient)
>   .reduce(Update)
> }
> {code}
> A more performant variation requires only one map-reduce regardless from the 
> number of iterations. A local mini-batch SGD could be run on each partition, 
> then the results could be averaged. This is based on (Zinkevich, Martin, 
> Markus Weimer, Lihong Li, and Alex J. Smola. "Parallelized stochastic 
> gradient descent." In Advances in neural information processing systems, 
> 2010, 
> http://www.research.rutgers.edu/~lihong/pub/Zinkevich11Parallelized.pdf).
> {code}
> rdd
>  .shuffle()
>  .mapPartitions((1 to numIterations or convergence) {
>iter.sample(fraction).map(Gradient).reduce(Update)
>  })
>  .reduce(Average)
> {code}
> A higher level iteration could enclose the above variation; shuffling the 
> data before the local mini-batches and feeding back the average weights from 
> the last iteration. This allows more variability in the sampling of the 
> mini-batches with the possibility to cover the whole dataset. Here is a Spark 
> based implementation 
> https://github.com/mashin-io/rich-spark/blob/master/src/main/scala/org/apache/spark/mllib/optimization/ParallelSGD.scala
> {code}
> (1 to numIterations1 or convergence) {
>  rdd
>   .shuffle()
>   .mapPartitions((1 to numIterations2 or convergence) {
> iter.sample(fraction).map(Gradient).reduce(Update)
>   })
>   .reduce(Average)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14844) KMeansModel in spark.ml should allow to change featureCol and predictionCol

2016-04-25 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14844:
--
Priority: Trivial  (was: Major)

> KMeansModel in spark.ml should allow to change featureCol and predictionCol
> ---
>
> Key: SPARK-14844
> URL: https://issues.apache.org/jira/browse/SPARK-14844
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Dominik Jastrzębski
>Priority: Trivial
>
> We need to add setFeaturesCol, setPredictionCol methods in 
> org.apache.spark.ml.clustering.KMeansModel.
> This will allow to:
> * transform a DataFrame with different feature column name than in the 
> DataFrame the model was fitted on.
> * create a prediction column with name other than the name that was set 
> during model fitting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14831) Make ML APIs in SparkR consistent

2016-04-25 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15256983#comment-15256983
 ] 

Joseph K. Bradley commented on SPARK-14831:
---

2. {{spark.glm}}, etc. SGTM.  For save/load, I'd prefer either 
{{spark.save/load}} (if that works for DataFrames too), or {{read.ml}} (rather 
than {{read.model}} since that leaves open the possibility of supporting 
Estimators and Pipelines in R someday).

> Make ML APIs in SparkR consistent
> -
>
> Key: SPARK-14831
> URL: https://issues.apache.org/jira/browse/SPARK-14831
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> In current master, we have 4 ML methods in SparkR:
> {code:none}
> glm(formula, family, data, ...)
> kmeans(data, centers, ...)
> naiveBayes(formula, data, ...)
> survreg(formula, data, ...)
> {code}
> We tried to keep the signatures similar to existing ones in R. However, if we 
> put them together, they are not consistent. One example is k-means, which 
> doesn't accept a formula. Instead of looking at each method independently, we 
> might want to update the signature of kmeans to
> {code:none}
> kmeans(formula, data, centers, ...)
> {code}
> We can also discuss possible global changes here. For example, `glm` puts 
> `family` before `data` while `kmeans` puts `centers` after `data`. This is 
> not consistent. And logically, the formula doesn't mean anything without 
> associating with a DataFrame. So it makes more sense to me to have the 
> following signature:
> {code:none}
> algorithm(df, formula, [required params], [optional params])
> {code}
> If we make this change, we might want to avoid name collisions because they 
> have different signature. We can use `ml.kmeans`, 'ml.glm`, etc.
> Sorry for discussing API changes in the last minute. But I think it would be 
> better to have consistent signatures in SparkR.
> cc: [~shivaram] [~josephkb] [~yanboliang]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14721) Remove the HiveContext class

2016-04-25 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-14721.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Remove the HiveContext class
> 
>
> Key: SPARK-14721
> URL: https://issues.apache.org/jira/browse/SPARK-14721
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14901) java exception when showing join

2016-04-25 Thread Brent Elmer (JIRA)
Brent Elmer created SPARK-14901:
---

 Summary: java exception when showing join
 Key: SPARK-14901
 URL: https://issues.apache.org/jira/browse/SPARK-14901
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.6.1
Reporter: Brent Elmer


I am using pyspark with netezza.  I am getting a java exception when trying to 
show the first row of a join.  I can show the first row for of the two 
dataframes separately but not the result of a join.  I get the same error for 
any action I take(first, collect, show).  Am I doing something wrong?

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
dispute_df = 
sqlContext.read.format('com.ibm.spark.netezza').options(url='jdbc:netezza://***:5480/db',
 user='***', password='***', dbtable='table1', 
driver='com.ibm.spark.netezza').load()
dispute_df.printSchema()
comments_df = 
sqlContext.read.format('com.ibm.spark.netezza').options(url='jdbc:netezza://***:5480/db',
 user='***', password='***', dbtable='table2', 
driver='com.ibm.spark.netezza').load()
comments_df.printSchema()
dispute_df.join(comments_df, dispute_df.COMMENTID == 
comments_df.COMMENTID).first()


root
 |-- COMMENTID: string (nullable = true)
 |-- EXPORTDATETIME: timestamp (nullable = true)
 |-- ARTAGS: string (nullable = true)
 |-- POTAGS: string (nullable = true)
 |-- INVTAG: string (nullable = true)
 |-- ACTIONTAG: string (nullable = true)
 |-- DISPUTEFLAG: string (nullable = true)
 |-- ACTIONFLAG: string (nullable = true)
 |-- CUSTOMFLAG1: string (nullable = true)
 |-- CUSTOMFLAG2: string (nullable = true)

root
 |-- COUNTRY: string (nullable = true)
 |-- CUSTOMER: string (nullable = true)
 |-- INVNUMBER: string (nullable = true)
 |-- INVSEQNUMBER: string (nullable = true)
 |-- LEDGERCODE: string (nullable = true)
 |-- COMMENTTEXT: string (nullable = true)
 |-- COMMENTTIMESTAMP: timestamp (nullable = true)
 |-- COMMENTLENGTH: long (nullable = true)
 |-- FREEINDEX: long (nullable = true)
 |-- COMPLETEDFLAG: long (nullable = true)
 |-- ACTIONFLAG: long (nullable = true)
 |-- FREETEXT: string (nullable = true)
 |-- USERNAME: string (nullable = true)
 |-- ACTION: string (nullable = true)
 |-- COMMENTID: string (nullable = true)

---
Py4JJavaError Traceback (most recent call last)
 in ()
  5 comments_df = 
sqlContext.read.format('com.ibm.spark.netezza').options(url='jdbc:netezza://***:5480/db',
 user='***', password='***', dbtable='table2', 
driver='com.ibm.spark.netezza').load()
  6 comments_df.printSchema()
> 7 dispute_df.join(comments_df, dispute_df.COMMENTID == 
comments_df.COMMENTID).first()

/usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc 
in first(self)
802 Row(age=2, name=u'Alice')
803 """
--> 804 return self.head()
805 
806 @ignore_unicode_prefix

/usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc 
in head(self, n)
790 """
791 if n is None:
--> 792 rs = self.head(1)
793 return rs[0] if rs else None
794 return self.take(n)

/usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc 
in head(self, n)
792 rs = self.head(1)
793 return rs[0] if rs else None
--> 794 return self.take(n)
795 
796 @ignore_unicode_prefix

/usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc 
in take(self, num)
304 with SCCallSiteSync(self._sc) as css:
305 port = 
self._sc._jvm.org.apache.spark.sql.execution.EvaluatePython.takeAndServe(
--> 306 self._jdf, num)
307 return list(_load_from_socket(port, 
BatchedSerializer(PickleSerializer(
308 

/usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py
 in __call__(self, *args)
811 answer = self.gateway_client.send_command(command)
812 return_value = get_return_value(
--> 813 answer, self.gateway_client, self.target_id, self.name)
814 
815 for temp_arg in temp_args:

/usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/utils.pyc in 
deco(*a, **kw)
 43 def deco(*a, **kw):
 44 try:
---> 45 return f(*a, **kw)
 46 except py4j.protocol.Py4JJavaError as e:
 47 s = e.java_exception.toString()

/usr/local/src/spark/spark-1.6.1-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/protocol.py
 in get_return_value(answer, gateway_client, target_id, name)
306 raise Py4JJavaError(
307 "An error occurred while calling {0}{1}{2}.\n".
--> 308 format(target_id, ".", name), value)
309 

[jira] [Updated] (SPARK-11559) Make `runs` no effect in k-means

2016-04-25 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-11559:
--
Shepherd: Joseph K. Bradley  (was: Xiangrui Meng)

> Make `runs` no effect in k-means
> 
>
> Key: SPARK-11559
> URL: https://issues.apache.org/jira/browse/SPARK-11559
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> We deprecated `runs` in Spark 1.6 (SPARK-11358). In 2.0, we can either remove 
> `runs` or make it no effect (with warning messages). So we can simplify the 
> implementation. I prefer the latter for better binary compatibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14900) spark.ml classification metrics should include accuracy

2016-04-25 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-14900:
-

 Summary: spark.ml classification metrics should include accuracy
 Key: SPARK-14900
 URL: https://issues.apache.org/jira/browse/SPARK-14900
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Joseph K. Bradley
Priority: Minor


To compute "accuracy" (0/1 classification accuracy), users can use 
{{precision}} in MulticlassMetrics and 
MulticlassClassificationEvaluator.metricName.  We should also support 
"accuracy" directly as an alias to help users familiar with that name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14829) Deprecate GLM APIs using SGD

2016-04-25 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14829:
--
Assignee: zhengruifeng

> Deprecate GLM APIs using SGD
> 
>
> Key: SPARK-14829
> URL: https://issues.apache.org/jira/browse/SPARK-14829
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: zhengruifeng
>
> I don't know how many times I have heard someone run into issues with 
> LinearRegression or LogisticRegression, only to find that it is because they 
> are using the SGD implementations in spark.mllib.  We should deprecate these 
> SGD APIs in 2.0 to encourage users to use LBFGS and the spark.ml 
> implementations, which are significantly better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14829) Deprecate GLM APIs using SGD

2016-04-25 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14829:
--
Shepherd: Joseph K. Bradley

> Deprecate GLM APIs using SGD
> 
>
> Key: SPARK-14829
> URL: https://issues.apache.org/jira/browse/SPARK-14829
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: zhengruifeng
>
> I don't know how many times I have heard someone run into issues with 
> LinearRegression or LogisticRegression, only to find that it is because they 
> are using the SGD implementations in spark.mllib.  We should deprecate these 
> SGD APIs in 2.0 to encourage users to use LBFGS and the spark.ml 
> implementations, which are significantly better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14731) Revert SPARK-12130 to make 2.0 shuffle service compatible with 1.x

2016-04-25 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-14731.

   Resolution: Fixed
 Assignee: Lianhui Wang
Fix Version/s: 2.0.0

> Revert SPARK-12130 to make 2.0 shuffle service compatible with 1.x
> --
>
> Key: SPARK-14731
> URL: https://issues.apache.org/jira/browse/SPARK-14731
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.0.0
>Reporter: Mark Grover
>Assignee: Lianhui Wang
> Fix For: 2.0.0
>
>
> Discussion on the dev list on [this 
> thread|http://apache-spark-developers-list.1001551.n3.nabble.com/YARN-Shuffle-service-and-its-compatibility-td17222.html].
> Conclusion seems to be that we should try to maintain compatibility between 
> Spark 1.x and Spark 2.x's shuffle service so folks who may want to run Spark 
> 1 and Spark 2 on, say, the same YARN cluster can do that easily while running 
> only one shuffle service.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >