[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-02-23 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882174#comment-15882174
 ] 

Nick Pentreath commented on SPARK-14409:


The other option is to work with [~danilo.ascione] PR here: 
https://github.com/apache/spark/pull/16618 if Yong does not have time to update.

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-02-23 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882163#comment-15882163
 ] 

Nick Pentreath commented on SPARK-14409:


[~roberto.mirizzi] the {{goodThreshold}} param seems pretty reasonable in this 
context to exclude irrelevant items. I think it can be a good {{expertParam}} 
addition.

Ok, I think that a first pass at this should just aim to replicate what we have 
exposed in {{mllib}} and wrap {{RankingMetrics}}. Initially we can look at: (a) 
supporting numeric columns and doing the windowing & {{collect_list}} approach 
to feed into {{RankingMetrics}}; (b) support Array columns and feed directly 
into {{RankingMetrics}} or (c) support both.

[~yongtang] already did a PR here: https://github.com/apache/spark/pull/12461. 
It is fairly complete and also includes MRR. [~yongtang] are you able to work 
on reviving that PR? If os, [~roberto.mirizzi] [~danilo.ascione] are you able 
to help review that PR?

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17495) Hive hash implementation

2017-02-23 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882161#comment-15882161
 ] 

Tejas Patil commented on SPARK-17495:
-

I am looking into using hive-hash when `hash()` in called in a hive context. 
Before jumping to a PR, wanted to discuss what model we should use.

Currently doing `hash()` in SQL uses murmur3. For anyone porting from Hive to 
Spark, this will give different results. 
- One easy thing to do is to replace the `hash` impl from `FunctionRegistry` 
for Hive enabled context. Downside: There can be users who can create hive 
enabled context but still operate over spark native tables. Using hive-hash is 
not something they want.
- Its hard to detect if a given query result will be written to hive / spark 
native table. eg. one could cache / persist and later choose to write the 
output to both hive table and spark native table. We could push this decision 
making to users by adding a config to use hive-hash. Note that this need to be 
a static config only allowed to set when the session is created. Letting users 
flip the config in middle of a session is risky as it can lead to undesired 
outputs.

Am open to comments about these two options. Unless there are any objections, 
will move forward with 2nd approach of using a config.


> Hive hash implementation
> 
>
> Key: SPARK-17495
> URL: https://issues.apache.org/jira/browse/SPARK-17495
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
>
> Spark internally uses Murmur3Hash for partitioning. This is different from 
> the one used by Hive. For queries which use bucketing this leads to different 
> results if one tries the same query on both engines. For us, we want users to 
> have backward compatibility to that one can switch parts of applications 
> across the engines without observing regressions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17495) Hive hash implementation

2017-02-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882152#comment-15882152
 ] 

Apache Spark commented on SPARK-17495:
--

User 'tejasapatil' has created a pull request for this issue:
https://github.com/apache/spark/pull/17056

> Hive hash implementation
> 
>
> Key: SPARK-17495
> URL: https://issues.apache.org/jira/browse/SPARK-17495
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
>
> Spark internally uses Murmur3Hash for partitioning. This is different from 
> the one used by Hive. For queries which use bucketing this leads to different 
> results if one tries the same query on both engines. For us, we want users to 
> have backward compatibility to that one can switch parts of applications 
> across the engines without observing regressions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19723) create table for data source tables should work with an non-existent location

2017-02-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19723:


Assignee: (was: Apache Spark)

> create table for data source tables should work with an non-existent location
> -
>
> Key: SPARK-19723
> URL: https://issues.apache.org/jira/browse/SPARK-19723
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>
> This JIRA is a follow up work after SPARK-19583
> As we discussed in that [PR|https://github.com/apache/spark/pull/16938] 
> The following DDL for datasource table with an non-existent location should 
> work:
> ``
> CREATE TABLE ... (PARTITIONED BY ...) LOCATION path
> ```
> Currently it will throw exception  that path not exists



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19723) create table for data source tables should work with an non-existent location

2017-02-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882147#comment-15882147
 ] 

Apache Spark commented on SPARK-19723:
--

User 'windpiger' has created a pull request for this issue:
https://github.com/apache/spark/pull/17055

> create table for data source tables should work with an non-existent location
> -
>
> Key: SPARK-19723
> URL: https://issues.apache.org/jira/browse/SPARK-19723
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>
> This JIRA is a follow up work after SPARK-19583
> As we discussed in that [PR|https://github.com/apache/spark/pull/16938] 
> The following DDL for datasource table with an non-existent location should 
> work:
> ``
> CREATE TABLE ... (PARTITIONED BY ...) LOCATION path
> ```
> Currently it will throw exception  that path not exists



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19723) create table for data source tables should work with an non-existent location

2017-02-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19723:


Assignee: Apache Spark

> create table for data source tables should work with an non-existent location
> -
>
> Key: SPARK-19723
> URL: https://issues.apache.org/jira/browse/SPARK-19723
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>Assignee: Apache Spark
>
> This JIRA is a follow up work after SPARK-19583
> As we discussed in that [PR|https://github.com/apache/spark/pull/16938] 
> The following DDL for datasource table with an non-existent location should 
> work:
> ``
> CREATE TABLE ... (PARTITIONED BY ...) LOCATION path
> ```
> Currently it will throw exception  that path not exists



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19723) create table for data source tables should work with an non-existent location

2017-02-23 Thread Song Jun (JIRA)
Song Jun created SPARK-19723:


 Summary: create table for data source tables should work with an 
non-existent location
 Key: SPARK-19723
 URL: https://issues.apache.org/jira/browse/SPARK-19723
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun


This JIRA is a follow up work after SPARK-19583

As we discussed in that [PR|https://github.com/apache/spark/pull/16938] 

The following DDL for datasource table with an non-existent location should 
work:
``
CREATE TABLE ... (PARTITIONED BY ...) LOCATION path
```
Currently it will throw exception  that path not exists



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14084) Parallel training jobs in model selection

2017-02-23 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-14084.

  Resolution: Duplicate
Target Version/s:   (was: )

> Parallel training jobs in model selection
> -
>
> Key: SPARK-14084
> URL: https://issues.apache.org/jira/browse/SPARK-14084
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>
> In CrossValidator and TrainValidationSplit, we run training jobs one by one. 
> If users have a big cluster, they might see speed-ups if we parallelize the 
> job submission on the driver. The trade-off is that we might need to make 
> multiple copies of the training data, which could be expensive. It is worth 
> testing and figure out the best way to implement it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14084) Parallel training jobs in model selection

2017-02-23 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882123#comment-15882123
 ] 

Nick Pentreath commented on SPARK-14084:


I guess we could have put SPARK-19071 into this ticket (sorry about that) - but 
since SPARK-19071 also covers a longer-term plan for further optimizing 
parallel CV, I'm going to close this as Superceded By. If watchers are still 
interested, please watch SPARK-19071. Thanks!

> Parallel training jobs in model selection
> -
>
> Key: SPARK-14084
> URL: https://issues.apache.org/jira/browse/SPARK-14084
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>
> In CrossValidator and TrainValidationSplit, we run training jobs one by one. 
> If users have a big cluster, they might see speed-ups if we parallelize the 
> job submission on the driver. The trade-off is that we might need to make 
> multiple copies of the training data, which could be expensive. It is worth 
> testing and figure out the best way to implement it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3246) Support weighted SVMWithSGD for classification of unbalanced dataset

2017-02-23 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882113#comment-15882113
 ] 

Nick Pentreath edited comment on SPARK-3246 at 2/24/17 7:15 AM:


Since {{mllib}} is in maintenance mode and {{LinearSVC}} was added in 
SPARK-14709 (and supports {{weightCol}}, I am going to close this as Wont Fix


was (Author: mlnick):
Since {{mllib}} is in maintenance mode and {{LinearSVC}} was added in 
SPARK-14709, I am going to close this as Wont Fix

> Support weighted SVMWithSGD for classification of unbalanced dataset
> 
>
> Key: SPARK-3246
> URL: https://issues.apache.org/jira/browse/SPARK-3246
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 0.9.0, 1.0.2
>Reporter: mahesh bhole
>
> Please support  weighted SVMWithSGD  for binary classification of unbalanced 
> dataset.Though other options like undersampling or oversampling can be 
> used,It will be good if we can have a way to assign weights to minority 
> class. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3246) Support weighted SVMWithSGD for classification of unbalanced dataset

2017-02-23 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882113#comment-15882113
 ] 

Nick Pentreath edited comment on SPARK-3246 at 2/24/17 7:16 AM:


Since {{mllib}} is in maintenance mode and {{LinearSVC}} was added in 
SPARK-14709 (and supports {{weightCol}}), I am going to close this as Wont Fix


was (Author: mlnick):
Since {{mllib}} is in maintenance mode and {{LinearSVC}} was added in 
SPARK-14709 (and supports {{weightCol}}, I am going to close this as Wont Fix

> Support weighted SVMWithSGD for classification of unbalanced dataset
> 
>
> Key: SPARK-3246
> URL: https://issues.apache.org/jira/browse/SPARK-3246
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 0.9.0, 1.0.2
>Reporter: mahesh bhole
>
> Please support  weighted SVMWithSGD  for binary classification of unbalanced 
> dataset.Though other options like undersampling or oversampling can be 
> used,It will be good if we can have a way to assign weights to minority 
> class. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3246) Support weighted SVMWithSGD for classification of unbalanced dataset

2017-02-23 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath closed SPARK-3246.
-
Resolution: Won't Fix

> Support weighted SVMWithSGD for classification of unbalanced dataset
> 
>
> Key: SPARK-3246
> URL: https://issues.apache.org/jira/browse/SPARK-3246
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 0.9.0, 1.0.2
>Reporter: mahesh bhole
>
> Please support  weighted SVMWithSGD  for binary classification of unbalanced 
> dataset.Though other options like undersampling or oversampling can be 
> used,It will be good if we can have a way to assign weights to minority 
> class. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3246) Support weighted SVMWithSGD for classification of unbalanced dataset

2017-02-23 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882113#comment-15882113
 ] 

Nick Pentreath commented on SPARK-3246:
---

Since {{mllib}} is in maintenance mode and {{LinearSVC}} was added in 
SPARK-14709, I am going to close this as Wont Fix

> Support weighted SVMWithSGD for classification of unbalanced dataset
> 
>
> Key: SPARK-3246
> URL: https://issues.apache.org/jira/browse/SPARK-3246
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 0.9.0, 1.0.2
>Reporter: mahesh bhole
>
> Please support  weighted SVMWithSGD  for binary classification of unbalanced 
> dataset.Though other options like undersampling or oversampling can be 
> used,It will be good if we can have a way to assign weights to minority 
> class. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19664) put 'hive.metastore.warehouse.dir' in hadoopConf place

2017-02-23 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-19664:
---

Assignee: Song Jun

> put 'hive.metastore.warehouse.dir' in hadoopConf place
> --
>
> Key: SPARK-19664
> URL: https://issues.apache.org/jira/browse/SPARK-19664
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Song Jun
>Assignee: Song Jun
>Priority: Minor
> Fix For: 2.2.0
>
>
> In SPARK-15959, we bring back the 'hive.metastore.warehouse.dir' , while in 
> the logic, when use the value of  'spark.sql.warehouse.dir' to overwrite 
> 'spark.sql.warehouse.dir' , it set it to 'sparkContext.conf' , I think it 
> should put in 'sparkContext.hadoopConfiguration' and overwrite the original 
> value of hadoopConf
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala#L64



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19664) put 'hive.metastore.warehouse.dir' in hadoopConf place

2017-02-23 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19664.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16996
[https://github.com/apache/spark/pull/16996]

> put 'hive.metastore.warehouse.dir' in hadoopConf place
> --
>
> Key: SPARK-19664
> URL: https://issues.apache.org/jira/browse/SPARK-19664
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Song Jun
>Priority: Minor
> Fix For: 2.2.0
>
>
> In SPARK-15959, we bring back the 'hive.metastore.warehouse.dir' , while in 
> the logic, when use the value of  'spark.sql.warehouse.dir' to overwrite 
> 'spark.sql.warehouse.dir' , it set it to 'sparkContext.conf' , I think it 
> should put in 'sparkContext.hadoopConfiguration' and overwrite the original 
> value of hadoopConf
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala#L64



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18939) Timezone support in partition values.

2017-02-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18939:


Assignee: (was: Apache Spark)

> Timezone support in partition values.
> -
>
> Key: SPARK-18939
> URL: https://issues.apache.org/jira/browse/SPARK-18939
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Takuya Ueshin
>
> We should also use session local timezone to interpret partition values.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18939) Timezone support in partition values.

2017-02-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18939:


Assignee: Apache Spark

> Timezone support in partition values.
> -
>
> Key: SPARK-18939
> URL: https://issues.apache.org/jira/browse/SPARK-18939
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>
> We should also use session local timezone to interpret partition values.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18939) Timezone support in partition values.

2017-02-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882083#comment-15882083
 ] 

Apache Spark commented on SPARK-18939:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/17053

> Timezone support in partition values.
> -
>
> Key: SPARK-18939
> URL: https://issues.apache.org/jira/browse/SPARK-18939
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Takuya Ueshin
>
> We should also use session local timezone to interpret partition values.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19690) Join a streaming DataFrame with a batch DataFrame may not work

2017-02-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882073#comment-15882073
 ] 

Apache Spark commented on SPARK-19690:
--

User 'uncleGen' has created a pull request for this issue:
https://github.com/apache/spark/pull/17052

> Join a streaming DataFrame with a batch DataFrame may not work
> --
>
> Key: SPARK-19690
> URL: https://issues.apache.org/jira/browse/SPARK-19690
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.3, 2.1.0, 2.1.1
>Reporter: Shixiong Zhu
>
> When joining a streaming DataFrame with a batch DataFrame, if the batch 
> DataFrame has an aggregation, it will be converted to a streaming physical 
> aggregation. Then the query will crash.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17075) Cardinality Estimation of Predicate Expressions

2017-02-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882043#comment-15882043
 ] 

Apache Spark commented on SPARK-17075:
--

User 'lins05' has created a pull request for this issue:
https://github.com/apache/spark/pull/17051

> Cardinality Estimation of Predicate Expressions
> ---
>
> Key: SPARK-17075
> URL: https://issues.apache.org/jira/browse/SPARK-17075
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.0.0
>Reporter: Ron Hu
>Assignee: Ron Hu
> Fix For: 2.2.0
>
>
> A filter condition is the predicate expression specified in the WHERE clause 
> of a SQL select statement.  A predicate can be a compound logical expression 
> with logical AND, OR, NOT operators combining multiple single conditions.  A 
> single condition usually has comparison operators such as =, <, <=, >, >=, 
> ‘like’, etc.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19721) Good error message for version mismatch in log files

2017-02-23 Thread Liwei Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882034#comment-15882034
 ] 

Liwei Lin commented on SPARK-19721:
---

I'd like to work on this too. Thanks.

> Good error message for version mismatch in log files
> 
>
> Key: SPARK-19721
> URL: https://issues.apache.org/jira/browse/SPARK-19721
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Michael Armbrust
>Priority: Blocker
>
> There are several places where we write out version identifiers in various 
> logs for structured streaming (usually {{v1}}).  However, in the places where 
> we check for this, we throw a confusing error message.  Instead, we should do 
> the following:
>  - Find all of the places where we do this kind of check.
>  - for {{vN}} where {{n>1}} say "UnsupportedLogFormat: The file {{path}} was 
> produced by a newer version of Spark and cannot be read by this version.  
> Please upgrade"
>  - for anything else throw an error saying the file is malformed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19688) Spark on Yarn Credentials File set to different application directory

2017-02-23 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882027#comment-15882027
 ] 

Saisai Shao commented on SPARK-19688:
-

According to my test, "spark.yarn.credentials.file" will be overwritten in 
yarn-client to point to a correct path when launching application 
(https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L737).
 So even Spark Streaming checkpoint still keeps the old configuration, it will 
be overwritten when the new application is started. So I don't see an issue 
here except this weird setting.

> Spark on Yarn Credentials File set to different application directory
> -
>
> Key: SPARK-19688
> URL: https://issues.apache.org/jira/browse/SPARK-19688
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, YARN
>Affects Versions: 1.6.3
>Reporter: Devaraj Jonnadula
>Priority: Minor
>
> spark.yarn.credentials.file property is set to different application Id 
> instead of actual Application Id 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19698) Race condition in stale attempt task completion vs current attempt task completion when task is doing persistent state changes

2017-02-23 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881991#comment-15881991
 ] 

Mridul Muralidharan commented on SPARK-19698:
-


Depending on ordering and semantics of task resubmission is not a very good 
design choice.
For the usecase described, would be better to use an external synchronization 
mechanism - and not depend on how spark would (re-)submit tasks : not only is 
it not an implementation detail, but it is subject to change without an 
explicit contract - the only explicit contract we have is with OutputCommitter


> Race condition in stale attempt task completion vs current attempt task 
> completion when task is doing persistent state changes
> --
>
> Key: SPARK-19698
> URL: https://issues.apache.org/jira/browse/SPARK-19698
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> We have encountered a strange scenario in our production environment. Below 
> is the best guess we have right now as to what's going on.
> Potentially, the final stage of a job has a failure in one of the tasks (such 
> as OOME on the executor) which can cause tasks for that stage to be 
> relaunched in a second attempt.
> https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1155
> keeps track of which tasks have been completed, but does NOT keep track of 
> which attempt those tasks were completed in. As such, we have encountered a 
> scenario where a particular task gets executed twice in different stage 
> attempts, and the DAGScheduler does not consider if the second attempt is 
> still running. This means if the first task attempt succeeded, the second 
> attempt can be cancelled part-way through its run cycle if all other tasks 
> (including the prior failed) are completed successfully.
> What this means is that if a task is manipulating some state somewhere (for 
> example: a upload-to-temporary-file-location, then delete-then-move on an 
> underlying s3n storage implementation) the driver can improperly shutdown the 
> running (2nd attempt) task between state manipulations, leaving the 
> persistent state in a bad state since the 2nd attempt never got to complete 
> its manipulations, and was terminated prematurely at some arbitrary point in 
> its state change logic (ex: finished the delete but not the move).
> This is using the mesos coarse grained executor. It is unclear if this 
> behavior is limited to the mesos coarse grained executor or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19698) Race condition in stale attempt task completion vs current attempt task completion when task is doing persistent state changes

2017-02-23 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881991#comment-15881991
 ] 

Mridul Muralidharan edited comment on SPARK-19698 at 2/24/17 5:25 AM:
--

Depending on ordering and semantics of task resubmission is not a very good 
design choice.
For the usecase described, would be better to use an external synchronization 
mechanism - and not depend on how spark would (re-)submit tasks : not only is 
it not an implementation detail, but it is subject to change without an 
explicit contract - the only explicit contract we have is with OutputCommitter 
- which is something you can look at for modelling this ?



was (Author: mridulm80):

Depending on ordering and semantics of task resubmission is not a very good 
design choice.
For the usecase described, would be better to use an external synchronization 
mechanism - and not depend on how spark would (re-)submit tasks : not only is 
it not an implementation detail, but it is subject to change without an 
explicit contract - the only explicit contract we have is with OutputCommitter


> Race condition in stale attempt task completion vs current attempt task 
> completion when task is doing persistent state changes
> --
>
> Key: SPARK-19698
> URL: https://issues.apache.org/jira/browse/SPARK-19698
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> We have encountered a strange scenario in our production environment. Below 
> is the best guess we have right now as to what's going on.
> Potentially, the final stage of a job has a failure in one of the tasks (such 
> as OOME on the executor) which can cause tasks for that stage to be 
> relaunched in a second attempt.
> https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1155
> keeps track of which tasks have been completed, but does NOT keep track of 
> which attempt those tasks were completed in. As such, we have encountered a 
> scenario where a particular task gets executed twice in different stage 
> attempts, and the DAGScheduler does not consider if the second attempt is 
> still running. This means if the first task attempt succeeded, the second 
> attempt can be cancelled part-way through its run cycle if all other tasks 
> (including the prior failed) are completed successfully.
> What this means is that if a task is manipulating some state somewhere (for 
> example: a upload-to-temporary-file-location, then delete-then-move on an 
> underlying s3n storage implementation) the driver can improperly shutdown the 
> running (2nd attempt) task between state manipulations, leaving the 
> persistent state in a bad state since the 2nd attempt never got to complete 
> its manipulations, and was terminated prematurely at some arbitrary point in 
> its state change logic (ex: finished the delete but not the move).
> This is using the mesos coarse grained executor. It is unclear if this 
> behavior is limited to the mesos coarse grained executor or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19722) Clean up the usage of EliminateSubqueryAliases

2017-02-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19722:


Assignee: Xiao Li  (was: Apache Spark)

> Clean up the usage of EliminateSubqueryAliases
> --
>
> Key: SPARK-19722
> URL: https://issues.apache.org/jira/browse/SPARK-19722
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Minor
>
> In the PR https://github.com/apache/spark/pull/11403, we introduced the 
> function `canonicalized` for eliminating useless subqueries. We can simply 
> replace the call of rule `EliminateSubqueryAliases` by the function 
> `canonicalized`.   
> After we changed the view resolution and management, the current reason why 
> we keep `EliminateSubqueryAliases ` in optimizer becomes out-of-dated. Thus, 
> we also should update the reason to `eager analysis of Dataset`. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19722) Clean up the usage of EliminateSubqueryAliases

2017-02-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19722:


Assignee: Apache Spark  (was: Xiao Li)

> Clean up the usage of EliminateSubqueryAliases
> --
>
> Key: SPARK-19722
> URL: https://issues.apache.org/jira/browse/SPARK-19722
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Minor
>
> In the PR https://github.com/apache/spark/pull/11403, we introduced the 
> function `canonicalized` for eliminating useless subqueries. We can simply 
> replace the call of rule `EliminateSubqueryAliases` by the function 
> `canonicalized`.   
> After we changed the view resolution and management, the current reason why 
> we keep `EliminateSubqueryAliases ` in optimizer becomes out-of-dated. Thus, 
> we also should update the reason to `eager analysis of Dataset`. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19722) Clean up the usage of EliminateSubqueryAliases

2017-02-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881990#comment-15881990
 ] 

Apache Spark commented on SPARK-19722:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/17050

> Clean up the usage of EliminateSubqueryAliases
> --
>
> Key: SPARK-19722
> URL: https://issues.apache.org/jira/browse/SPARK-19722
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Minor
>
> In the PR https://github.com/apache/spark/pull/11403, we introduced the 
> function `canonicalized` for eliminating useless subqueries. We can simply 
> replace the call of rule `EliminateSubqueryAliases` by the function 
> `canonicalized`.   
> After we changed the view resolution and management, the current reason why 
> we keep `EliminateSubqueryAliases ` in optimizer becomes out-of-dated. Thus, 
> we also should update the reason to `eager analysis of Dataset`. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19722) Clean up the usage of EliminateSubqueryAliases

2017-02-23 Thread Xiao Li (JIRA)
Xiao Li created SPARK-19722:
---

 Summary: Clean up the usage of EliminateSubqueryAliases
 Key: SPARK-19722
 URL: https://issues.apache.org/jira/browse/SPARK-19722
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Xiao Li
Assignee: Xiao Li
Priority: Minor


In the PR https://github.com/apache/spark/pull/11403, we introduced the 
function `canonicalized` for eliminating useless subqueries. We can simply 
replace the call of rule `EliminateSubqueryAliases` by the function 
`canonicalized`.   

After we changed the view resolution and management, the current reason why we 
keep `EliminateSubqueryAliases ` in optimizer becomes out-of-dated. Thus, we 
also should update the reason to `eager analysis of Dataset`. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14703) Spark uses SLF4J, but actually relies quite heavily on Log4J

2017-02-23 Thread Sheng Luo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881981#comment-15881981
 ] 

Sheng Luo commented on SPARK-14703:
---

For a workaround, log4j-over-slf4j.jar can be used as a drop in replacement for 
log4j.jar which works for me. All the calls to log4j will actually be routed to 
slf4j which in turn will be routed to logback in your case.

> Spark uses SLF4J, but actually relies quite heavily on Log4J
> 
>
> Key: SPARK-14703
> URL: https://issues.apache.org/jira/browse/SPARK-14703
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 1.6.0
> Environment: 1.6.0-cdh5.7.0, logback 1.1.3, yarn
>Reporter: Matthew Byng-Maddick
>Priority: Minor
>  Labels: log4j, logback, logging, slf4j
> Attachments: spark-logback.patch
>
>
> We've built a version of Hadoop CDH-5.7.0 in house with logback as the SLF4J 
> provider, in order to send hadoop logs straight to logstash (to handle with 
> logstash/elasticsearch), on top of our existing use of the logback backend.
> In trying to start spark-shell I discovered several points where the fact 
> that we weren't quite using a real L4J caused the sc not to be created or the 
> YARN module not to exist. There are many more places where we should probably 
> be wrapping the logging more sensibly, but I have a basic patch that fixes 
> some of the worst offenders (at least the ones that stop the sparkContext 
> being created properly).
> I'm prepared to accept that this is not a good solution and there probably 
> needs to be some sort of better wrapper, perhaps in the Logging.scala class 
> which handles this properly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17495) Hive hash implementation

2017-02-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881980#comment-15881980
 ] 

Apache Spark commented on SPARK-17495:
--

User 'tejasapatil' has created a pull request for this issue:
https://github.com/apache/spark/pull/17049

> Hive hash implementation
> 
>
> Key: SPARK-17495
> URL: https://issues.apache.org/jira/browse/SPARK-17495
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
>
> Spark internally uses Murmur3Hash for partitioning. This is different from 
> the one used by Hive. For queries which use bucketing this leads to different 
> results if one tries the same query on both engines. For us, we want users to 
> have backward compatibility to that one can switch parts of applications 
> across the engines without observing regressions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19715) Option to Strip Paths in FileSource

2017-02-23 Thread Liwei Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881965#comment-15881965
 ] 

Liwei Lin commented on SPARK-19715:
---

I'll work on this. Thanks!

> Option to Strip Paths in FileSource
> ---
>
> Key: SPARK-19715
> URL: https://issues.apache.org/jira/browse/SPARK-19715
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Michael Armbrust
>
> Today, we compare the whole path when deciding if a file is new in the 
> FileSource for structured streaming.  However, this cause cause false 
> negatives in the case where the path has changed in a cosmetic way (i.e. 
> changing s3n to s3a).  We should add an option {{fileNameOnly}} that causes 
> the new file check to be based only on the filename (but still store the 
> whole path in the log).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14772) Python ML Params.copy treats uid, paramMaps differently than Scala

2017-02-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881927#comment-15881927
 ] 

Apache Spark commented on SPARK-14772:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/17048

> Python ML Params.copy treats uid, paramMaps differently than Scala
> --
>
> Key: SPARK-14772
> URL: https://issues.apache.org/jira/browse/SPARK-14772
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.1.0
>Reporter: Joseph K. Bradley
>Assignee: Bryan Cutler
> Fix For: 2.2.0
>
>
> In PySpark, {{ml.param.Params.copy}} does not quite match the Scala 
> implementation:
> * It does not copy the UID
> * It does not respect the difference between defaultParamMap and paramMap.  
> This is an issue with {{_copyValues}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17075) Cardinality Estimation of Predicate Expressions

2017-02-23 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-17075:
---

Assignee: Ron Hu

> Cardinality Estimation of Predicate Expressions
> ---
>
> Key: SPARK-17075
> URL: https://issues.apache.org/jira/browse/SPARK-17075
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.0.0
>Reporter: Ron Hu
>Assignee: Ron Hu
> Fix For: 2.2.0
>
>
> A filter condition is the predicate expression specified in the WHERE clause 
> of a SQL select statement.  A predicate can be a compound logical expression 
> with logical AND, OR, NOT operators combining multiple single conditions.  A 
> single condition usually has comparison operators such as =, <, <=, >, >=, 
> ‘like’, etc.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17075) Cardinality Estimation of Predicate Expressions

2017-02-23 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-17075.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16395
[https://github.com/apache/spark/pull/16395]

> Cardinality Estimation of Predicate Expressions
> ---
>
> Key: SPARK-17075
> URL: https://issues.apache.org/jira/browse/SPARK-17075
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.0.0
>Reporter: Ron Hu
> Fix For: 2.2.0
>
>
> A filter condition is the predicate expression specified in the WHERE clause 
> of a SQL select statement.  A predicate can be a compound logical expression 
> with logical AND, OR, NOT operators combining multiple single conditions.  A 
> single condition usually has comparison operators such as =, <, <=, >, >=, 
> ‘like’, etc.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4563) Allow spark driver to bind to different ip then advertise ip

2017-02-23 Thread Danny Robinson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881893#comment-15881893
 ] 

Danny Robinson edited comment on SPARK-4563 at 2/24/17 3:54 AM:


Updated my solution for Spark 1.6.3 which seemed to develop an issue with the 
port actually being accessible, possibly to do with tcp6.

{code}
#at container startup:
export SPARK_PUBLIC_DNS=IPADDR_OF_DOCKER_HOST_OR_PROXY
export SPARK_LOCAL_IP=IPADDR_OF_DOCKER_HOST_OR_PROXY
echo -e "0.0.0.0 ${HOSTNAME_OF_DOCKER_HOST_OR_PROXY}" >> /etc/hosts
{code}


was (Author: dannyjrobinson):
Updated my solution for Spark 1.6.3 which seemed to develop an issue with the 
port actually being accessible, possibly to do with tcp6.

{code}
export SPARK_PUBLIC_DNS=IPADDR_OF_DOCKER_HOST_OR_PROXY
export SPARK_LOCAL_IP=IPADDR_OF_DOCKER_HOST_OR_PROXY
at container startup I do this:
echo -e "0.0.0.0 ${HOSTNAME_OF_DOCKER_HOST_OR_PROXY}" >> /etc/hosts
{code}

> Allow spark driver to bind to different ip then advertise ip
> 
>
> Key: SPARK-4563
> URL: https://issues.apache.org/jira/browse/SPARK-4563
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: Long Nguyen
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.1.0
>
>
> Spark driver bind ip and advertise is not configurable. spark.driver.host is 
> only bind ip. SPARK_PUBLIC_DNS does not work for spark driver. Allow option 
> to set advertised ip/hostname



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4563) Allow spark driver to bind to different ip then advertise ip

2017-02-23 Thread Danny Robinson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881893#comment-15881893
 ] 

Danny Robinson commented on SPARK-4563:
---

Updated my solution for Spark 1.6.3 which seemed to develop an issue with the 
port actually being accessible, possibly to do with tcp6.

{code}
export SPARK_PUBLIC_DNS=IPADDR_OF_DOCKER_HOST_OR_PROXY
export SPARK_LOCAL_IP=IPADDR_OF_DOCKER_HOST_OR_PROXY
at container startup I do this:
echo -e "0.0.0.0 ${HOSTNAME_OF_DOCKER_HOST_OR_PROXY}" >> /etc/hosts
{code}

> Allow spark driver to bind to different ip then advertise ip
> 
>
> Key: SPARK-4563
> URL: https://issues.apache.org/jira/browse/SPARK-4563
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: Long Nguyen
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.1.0
>
>
> Spark driver bind ip and advertise is not configurable. spark.driver.host is 
> only bind ip. SPARK_PUBLIC_DNS does not work for spark driver. Allow option 
> to set advertised ip/hostname



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4563) Allow spark driver to bind to different ip then advertise ip

2017-02-23 Thread Alex Hanson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881819#comment-15881819
 ] 

Alex Hanson commented on SPARK-4563:


I have a similar issue where I'm running Spark v1.6.3 and wanted to add to what 
Danny suggested above for a Spark 1.6 solution. My Spark cluster is running 
standalone with nothing else running on those nodes. My solution uses iptables 
to forward traffic for the internal Docker IP (which I can't seem to configure 
the Spark cluster to not expose, at least as of v1.6) to the Docker Host IP, 
which works because the network on which the Docker host resides is a 
10.10.0.0/16 network, and the Docker internal network is 172.18.0.0/16. Since 
there are no other 172.18.*.* addresses, putting these rules in place on the 
Spark nodes doesn't collide with anything else on the host network.

I know I'm not the only one who can't make the jump from 1.6 to 2.0 (or 2.1) 
quite yet due to production applications, so I wanted to add my comment here 
that might help others through similar challenges.

> Allow spark driver to bind to different ip then advertise ip
> 
>
> Key: SPARK-4563
> URL: https://issues.apache.org/jira/browse/SPARK-4563
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: Long Nguyen
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.1.0
>
>
> Spark driver bind ip and advertise is not configurable. spark.driver.host is 
> only bind ip. SPARK_PUBLIC_DNS does not work for spark driver. Allow option 
> to set advertised ip/hostname



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14772) Python ML Params.copy treats uid, paramMaps differently than Scala

2017-02-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14772:
--
Fix Version/s: 2.2.0

> Python ML Params.copy treats uid, paramMaps differently than Scala
> --
>
> Key: SPARK-14772
> URL: https://issues.apache.org/jira/browse/SPARK-14772
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.1.0
>Reporter: Joseph K. Bradley
>Assignee: Bryan Cutler
> Fix For: 2.2.0
>
>
> In PySpark, {{ml.param.Params.copy}} does not quite match the Scala 
> implementation:
> * It does not copy the UID
> * It does not respect the difference between defaultParamMap and paramMap.  
> This is an issue with {{_copyValues}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19720) Redact sensitive information from SparkSubmit console output

2017-02-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19720:


Assignee: (was: Apache Spark)

> Redact sensitive information from SparkSubmit console output
> 
>
> Key: SPARK-19720
> URL: https://issues.apache.org/jira/browse/SPARK-19720
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.2.0
>Reporter: Mark Grover
>
> SPARK-18535 took care of redacting sensitive information from Spark event 
> logs and UI. However, it intentionally didn't bother redacting the same 
> sensitive information from SparkSubmit's console output because it was on the 
> client's machine, which already had the sensitive information on disk (in 
> spark-defaults.conf) or on terminal (spark-submit command line).
> However, it seems now that it's better to redact information from 
> SparkSubmit's console output as well because orchestration software like 
> Oozie usually expose SparkSubmit's console output via a UI. To make matters 
> worse, Oozie, in particular, always sets the {{--verbose}} flag on 
> SparkSubmit invocation, making the sensitive information readily available in 
> its UI (see 
> [code|https://github.com/apache/oozie/blob/master/sharelib/spark/src/main/java/org/apache/oozie/action/hadoop/SparkMain.java#L248]
>  here).
> This is a JIRA for tracking redaction of sensitive information from 
> SparkSubmit's console output.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19720) Redact sensitive information from SparkSubmit console output

2017-02-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19720:


Assignee: Apache Spark

> Redact sensitive information from SparkSubmit console output
> 
>
> Key: SPARK-19720
> URL: https://issues.apache.org/jira/browse/SPARK-19720
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.2.0
>Reporter: Mark Grover
>Assignee: Apache Spark
>
> SPARK-18535 took care of redacting sensitive information from Spark event 
> logs and UI. However, it intentionally didn't bother redacting the same 
> sensitive information from SparkSubmit's console output because it was on the 
> client's machine, which already had the sensitive information on disk (in 
> spark-defaults.conf) or on terminal (spark-submit command line).
> However, it seems now that it's better to redact information from 
> SparkSubmit's console output as well because orchestration software like 
> Oozie usually expose SparkSubmit's console output via a UI. To make matters 
> worse, Oozie, in particular, always sets the {{--verbose}} flag on 
> SparkSubmit invocation, making the sensitive information readily available in 
> its UI (see 
> [code|https://github.com/apache/oozie/blob/master/sharelib/spark/src/main/java/org/apache/oozie/action/hadoop/SparkMain.java#L248]
>  here).
> This is a JIRA for tracking redaction of sensitive information from 
> SparkSubmit's console output.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19720) Redact sensitive information from SparkSubmit console output

2017-02-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881773#comment-15881773
 ] 

Apache Spark commented on SPARK-19720:
--

User 'markgrover' has created a pull request for this issue:
https://github.com/apache/spark/pull/17047

> Redact sensitive information from SparkSubmit console output
> 
>
> Key: SPARK-19720
> URL: https://issues.apache.org/jira/browse/SPARK-19720
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.2.0
>Reporter: Mark Grover
>
> SPARK-18535 took care of redacting sensitive information from Spark event 
> logs and UI. However, it intentionally didn't bother redacting the same 
> sensitive information from SparkSubmit's console output because it was on the 
> client's machine, which already had the sensitive information on disk (in 
> spark-defaults.conf) or on terminal (spark-submit command line).
> However, it seems now that it's better to redact information from 
> SparkSubmit's console output as well because orchestration software like 
> Oozie usually expose SparkSubmit's console output via a UI. To make matters 
> worse, Oozie, in particular, always sets the {{--verbose}} flag on 
> SparkSubmit invocation, making the sensitive information readily available in 
> its UI (see 
> [code|https://github.com/apache/oozie/blob/master/sharelib/spark/src/main/java/org/apache/oozie/action/hadoop/SparkMain.java#L248]
>  here).
> This is a JIRA for tracking redaction of sensitive information from 
> SparkSubmit's console output.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19698) Race condition in stale attempt task completion vs current attempt task completion when task is doing persistent state changes

2017-02-23 Thread Jisoo Kim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881771#comment-15881771
 ] 

Jisoo Kim commented on SPARK-19698:
---

Ah, I see what you mean. I don't use Spark's speculation feature, so I wasn't 
aware that the running tasks won't be killed when their speculative copies get 
restarted. What is the reason behind not killing the stale tasks that were 
overridden? Is that for performance? 

I found that TaskSetManager will kill all the other attempts for the specific 
task when one of the attempts succeeds: 
https://github.com/apache/spark/blob/d9043092caf71d5fa6be18ae8c51a0158bc2218e/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L709

However, the above scenario still concerns me in case the task has some other 
long-running computation after modifying external state. In that case, Attempt 
1 can be launched after Attempt 0 finishes modifying external state (but is 
still doing some computation) and gets partway through its own modification. I 
think in this case if Attempt 1 gets killed or all other partitions are 
"finished" before Attempt 1 finishes, the same problem can happen. 

I wonder if this approach is a viable solution:
- Have additional information (task attemptNumber from task info) when adding 
the task index to speculableTasks 
(https://github.com/apache/spark/blob/d9043092caf71d5fa6be18ae8c51a0158bc2218e/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L937)
- Have TaskSetManager to notify the driver only when the completed task is not 
inside speculableTasks 
(https://github.com/apache/spark/blob/d9043092caf71d5fa6be18ae8c51a0158bc2218e/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L706)



> Race condition in stale attempt task completion vs current attempt task 
> completion when task is doing persistent state changes
> --
>
> Key: SPARK-19698
> URL: https://issues.apache.org/jira/browse/SPARK-19698
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> We have encountered a strange scenario in our production environment. Below 
> is the best guess we have right now as to what's going on.
> Potentially, the final stage of a job has a failure in one of the tasks (such 
> as OOME on the executor) which can cause tasks for that stage to be 
> relaunched in a second attempt.
> https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1155
> keeps track of which tasks have been completed, but does NOT keep track of 
> which attempt those tasks were completed in. As such, we have encountered a 
> scenario where a particular task gets executed twice in different stage 
> attempts, and the DAGScheduler does not consider if the second attempt is 
> still running. This means if the first task attempt succeeded, the second 
> attempt can be cancelled part-way through its run cycle if all other tasks 
> (including the prior failed) are completed successfully.
> What this means is that if a task is manipulating some state somewhere (for 
> example: a upload-to-temporary-file-location, then delete-then-move on an 
> underlying s3n storage implementation) the driver can improperly shutdown the 
> running (2nd attempt) task between state manipulations, leaving the 
> persistent state in a bad state since the 2nd attempt never got to complete 
> its manipulations, and was terminated prematurely at some arbitrary point in 
> its state change logic (ex: finished the delete but not the move).
> This is using the mesos coarse grained executor. It is unclear if this 
> behavior is limited to the mesos coarse grained executor or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14772) Python ML Params.copy treats uid, paramMaps differently than Scala

2017-02-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14772:
--
 Shepherd: Joseph K. Bradley
Affects Version/s: 2.1.0
 Target Version/s: 2.1.1, 2.2.0

> Python ML Params.copy treats uid, paramMaps differently than Scala
> --
>
> Key: SPARK-14772
> URL: https://issues.apache.org/jira/browse/SPARK-14772
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.1.0
>Reporter: Joseph K. Bradley
>Assignee: Bryan Cutler
>
> In PySpark, {{ml.param.Params.copy}} does not quite match the Scala 
> implementation:
> * It does not copy the UID
> * It does not respect the difference between defaultParamMap and paramMap.  
> This is an issue with {{_copyValues}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14772) Python ML Params.copy treats uid, paramMaps differently than Scala

2017-02-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-14772:
-

Assignee: Bryan Cutler

> Python ML Params.copy treats uid, paramMaps differently than Scala
> --
>
> Key: SPARK-14772
> URL: https://issues.apache.org/jira/browse/SPARK-14772
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Bryan Cutler
>
> In PySpark, {{ml.param.Params.copy}} does not quite match the Scala 
> implementation:
> * It does not copy the UID
> * It does not respect the difference between defaultParamMap and paramMap.  
> This is an issue with {{_copyValues}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19721) Good error message for version mismatch in log files

2017-02-23 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-19721:


 Summary: Good error message for version mismatch in log files
 Key: SPARK-19721
 URL: https://issues.apache.org/jira/browse/SPARK-19721
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Michael Armbrust
Priority: Blocker


There are several places where we write out version identifiers in various logs 
for structured streaming (usually {{v1}}).  However, in the places where we 
check for this, we throw a confusing error message.  Instead, we should do the 
following:
 - Find all of the places where we do this kind of check.
 - for {{vN}} where {{n>1}} say "UnsupportedLogFormat: The file {{path}} was 
produced by a newer version of Spark and cannot be read by this version.  
Please upgrade"
 - for anything else throw an error saying the file is malformed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19720) Redact sensitive information from SparkSubmit console output

2017-02-23 Thread Mark Grover (JIRA)
Mark Grover created SPARK-19720:
---

 Summary: Redact sensitive information from SparkSubmit console 
output
 Key: SPARK-19720
 URL: https://issues.apache.org/jira/browse/SPARK-19720
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.2.0
Reporter: Mark Grover


SPARK-18535 took care of redacting sensitive information from Spark event logs 
and UI. However, it intentionally didn't bother redacting the same sensitive 
information from SparkSubmit's console output because it was on the client's 
machine, which already had the sensitive information on disk (in 
spark-defaults.conf) or on terminal (spark-submit command line).

However, it seems now that it's better to redact information from SparkSubmit's 
console output as well because orchestration software like Oozie usually expose 
SparkSubmit's console output via a UI. To make matters worse, Oozie, in 
particular, always sets the {{--verbose}} flag on SparkSubmit invocation, 
making the sensitive information readily available in its UI (see 
[code|https://github.com/apache/oozie/blob/master/sharelib/spark/src/main/java/org/apache/oozie/action/hadoop/SparkMain.java#L248]
 here).

This is a JIRA for tracking redaction of sensitive information from 
SparkSubmit's console output.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19715) Option to Strip Paths in FileSource

2017-02-23 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881731#comment-15881731
 ] 

Michael Armbrust commented on SPARK-19715:
--

[~lwlin] another file source features you might want to work on.

> Option to Strip Paths in FileSource
> ---
>
> Key: SPARK-19715
> URL: https://issues.apache.org/jira/browse/SPARK-19715
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Michael Armbrust
>
> Today, we compare the whole path when deciding if a file is new in the 
> FileSource for structured streaming.  However, this cause cause false 
> negatives in the case where the path has changed in a cosmetic way (i.e. 
> changing s3n to s3a).  We should add an option {{fileNameOnly}} that causes 
> the new file check to be based only on the filename (but still store the 
> whole path in the log).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19596) After a Stage is completed, all Tasksets for the stage should be marked as zombie

2017-02-23 Thread Kay Ousterhout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881726#comment-15881726
 ] 

Kay Ousterhout commented on SPARK-19596:


I agree that this is an issue (although it would be implicitly fixed if we 
cancel running tasks in zombie stages, because that would mean that a task 
attempt from an earlier, still-running stage attempt can't cause a stage to be 
marked as complete)

> After a Stage is completed, all Tasksets for the stage should be marked as 
> zombie
> -
>
> Key: SPARK-19596
> URL: https://issues.apache.org/jira/browse/SPARK-19596
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.2.0
>Reporter: Imran Rashid
>
> Fetch Failures can lead to multiple simultaneous tasksets for one stage.  The 
> stage may eventually be finished by task completions from a prior stage 
> attempt.  When this happens, the most recent taskset is not marked as a 
> zombie.  This means that taskset may continue to submit new tasks even after 
> the stage is complete.
> This is not a correctness issue, but it will effect performance, as cluster 
> resources will get tied up running tasks that are not needed.
> This is a follow up to https://issues.apache.org/jira/browse/SPARK-19565.  
> See some discussion in the pr for that issue: 
> https://github.com/apache/spark/pull/16901



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19714) Bucketizer Bug Regarding Handling Unbucketed Inputs

2017-02-23 Thread Wojciech Szymanski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881723#comment-15881723
 ] 

Wojciech Szymanski commented on SPARK-19714:


I fully agree with you Bill, that "invalid" is unfortunate name in this 
context, so at least docs should be updated.
[~yanboliang] could you please advise if additional rework is needed? If so I 
can take it as well.

> Bucketizer Bug Regarding Handling Unbucketed Inputs
> ---
>
> Key: SPARK-19714
> URL: https://issues.apache.org/jira/browse/SPARK-19714
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Bill Chambers
>
> {code}
> contDF = spark.range(500).selectExpr("cast(id as double) as id")
> import org.apache.spark.ml.feature.Bucketizer
> val splits = Array(5.0, 10.0, 250.0, 500.0)
> val bucketer = new Bucketizer()
>   .setSplits(splits)
>   .setInputCol("id")
>   .setHandleInvalid("skip")
> bucketer.transform(contDF).show()
> {code}
> You would expect that this would handle the invalid buckets. However it fails
> {code}
> Caused by: org.apache.spark.SparkException: Feature value 0.0 out of 
> Bucketizer bounds [5.0, 500.0].  Check your features, or loosen the 
> lower/upper bound constraints.
> {code} 
> It seems strange that handleInvalud doesn't actually handleInvalid inputs.
> Thoughts anyone?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19691) Calculating percentile of decimal column fails with ClassCastException

2017-02-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881694#comment-15881694
 ] 

Apache Spark commented on SPARK-19691:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/17046

> Calculating percentile of decimal column fails with ClassCastException
> --
>
> Key: SPARK-19691
> URL: https://issues.apache.org/jira/browse/SPARK-19691
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Assignee: Takeshi Yamamuro
> Fix For: 2.2.0
>
>
> Running
> {code}
> spark.range(10).selectExpr("cast (id as decimal) as 
> x").selectExpr("percentile(x, 0.5)").collect()
> {code}
> results in a ClassCastException:
> {code}
>  java.lang.ClassCastException: org.apache.spark.sql.types.Decimal cannot be 
> cast to java.lang.Number
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.Percentile.update(Percentile.scala:141)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.Percentile.update(Percentile.scala:58)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.update(interfaces.scala:514)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:171)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:171)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:187)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:181)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:151)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.(ObjectAggregationIterator.scala:78)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:101)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:113)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19698) Race condition in stale attempt task completion vs current attempt task completion when task is doing persistent state changes

2017-02-23 Thread Kay Ousterhout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881688#comment-15881688
 ] 

Kay Ousterhout commented on SPARK-19698:


My concern is that there are other cases in Spark where this issue could arise 
(so Spark tasks need to be very careful about how they modify external state).  
Here's another scenario:

- Attempt 0 of a task starts and takes a long time to run
- A second, speculative copy of the task is started (attempt 1)
- Attempt 0 finishes successfully, but attempt 1 is still running
- Attempt 1 gets partway through modifying the external state, but then gets 
killed because of an OOM on the machine
- Attempt 1 won't get re-started, because a copy of the task already finished 
successfully

This seems like it will have the same issue you mentioned in the JIRA, right?

> Race condition in stale attempt task completion vs current attempt task 
> completion when task is doing persistent state changes
> --
>
> Key: SPARK-19698
> URL: https://issues.apache.org/jira/browse/SPARK-19698
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> We have encountered a strange scenario in our production environment. Below 
> is the best guess we have right now as to what's going on.
> Potentially, the final stage of a job has a failure in one of the tasks (such 
> as OOME on the executor) which can cause tasks for that stage to be 
> relaunched in a second attempt.
> https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1155
> keeps track of which tasks have been completed, but does NOT keep track of 
> which attempt those tasks were completed in. As such, we have encountered a 
> scenario where a particular task gets executed twice in different stage 
> attempts, and the DAGScheduler does not consider if the second attempt is 
> still running. This means if the first task attempt succeeded, the second 
> attempt can be cancelled part-way through its run cycle if all other tasks 
> (including the prior failed) are completed successfully.
> What this means is that if a task is manipulating some state somewhere (for 
> example: a upload-to-temporary-file-location, then delete-then-move on an 
> underlying s3n storage implementation) the driver can improperly shutdown the 
> running (2nd attempt) task between state manipulations, leaving the 
> persistent state in a bad state since the 2nd attempt never got to complete 
> its manipulations, and was terminated prematurely at some arbitrary point in 
> its state change logic (ex: finished the delete but not the move).
> This is using the mesos coarse grained executor. It is unclear if this 
> behavior is limited to the mesos coarse grained executor or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19714) Bucketizer Bug Regarding Handling Unbucketed Inputs

2017-02-23 Thread Bill Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881676#comment-15881676
 ] 

Bill Chambers commented on SPARK-19714:
---

"Invalid" is a poor descriptor IMO. Invalid should be defined as "not defined 
in this range". If it's null, why isn't it just "handleNull" or something since 
it only applies to null/missing values?

A doc update would definitely help. I've got my own opinions about how this 
should work but I'll leave it up to you. Be curious if anyone else has 
thoughts, maybe I'm the only one in which case... whatever :)

> Bucketizer Bug Regarding Handling Unbucketed Inputs
> ---
>
> Key: SPARK-19714
> URL: https://issues.apache.org/jira/browse/SPARK-19714
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Bill Chambers
>
> {code}
> contDF = spark.range(500).selectExpr("cast(id as double) as id")
> import org.apache.spark.ml.feature.Bucketizer
> val splits = Array(5.0, 10.0, 250.0, 500.0)
> val bucketer = new Bucketizer()
>   .setSplits(splits)
>   .setInputCol("id")
>   .setHandleInvalid("skip")
> bucketer.transform(contDF).show()
> {code}
> You would expect that this would handle the invalid buckets. However it fails
> {code}
> Caused by: org.apache.spark.SparkException: Feature value 0.0 out of 
> Bucketizer bounds [5.0, 500.0].  Check your features, or loosen the 
> lower/upper bound constraints.
> {code} 
> It seems strange that handleInvalud doesn't actually handleInvalid inputs.
> Thoughts anyone?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16122) Spark History Server REST API missing an environment endpoint per application

2017-02-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881674#comment-15881674
 ] 

Apache Spark commented on SPARK-16122:
--

User 'uncleGen' has created a pull request for this issue:
https://github.com/apache/spark/pull/17033

> Spark History Server REST API missing an environment endpoint per application
> -
>
> Key: SPARK-16122
> URL: https://issues.apache.org/jira/browse/SPARK-16122
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, Web UI
>Affects Versions: 1.6.1
>Reporter: Neelesh Srinivas Salian
>Assignee: Genmao Yu
>Priority: Minor
>  Labels: Docs, WebUI
> Fix For: 2.2.0
>
>
> The WebUI for the Spark History Server has the Environment tab that allows 
> you to view the Environment for that job.
> With Runtime , Spark properties...etc.
> How about adding an endpoint to the REST API that looks and points to this 
> environment tab for that application?
> /applications/[app-id]/environment
> Added Docs too so that we can spawn a subsequent Documentation addition to 
> get it included in the API.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19709) CSV datasource fails to read empty file

2017-02-23 Thread Wojciech Szymanski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881672#comment-15881672
 ] 

Wojciech Szymanski commented on SPARK-19709:


Thanks, I will try to fix it soon.

> CSV datasource fails to read empty file
> ---
>
> Key: SPARK-19709
> URL: https://issues.apache.org/jira/browse/SPARK-19709
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> I just {{touch a}} and then ran the codes below:
> {code}
> scala> spark.read.csv("a")
> java.util.NoSuchElementException: next on empty iterator
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
>   at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.
> {code}
> It seems we should produce an empty dataframe consistently with 
> `spark.read.json("a")`.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19709) CSV datasource fails to read empty file

2017-02-23 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881648#comment-15881648
 ] 

Hyukjin Kwon commented on SPARK-19709:
--

Please go ahead. (but I _personally_ recommend you open a PR in few days just 
to avoid potential conflicts because, for example, if 
https://github.com/apache/spark/pull/16976 gets merged, the code path will be 
changed rapidly).

> CSV datasource fails to read empty file
> ---
>
> Key: SPARK-19709
> URL: https://issues.apache.org/jira/browse/SPARK-19709
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> I just {{touch a}} and then ran the codes below:
> {code}
> scala> spark.read.csv("a")
> java.util.NoSuchElementException: next on empty iterator
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
>   at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.
> {code}
> It seems we should produce an empty dataframe consistently with 
> `spark.read.json("a")`.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19698) Race condition in stale attempt task completion vs current attempt task completion when task is doing persistent state changes

2017-02-23 Thread Jisoo Kim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881644#comment-15881644
 ] 

Jisoo Kim commented on SPARK-19698:
---

[~kayousterhout] If the failed task gets re-tried, as long as Driver doesn't 
shut down before the next attempt finishes, it should be ok because the next 
attempt will upload a file as intended. That's actually similar to what 
happened in my workload, executor was lost due to OOME and stage was 
resubmitted eventually. If the driver didn't think that the job was done, 
things would've been fine. The driver didn't mark the partition that the failed 
task was responsible for as "finished", so in the next attempt, the task 
finished successfully (and there were no race condition for this specific task 
because the executor that was running this task was lost) but one of the other 
tasks had a such problem. One thing I am not sure about my solution is a 
possible performance regression, but I think it might be better than having 
some kind of an "incorrect" external state unless it is not recommended and not 
a good practice to have a task to modify some external state.

> Race condition in stale attempt task completion vs current attempt task 
> completion when task is doing persistent state changes
> --
>
> Key: SPARK-19698
> URL: https://issues.apache.org/jira/browse/SPARK-19698
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> We have encountered a strange scenario in our production environment. Below 
> is the best guess we have right now as to what's going on.
> Potentially, the final stage of a job has a failure in one of the tasks (such 
> as OOME on the executor) which can cause tasks for that stage to be 
> relaunched in a second attempt.
> https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1155
> keeps track of which tasks have been completed, but does NOT keep track of 
> which attempt those tasks were completed in. As such, we have encountered a 
> scenario where a particular task gets executed twice in different stage 
> attempts, and the DAGScheduler does not consider if the second attempt is 
> still running. This means if the first task attempt succeeded, the second 
> attempt can be cancelled part-way through its run cycle if all other tasks 
> (including the prior failed) are completed successfully.
> What this means is that if a task is manipulating some state somewhere (for 
> example: a upload-to-temporary-file-location, then delete-then-move on an 
> underlying s3n storage implementation) the driver can improperly shutdown the 
> running (2nd attempt) task between state manipulations, leaving the 
> persistent state in a bad state since the 2nd attempt never got to complete 
> its manipulations, and was terminated prematurely at some arbitrary point in 
> its state change logic (ex: finished the delete but not the move).
> This is using the mesos coarse grained executor. It is unclear if this 
> behavior is limited to the mesos coarse grained executor or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19709) CSV datasource fails to read empty file

2017-02-23 Thread Wojciech Szymanski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881633#comment-15881633
 ] 

Wojciech Szymanski commented on SPARK-19709:


[~hyukjin.kwon] I can also look at this if you don't mind. It seems it's very 
easy to reproduce.

> CSV datasource fails to read empty file
> ---
>
> Key: SPARK-19709
> URL: https://issues.apache.org/jira/browse/SPARK-19709
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> I just {{touch a}} and then ran the codes below:
> {code}
> scala> spark.read.csv("a")
> java.util.NoSuchElementException: next on empty iterator
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
>   at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.
> {code}
> It seems we should produce an empty dataframe consistently with 
> `spark.read.json("a")`.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2017-02-23 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18057:
-
Summary: Update structured streaming kafka from 10.0.1 to 10.2.0  (was: 
Update structured streaming kafka from 10.0.1 to 10.1.0)

> Update structured streaming kafka from 10.0.1 to 10.2.0
> ---
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7354) Flaky test: o.a.s.deploy.SparkSubmitSuite --jars

2017-02-23 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881593#comment-15881593
 ] 

Andrew Ash commented on SPARK-7354:
---

We saw a flake for this test in the k8s repo's Travis builds too: 
https://github.com/apache-spark-on-k8s/spark/issues/110#issuecomment-281837162

> Flaky test: o.a.s.deploy.SparkSubmitSuite --jars
> 
>
> Key: SPARK-7354
> URL: https://issues.apache.org/jira/browse/SPARK-7354
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core, Tests
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Tathagata Das
>Assignee: Andrew Or
>Priority: Critical
>  Labels: flaky-test
>
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2271/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=centos/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19373) Mesos implementation of spark.scheduler.minRegisteredResourcesRatio looks at acquired cores rather than registerd cores

2017-02-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19373:


Assignee: (was: Apache Spark)

> Mesos implementation of spark.scheduler.minRegisteredResourcesRatio looks at 
> acquired cores rather than registerd cores
> ---
>
> Key: SPARK-19373
> URL: https://issues.apache.org/jira/browse/SPARK-19373
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Michael Gummelt
>
> We're currently using `totalCoresAcquired` to account for registered 
> resources, which is incorrect.  That variable measures the number of cores 
> the scheduler has accepted.  We should be using `totalCoreCount` like the 
> other schedulers do.
> Fixing this is important for locality, since users often want to wait for all 
> executors to come up before scheduling tasks to ensure they get a node-local 
> placement. 
> original PR to add support: https://github.com/apache/spark/pull/8672/files



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19373) Mesos implementation of spark.scheduler.minRegisteredResourcesRatio looks at acquired cores rather than registerd cores

2017-02-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19373:


Assignee: Apache Spark

> Mesos implementation of spark.scheduler.minRegisteredResourcesRatio looks at 
> acquired cores rather than registerd cores
> ---
>
> Key: SPARK-19373
> URL: https://issues.apache.org/jira/browse/SPARK-19373
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Michael Gummelt
>Assignee: Apache Spark
>
> We're currently using `totalCoresAcquired` to account for registered 
> resources, which is incorrect.  That variable measures the number of cores 
> the scheduler has accepted.  We should be using `totalCoreCount` like the 
> other schedulers do.
> Fixing this is important for locality, since users often want to wait for all 
> executors to come up before scheduling tasks to ensure they get a node-local 
> placement. 
> original PR to add support: https://github.com/apache/spark/pull/8672/files



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19373) Mesos implementation of spark.scheduler.minRegisteredResourcesRatio looks at acquired cores rather than registerd cores

2017-02-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881572#comment-15881572
 ] 

Apache Spark commented on SPARK-19373:
--

User 'mgummelt' has created a pull request for this issue:
https://github.com/apache/spark/pull/17045

> Mesos implementation of spark.scheduler.minRegisteredResourcesRatio looks at 
> acquired cores rather than registerd cores
> ---
>
> Key: SPARK-19373
> URL: https://issues.apache.org/jira/browse/SPARK-19373
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Michael Gummelt
>
> We're currently using `totalCoresAcquired` to account for registered 
> resources, which is incorrect.  That variable measures the number of cores 
> the scheduler has accepted.  We should be using `totalCoreCount` like the 
> other schedulers do.
> Fixing this is important for locality, since users often want to wait for all 
> executors to come up before scheduling tasks to ensure they get a node-local 
> placement. 
> original PR to add support: https://github.com/apache/spark/pull/8672/files



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19714) Bucketizer Bug Regarding Handling Unbucketed Inputs

2017-02-23 Thread Wojciech Szymanski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881568#comment-15881568
 ] 

Wojciech Szymanski commented on SPARK-19714:


IMHO Bucketizer works as expected. I guess that from your point of view, 
invalid value is a number out of range, i.e. 0,1,2,3,4, but from Spark point of 
view, invalid value is not a number.
{code}
if (getHandleInvalid == Bucketizer.SKIP_INVALID) {
  // "skip" NaN option is set, will filter out NaN values in the dataset
  (dataset.na.drop().toDF(), false)
}
{code}

I fully agree that dosc for handleInvalid might be confusing, since definition 
of invalid values is missing:
{code}   
/**
* Param for how to handle invalid entries. Options are 'skip' (filter out 
rows with
* invalid values), 'error' (throw an error), or 'keep' (keep invalid values 
in a special
* additional bucket).
* Default: "error"
* @group param
 */
val handleInvalid: Param[String]
{code}

I would suggest that I update the dosc by clarifying what kind of invalid 
values will be filtered out if 'skip' strategy is used. 
I am not sure if introducing a new strategy for handling values out of range 
will be welcomed by the community.

> Bucketizer Bug Regarding Handling Unbucketed Inputs
> ---
>
> Key: SPARK-19714
> URL: https://issues.apache.org/jira/browse/SPARK-19714
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Bill Chambers
>
> {code}
> contDF = spark.range(500).selectExpr("cast(id as double) as id")
> import org.apache.spark.ml.feature.Bucketizer
> val splits = Array(5.0, 10.0, 250.0, 500.0)
> val bucketer = new Bucketizer()
>   .setSplits(splits)
>   .setInputCol("id")
>   .setHandleInvalid("skip")
> bucketer.transform(contDF).show()
> {code}
> You would expect that this would handle the invalid buckets. However it fails
> {code}
> Caused by: org.apache.spark.SparkException: Feature value 0.0 out of 
> Bucketizer bounds [5.0, 500.0].  Check your features, or loosen the 
> lower/upper bound constraints.
> {code} 
> It seems strange that handleInvalud doesn't actually handleInvalid inputs.
> Thoughts anyone?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-14523) Feature parity for Statistics ML with MLlib

2017-02-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-14523.
-
Resolution: Done

> Feature parity for Statistics ML with MLlib
> ---
>
> Key: SPARK-14523
> URL: https://issues.apache.org/jira/browse/SPARK-14523
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: yuhao yang
>
> Some statistics functions have been supported by DataFrame directly. Use this 
> jira to discuss/design the statistics package in Spark.ML and its function 
> scope. Hypothesis test and correlation computation may still need to expose 
> independent interfaces.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14523) Feature parity for Statistics ML with MLlib

2017-02-23 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881567#comment-15881567
 ] 

Joseph K. Bradley commented on SPARK-14523:
---

Alright, given that there are now 3 more subtasks for stats, I'll close this 
one in favor of those other 3.

> Feature parity for Statistics ML with MLlib
> ---
>
> Key: SPARK-14523
> URL: https://issues.apache.org/jira/browse/SPARK-14523
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: yuhao yang
>
> Some statistics functions have been supported by DataFrame directly. Use this 
> jira to discuss/design the statistics package in Spark.ML and its function 
> scope. Hypothesis test and correlation computation may still need to expose 
> independent interfaces.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19698) Race condition in stale attempt task completion vs current attempt task completion when task is doing persistent state changes

2017-02-23 Thread Kay Ousterhout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881564#comment-15881564
 ] 

Kay Ousterhout commented on SPARK-19698:


I see -- I agree that everything in your description is correct.  The driver 
will allow all tasks to finish if it's still running (e.g., if other tasks are 
being submitted), but you're right it will shut down the workers while some 
tasks are still in progress if the Driver shuts down.

To think about how to fix this, let me ask you a question about your workload: 
suppose a task is in the middle of manipulating some external state (as you 
described in the JIRA description) and it gets killed suddenly because the JVM 
runs out of memory (e.g., because another concurrently running task used up all 
of the memory).  In that case, the job listener won't be told about the failed 
task, and it will be re-tried.  Does that pose a problem in the same way that 
the behavior described in the PR is problematic?

> Race condition in stale attempt task completion vs current attempt task 
> completion when task is doing persistent state changes
> --
>
> Key: SPARK-19698
> URL: https://issues.apache.org/jira/browse/SPARK-19698
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> We have encountered a strange scenario in our production environment. Below 
> is the best guess we have right now as to what's going on.
> Potentially, the final stage of a job has a failure in one of the tasks (such 
> as OOME on the executor) which can cause tasks for that stage to be 
> relaunched in a second attempt.
> https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1155
> keeps track of which tasks have been completed, but does NOT keep track of 
> which attempt those tasks were completed in. As such, we have encountered a 
> scenario where a particular task gets executed twice in different stage 
> attempts, and the DAGScheduler does not consider if the second attempt is 
> still running. This means if the first task attempt succeeded, the second 
> attempt can be cancelled part-way through its run cycle if all other tasks 
> (including the prior failed) are completed successfully.
> What this means is that if a task is manipulating some state somewhere (for 
> example: a upload-to-temporary-file-location, then delete-then-move on an 
> underlying s3n storage implementation) the driver can improperly shutdown the 
> running (2nd attempt) task between state manipulations, leaving the 
> persistent state in a bad state since the 2nd attempt never got to complete 
> its manipulations, and was terminated prematurely at some arbitrary point in 
> its state change logic (ex: finished the delete but not the move).
> This is using the mesos coarse grained executor. It is unclear if this 
> behavior is limited to the mesos coarse grained executor or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16920) Investigate and fix issues introduced in SPARK-15858

2017-02-23 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881537#comment-15881537
 ] 

Joseph K. Bradley commented on SPARK-16920:
---

Thanks for adding that gist!  I agree with your argument that it's O(N), and 
the numbers look good in that respect.

I'm going to say we're done here and close this JIRA.  Thanks [~mahmoudr] and 
[~vladimir.feinberg]!

> Investigate and fix issues introduced in SPARK-15858
> 
>
> Key: SPARK-16920
> URL: https://issues.apache.org/jira/browse/SPARK-16920
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Vladimir Feinberg
> Fix For: 2.2.0
>
>
> There were several issues regarding the PR resolving SPARK-15858, my comments 
> are available here:
> https://github.com/apache/spark/commit/393db655c3c43155305fbba1b2f8c48a95f18d93
> The two most important issues are:
> 1. The PR did not add a stress test proving it resolved the issue it was 
> supposed to (though I have no doubt the optimization made is indeed correct).
> 2. The PR introduced quadratic prediction time in terms of the number of 
> trees, which was previously linear. This issue needs to be investigated for 
> whether it causes problems for large numbers of trees (say, 1000), an 
> appropriate test should be added, and then fixed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16920) Investigate and fix issues introduced in SPARK-15858

2017-02-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-16920.
---
  Resolution: Done
   Fix Version/s: 2.2.0
Target Version/s: 2.2.0

> Investigate and fix issues introduced in SPARK-15858
> 
>
> Key: SPARK-16920
> URL: https://issues.apache.org/jira/browse/SPARK-16920
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Vladimir Feinberg
>Assignee: Mahmoud Rawas
> Fix For: 2.2.0
>
>
> There were several issues regarding the PR resolving SPARK-15858, my comments 
> are available here:
> https://github.com/apache/spark/commit/393db655c3c43155305fbba1b2f8c48a95f18d93
> The two most important issues are:
> 1. The PR did not add a stress test proving it resolved the issue it was 
> supposed to (though I have no doubt the optimization made is indeed correct).
> 2. The PR introduced quadratic prediction time in terms of the number of 
> trees, which was previously linear. This issue needs to be investigated for 
> whether it causes problems for large numbers of trees (say, 1000), an 
> appropriate test should be added, and then fixed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16920) Investigate and fix issues introduced in SPARK-15858

2017-02-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-16920:
-

Assignee: Mahmoud Rawas

> Investigate and fix issues introduced in SPARK-15858
> 
>
> Key: SPARK-16920
> URL: https://issues.apache.org/jira/browse/SPARK-16920
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Vladimir Feinberg
>Assignee: Mahmoud Rawas
> Fix For: 2.2.0
>
>
> There were several issues regarding the PR resolving SPARK-15858, my comments 
> are available here:
> https://github.com/apache/spark/commit/393db655c3c43155305fbba1b2f8c48a95f18d93
> The two most important issues are:
> 1. The PR did not add a stress test proving it resolved the issue it was 
> supposed to (though I have no doubt the optimization made is indeed correct).
> 2. The PR introduced quadratic prediction time in terms of the number of 
> trees, which was previously linear. This issue needs to be investigated for 
> whether it causes problems for large numbers of trees (say, 1000), an 
> appropriate test should be added, and then fixed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16920) Investigate and fix issues introduced in SPARK-15858

2017-02-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-16920:
--
Target Version/s:   (was: 2.2.0)

> Investigate and fix issues introduced in SPARK-15858
> 
>
> Key: SPARK-16920
> URL: https://issues.apache.org/jira/browse/SPARK-16920
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Vladimir Feinberg
> Fix For: 2.2.0
>
>
> There were several issues regarding the PR resolving SPARK-15858, my comments 
> are available here:
> https://github.com/apache/spark/commit/393db655c3c43155305fbba1b2f8c48a95f18d93
> The two most important issues are:
> 1. The PR did not add a stress test proving it resolved the issue it was 
> supposed to (though I have no doubt the optimization made is indeed correct).
> 2. The PR introduced quadratic prediction time in terms of the number of 
> trees, which was previously linear. This issue needs to be investigated for 
> whether it causes problems for large numbers of trees (say, 1000), an 
> appropriate test should be added, and then fixed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19698) Race condition in stale attempt task completion vs current attempt task completion when task is doing persistent state changes

2017-02-23 Thread Jisoo Kim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881525#comment-15881525
 ] 

Jisoo Kim commented on SPARK-19698:
---

[~kayousterhout] Thanks for linking the JIRA ticket, I agree that the ticket 
describes a very similar problem that I had. However, I don't think that fixes 
the problem because the PR only deals with a problem in ShuffleMapStage and 
doesn't check the attempt Id in case of ResultStage. In my case, it was 
ResultStage that had the problem. I had run my test with a fix from 
(https://github.com/apache/spark/pull/16620) but it still failed. 

Could you point me to where driver will wait until all tasks finish? I tried 
finding the part but wasn't successful. I don't think Driver shuts down all 
tasks when a job is done, however, DAGScheduler signals the JobWaiter every 
time it receives completion event for a task that is responsible for unfinished 
partition 
(https://github.com/apache/spark/blob/ba8912e5f3d5c5a366cb3d1f6be91f2471d048d2/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1171).
 As a result, JobWaiter will call success() on the job promise 
(https://github.com/jinxing64/spark/blob/6809d1ff5d09693e961087da35c8f6b3b50fe53c/core/src/main/scala/org/apache/spark/scheduler/JobWaiter.scala#L61)
 before the 2nd task attempt finishes. This could not be of a problem if Driver 
waits until all tasks finish and SparkContext doesn't return results before all 
tasks finish, but I haven't found that it does yet (please correct me if I am 
missing something). I call SparkContext.stop() after I get the result from the 
application to clean up and upload event logs so I can view the spark history 
from the history server. And when SparkContext stops, AFAIK, it stops the 
Driver as well, which will shut down the task scheduler and executors, and I 
don't think executors will wait until it finishes its task before it shuts 
down. Hence, if this happens, the 2nd task attempt will get shut down as well I 
think.

> Race condition in stale attempt task completion vs current attempt task 
> completion when task is doing persistent state changes
> --
>
> Key: SPARK-19698
> URL: https://issues.apache.org/jira/browse/SPARK-19698
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> We have encountered a strange scenario in our production environment. Below 
> is the best guess we have right now as to what's going on.
> Potentially, the final stage of a job has a failure in one of the tasks (such 
> as OOME on the executor) which can cause tasks for that stage to be 
> relaunched in a second attempt.
> https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1155
> keeps track of which tasks have been completed, but does NOT keep track of 
> which attempt those tasks were completed in. As such, we have encountered a 
> scenario where a particular task gets executed twice in different stage 
> attempts, and the DAGScheduler does not consider if the second attempt is 
> still running. This means if the first task attempt succeeded, the second 
> attempt can be cancelled part-way through its run cycle if all other tasks 
> (including the prior failed) are completed successfully.
> What this means is that if a task is manipulating some state somewhere (for 
> example: a upload-to-temporary-file-location, then delete-then-move on an 
> underlying s3n storage implementation) the driver can improperly shutdown the 
> running (2nd attempt) task between state manipulations, leaving the 
> persistent state in a bad state since the 2nd attempt never got to complete 
> its manipulations, and was terminated prematurely at some arbitrary point in 
> its state change logic (ex: finished the delete but not the move).
> This is using the mesos coarse grained executor. It is unclear if this 
> behavior is limited to the mesos coarse grained executor or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19459) ORC tables cannot be read when they contain char/varchar columns

2017-02-23 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-19459:

Fix Version/s: 2.1.1

> ORC tables cannot be read when they contain char/varchar columns
> 
>
> Key: SPARK-19459
> URL: https://issues.apache.org/jira/browse/SPARK-19459
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
> Fix For: 2.1.1, 2.2.0
>
>
> Reading from an ORC table which contains char/varchar columns can fail if the 
> table has been created using Spark. This is caused by the fact that spark 
> internally replaces char and varchar columns with a string column, this 
> causes the ORC reader to use the wrong reader, and that eventually causes a 
> ClassCastException.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19716) Dataset should allow by-name resolution for struct type elements in array

2017-02-23 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881490#comment-15881490
 ] 

Simeon Simeonov commented on SPARK-19716:
-

This is an important issue because it prevent schema evolution with datasets 
that is {{mergeSchema=true}} compatible for dataframes. This means two things:

1. Customers currently using dataframes with non-trivial schema may not be able 
to migrate to datasets.
2. Customers that migrate (or start) using datasets may be stuck not being able 
to evolve their schema.

> Dataset should allow by-name resolution for struct type elements in array
> -
>
> Key: SPARK-19716
> URL: https://issues.apache.org/jira/browse/SPARK-19716
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>
> if we have a DataFrame with schema {{a: int, b: int, c: int}}, and convert it 
> to Dataset with {{case class Data(a: Int, c: Int)}}, it works and we will 
> extract the `a` and `c` columns to build the Data.
> However, if the struct is inside array, e.g. schema is {{arr: array>}}, and we wanna convert it to Dataset with {{case class 
> ComplexData(arr: Seq[Data])}}, we will fail. The reason is, to allow 
> compatible types, e.g. convert {{a: int}} to {{case class A(a: Long)}}, we 
> will add cast for each field, except struct type field, because struct type 
> is flexible, the number of columns can mismatch. We should probably also skip 
> cast for array and map type.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19635) Feature parity for Chi-square hypothesis testing in MLlib

2017-02-23 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881471#comment-15881471
 ] 

Timothy Hunter commented on SPARK-19635:


After working on it, I realized that Column operations do not fit very well the 
sort of requested operations. Hypothesis testing require to chain a UDAF with a 
UDF then with a UDAF again, which is not something that can be expressed inside 
catalyst by doing {{dataframe.select(test("features"))}}. I am going to have a 
simpler interface that is simpler to interface (see design doc above).

> Feature parity for Chi-square hypothesis testing in MLlib
> -
>
> Key: SPARK-19635
> URL: https://issues.apache.org/jira/browse/SPARK-19635
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>
> This ticket tracks porting the functionality of 
> spark.mllib.Statistics.chiSqTest over to spark.ml.
> Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19636) Feature parity for correlation statistics in MLlib

2017-02-23 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881457#comment-15881457
 ] 

Timothy Hunter commented on SPARK-19636:


After working on it, I realized that Column operations do not fit very well the 
sort of requested operations. Correlations require to chain a UDAF with a UDF 
then with a UDAF again, which is not something that can be expressed inside 
catalyst by doing {{dataframe.select(corr("features"))}}. I am going to have a 
simpler interface that is simpler to interface (see design doc above).

> Feature parity for correlation statistics in MLlib
> --
>
> Key: SPARK-19636
> URL: https://issues.apache.org/jira/browse/SPARK-19636
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>
> This ticket tracks porting the functionality of spark.mllib.Statistics.corr() 
> over to spark.ml.
> Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14658) when executor lost DagScheduer may submit one stage twice even if the first running taskset for this stage is not finished

2017-02-23 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-14658:
---
Fix Version/s: 2.2.0

> when executor lost DagScheduer may submit one stage twice even if the first 
> running taskset for this stage is not finished
> --
>
> Key: SPARK-14658
> URL: https://issues.apache.org/jira/browse/SPARK-14658
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.6.1, 2.0.0, 2.1.0, 2.2.0
> Environment: spark1.6.1  hadoop-2.6.0-cdh5.4.2
>Reporter: yixiaohua
> Fix For: 2.2.0
>
>
> {code}
> 16/04/14 15:35:22 ERROR DAGSchedulerEventProcessLoop: 
> DAGSchedulerEventProcessLoop failed; shutting down SparkContext
> java.lang.IllegalStateException: more than one active taskSet for stage 57: 
> 57.2,57.1
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:173)
> at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1052)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:921)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1214)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> {code}
> First Time:
> {code}
> 16/04/14 15:35:20 INFO DAGScheduler: Resubmitting ShuffleMapStage 57 (run at 
> AccessController.java:-2) because some of its tasks had failed: 5, 8, 9, 12, 
> 13, 16, 17, 18, 19, 23, 26, 27, 28, 29, 30, 31, 40, 42, 43, 48, 49, 50, 51, 
> 52, 53, 55, 56, 57, 59, 60, 61, 67, 70, 71, 84, 85, 86, 87, 98, 99, 100, 101, 
> 108, 109, 110, 111, 112, 113, 114, 115, 126, 127, 134, 136, 137, 146, 147, 
> 150, 151, 154, 155, 158, 159, 162, 163, 164, 165, 166, 167, 170, 171, 172, 
> 173, 174, 175, 176, 177, 178, 179, 180, 181, 188, 189, 190, 191, 198, 199, 
> 204, 206, 207, 208, 218, 219, 222, 223, 230, 231, 236, 238, 239
> 16/04/14 15:35:20 DEBUG DAGScheduler: submitStage(ShuffleMapStage 57)
> 16/04/14 15:35:20 DEBUG DAGScheduler: missing: List()
> 16/04/14 15:35:20 INFO DAGScheduler: Submitting ShuffleMapStage 57 
> (MapPartitionsRDD[7887] at run at AccessController.java:-2), which has no 
> missing parents
> 16/04/14 15:35:20 DEBUG DAGScheduler: submitMissingTasks(ShuffleMapStage 57)
> 16/04/14 15:35:20 INFO DAGScheduler: Submitting 100 missing tasks from 
> ShuffleMapStage 57 (MapPartitionsRDD[7887] at run at AccessController.java:-2)
> 16/04/14 15:35:20 DEBUG DAGScheduler: New pending partitions: Set(206, 177, 
> 127, 98, 48, 27, 23, 163, 238, 188, 159, 28, 109, 59, 9, 176, 126, 207, 174, 
> 43, 170, 208, 158, 108, 29, 8, 204, 154, 223, 173, 219, 190, 111, 61, 40, 
> 136, 115, 86, 57, 155, 55, 230, 222, 180, 172, 151, 101, 18, 166, 56, 137, 
> 87, 52, 171, 71, 42, 167, 198, 67, 17, 236, 165, 13, 5, 53, 178, 99, 70, 49, 
> 218, 147, 164, 114, 85, 60, 31, 179, 150, 19, 100, 50, 175, 146, 134, 113, 
> 84, 51, 30, 199, 26, 16, 191, 162, 112, 12, 239, 231, 189, 181, 110)
> {code}
> Second Time:
> {code}
> 16/04/14 15:35:22 INFO DAGScheduler: Resubmitting ShuffleMapStage 57 (run at 
> AccessController.java:-2) because some of its tasks had failed: 26
> 16/04/14 15:35:22 DEBUG DAGScheduler: submitStage(ShuffleMapStage 57)
> 16/04/14 15:35:22 DEBUG DAGScheduler: missing: List()
> 16/04/14 15:35:22 INFO DAGScheduler: Submitting ShuffleMapStage 57 
> (MapPartitionsRDD[7887] at run at AccessController.java:-2), which has no 
> missing parents
> 16/04/14 15:35:22 DEBUG DAGScheduler: submitMissingTasks(ShuffleMapStage 57)
> 16/04/14 15:35:22 INFO DAGScheduler: Submitting 1 missing tasks from 
> ShuffleMapStage 57 (MapPartitionsRDD[7887] at run at AccessController.java:-2)
> 16/04/14 15:35:22 DEBUG DAGScheduler: New pending partitions: Set(26)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19674) Ignore driver accumulator updates don't belong to the execution when merging all accumulator updates

2017-02-23 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-19674:
---

Assignee: Carson Wang

> Ignore driver accumulator updates don't belong to the execution when merging 
> all accumulator updates
> 
>
> Key: SPARK-19674
> URL: https://issues.apache.org/jira/browse/SPARK-19674
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Carson Wang
>Assignee: Carson Wang
>Priority: Minor
> Fix For: 2.2.0
>
>
> In SQLListener.getExecutionMetrics, driver accumulator updates don't belong 
> to the execution should be ignored when merging all accumulator updates to 
> prevent NoSuchElementException.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19674) Ignore driver accumulator updates don't belong to the execution when merging all accumulator updates

2017-02-23 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19674.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17009
[https://github.com/apache/spark/pull/17009]

> Ignore driver accumulator updates don't belong to the execution when merging 
> all accumulator updates
> 
>
> Key: SPARK-19674
> URL: https://issues.apache.org/jira/browse/SPARK-19674
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Carson Wang
>Priority: Minor
> Fix For: 2.2.0
>
>
> In SQLListener.getExecutionMetrics, driver accumulator updates don't belong 
> to the execution should be ignored when merging all accumulator updates to 
> prevent NoSuchElementException.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19718) Fix flaky test: org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite: stress test for failOnDataLoss=false

2017-02-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881418#comment-15881418
 ] 

Apache Spark commented on SPARK-19718:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/17044

> Fix flaky test: 
> org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite: 
> stress test for failOnDataLoss=false
> ---
>
> Key: SPARK-19718
> URL: https://issues.apache.org/jira/browse/SPARK-19718
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>
> SPARK-19617 changed HDFSMetadataLog to enable interrupts when using the local 
> file system. However, now we hit HADOOP-12074: `Shell.runCommand` converts 
> `InterruptedException` to `new IOException(ie.toString())` before Hadoop 2.8.
> Test failure: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/2504/consoleFull
> {code}
> [info] - stress test for failOnDataLoss=false *** FAILED *** (1 minute, 1 
> second)
> [info]   org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
> 27d45f4f-14dc-4c74-8b52-4bbd4f2b9bec, runId = 
> 23b8c1ea-4da9-4096-967a-692933e4b319] terminated with exception: 
> java.lang.InterruptedException
> [info]   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:304)
> [info]   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:190)
> [info]   Cause: java.io.IOException: java.lang.InterruptedException
> [info]   at org.apache.hadoop.util.Shell.runCommand(Shell.java:578)
> [info]   at org.apache.hadoop.util.Shell.run(Shell.java:478)
> [info]   at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:766)
> [info]   at org.apache.hadoop.util.Shell.execCommand(Shell.java:859)
> [info]   at org.apache.hadoop.util.Shell.execCommand(Shell.java:842)
> [info]   at 
> org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:661)
> [info]   at 
> org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:300)
> [info]   at 
> org.apache.hadoop.fs.FileSystem.primitiveCreate(FileSystem.java:1014)
> [info]   at 
> org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:85)
> [info]   at 
> org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.(ChecksumFs.java:354)
> [info]   at 
> org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:394)
> [info]   at 
> org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577)
> [info]   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:680)
> [info]   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:676)
> [info]   at 
> org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
> [info]   at org.apache.hadoop.fs.FileContext.create(FileContext.java:676)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19718) Fix flaky test: org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite: stress test for failOnDataLoss=false

2017-02-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19718:


Assignee: Apache Spark

> Fix flaky test: 
> org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite: 
> stress test for failOnDataLoss=false
> ---
>
> Key: SPARK-19718
> URL: https://issues.apache.org/jira/browse/SPARK-19718
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> SPARK-19617 changed HDFSMetadataLog to enable interrupts when using the local 
> file system. However, now we hit HADOOP-12074: `Shell.runCommand` converts 
> `InterruptedException` to `new IOException(ie.toString())` before Hadoop 2.8.
> Test failure: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/2504/consoleFull
> {code}
> [info] - stress test for failOnDataLoss=false *** FAILED *** (1 minute, 1 
> second)
> [info]   org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
> 27d45f4f-14dc-4c74-8b52-4bbd4f2b9bec, runId = 
> 23b8c1ea-4da9-4096-967a-692933e4b319] terminated with exception: 
> java.lang.InterruptedException
> [info]   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:304)
> [info]   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:190)
> [info]   Cause: java.io.IOException: java.lang.InterruptedException
> [info]   at org.apache.hadoop.util.Shell.runCommand(Shell.java:578)
> [info]   at org.apache.hadoop.util.Shell.run(Shell.java:478)
> [info]   at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:766)
> [info]   at org.apache.hadoop.util.Shell.execCommand(Shell.java:859)
> [info]   at org.apache.hadoop.util.Shell.execCommand(Shell.java:842)
> [info]   at 
> org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:661)
> [info]   at 
> org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:300)
> [info]   at 
> org.apache.hadoop.fs.FileSystem.primitiveCreate(FileSystem.java:1014)
> [info]   at 
> org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:85)
> [info]   at 
> org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.(ChecksumFs.java:354)
> [info]   at 
> org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:394)
> [info]   at 
> org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577)
> [info]   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:680)
> [info]   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:676)
> [info]   at 
> org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
> [info]   at org.apache.hadoop.fs.FileContext.create(FileContext.java:676)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19718) Fix flaky test: org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite: stress test for failOnDataLoss=false

2017-02-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19718:


Assignee: (was: Apache Spark)

> Fix flaky test: 
> org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite: 
> stress test for failOnDataLoss=false
> ---
>
> Key: SPARK-19718
> URL: https://issues.apache.org/jira/browse/SPARK-19718
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>
> SPARK-19617 changed HDFSMetadataLog to enable interrupts when using the local 
> file system. However, now we hit HADOOP-12074: `Shell.runCommand` converts 
> `InterruptedException` to `new IOException(ie.toString())` before Hadoop 2.8.
> Test failure: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/2504/consoleFull
> {code}
> [info] - stress test for failOnDataLoss=false *** FAILED *** (1 minute, 1 
> second)
> [info]   org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
> 27d45f4f-14dc-4c74-8b52-4bbd4f2b9bec, runId = 
> 23b8c1ea-4da9-4096-967a-692933e4b319] terminated with exception: 
> java.lang.InterruptedException
> [info]   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:304)
> [info]   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:190)
> [info]   Cause: java.io.IOException: java.lang.InterruptedException
> [info]   at org.apache.hadoop.util.Shell.runCommand(Shell.java:578)
> [info]   at org.apache.hadoop.util.Shell.run(Shell.java:478)
> [info]   at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:766)
> [info]   at org.apache.hadoop.util.Shell.execCommand(Shell.java:859)
> [info]   at org.apache.hadoop.util.Shell.execCommand(Shell.java:842)
> [info]   at 
> org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:661)
> [info]   at 
> org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:300)
> [info]   at 
> org.apache.hadoop.fs.FileSystem.primitiveCreate(FileSystem.java:1014)
> [info]   at 
> org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:85)
> [info]   at 
> org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.(ChecksumFs.java:354)
> [info]   at 
> org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:394)
> [info]   at 
> org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577)
> [info]   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:680)
> [info]   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:676)
> [info]   at 
> org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
> [info]   at org.apache.hadoop.fs.FileContext.create(FileContext.java:676)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14658) when executor lost DagScheduer may submit one stage twice even if the first running taskset for this stage is not finished

2017-02-23 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-14658.

Resolution: Duplicate

I'm fairly sure this duplicates SPARK-19263, as Mark mentioned on the PR.  
Check out this comment for a description of what's going on: 
https://github.com/apache/spark/pull/16620#issuecomment-279125227

Josh, feel free to re-open if you think this is a different issue.

> when executor lost DagScheduer may submit one stage twice even if the first 
> running taskset for this stage is not finished
> --
>
> Key: SPARK-14658
> URL: https://issues.apache.org/jira/browse/SPARK-14658
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.6.1, 2.0.0, 2.1.0, 2.2.0
> Environment: spark1.6.1  hadoop-2.6.0-cdh5.4.2
>Reporter: yixiaohua
>
> {code}
> 16/04/14 15:35:22 ERROR DAGSchedulerEventProcessLoop: 
> DAGSchedulerEventProcessLoop failed; shutting down SparkContext
> java.lang.IllegalStateException: more than one active taskSet for stage 57: 
> 57.2,57.1
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:173)
> at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1052)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:921)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1214)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> {code}
> First Time:
> {code}
> 16/04/14 15:35:20 INFO DAGScheduler: Resubmitting ShuffleMapStage 57 (run at 
> AccessController.java:-2) because some of its tasks had failed: 5, 8, 9, 12, 
> 13, 16, 17, 18, 19, 23, 26, 27, 28, 29, 30, 31, 40, 42, 43, 48, 49, 50, 51, 
> 52, 53, 55, 56, 57, 59, 60, 61, 67, 70, 71, 84, 85, 86, 87, 98, 99, 100, 101, 
> 108, 109, 110, 111, 112, 113, 114, 115, 126, 127, 134, 136, 137, 146, 147, 
> 150, 151, 154, 155, 158, 159, 162, 163, 164, 165, 166, 167, 170, 171, 172, 
> 173, 174, 175, 176, 177, 178, 179, 180, 181, 188, 189, 190, 191, 198, 199, 
> 204, 206, 207, 208, 218, 219, 222, 223, 230, 231, 236, 238, 239
> 16/04/14 15:35:20 DEBUG DAGScheduler: submitStage(ShuffleMapStage 57)
> 16/04/14 15:35:20 DEBUG DAGScheduler: missing: List()
> 16/04/14 15:35:20 INFO DAGScheduler: Submitting ShuffleMapStage 57 
> (MapPartitionsRDD[7887] at run at AccessController.java:-2), which has no 
> missing parents
> 16/04/14 15:35:20 DEBUG DAGScheduler: submitMissingTasks(ShuffleMapStage 57)
> 16/04/14 15:35:20 INFO DAGScheduler: Submitting 100 missing tasks from 
> ShuffleMapStage 57 (MapPartitionsRDD[7887] at run at AccessController.java:-2)
> 16/04/14 15:35:20 DEBUG DAGScheduler: New pending partitions: Set(206, 177, 
> 127, 98, 48, 27, 23, 163, 238, 188, 159, 28, 109, 59, 9, 176, 126, 207, 174, 
> 43, 170, 208, 158, 108, 29, 8, 204, 154, 223, 173, 219, 190, 111, 61, 40, 
> 136, 115, 86, 57, 155, 55, 230, 222, 180, 172, 151, 101, 18, 166, 56, 137, 
> 87, 52, 171, 71, 42, 167, 198, 67, 17, 236, 165, 13, 5, 53, 178, 99, 70, 49, 
> 218, 147, 164, 114, 85, 60, 31, 179, 150, 19, 100, 50, 175, 146, 134, 113, 
> 84, 51, 30, 199, 26, 16, 191, 162, 112, 12, 239, 231, 189, 181, 110)
> {code}
> Second Time:
> {code}
> 16/04/14 15:35:22 INFO DAGScheduler: Resubmitting ShuffleMapStage 57 (run at 
> AccessController.java:-2) because some of its tasks had failed: 26
> 16/04/14 15:35:22 DEBUG DAGScheduler: submitStage(ShuffleMapStage 57)
> 16/04/14 15:35:22 DEBUG DAGScheduler: missing: List()
> 16/04/14 15:35:22 INFO DAGScheduler: Submitting ShuffleMapStage 57 
> (MapPartitionsRDD[7887] at run at AccessController.java:-2), which has no 
> missing parents
> 16/04/14 15:35:22 DEBUG DAGScheduler: submitMissingTasks(ShuffleMapStage 57)
> 16/04/14 15:35:22 INFO DAGScheduler: Submitting 1 missing tasks from 
> ShuffleMapStage 57 (MapPartitionsRDD[7887] at run at AccessController.java:-2)
> 16/04/14 15:35:22 DEBUG DAGScheduler: New pending partitions: Set(26)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19263) DAGScheduler should avoid sending conflicting task set.

2017-02-23 Thread Kay Ousterhout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881406#comment-15881406
 ] 

Kay Ousterhout commented on SPARK-19263:


Just noting that this was fixed by https://github.com/apache/spark/pull/16620 
(the other PR was accidentally created with the same JIRA ID)

> DAGScheduler should avoid sending conflicting task set.
> ---
>
> Key: SPARK-19263
> URL: https://issues.apache.org/jira/browse/SPARK-19263
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: jin xing
>Assignee: jin xing
> Fix For: 2.2.0
>
>
> In current *DAGScheduler handleTaskCompletion* code, when *event.reason* is 
> *Success*, it will first do *stage.pendingPartitions -= task.partitionId*, 
> which maybe a bug when *FetchFailed* happens. Think about below:
> # Stage 0 runs and generates shuffle output data.
> # Stage 1 reads the output from stage 0 and generates more shuffle data. It 
> has two tasks: ShuffleMapTask1 and ShuffleMapTask2, and these tasks are 
> launched on executorA.
> # ShuffleMapTask1 fails to fetch blocks locally and sends a FetchFailed to 
> the driver. The driver marks executorA as lost and updates failedEpoch;
> # The driver resubmits stage 0 so the missing output can be re-generated, and 
> then once it completes, resubmits stage 1 with ShuffleMapTask1x and 
> ShuffleMapTask2x.
> # ShuffleMapTask2 (from the original attempt of stage 1) successfully 
> finishes on executorA and sends Success back to driver. This causes 
> DAGScheduler::handleTaskCompletion to remove partition 2 from 
> stage.pendingPartitions (line 1149), but it does not add the partition to the 
> set of output locations (line 1192), because the task’s epoch is less than 
> the failure epoch for the executor (because of the earlier failure on 
> executor A)
> # ShuffleMapTask1x successfully finishes on executorB, causing the driver to 
> remove partition 1 from stage.pendingPartitions. Combined with the previous 
> step, this means that there are no more pending partitions for the stage, so 
> the DAGScheduler marks the stage as finished (line 1196). However, the 
> shuffle stage is not available (line 1215) because the completion for 
> ShuffleMapTask2 was ignored because of its epoch, so the DAGScheduler 
> resubmits the stage.
> # ShuffleMapTask2x is still running, so when TaskSchedulerImpl::submitTasks 
> is called for the re-submitted stage, it throws an error, because there’s an 
> existing active task set
> To reproduce the bug:
> 1. We need to do some modification in *ShuffleBlockFetcherIterator*: check 
> whether the task's index in *TaskSetManager* and stage attempt equal to 0 at 
> the same time, if so, throw FetchFailedException;
> 2. Rebuild spark then submit following job:
> {code}
> val rdd = sc.parallelize(List((0, 1), (1, 1), (2, 1), (3, 1), (1, 2), (0, 
> 3), (2, 1), (3, 1)), 2)
> rdd.reduceByKey {
>   (v1, v2) => {
> Thread.sleep(1)
> v1 + v2
>   }
> }.map {
>   keyAndValue => {
> (keyAndValue._1 % 2, keyAndValue._2)
>   }
> }.reduceByKey {
>   (v1, v2) => {
> Thread.sleep(1)
> v1 + v2
>   }
> }.collect
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19718) Fix flaky test: org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite: stress test for failOnDataLoss=false

2017-02-23 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19718:
-
Description: 
SPARK-19617 changed HDFSMetadataLog to enable interrupts when using the local 
file system. However, now we hit HADOOP-12074: `Shell.runCommand` converts 
`InterruptedException` to `new IOException(ie.toString())` before Hadoop 2.8.

Test failure: 
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/2504/consoleFull

{code}
[info] - stress test for failOnDataLoss=false *** FAILED *** (1 minute, 1 
second)
[info]   org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
27d45f4f-14dc-4c74-8b52-4bbd4f2b9bec, runId = 
23b8c1ea-4da9-4096-967a-692933e4b319] terminated with exception: 
java.lang.InterruptedException
[info]   at 
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:304)
[info]   at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:190)
[info]   Cause: java.io.IOException: java.lang.InterruptedException
[info]   at org.apache.hadoop.util.Shell.runCommand(Shell.java:578)
[info]   at org.apache.hadoop.util.Shell.run(Shell.java:478)
[info]   at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:766)
[info]   at org.apache.hadoop.util.Shell.execCommand(Shell.java:859)
[info]   at org.apache.hadoop.util.Shell.execCommand(Shell.java:842)
[info]   at 
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:661)
[info]   at 
org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:300)
[info]   at 
org.apache.hadoop.fs.FileSystem.primitiveCreate(FileSystem.java:1014)
[info]   at 
org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:85)
[info]   at 
org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.(ChecksumFs.java:354)
[info]   at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:394)
[info]   at 
org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577)
[info]   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:680)
[info]   at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:676)
[info]   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
[info]   at org.apache.hadoop.fs.FileContext.create(FileContext.java:676)
{code}

> Fix flaky test: 
> org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite: 
> stress test for failOnDataLoss=false
> ---
>
> Key: SPARK-19718
> URL: https://issues.apache.org/jira/browse/SPARK-19718
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>
> SPARK-19617 changed HDFSMetadataLog to enable interrupts when using the local 
> file system. However, now we hit HADOOP-12074: `Shell.runCommand` converts 
> `InterruptedException` to `new IOException(ie.toString())` before Hadoop 2.8.
> Test failure: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/2504/consoleFull
> {code}
> [info] - stress test for failOnDataLoss=false *** FAILED *** (1 minute, 1 
> second)
> [info]   org.apache.spark.sql.streaming.StreamingQueryException: Query [id = 
> 27d45f4f-14dc-4c74-8b52-4bbd4f2b9bec, runId = 
> 23b8c1ea-4da9-4096-967a-692933e4b319] terminated with exception: 
> java.lang.InterruptedException
> [info]   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:304)
> [info]   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:190)
> [info]   Cause: java.io.IOException: java.lang.InterruptedException
> [info]   at org.apache.hadoop.util.Shell.runCommand(Shell.java:578)
> [info]   at org.apache.hadoop.util.Shell.run(Shell.java:478)
> [info]   at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:766)
> [info]   at org.apache.hadoop.util.Shell.execCommand(Shell.java:859)
> [info]   at org.apache.hadoop.util.Shell.execCommand(Shell.java:842)
> [info]   at 
> org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:661)
> [info]   at 
> org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:300)
> [info]   at 
> org.apache.hadoop.fs.FileSystem.primitiveCreate(FileSystem.java:1014)
> [info]   at 
> org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:85)
> [info]   at 
> org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.(ChecksumFs.java:354)
> [info]   at 
> 

[jira] [Comment Edited] (SPARK-19698) Race condition in stale attempt task completion vs current attempt task completion when task is doing persistent state changes

2017-02-23 Thread Kay Ousterhout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881358#comment-15881358
 ] 

Kay Ousterhout edited comment on SPARK-19698 at 2/23/17 9:57 PM:
-

I think this is the same issue as SPARK-19263 -- can you check to see if that 
fixes the problem / have you looked at that JIRA?  I wrote a super long 
description of the problem towards the end of the associated PR.

One more note is that right now, Spark won't cancel running task attempts 
(although there's a JIRA to fix this), even when a stage is marked as failed.  
So the exact scenario you described, where the 2nd task attempt gets shut down, 
shouldn't occur (the driver will wait for the 2nd task attempt to complete, but 
will ignore the result).


was (Author: kayousterhout):
I think this is the same issue as SPARK-19263 -- can you check to see if that 
fixes the problem / have you looked at that JIRA?  I wrote a super long 
description of the problem towards the end of the associated PR.

> Race condition in stale attempt task completion vs current attempt task 
> completion when task is doing persistent state changes
> --
>
> Key: SPARK-19698
> URL: https://issues.apache.org/jira/browse/SPARK-19698
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> We have encountered a strange scenario in our production environment. Below 
> is the best guess we have right now as to what's going on.
> Potentially, the final stage of a job has a failure in one of the tasks (such 
> as OOME on the executor) which can cause tasks for that stage to be 
> relaunched in a second attempt.
> https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1155
> keeps track of which tasks have been completed, but does NOT keep track of 
> which attempt those tasks were completed in. As such, we have encountered a 
> scenario where a particular task gets executed twice in different stage 
> attempts, and the DAGScheduler does not consider if the second attempt is 
> still running. This means if the first task attempt succeeded, the second 
> attempt can be cancelled part-way through its run cycle if all other tasks 
> (including the prior failed) are completed successfully.
> What this means is that if a task is manipulating some state somewhere (for 
> example: a upload-to-temporary-file-location, then delete-then-move on an 
> underlying s3n storage implementation) the driver can improperly shutdown the 
> running (2nd attempt) task between state manipulations, leaving the 
> persistent state in a bad state since the 2nd attempt never got to complete 
> its manipulations, and was terminated prematurely at some arbitrary point in 
> its state change logic (ex: finished the delete but not the move).
> This is using the mesos coarse grained executor. It is unclear if this 
> behavior is limited to the mesos coarse grained executor or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19698) Race condition in stale attempt task completion vs current attempt task completion when task is doing persistent state changes

2017-02-23 Thread Kay Ousterhout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881358#comment-15881358
 ] 

Kay Ousterhout commented on SPARK-19698:


I think this is the same issue as SPARK-19263 -- can you check to see if that 
fixes the problem / have you looked at that JIRA?  I wrote a super long 
description of the problem towards the end of the associated PR.

> Race condition in stale attempt task completion vs current attempt task 
> completion when task is doing persistent state changes
> --
>
> Key: SPARK-19698
> URL: https://issues.apache.org/jira/browse/SPARK-19698
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> We have encountered a strange scenario in our production environment. Below 
> is the best guess we have right now as to what's going on.
> Potentially, the final stage of a job has a failure in one of the tasks (such 
> as OOME on the executor) which can cause tasks for that stage to be 
> relaunched in a second attempt.
> https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1155
> keeps track of which tasks have been completed, but does NOT keep track of 
> which attempt those tasks were completed in. As such, we have encountered a 
> scenario where a particular task gets executed twice in different stage 
> attempts, and the DAGScheduler does not consider if the second attempt is 
> still running. This means if the first task attempt succeeded, the second 
> attempt can be cancelled part-way through its run cycle if all other tasks 
> (including the prior failed) are completed successfully.
> What this means is that if a task is manipulating some state somewhere (for 
> example: a upload-to-temporary-file-location, then delete-then-move on an 
> underlying s3n storage implementation) the driver can improperly shutdown the 
> running (2nd attempt) task between state manipulations, leaving the 
> persistent state in a bad state since the 2nd attempt never got to complete 
> its manipulations, and was terminated prematurely at some arbitrary point in 
> its state change logic (ex: finished the delete but not the move).
> This is using the mesos coarse grained executor. It is unclear if this 
> behavior is limited to the mesos coarse grained executor or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19719) Structured Streaming write to Kafka

2017-02-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881346#comment-15881346
 ] 

Apache Spark commented on SPARK-19719:
--

User 'tcondie' has created a pull request for this issue:
https://github.com/apache/spark/pull/17043

> Structured Streaming write to Kafka
> ---
>
> Key: SPARK-19719
> URL: https://issues.apache.org/jira/browse/SPARK-19719
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Tyson Condie
>
> This issue deals with writing to Apache Kafka for both streaming and batch 
> queries. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19719) Structured Streaming write to Kafka

2017-02-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19719:


Assignee: Apache Spark

> Structured Streaming write to Kafka
> ---
>
> Key: SPARK-19719
> URL: https://issues.apache.org/jira/browse/SPARK-19719
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Tyson Condie
>Assignee: Apache Spark
>
> This issue deals with writing to Apache Kafka for both streaming and batch 
> queries. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19719) Structured Streaming write to Kafka

2017-02-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19719:


Assignee: (was: Apache Spark)

> Structured Streaming write to Kafka
> ---
>
> Key: SPARK-19719
> URL: https://issues.apache.org/jira/browse/SPARK-19719
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Tyson Condie
>
> This issue deals with writing to Apache Kafka for both streaming and batch 
> queries. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19719) Structured Streaming write to Kafka

2017-02-23 Thread Tyson Condie (JIRA)
Tyson Condie created SPARK-19719:


 Summary: Structured Streaming write to Kafka
 Key: SPARK-19719
 URL: https://issues.apache.org/jira/browse/SPARK-19719
 Project: Spark
  Issue Type: New Feature
  Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Tyson Condie


This issue deals with writing to Apache Kafka for both streaming and batch 
queries. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19717) Expanding Spark ML under Different Namespace

2017-02-23 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19717.
---
Resolution: Duplicate

> Expanding Spark ML under Different Namespace
> 
>
> Key: SPARK-19717
> URL: https://issues.apache.org/jira/browse/SPARK-19717
> Project: Spark
>  Issue Type: Wish
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Shouheng Yi
>Priority: Minor
>
> This ticket is corresponding to a previous email thread in the dev list: 
> [Spark Namespace]: Expanding Spark ML under Different Namespace?
> The concern is about making some access modifier of some classes/trait from 
> "private [spark]" to public.
> Right now, the options to solve this issues are:
> 1. write or copy your own implementations
> 2. work under org.apache, but have some comments to address this issue
> Details can be found in the dev list email archive.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-19717) Expanding Spark ML under Different Namespace

2017-02-23 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-19717:
---

> Expanding Spark ML under Different Namespace
> 
>
> Key: SPARK-19717
> URL: https://issues.apache.org/jira/browse/SPARK-19717
> Project: Spark
>  Issue Type: Wish
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Shouheng Yi
>Priority: Minor
>
> This ticket is corresponding to a previous email thread in the dev list: 
> [Spark Namespace]: Expanding Spark ML under Different Namespace?
> The concern is about making some access modifier of some classes/trait from 
> "private [spark]" to public.
> Right now, the options to solve this issues are:
> 1. write or copy your own implementations
> 2. work under org.apache, but have some comments to address this issue
> Details can be found in the dev list email archive.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-19717) Expanding Spark ML under Different Namespace

2017-02-23 Thread Shouheng Yi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shouheng Yi closed SPARK-19717.
---
Resolution: Fixed

Duplicated issue https://issues.apache.org/jira/browse/SPARK-19498

> Expanding Spark ML under Different Namespace
> 
>
> Key: SPARK-19717
> URL: https://issues.apache.org/jira/browse/SPARK-19717
> Project: Spark
>  Issue Type: Wish
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Shouheng Yi
>Priority: Minor
>
> This ticket is corresponding to a previous email thread in the dev list: 
> [Spark Namespace]: Expanding Spark ML under Different Namespace?
> The concern is about making some access modifier of some classes/trait from 
> "private [spark]" to public.
> Right now, the options to solve this issues are:
> 1. write or copy your own implementations
> 2. work under org.apache, but have some comments to address this issue
> Details can be found in the dev list email archive.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19717) Expanding Spark ML under Different Namespace

2017-02-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881331#comment-15881331
 ] 

Sean Owen commented on SPARK-19717:
---

I don't know that this should be a JIRA. What are you specifically asking to 
open up and why? those sorts of details need to be here.

> Expanding Spark ML under Different Namespace
> 
>
> Key: SPARK-19717
> URL: https://issues.apache.org/jira/browse/SPARK-19717
> Project: Spark
>  Issue Type: Wish
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Shouheng Yi
>Priority: Minor
>
> This ticket is corresponding to a previous email thread in the dev list: 
> [Spark Namespace]: Expanding Spark ML under Different Namespace?
> The concern is about making some access modifier of some classes/trait from 
> "private [spark]" to public.
> Right now, the options to solve this issues are:
> 1. write or copy your own implementations
> 2. work under org.apache, but have some comments to address this issue
> Details can be found in the dev list email archive.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19718) Fix flaky test: org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite: stress test for failOnDataLoss=false

2017-02-23 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-19718:


 Summary: Fix flaky test: 
org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite: 
stress test for failOnDataLoss=false
 Key: SPARK-19718
 URL: https://issues.apache.org/jira/browse/SPARK-19718
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.2.0
Reporter: Shixiong Zhu






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19716) Dataset should allow by-name resolution for struct type elements in array

2017-02-23 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-19716:

Description: 
if we have a DataFrame with schema {{a: int, b: int, c: int}}, and convert it 
to Dataset with {{case class Data(a: Int, c: Int)}}, it works and we will 
extract the `a` and `c` columns to build the Data.

However, if the struct is inside array, e.g. schema is {{arr: array}}, and we wanna convert it to Dataset with {{case class 
ComplexData(arr: Seq[Data])}}, we will fail. The reason is, to allow compatible 
types, e.g. convert {{a: int}} to {{case class A(a: Long)}}, we will add cast 
for each field, except struct type field, because struct type is flexible, the 
number of columns can mismatch. We should probably also skip cast for array and 
map type.

  was:
if we have a DataFrame with schema {{a: int, b: int, c: int}}, and convert it 
to Dataset with {{case class Data(a: Int, c: Int)}}, it works and we will 
extract the `a` and `c` columns to build the Data.

However, if the struct is inside array, e.g. schema is {{arr: array}}, and we wanna convert it to Dataset with {{case class 
ComplexData(arr: Seq[Data])}}, we will fail. The reason is, we will add cast 
for each field, except struct type field, because struct type is flexible, the 
number of columns can mismatch. We should probably also skip cast for array and 
map type.


> Dataset should allow by-name resolution for struct type elements in array
> -
>
> Key: SPARK-19716
> URL: https://issues.apache.org/jira/browse/SPARK-19716
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>
> if we have a DataFrame with schema {{a: int, b: int, c: int}}, and convert it 
> to Dataset with {{case class Data(a: Int, c: Int)}}, it works and we will 
> extract the `a` and `c` columns to build the Data.
> However, if the struct is inside array, e.g. schema is {{arr: array>}}, and we wanna convert it to Dataset with {{case class 
> ComplexData(arr: Seq[Data])}}, we will fail. The reason is, to allow 
> compatible types, e.g. convert {{a: int}} to {{case class A(a: Long)}}, we 
> will add cast for each field, except struct type field, because struct type 
> is flexible, the number of columns can mismatch. We should probably also skip 
> cast for array and map type.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19716) Dataset should allow by-name resolution for struct type elements in array

2017-02-23 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-19716:

Description: 
if we have a DataFrame with schema {{a: int, b: int, c: int}}, and convert it 
to Dataset with {{case class Data(a: Int, c: Int)}}, it works and we will 
extract the `a` and `c` columns to build the Data.

However, if the struct is inside array, e.g. schema is {{arr: array}}, and we wanna convert it to Dataset with {{case class 
ComplexData(arr: Seq[Data])}}, we will fail. The reason is, we will add cast 
for each field, except struct type field, because struct type is flexible, the 
number of columns can mismatch. We should probably also skip cast for array and 
map type.

  was:
if we have a DataFrame with schema {{a: int, b: int, c: int}}, and convert it 
to Dataset with {{case class Data(a: Int, c: Int)}}, it works and we will 
extract the `a` and `c` columns to build the Data.

However, if the struct is inside array, e.g. schema is {{arr: array}}, and we wanna convert it to Dataset with {{case class 
ComplexData(arr: Seq[Data])}}, we will fail. we should support this case.


> Dataset should allow by-name resolution for struct type elements in array
> -
>
> Key: SPARK-19716
> URL: https://issues.apache.org/jira/browse/SPARK-19716
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>
> if we have a DataFrame with schema {{a: int, b: int, c: int}}, and convert it 
> to Dataset with {{case class Data(a: Int, c: Int)}}, it works and we will 
> extract the `a` and `c` columns to build the Data.
> However, if the struct is inside array, e.g. schema is {{arr: array>}}, and we wanna convert it to Dataset with {{case class 
> ComplexData(arr: Seq[Data])}}, we will fail. The reason is, we will add cast 
> for each field, except struct type field, because struct type is flexible, 
> the number of columns can mismatch. We should probably also skip cast for 
> array and map type.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19717) Expanding Spark ML under Different Namespace

2017-02-23 Thread Shouheng Yi (JIRA)
Shouheng Yi created SPARK-19717:
---

 Summary: Expanding Spark ML under Different Namespace
 Key: SPARK-19717
 URL: https://issues.apache.org/jira/browse/SPARK-19717
 Project: Spark
  Issue Type: Wish
  Components: ML, MLlib
Affects Versions: 2.1.0
Reporter: Shouheng Yi
Priority: Minor


This ticket is corresponding to a previous email thread in the dev list: [Spark 
Namespace]: Expanding Spark ML under Different Namespace?

The concern is about making some access modifier of some classes/trait from 
"private [spark]" to public.

Right now, the options to solve this issues are:
1. write or copy your own implementations
2. work under org.apache, but have some comments to address this issue

Details can be found in the dev list email archive.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19684) Move info about running specific tests to developer website

2017-02-23 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-19684.

   Resolution: Fixed
Fix Version/s: 2.2.0

> Move info about running specific tests to developer website
> ---
>
> Key: SPARK-19684
> URL: https://issues.apache.org/jira/browse/SPARK-19684
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.1.1
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
> Fix For: 2.2.0
>
>
> This JIRA accompanies this change to the website: 
> https://github.com/apache/spark-website/pull/33.
> Running individual tests is not something that changes with new versions of 
> the project, and is primarily used by developers (not users) so should be 
> moved to the developer-tools page of the main website (with a link from the 
> building-spark page on the release-specific docs).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >