[jira] [Commented] (SPARK-15888) Python UDF over aggregate fails

2016-06-10 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325733#comment-15325733
 ] 

Davies Liu commented on SPARK-15888:


After some investigation, it turned out to be that the Python UDF over 
aggregate function could not be extracted and inserted BEFORE the aggregate, 
should be insert AFTER aggregate.

A logical aggregate will become multiple physical aggregates, maybe it's better 
to add another rule for logical plan  (keep the current rule for physical plan).

> Python UDF over aggregate fails
> ---
>
> Key: SPARK-15888
> URL: https://issues.apache.org/jira/browse/SPARK-15888
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Vladimir Feinberg
>
> This looks like a regression from 1.6.1.
> The following notebook runs without error in a Spark 1.6.1 cluster, but fails 
> in 2.0.0:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6001574963454425/3194562079278586/1653464426712019/latest.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15888) Python UDF over aggregate fails

2016-06-10 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15888:
---
Summary: Python UDF over aggregate fails  (was: UDF fails in Python)

> Python UDF over aggregate fails
> ---
>
> Key: SPARK-15888
> URL: https://issues.apache.org/jira/browse/SPARK-15888
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Vladimir Feinberg
>
> This looks like a regression from 1.6.1.
> The following notebook runs without error in a Spark 1.6.1 cluster, but fails 
> in 2.0.0:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6001574963454425/3194562079278586/1653464426712019/latest.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15894) Add doc to control #partition for input files

2016-06-10 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325729#comment-15325729
 ] 

Takeshi Yamamuro commented on SPARK-15894:
--

cc: [~rxin] [~davies]

> Add doc to control #partition for input files
> -
>
> Key: SPARK-15894
> URL: https://issues.apache.org/jira/browse/SPARK-15894
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Takeshi Yamamuro
>Priority: Trivial
>
> I recently saw some users ask questions in spark-users about how to control 
> #partitions for input files
> without DataFrame#repartititon.
> https://www.mail-archive.com/user@spark.apache.org/msg51603.html
> https://www.mail-archive.com/user@spark.apache.org/msg51742.html
> The two parameters `spark.sql.files.maxPartitionBytes` and 
> `spark.sql.files.openCostInBytes` are
> internal ones though, it seems they are useful for the users that control 
> this sometimes.
> So, Trivial things though, I think we'd better off adding doc for the two 
> parameter
> in `sql-programming-guide`.
> Though?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15894) Add doc to control #partition for input files

2016-06-10 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325726#comment-15325726
 ] 

Takeshi Yamamuro commented on SPARK-15894:
--

The patch is like 
https://github.com/apache/spark/compare/master...maropu:SPARK-15894.

> Add doc to control #partition for input files
> -
>
> Key: SPARK-15894
> URL: https://issues.apache.org/jira/browse/SPARK-15894
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Takeshi Yamamuro
>Priority: Trivial
>
> I recently saw some users ask questions in spark-users about how to control 
> #partitions for input files
> without DataFrame#repartititon.
> https://www.mail-archive.com/user@spark.apache.org/msg51603.html
> https://www.mail-archive.com/user@spark.apache.org/msg51742.html
> The two parameters `spark.sql.files.maxPartitionBytes` and 
> `spark.sql.files.openCostInBytes` are
> internal ones though, it seems they are useful for the users that control 
> this sometimes.
> So, Trivial things though, I think we'd better off adding doc for the two 
> parameter
> in `sql-programming-guide`.
> Though?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15894) Add doc to control #partition for input files

2016-06-10 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-15894:


 Summary: Add doc to control #partition for input files
 Key: SPARK-15894
 URL: https://issues.apache.org/jira/browse/SPARK-15894
 Project: Spark
  Issue Type: Documentation
  Components: SQL
Affects Versions: 1.6.1
Reporter: Takeshi Yamamuro
Priority: Trivial


I recently saw some users ask questions in spark-users about how to control 
#partitions for input files
without DataFrame#repartititon.

https://www.mail-archive.com/user@spark.apache.org/msg51603.html
https://www.mail-archive.com/user@spark.apache.org/msg51742.html

The two parameters `spark.sql.files.maxPartitionBytes` and 
`spark.sql.files.openCostInBytes` are
internal ones though, it seems they are useful for the users that control this 
sometimes.
So, Trivial things though, I think we'd better off adding doc for the two 
parameter
in `sql-programming-guide`.
Though?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13207) _SUCCESS should not break partition discovery

2016-06-10 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325718#comment-15325718
 ] 

Simeon Simeonov commented on SPARK-13207:
-

[~yhuai] The PR associated with that ticket explicitly calls out {{_metadata}} 
and {{_common_metadata}} as not excluded. i am wondering why that PR will fix 
this issue... Can you add a test to demonstrate that this is fixed?

> _SUCCESS should not break partition discovery
> -
>
> Key: SPARK-13207
> URL: https://issues.apache.org/jira/browse/SPARK-13207
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>  Labels: backport-needed
> Fix For: 1.6.2, 2.0.0
>
>
> Partitioning discovery will fail with the following case
> {code}
> test("_SUCCESS should not break partitioning discovery") {
> withTempPath { dir =>
>   val tablePath = new File(dir, "table")
>   val df = (1 to 3).map(i => (i, i, i, i)).toDF("a", "b", "c", "d")
>   df.write
> .format("parquet")
> .partitionBy("b", "c", "d")
> .save(tablePath.getCanonicalPath)
>   Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1", "_SUCCESS"))
>   Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1/c=1", 
> "_SUCCESS"))
>   Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1/c=1/d=1", 
> "_SUCCESS"))
>   
> checkAnswer(sqlContext.read.format("parquet").load(tablePath.getCanonicalPath),
>  df)
> }
>   }
> {code}
> Because {{_SUCCESS}} is the in the inner partitioning dirs, partitioning 
> discovery will fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15759) Fallback to non-codegen if fail to compile generated code

2016-06-10 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15759.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13501
[https://github.com/apache/spark/pull/13501]

> Fallback to non-codegen if fail to compile generated code
> -
>
> Key: SPARK-15759
> URL: https://issues.apache.org/jira/browse/SPARK-15759
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> If anything go wrong on whole-stage codegen, we should temporary disable it 
> for part of the query to make sure that the query could ran.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15639) Try to push down filter at RowGroups level for parquet reader

2016-06-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15639:


Assignee: Apache Spark  (was: Liang-Chi Hsieh)

> Try to push down filter at RowGroups level for parquet reader
> -
>
> Key: SPARK-15639
> URL: https://issues.apache.org/jira/browse/SPARK-15639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
> Fix For: 2.0.0
>
>
> When we use vecterized parquet reader, although the base reader (i.e., 
> SpecificParquetRecordReaderBase) will retrieve pushed-down filters for 
> RowGroups-level filtering, we seems not really set up the filters to be 
> pushed down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15639) Try to push down filter at RowGroups level for parquet reader

2016-06-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15639:


Assignee: Liang-Chi Hsieh  (was: Apache Spark)

> Try to push down filter at RowGroups level for parquet reader
> -
>
> Key: SPARK-15639
> URL: https://issues.apache.org/jira/browse/SPARK-15639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> When we use vecterized parquet reader, although the base reader (i.e., 
> SpecificParquetRecordReaderBase) will retrieve pushed-down filters for 
> RowGroups-level filtering, we seems not really set up the filters to be 
> pushed down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15585) Don't use null in data source options to indicate default value

2016-06-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325686#comment-15325686
 ] 

Apache Spark commented on SPARK-15585:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/13616

> Don't use null in data source options to indicate default value
> ---
>
> Key: SPARK-15585
> URL: https://issues.apache.org/jira/browse/SPARK-15585
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> See email: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/changed-behavior-for-csv-datasource-and-quoting-in-spark-2-0-0-SNAPSHOT-td17704.html
> We'd need to change DataFrameReader/DataFrameWriter in Python's 
> csv/json/parquet/... functions to put the actual default option values as 
> function parameters, rather than setting them to None. We can then in 
> CSVOptions.getChar (and JSONOptions, etc) to actually return null if the 
> value is null, rather  than setting it to default value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15678) Not use cache on appends and overwrites

2016-06-10 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15678:
---
Assignee: Sameer Agarwal

> Not use cache on appends and overwrites
> ---
>
> Key: SPARK-15678
> URL: https://issues.apache.org/jira/browse/SPARK-15678
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
> Fix For: 2.0.0
>
>
> SparkSQL currently doesn't drop caches if the underlying data is overwritten.
> {code}
> val dir = "/tmp/test"
> sqlContext.range(1000).write.mode("overwrite").parquet(dir)
> val df = sqlContext.read.parquet(dir).cache()
> df.count() // outputs 1000
> sqlContext.range(10).write.mode("overwrite").parquet(dir)
> sqlContext.read.parquet(dir).count() // outputs 1000 instead of 10 < We 
> are still using the cached dataset
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15678) Not use cache on appends and overwrites

2016-06-10 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15678.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13566
[https://github.com/apache/spark/pull/13566]

> Not use cache on appends and overwrites
> ---
>
> Key: SPARK-15678
> URL: https://issues.apache.org/jira/browse/SPARK-15678
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Sameer Agarwal
> Fix For: 2.0.0
>
>
> SparkSQL currently doesn't drop caches if the underlying data is overwritten.
> {code}
> val dir = "/tmp/test"
> sqlContext.range(1000).write.mode("overwrite").parquet(dir)
> val df = sqlContext.read.parquet(dir).cache()
> df.count() // outputs 1000
> sqlContext.range(10).write.mode("overwrite").parquet(dir)
> sqlContext.read.parquet(dir).count() // outputs 1000 instead of 10 < We 
> are still using the cached dataset
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-15639) Try to push down filter at RowGroups level for parquet reader

2016-06-10 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reopened SPARK-15639:


We've decided to revert the merged PR, so reopening it.

> Try to push down filter at RowGroups level for parquet reader
> -
>
> Key: SPARK-15639
> URL: https://issues.apache.org/jira/browse/SPARK-15639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> When we use vecterized parquet reader, although the base reader (i.e., 
> SpecificParquetRecordReaderBase) will retrieve pushed-down filters for 
> RowGroups-level filtering, we seems not really set up the filters to be 
> pushed down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15369) Investigate selectively using Jython for parts of PySpark

2016-06-10 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-15369:

Description: Transferring data from the JVM to the Python executor can be a 
substantial bottleneck. While Jython is not suitable for all UDFs or map 
functions, it may be suitable for some simple ones. We should investigate the 
option of using Jython to accelerate these small functions.  (was: Transfering 
data from the JVM to the Python executor can be a substantial bottleneck. While 
JYthon is not suitable for all UDFs or map functions, it may be suitable for 
some simple ones. We should investigate the option of using JYthon to 
accelerate these small functions.)

> Investigate selectively using Jython for parts of PySpark
> -
>
> Key: SPARK-15369
> URL: https://issues.apache.org/jira/browse/SPARK-15369
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: holdenk
>Priority: Minor
>
> Transferring data from the JVM to the Python executor can be a substantial 
> bottleneck. While Jython is not suitable for all UDFs or map functions, it 
> may be suitable for some simple ones. We should investigate the option of 
> using Jython to accelerate these small functions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12661) Drop Python 2.6 support in PySpark

2016-06-10 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325678#comment-15325678
 ] 

holdenk commented on SPARK-12661:
-

What are we missing to drop 2.6 support? We could keep the legacy 2.6 support 
code while still dropping it from the supported list and jenkins.

> Drop Python 2.6 support in PySpark
> --
>
> Key: SPARK-12661
> URL: https://issues.apache.org/jira/browse/SPARK-12661
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Reporter: Davies Liu
>  Labels: releasenotes
>
> 1. stop testing with 2.6
> 2. remove the code for python 2.6
> see discussion : 
> https://www.mail-archive.com/user@spark.apache.org/msg43423.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15819) Add KMeanSummary in KMeans of PySpark

2016-06-10 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-15819:
---
Component/s: PySpark
 ML

> Add KMeanSummary in KMeans of PySpark
> -
>
> Key: SPARK-15819
> URL: https://issues.apache.org/jira/browse/SPARK-15819
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>
> There's no corresponding python api for KMeansSummary, it would be nice to 
> have it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-15751) Add generateAssociationRules in fpm in pyspark

2016-06-10 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang closed SPARK-15751.
--
Resolution: Won't Fix

> Add generateAssociationRules in fpm in pyspark
> --
>
> Key: SPARK-15751
> URL: https://issues.apache.org/jira/browse/SPARK-15751
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>
> There's no api for generating association rules in pyspark for now. Please 
> close it if there's already an existing jira tracking this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15639) Try to push down filter at RowGroups level for parquet reader

2016-06-10 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-15639:
---
Assignee: Liang-Chi Hsieh

> Try to push down filter at RowGroups level for parquet reader
> -
>
> Key: SPARK-15639
> URL: https://issues.apache.org/jira/browse/SPARK-15639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> When we use vecterized parquet reader, although the base reader (i.e., 
> SpecificParquetRecordReaderBase) will retrieve pushed-down filters for 
> RowGroups-level filtering, we seems not really set up the filters to be 
> pushed down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15639) Try to push down filter at RowGroups level for parquet reader

2016-06-10 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-15639:
---
Affects Version/s: 2.0.0

> Try to push down filter at RowGroups level for parquet reader
> -
>
> Key: SPARK-15639
> URL: https://issues.apache.org/jira/browse/SPARK-15639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> When we use vecterized parquet reader, although the base reader (i.e., 
> SpecificParquetRecordReaderBase) will retrieve pushed-down filters for 
> RowGroups-level filtering, we seems not really set up the filters to be 
> pushed down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15639) Try to push down filter at RowGroups level for parquet reader

2016-06-10 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-15639.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13371
[https://github.com/apache/spark/pull/13371]

> Try to push down filter at RowGroups level for parquet reader
> -
>
> Key: SPARK-15639
> URL: https://issues.apache.org/jira/browse/SPARK-15639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> When we use vecterized parquet reader, although the base reader (i.e., 
> SpecificParquetRecordReaderBase) will retrieve pushed-down filters for 
> RowGroups-level filtering, we seems not really set up the filters to be 
> pushed down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15893) spark.createDataFrame raises an exception in Spark 2.0 tests on Windows

2016-06-10 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-15893:


 Summary: spark.createDataFrame raises an exception in Spark 2.0 
tests on Windows
 Key: SPARK-15893
 URL: https://issues.apache.org/jira/browse/SPARK-15893
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 2.0.0
Reporter: Alexander Ulanov


spark.createDataFrame raises an exception in Spark 2.0 tests on Windows

For example, LogisticRegressionSuite fails at Line 46:
Exception encountered when invoking run on a nested suite - 
java.net.URISyntaxException: Relative path in absolute URI: 
file:C:/dev/spark/external/flume-assembly/spark-warehouse
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path 
in absolute URI: file:C:/dev/spark/external/flume-assembly/spark-warehouse
at org.apache.hadoop.fs.Path.initialize(Path.java:206)
at org.apache.hadoop.fs.Path.(Path.java:172)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeQualifiedPath(SessionCatalog.scala:109)


Another example, DataFrameSuite raises:
java.net.URISyntaxException: Relative path in absolute URI: 
file:C:/dev/spark/external/flume-assembly/spark-warehouse
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path 
in absolute URI: file:C:/dev/spark/external/flume-assembly/spark-warehouse
at org.apache.hadoop.fs.Path.initialize(Path.java:206)
at org.apache.hadoop.fs.Path.(Path.java:172)





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15822) segmentation violation in o.a.s.unsafe.types.UTF8String

2016-06-10 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325567#comment-15325567
 ] 

Herman van Hovell commented on SPARK-15822:
---

[~robbinspg] I have tried to reproduce the problem on my side using your 
code/dataset with 2 workers, but I unfortunately could not reproduce the issue. 
I do have a few follow-up questions:
* What version of 2.0 are you on (git commit hash)?
* Could you post the plan? I am curious which side on the join is producing 
these illegal values.


> segmentation violation in o.a.s.unsafe.types.UTF8String 
> 
>
> Key: SPARK-15822
> URL: https://issues.apache.org/jira/browse/SPARK-15822
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: linux amd64
> openjdk version "1.8.0_91"
> OpenJDK Runtime Environment (build 1.8.0_91-b14)
> OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)
>Reporter: Pete Robbins
>Assignee: Herman van Hovell
>Priority: Blocker
>
> Executors fail with segmentation violation while running application with
> spark.memory.offHeap.enabled true
> spark.memory.offHeap.size 512m
> Also now reproduced with 
> spark.memory.offHeap.enabled false
> {noformat}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f4559b4d4bd, pid=14182, tid=139935319750400
> #
> # JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build 1.8.0_91-b14)
> # Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 
> compressed oops)
> # Problematic frame:
> # J 4816 C2 
> org.apache.spark.unsafe.types.UTF8String.compareTo(Lorg/apache/spark/unsafe/types/UTF8String;)I
>  (64 bytes) @ 0x7f4559b4d4bd [0x7f4559b4d460+0x5d]
> {noformat}
> We initially saw this on IBM java on PowerPC box but is recreatable on linux 
> with OpenJDK. On linux with IBM Java 8 we see a null pointer exception at the 
> same code point:
> {noformat}
> 16/06/08 11:14:58 ERROR Executor: Exception in task 1.0 in stage 5.0 (TID 48)
> java.lang.NullPointerException
>   at 
> org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:831)
>   at org.apache.spark.unsafe.types.UTF8String.compare(UTF8String.java:844)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$doExecute$2$$anon$2.hasNext(WholeStageCodegenExec.scala:377)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
>   at org.spark_project.guava.collect.Ordering.leastOf(Ordering.java:664)
>   at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1365)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1362)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.lang.Thread.run(Thread.java:785)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15892) aft_survival_regression.py example fails in branch-2.0

2016-06-10 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-15892:
-

 Summary: aft_survival_regression.py example fails in branch-2.0
 Key: SPARK-15892
 URL: https://issues.apache.org/jira/browse/SPARK-15892
 Project: Spark
  Issue Type: Bug
  Components: Examples, ML, PySpark
Affects Versions: 2.0.0
Reporter: Joseph K. Bradley


Running the example (after the fix in 
[https://github.com/apache/spark/pull/13393]) causes this failure:

{code}
Traceback (most recent call last):  
  File 
"/Users/josephkb/spark/examples/src/main/python/ml/aft_survival_regression.py", 
line 49, in 
model = aft.fit(training)
  File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 
64, in fit
  File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", 
line 213, in _fit
  File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", 
line 210, in _fit_java
  File 
"/Users/josephkb/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", 
line 933, in __call__
  File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", 
line 79, in deco
pyspark.sql.utils.IllegalArgumentException: u'requirement failed: The number of 
instances should be greater than 0.0, but got 0.'
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15888) UDF fails in Python

2016-06-10 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325558#comment-15325558
 ] 

Xiao Li commented on SPARK-15888:
-

Sure, will do it. Thanks!

> UDF fails in Python
> ---
>
> Key: SPARK-15888
> URL: https://issues.apache.org/jira/browse/SPARK-15888
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Vladimir Feinberg
>
> This looks like a regression from 1.6.1.
> The following notebook runs without error in a Spark 1.6.1 cluster, but fails 
> in 2.0.0:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6001574963454425/3194562079278586/1653464426712019/latest.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15888) UDF fails in Python

2016-06-10 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-15888:

Component/s: SQL

> UDF fails in Python
> ---
>
> Key: SPARK-15888
> URL: https://issues.apache.org/jira/browse/SPARK-15888
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Vladimir Feinberg
>
> This looks like a regression from 1.6.1.
> The following notebook runs without error in a Spark 1.6.1 cluster, but fails 
> in 2.0.0:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6001574963454425/3194562079278586/1653464426712019/latest.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15891) Make YARN logs less noisy

2016-06-10 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-15891:
--

 Summary: Make YARN logs less noisy
 Key: SPARK-15891
 URL: https://issues.apache.org/jira/browse/SPARK-15891
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.0.0
Reporter: Marcelo Vanzin
Priority: Minor


Spark can generate a lot of logs when running in YARN mode. The problem is 
already a little bad with normal configuration, but is even worse with dynamic 
allocation on.

The first problem is that for every executor Spark launches, it will print the 
whole command and all the env variables it's setting, even though those are 
exactly the same for every executor. That's not too bad with a handful of 
executors, but get annoying pretty soon after that. Dynamic allocation makes 
that problem worse since there executors constantly being started and then 
going away.

Also, there's a lot of logging generated by the dynamic allocation backend code 
in the YARN module. We should audit those and make sure they all make sense, 
and whether / how to reduce the amount of logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15884) Override stringArgs method in MapPartitionsInR case class in order to avoid Out Of Mermory exceptions when calling toString

2016-06-10 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-15884.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13610
[https://github.com/apache/spark/pull/13610]

> Override stringArgs method in MapPartitionsInR case class in order to avoid 
> Out Of Mermory exceptions when calling toString
> ---
>
> Key: SPARK-15884
> URL: https://issues.apache.org/jira/browse/SPARK-15884
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Reporter: Narine Kokhlikyan
>Assignee: Narine Kokhlikyan
> Fix For: 2.0.0
>
>
> As discussed in https://github.com/apache/spark/pull/12836
> we need to override stringArgs method in MapPartitionsInR in order to avoid 
> too large strings generated by "stringArgs" method based on the input 
> arguments. 
> In this case exclude some of the input arguments: serialized R objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15884) Override stringArgs method in MapPartitionsInR case class in order to avoid Out Of Mermory exceptions when calling toString

2016-06-10 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-15884:
---
Assignee: Narine Kokhlikyan

> Override stringArgs method in MapPartitionsInR case class in order to avoid 
> Out Of Mermory exceptions when calling toString
> ---
>
> Key: SPARK-15884
> URL: https://issues.apache.org/jira/browse/SPARK-15884
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Reporter: Narine Kokhlikyan
>Assignee: Narine Kokhlikyan
> Fix For: 2.0.0
>
>
> As discussed in https://github.com/apache/spark/pull/12836
> we need to override stringArgs method in MapPartitionsInR in order to avoid 
> too large strings generated by "stringArgs" method based on the input 
> arguments. 
> In this case exclude some of the input arguments: serialized R objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15889) Add a unique id to ContinuousQuery

2016-06-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15889:


Assignee: Tathagata Das  (was: Apache Spark)

> Add a unique id to ContinuousQuery
> --
>
> Key: SPARK-15889
> URL: https://issues.apache.org/jira/browse/SPARK-15889
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> ContinuousQueries have names that are unique across all the active ones. 
> However, when queries are rapidly restarted with same name, it causes races 
> conditions with the listener. A listener event from a stopped query can 
> arrive after the query has been restarted, leading to complexities in 
> monitoring infrastructure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15889) Add a unique id to ContinuousQuery

2016-06-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15889:


Assignee: Apache Spark  (was: Tathagata Das)

> Add a unique id to ContinuousQuery
> --
>
> Key: SPARK-15889
> URL: https://issues.apache.org/jira/browse/SPARK-15889
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Apache Spark
>
> ContinuousQueries have names that are unique across all the active ones. 
> However, when queries are rapidly restarted with same name, it causes races 
> conditions with the listener. A listener event from a stopped query can 
> arrive after the query has been restarted, leading to complexities in 
> monitoring infrastructure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15889) Add a unique id to ContinuousQuery

2016-06-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325518#comment-15325518
 ] 

Apache Spark commented on SPARK-15889:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/13613

> Add a unique id to ContinuousQuery
> --
>
> Key: SPARK-15889
> URL: https://issues.apache.org/jira/browse/SPARK-15889
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> ContinuousQueries have names that are unique across all the active ones. 
> However, when queries are rapidly restarted with same name, it causes races 
> conditions with the listener. A listener event from a stopped query can 
> arrive after the query has been restarted, leading to complexities in 
> monitoring infrastructure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14501) spark.ml parity for fpm - frequent items

2016-06-10 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325515#comment-15325515
 ] 

Jeff Zhang commented on SPARK-14501:


working on it. 

> spark.ml parity for fpm - frequent items
> 
>
> Key: SPARK-14501
> URL: https://issues.apache.org/jira/browse/SPARK-14501
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This is an umbrella for porting the spark.mllib.fpm subpackage to spark.ml.
> I am initially creating a single subtask, which will require a brief design 
> doc for the DataFrame-based API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15890) Support Stata-like tabulation of values in a single column, optionally with weights

2016-06-10 Thread Shafique Jamal (JIRA)
Shafique Jamal created SPARK-15890:
--

 Summary: Support Stata-like tabulation of values in a single 
column, optionally with weights
 Key: SPARK-15890
 URL: https://issues.apache.org/jira/browse/SPARK-15890
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Shafique Jamal
Priority: Minor


In Stata, one can tabulate the values in a single column of a dataset, and 
provide weights. For example if your data looks like this:

 +-+
 | id   gender   w |
 |-|
  1. |  1M   2 |
  2. |  2M   4 |
  3. |  3M   1 |
  4. |  4F   1 |
  5. |  5F   3 |
 +-+
(where w is weight), you can tabulate the values of gender and get this result:

. tab gender

 gender |  Freq. PercentCum.
+---
  F |  2   40.00   40.00
  M |  3   60.00  100.00
+---
  Total |  5  100.00

you can apply weights to this tabulation as follows:

. tab gender [aw=w]

 gender |  Freq. PercentCum.
+---
  F | 1.81818182   36.36   36.36
  M | 3.18181818   63.64  100.00
+---
  Total |  5  100.00

I would like to have the same capability with Spark dataframes. Here is what I 
have done:

https://github.com/shafiquejamal/spark/commit/24ed3151db1ed2188ad67b2b5ccbf2883adf7af2

This allows me to do the following:

val obs1 = ("1", "M", 10, "P", 2d)
val obs2 = ("2", "M", 12, "S", 4d)
val obs3 = ("3", "M", 13, "B", 1d)
val obs4 = ("4", "F", 11, "P", 1d)
val obs5 = ("5", "F", 13, "M", 3d)
val df = Seq(obs1, obs2, obs3, obs4, obs5).toDF("id", "gender", "age", 
"educ", "w")

val tabWithoutWeights = df.stat.tab("gender")
val tabWithWeights = df.stat.tab("gender", "w")

tabWithoutWeights.show()
tabWithWeights.show()

This yields the following:

+--+-+-+--+
|gender|count(gender)|Frequency|Proportion|
+--+-+-+--+
| F|2|  2.0|   0.4|
| M|3|  3.0|   0.6|
+--+-+-+--+

+--+-+--+---+
|gender|count(gender)| Frequency| Proportion|
+--+-+--+---+
| F|2|1.8181818181818181|0.36363636363636365|
| M|3|3.1818181818181817| 0.6363636363636364|
+--+-+--+---+






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15889) Add a unique id to ContinuousQuery

2016-06-10 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-15889:
-

 Summary: Add a unique id to ContinuousQuery
 Key: SPARK-15889
 URL: https://issues.apache.org/jira/browse/SPARK-15889
 Project: Spark
  Issue Type: Sub-task
Reporter: Tathagata Das
Assignee: Tathagata Das


ContinuousQueries have names that are unique across all the active ones. 
However, when queries are rapidly restarted with same name, it causes races 
conditions with the listener. A listener event from a stopped query can arrive 
after the query has been restarted, leading to complexities in monitoring 
infrastructure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15851) Spark 2.0 does not compile in Windows 7

2016-06-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325470#comment-15325470
 ] 

Apache Spark commented on SPARK-15851:
--

User 'avulanov' has created a pull request for this issue:
https://github.com/apache/spark/pull/13612

> Spark 2.0 does not compile in Windows 7
> ---
>
> Key: SPARK-15851
> URL: https://issues.apache.org/jira/browse/SPARK-15851
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
> Environment: Windows 7
>Reporter: Alexander Ulanov
>
> Spark does not compile in Windows 7.
> "mvn compile" fails on spark-core due to trying to execute a bash script 
> spark-build-info.
> Work around:
> 1)Install win-bash and put in path
> 2)Change line 350 of core/pom.xml
> 
>   
>   
>   
> 
> Error trace:
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project 
> spark-core_2.11: An Ant BuildException has occured: Execute failed: 
> java.io.IOException: Cannot run program 
> "C:\dev\spark\core\..\build\spark-build-info" (in directory 
> "C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 
> application
> [ERROR] around Ant part ... executable="C:\dev\spark\core/../build/spark-build-info">... @ 4:73 in 
> C:\dev\spark\core\target\antrun\build-main.xml



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15851) Spark 2.0 does not compile in Windows 7

2016-06-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15851:


Assignee: Apache Spark

> Spark 2.0 does not compile in Windows 7
> ---
>
> Key: SPARK-15851
> URL: https://issues.apache.org/jira/browse/SPARK-15851
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
> Environment: Windows 7
>Reporter: Alexander Ulanov
>Assignee: Apache Spark
>
> Spark does not compile in Windows 7.
> "mvn compile" fails on spark-core due to trying to execute a bash script 
> spark-build-info.
> Work around:
> 1)Install win-bash and put in path
> 2)Change line 350 of core/pom.xml
> 
>   
>   
>   
> 
> Error trace:
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project 
> spark-core_2.11: An Ant BuildException has occured: Execute failed: 
> java.io.IOException: Cannot run program 
> "C:\dev\spark\core\..\build\spark-build-info" (in directory 
> "C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 
> application
> [ERROR] around Ant part ... executable="C:\dev\spark\core/../build/spark-build-info">... @ 4:73 in 
> C:\dev\spark\core\target\antrun\build-main.xml



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15851) Spark 2.0 does not compile in Windows 7

2016-06-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15851:


Assignee: (was: Apache Spark)

> Spark 2.0 does not compile in Windows 7
> ---
>
> Key: SPARK-15851
> URL: https://issues.apache.org/jira/browse/SPARK-15851
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
> Environment: Windows 7
>Reporter: Alexander Ulanov
>
> Spark does not compile in Windows 7.
> "mvn compile" fails on spark-core due to trying to execute a bash script 
> spark-build-info.
> Work around:
> 1)Install win-bash and put in path
> 2)Change line 350 of core/pom.xml
> 
>   
>   
>   
> 
> Error trace:
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project 
> spark-core_2.11: An Ant BuildException has occured: Execute failed: 
> java.io.IOException: Cannot run program 
> "C:\dev\spark\core\..\build\spark-build-info" (in directory 
> "C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 
> application
> [ERROR] around Ant part ... executable="C:\dev\spark\core/../build/spark-build-info">... @ 4:73 in 
> C:\dev\spark\core\target\antrun\build-main.xml



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15822) segmentation violation in o.a.s.unsafe.types.UTF8String

2016-06-10 Thread Adam Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325465#comment-15325465
 ] 

Adam Roberts commented on SPARK-15822:
--

I added a link above to the dataset, it's 658mb when extracted from the bz2 
which is 109mb, I can reproduce the problem by doing head -100 2008.csv > 
lessrows.csv and using this as the first argument (then we only deal with a 
94mb shortened version of the original file).

> segmentation violation in o.a.s.unsafe.types.UTF8String 
> 
>
> Key: SPARK-15822
> URL: https://issues.apache.org/jira/browse/SPARK-15822
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: linux amd64
> openjdk version "1.8.0_91"
> OpenJDK Runtime Environment (build 1.8.0_91-b14)
> OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)
>Reporter: Pete Robbins
>Assignee: Herman van Hovell
>Priority: Blocker
>
> Executors fail with segmentation violation while running application with
> spark.memory.offHeap.enabled true
> spark.memory.offHeap.size 512m
> Also now reproduced with 
> spark.memory.offHeap.enabled false
> {noformat}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f4559b4d4bd, pid=14182, tid=139935319750400
> #
> # JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build 1.8.0_91-b14)
> # Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 
> compressed oops)
> # Problematic frame:
> # J 4816 C2 
> org.apache.spark.unsafe.types.UTF8String.compareTo(Lorg/apache/spark/unsafe/types/UTF8String;)I
>  (64 bytes) @ 0x7f4559b4d4bd [0x7f4559b4d460+0x5d]
> {noformat}
> We initially saw this on IBM java on PowerPC box but is recreatable on linux 
> with OpenJDK. On linux with IBM Java 8 we see a null pointer exception at the 
> same code point:
> {noformat}
> 16/06/08 11:14:58 ERROR Executor: Exception in task 1.0 in stage 5.0 (TID 48)
> java.lang.NullPointerException
>   at 
> org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:831)
>   at org.apache.spark.unsafe.types.UTF8String.compare(UTF8String.java:844)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$doExecute$2$$anon$2.hasNext(WholeStageCodegenExec.scala:377)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
>   at org.spark_project.guava.collect.Ordering.leastOf(Ordering.java:664)
>   at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1365)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1362)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.lang.Thread.run(Thread.java:785)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15888) UDF fails in Python

2016-06-10 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-15888:
-
Shepherd: Davies Liu

> UDF fails in Python
> ---
>
> Key: SPARK-15888
> URL: https://issues.apache.org/jira/browse/SPARK-15888
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Vladimir Feinberg
>
> This looks like a regression from 1.6.1.
> The following notebook runs without error in a Spark 1.6.1 cluster, but fails 
> in 2.0.0:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6001574963454425/3194562079278586/1653464426712019/latest.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15888) UDF fails in Python

2016-06-10 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325432#comment-15325432
 ] 

Yin Huai commented on SPARK-15888:
--

[~davies] I am putting you as the shepherd.

> UDF fails in Python
> ---
>
> Key: SPARK-15888
> URL: https://issues.apache.org/jira/browse/SPARK-15888
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Vladimir Feinberg
>
> This looks like a regression from 1.6.1.
> The following notebook runs without error in a Spark 1.6.1 cluster, but fails 
> in 2.0.0:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6001574963454425/3194562079278586/1653464426712019/latest.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15773) Avoid creating local variable `sc` in examples if possible

2016-06-10 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15773.
-
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.0.0

> Avoid creating local variable `sc` in examples if possible
> --
>
> Key: SPARK-15773
> URL: https://issues.apache.org/jira/browse/SPARK-15773
> Project: Spark
>  Issue Type: Task
>  Components: Examples
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.0.0
>
>
> Instead of using local variable `sc` like the following example, this issue 
> uses `spark.sparkContext`. This makes examples more concise, and also fixes 
> some misleading, i.e., creating SparkContext from SparkSession.
> {code}
> -println("Creating SparkContext")
> -val sc = spark.sparkContext
> -
>  println("Writing local file to DFS")
>  val dfsFilename = dfsDirPath + "/dfs_read_write_test"
> -val fileRDD = sc.parallelize(fileContents)
> +val fileRDD = spark.sparkContext.parallelize(fileContents)
> {code}
> This will change 12 files (+30 lines, -52 lines).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15822) segmentation violation in o.a.s.unsafe.types.UTF8String

2016-06-10 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325417#comment-15325417
 ] 

Davies Liu commented on SPARK-15822:


The latest stacktrace is different than previous one, it seems that the 
UnsafeRow in aggregate hash map is corrupt.

How large is your dataset? It will be great if we can reproduce it (currently 
no luck).

> segmentation violation in o.a.s.unsafe.types.UTF8String 
> 
>
> Key: SPARK-15822
> URL: https://issues.apache.org/jira/browse/SPARK-15822
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: linux amd64
> openjdk version "1.8.0_91"
> OpenJDK Runtime Environment (build 1.8.0_91-b14)
> OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)
>Reporter: Pete Robbins
>Assignee: Herman van Hovell
>Priority: Blocker
>
> Executors fail with segmentation violation while running application with
> spark.memory.offHeap.enabled true
> spark.memory.offHeap.size 512m
> Also now reproduced with 
> spark.memory.offHeap.enabled false
> {noformat}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f4559b4d4bd, pid=14182, tid=139935319750400
> #
> # JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build 1.8.0_91-b14)
> # Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 
> compressed oops)
> # Problematic frame:
> # J 4816 C2 
> org.apache.spark.unsafe.types.UTF8String.compareTo(Lorg/apache/spark/unsafe/types/UTF8String;)I
>  (64 bytes) @ 0x7f4559b4d4bd [0x7f4559b4d460+0x5d]
> {noformat}
> We initially saw this on IBM java on PowerPC box but is recreatable on linux 
> with OpenJDK. On linux with IBM Java 8 we see a null pointer exception at the 
> same code point:
> {noformat}
> 16/06/08 11:14:58 ERROR Executor: Exception in task 1.0 in stage 5.0 (TID 48)
> java.lang.NullPointerException
>   at 
> org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:831)
>   at org.apache.spark.unsafe.types.UTF8String.compare(UTF8String.java:844)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$doExecute$2$$anon$2.hasNext(WholeStageCodegenExec.scala:377)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
>   at org.spark_project.guava.collect.Ordering.leastOf(Ordering.java:664)
>   at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1365)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1362)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.lang.Thread.run(Thread.java:785)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15888) UDF fails in Python

2016-06-10 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325413#comment-15325413
 ] 

Yin Huai commented on SPARK-15888:
--

[~smilegator] anyone from your side has time to take a look at this?

> UDF fails in Python
> ---
>
> Key: SPARK-15888
> URL: https://issues.apache.org/jira/browse/SPARK-15888
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Vladimir Feinberg
>
> This looks like a regression from 1.6.1.
> The following notebook runs without error in a Spark 1.6.1 cluster, but fails 
> in 2.0.0:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6001574963454425/3194562079278586/1653464426712019/latest.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15888) UDF fails in Python

2016-06-10 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-15888:
-
Target Version/s: 2.0.0

> UDF fails in Python
> ---
>
> Key: SPARK-15888
> URL: https://issues.apache.org/jira/browse/SPARK-15888
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Vladimir Feinberg
>
> This looks like a regression from 1.6.1.
> The following notebook runs without error in a Spark 1.6.1 cluster, but fails 
> in 2.0.0:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6001574963454425/3194562079278586/1653464426712019/latest.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15888) UDF fails in Python

2016-06-10 Thread Vladimir Feinberg (JIRA)
Vladimir Feinberg created SPARK-15888:
-

 Summary: UDF fails in Python
 Key: SPARK-15888
 URL: https://issues.apache.org/jira/browse/SPARK-15888
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.0.0
Reporter: Vladimir Feinberg


This looks like a regression from 1.6.1.

The following notebook runs without error in a Spark 1.6.1 cluster, but fails 
in 2.0.0:

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6001574963454425/3194562079278586/1653464426712019/latest.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap

2016-06-10 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325402#comment-15325402
 ] 

Reynold Xin commented on SPARK-15581:
-

Note that there is a big, non-ML factor for breeze. It is a large dependency 
and difficult to upgrade for newer Scala versions (or maintaining support for 
very old Scala versions). cc [~joshrosen].



> MLlib 2.1 Roadmap
> -
>
> Key: SPARK-15581
> URL: https://issues.apache.org/jira/browse/SPARK-15581
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
>
> This is a master list for MLlib improvements we are working on for the next 
> release. Please view this as a wish list rather than a definite plan, for we 
> don't have an accurate estimate of available resources. Due to limited review 
> bandwidth, features appearing on this list will get higher priority during 
> code review. But feel free to suggest new items to the list in comments. We 
> are experimenting with this process. Your feedback would be greatly 
> appreciated.
> h1. Instructions
> h2. For contributors:
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time Spark contributor, please always start with a 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a medium/big feature. Based on our experience, mixing the development 
> process with a big feature usually causes long delay in code review.
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start working on some features. This is to avoid duplicate work. For 
> small features, you don't need to wait to get JIRA assigned.
> * For medium/big features or features with dependencies, please get assigned 
> first before coding and keep the ETA updated on the JIRA. If there exist no 
> activity on the JIRA page for a certain amount of time, the JIRA should be 
> released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
> after another.
> * Remember to add the `@Since("VERSION")` annotation to new public APIs.
> * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code 
> review greatly helps to improve others' code as well as yours.
> h2. For committers:
> * Try to break down big features into small and specific JIRA tasks and link 
> them properly.
> * Add a "starter" label to starter tasks.
> * Put a rough estimate for medium/big features and track the progress.
> * If you start reviewing a PR, please add yourself to the Shepherd field on 
> JIRA.
> * If the code looks good to you, please comment "LGTM". For non-trivial PRs, 
> please ping a maintainer to make a final pass.
> * After merging a PR, create and link JIRAs for Python, example code, and 
> documentation if applicable.
> h1. Roadmap (*WIP*)
> This is NOT [a complete list of MLlib JIRAs for 2.1| 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority].
>  We only include umbrella JIRAs and high-level tasks.
> Major efforts in this release:
> * Feature parity for the DataFrames-based API (`spark.ml`), relative to the 
> RDD-based API
> * ML persistence
> * Python API feature parity and test coverage
> * R API expansion and improvements
> * Note about new features: As usual, we expect to expand the feature set of 
> MLlib.  However, we will prioritize API parity, bug fixes, and improvements 
> over new features.
> Note `spark.mllib` is in maintenance mode now.  We will accept bug fixes for 
> it, but new features, APIs, and improvements will only be added to `spark.ml`.
> h2. Critical feature parity in DataFrame-based API
> * Umbrella JIRA: [SPARK-4591]
> h2. Persistence
> * Complete persistence within MLlib
> ** Python tuning (SPARK-13786)
> * MLlib in R format: compatibility with other languages (SPARK-15572)
> * Impose backwards compatibility for persistence (SPARK-15573)
> h2. Python API
> * Standardize unit tests for Scala and Python to improve and consolidate test 
> coverage for Params, persistence, and other common functionality (SPARK-15571)
> * Improve Python API handling of Params, persistence (SPARK-14771) 
> (SPARK-14706)
> ** Note: The linked JIRAs for this are incomplete.  More to be created...
> ** Related: Implement Python meta-algorithms in Scala (to simplify 
> persistence) (SPARK-15574)
> * Feature parity: The main 

[jira] [Assigned] (SPARK-15887) Bring back the hive-site.xml support for Spark 2.0

2016-06-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15887:


Assignee: Wenchen Fan  (was: Apache Spark)

> Bring back the hive-site.xml support for Spark 2.0
> --
>
> Key: SPARK-15887
> URL: https://issues.apache.org/jira/browse/SPARK-15887
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>
> Right now, Spark 2.0 does not load hive-site.xml. Based on users' feedback, 
> it seems make sense to still load this conf file.
> Originally, this file was loaded when we load HiveConf class and all settings 
> can be retrieved after we create a HiveConf instances. Let's avoid of using 
> this way to load hive-site.xml. Instead, since hive-site.xml is a normal 
> hadoop conf file, we can first find its url using the classloader and then 
> use Hadoop Configuration's addResource (or add hive-site.xml as a default 
> resource through Configuration.addDefaultResource) to load confs.
> Please note that hive-site.xml needs to be loaded into the hadoop conf used 
> to create metadataHive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15887) Bring back the hive-site.xml support for Spark 2.0

2016-06-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15887:


Assignee: Apache Spark  (was: Wenchen Fan)

> Bring back the hive-site.xml support for Spark 2.0
> --
>
> Key: SPARK-15887
> URL: https://issues.apache.org/jira/browse/SPARK-15887
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>
> Right now, Spark 2.0 does not load hive-site.xml. Based on users' feedback, 
> it seems make sense to still load this conf file.
> Originally, this file was loaded when we load HiveConf class and all settings 
> can be retrieved after we create a HiveConf instances. Let's avoid of using 
> this way to load hive-site.xml. Instead, since hive-site.xml is a normal 
> hadoop conf file, we can first find its url using the classloader and then 
> use Hadoop Configuration's addResource (or add hive-site.xml as a default 
> resource through Configuration.addDefaultResource) to load confs.
> Please note that hive-site.xml needs to be loaded into the hadoop conf used 
> to create metadataHive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15887) Bring back the hive-site.xml support for Spark 2.0

2016-06-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325386#comment-15325386
 ] 

Apache Spark commented on SPARK-15887:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/13611

> Bring back the hive-site.xml support for Spark 2.0
> --
>
> Key: SPARK-15887
> URL: https://issues.apache.org/jira/browse/SPARK-15887
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>
> Right now, Spark 2.0 does not load hive-site.xml. Based on users' feedback, 
> it seems make sense to still load this conf file.
> Originally, this file was loaded when we load HiveConf class and all settings 
> can be retrieved after we create a HiveConf instances. Let's avoid of using 
> this way to load hive-site.xml. Instead, since hive-site.xml is a normal 
> hadoop conf file, we can first find its url using the classloader and then 
> use Hadoop Configuration's addResource (or add hive-site.xml as a default 
> resource through Configuration.addDefaultResource) to load confs.
> Please note that hive-site.xml needs to be loaded into the hadoop conf used 
> to create metadataHive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap

2016-06-10 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325377#comment-15325377
 ] 

Alexander Ulanov commented on SPARK-15581:
--

I would like to comment on Breeze and deep learning parts, because I have been 
implementing multilayer perceptron for Spark and have used Breeze a lot.

Breeze provides convenient abstraction for dense and sparse vectors and 
matrices and allows performing linear algebra backed by netlib-java and native 
BLAS. At the same time Spark "linalg" has its own abstractions for that. This 
might be confusing to users and developers. Obviously, Spark should have a 
single library for linear algebra. Having said that, Breeze is more convenient 
and flexible than linalg, though it misses some features such as in-place 
matrix multiplications and multidimensional arrays. Breeze cannot be removed 
from Spark because "linalg" does not have enough functionality to fully replace 
it. To address this, I have implemented a Scala tensor library on top of 
netlib-java. "linalg" can be wrapped around it. It also provides functions 
similar to Breeze and allows working with multi-dimensional arrays. [~mengxr], 
[~dbtsai] and myself were planning to discuss this after the 2.0 release, and I 
am posting these considerations here since you raised this question too. Could 
you take a look on this library and tell what do you think? The source code is 
here https://github.com/avulanov/scala-tensor

With regards to deep learning, I believe that having deep learning within 
Spark's ML library is a question of convenience. Spark has broad analytic 
capabilities and it is useful to have deep learning as one of these tools at 
hand. Deep learning is a model of choice for several important modern 
use-cases, and Spark ML might want to cover them. Eventually, it is hard to 
explain, why do we have PCA in ML but don't provide Autoencoder. To summarize 
this, I think that Spark should have at least the most widely used deep 
learning models, such as fully connected artificial neural network, 
convolutional network and autoencoder. Advanced and experimental deep learning 
features might reside within packages or as pluggable external tools. Spark ML 
already has fully connected networks in place. Stacked autoencoder is 
implemented but not merged yet. The only thing that remains is convolutional 
network. These 3 will provide a comprehensive deep learning set for Spark ML.

> MLlib 2.1 Roadmap
> -
>
> Key: SPARK-15581
> URL: https://issues.apache.org/jira/browse/SPARK-15581
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
>
> This is a master list for MLlib improvements we are working on for the next 
> release. Please view this as a wish list rather than a definite plan, for we 
> don't have an accurate estimate of available resources. Due to limited review 
> bandwidth, features appearing on this list will get higher priority during 
> code review. But feel free to suggest new items to the list in comments. We 
> are experimenting with this process. Your feedback would be greatly 
> appreciated.
> h1. Instructions
> h2. For contributors:
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time Spark contributor, please always start with a 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a medium/big feature. Based on our experience, mixing the development 
> process with a big feature usually causes long delay in code review.
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start working on some features. This is to avoid duplicate work. For 
> small features, you don't need to wait to get JIRA assigned.
> * For medium/big features or features with dependencies, please get assigned 
> first before coding and keep the ETA updated on the JIRA. If there exist no 
> activity on the JIRA page for a certain amount of time, the JIRA should be 
> released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
> after another.
> * Remember to add the `@Since("VERSION")` annotation to new public APIs.
> * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code 
> review greatly helps to improve others' code as well as yours.
> h2. For committers:
> * Try to break down big features into small and specific JIRA tasks and link 
> them properly.
> * Add a "starter" label to starter tasks.
> * Put a rough estimate for medium/big features and track the progress.
> * If you start reviewing a PR, please add yourself to the Shepherd field on 
> JIRA.
> * If 

[jira] [Created] (SPARK-15887) Bring back the hive-site.xml support for Spark 2.0

2016-06-10 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-15887:
---

 Summary: Bring back the hive-site.xml support for Spark 2.0
 Key: SPARK-15887
 URL: https://issues.apache.org/jira/browse/SPARK-15887
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan


Right now, Spark 2.0 does not load hive-site.xml. Based on users' feedback, it 
seems make sense to still load this conf file.
Originally, this file was loaded when we load HiveConf class and all settings 
can be retrieved after we create a HiveConf instances. Let's avoid of using 
this way to load hive-site.xml. Instead, since hive-site.xml is a normal hadoop 
conf file, we can first find its url using the classloader and then use Hadoop 
Configuration's addResource (or add hive-site.xml as a default resource through 
Configuration.addDefaultResource) to load confs.
Please note that hive-site.xml needs to be loaded into the hadoop conf used to 
create metadataHive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15884) Override stringArgs method in MapPartitionsInR case class in order to avoid Out Of Mermory exceptions when calling toString

2016-06-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15884:


Assignee: (was: Apache Spark)

> Override stringArgs method in MapPartitionsInR case class in order to avoid 
> Out Of Mermory exceptions when calling toString
> ---
>
> Key: SPARK-15884
> URL: https://issues.apache.org/jira/browse/SPARK-15884
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Reporter: Narine Kokhlikyan
>
> As discussed in https://github.com/apache/spark/pull/12836
> we need to override stringArgs method in MapPartitionsInR in order to avoid 
> too large strings generated by "stringArgs" method based on the input 
> arguments. 
> In this case exclude some of the input arguments: serialized R objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15884) Override stringArgs method in MapPartitionsInR case class in order to avoid Out Of Mermory exceptions when calling toString

2016-06-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325360#comment-15325360
 ] 

Apache Spark commented on SPARK-15884:
--

User 'NarineK' has created a pull request for this issue:
https://github.com/apache/spark/pull/13610

> Override stringArgs method in MapPartitionsInR case class in order to avoid 
> Out Of Mermory exceptions when calling toString
> ---
>
> Key: SPARK-15884
> URL: https://issues.apache.org/jira/browse/SPARK-15884
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Reporter: Narine Kokhlikyan
>
> As discussed in https://github.com/apache/spark/pull/12836
> we need to override stringArgs method in MapPartitionsInR in order to avoid 
> too large strings generated by "stringArgs" method based on the input 
> arguments. 
> In this case exclude some of the input arguments: serialized R objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15884) Override stringArgs method in MapPartitionsInR case class in order to avoid Out Of Mermory exceptions when calling toString

2016-06-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15884:


Assignee: Apache Spark

> Override stringArgs method in MapPartitionsInR case class in order to avoid 
> Out Of Mermory exceptions when calling toString
> ---
>
> Key: SPARK-15884
> URL: https://issues.apache.org/jira/browse/SPARK-15884
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Reporter: Narine Kokhlikyan
>Assignee: Apache Spark
>
> As discussed in https://github.com/apache/spark/pull/12836
> we need to override stringArgs method in MapPartitionsInR in order to avoid 
> too large strings generated by "stringArgs" method based on the input 
> arguments. 
> In this case exclude some of the input arguments: serialized R objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15886) PySpark ML examples should use local linear algebra

2016-06-10 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-15886:
-

 Summary: PySpark ML examples should use local linear algebra
 Key: SPARK-15886
 URL: https://issues.apache.org/jira/browse/SPARK-15886
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML, PySpark
Affects Versions: 2.0.0
Reporter: Joseph K. Bradley
Assignee: Hyukjin Kwon


Fix Python examples to use the new ML Vector and Matrix APIs in the ML pipeline 
based algorithms



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-15886) PySpark ML examples should use local linear algebra

2016-06-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-15886.
-
Resolution: Duplicate

Oh, just saw in the PR that you'd send a follow-up since the original should 
include this.  I'll close this JIRA.

> PySpark ML examples should use local linear algebra
> ---
>
> Key: SPARK-15886
> URL: https://issues.apache.org/jira/browse/SPARK-15886
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Joseph K. Bradley
>Assignee: Hyukjin Kwon
>
> Fix Python examples to use the new ML Vector and Matrix APIs in the ML 
> pipeline based algorithms



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15862) Better Error Message When Having Database Name in CACHE TABLE AS SELECT

2016-06-10 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-15862:
---
Assignee: Xiao Li

> Better Error Message When Having Database Name in CACHE TABLE AS SELECT
> ---
>
> Key: SPARK-15862
> URL: https://issues.apache.org/jira/browse/SPARK-15862
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Minor
>
> The table name in CACHE TABLE AS SELECT should NOT contain database prefix 
> like "database.table". Thus, this PR captures this in Parser and outputs a 
> better error message, instead of reporting the view already exists.
> In addition, in this JIRA, we have a few issues that need to be addressed: 1) 
> refactor the Parser to generate table identifiers instead of returning the 
> table name string; 2) add test case for caching and uncaching qualified table 
> names;  3) fix a few test cases that do not drop temp table at the end;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15885) Provide to executor logs from stage details page in UI

2016-06-10 Thread Tom Magrino (JIRA)
Tom Magrino created SPARK-15885:
---

 Summary: Provide to executor logs from stage details page in UI
 Key: SPARK-15885
 URL: https://issues.apache.org/jira/browse/SPARK-15885
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: Tom Magrino
Priority: Trivial


Currently, the stage details page lists information about executors but does 
not readily provide links to the log output for the given executor.  It would 
be useful to have a link to the log directly from the stage details page, 
rather than navigating to the executors tab and finding the appropriate 
executor id there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15885) Provide links to executor logs from stage details page in UI

2016-06-10 Thread Tom Magrino (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom Magrino updated SPARK-15885:

Summary: Provide links to executor logs from stage details page in UI  
(was: Provide to executor logs from stage details page in UI)

> Provide links to executor logs from stage details page in UI
> 
>
> Key: SPARK-15885
> URL: https://issues.apache.org/jira/browse/SPARK-15885
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Tom Magrino
>Priority: Trivial
>
> Currently, the stage details page lists information about executors but does 
> not readily provide links to the log output for the given executor.  It would 
> be useful to have a link to the log directly from the stage details page, 
> rather than navigating to the executors tab and finding the appropriate 
> executor id there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13207) _SUCCESS should not break partition discovery

2016-06-10 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325337#comment-15325337
 ] 

Yin Huai commented on SPARK-13207:
--

Hey [~simeons], sorry for late reply. SPARK-15454 has fixed this issue.

> _SUCCESS should not break partition discovery
> -
>
> Key: SPARK-13207
> URL: https://issues.apache.org/jira/browse/SPARK-13207
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>  Labels: backport-needed
> Fix For: 1.6.2, 2.0.0
>
>
> Partitioning discovery will fail with the following case
> {code}
> test("_SUCCESS should not break partitioning discovery") {
> withTempPath { dir =>
>   val tablePath = new File(dir, "table")
>   val df = (1 to 3).map(i => (i, i, i, i)).toDF("a", "b", "c", "d")
>   df.write
> .format("parquet")
> .partitionBy("b", "c", "d")
> .save(tablePath.getCanonicalPath)
>   Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1", "_SUCCESS"))
>   Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1/c=1", 
> "_SUCCESS"))
>   Files.touch(new File(s"${tablePath.getCanonicalPath}/b=1/c=1/d=1", 
> "_SUCCESS"))
>   
> checkAnswer(sqlContext.read.format("parquet").load(tablePath.getCanonicalPath),
>  df)
> }
>   }
> {code}
> Because {{_SUCCESS}} is the in the inner partitioning dirs, partitioning 
> discovery will fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15884) Override stringArgs method in MapPartitionsInR case class in order to avoid Out Of Mermory exceptions when calling toString

2016-06-10 Thread Narine Kokhlikyan (JIRA)
Narine Kokhlikyan created SPARK-15884:
-

 Summary: Override stringArgs method in MapPartitionsInR case class 
in order to avoid Out Of Mermory exceptions when calling toString
 Key: SPARK-15884
 URL: https://issues.apache.org/jira/browse/SPARK-15884
 Project: Spark
  Issue Type: Bug
  Components: SparkR, SQL
Reporter: Narine Kokhlikyan


As discussed in https://github.com/apache/spark/pull/12836
we need to override stringArgs method in MapPartitionsInR in order to avoid too 
large strings generated by "stringArgs" method based on the input arguments. 

In this case exclude some of the input arguments: serialized R objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15688) RelationalGroupedDataset.toDF should not add group by expressions that are already added in the aggregate expressions.

2016-06-10 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-15688.
--
Resolution: Won't Fix

https://github.com/apache/spark/pull/13483#issuecomment-224758653

> RelationalGroupedDataset.toDF should not add group by expressions that are 
> already added in the aggregate expressions.
> --
>
> Key: SPARK-15688
> URL: https://issues.apache.org/jira/browse/SPARK-15688
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> For {{df.groupBy("col").agg($"col", count("*"))`}}, it is kind of weird to 
> have col appearing twice in the result. Seems we can avoid of output group by 
> expressions twice if it already a part of {{agg}}. Looks like 
> RelationalGroupedDataset.toDF is the place to change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15822) segmentation violation in o.a.s.unsafe.types.UTF8String

2016-06-10 Thread Adam Roberts (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15324475#comment-15324475
 ] 

Adam Roberts edited comment on SPARK-15822 at 6/10/16 9:39 PM:
---

Herman, here's the application, note my HashedRelation comment is only a theory 
at this stage (edit: now looks irrelevant)

{code}
import org.apache.spark.SparkConf
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._

object SQLFlights {
  def displayTop(title: String, df: DataFrame) {
println(title);
df.sort(desc("rank")).take(10).foreach(println)
  }

  def main(args: Array[String]) {
val inputfile = args(0)
val airport = args(1)

val conf = new SparkConf().setAppName("SQL Flights")
val sqlContext = 
org.apache.spark.sql.SparkSession.builder.config(conf).getOrCreate()

val df = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load(inputfile)
.cache()

val arrivals = df.filter(s"Dest = '$airport'").cache();
val departures = df.filter(s"Origin = '$airport'").cache();
val departuresByCarrier = departures.groupBy("Dest", 
"UniqueCarrier").count().withColumnRenamed("count", "total")
val a = departures.filter("Cancelled != 0 and CancellationCode = 'A'")
println("done a")
val b = a.groupBy("Dest", "UniqueCarrier").count()
println("done b")
val c = b.join(departuresByCarrier, Seq("Dest", "UniqueCarrier"))
println("done c")
val d = c.selectExpr("Dest", "UniqueCarrier", "round(count * 100 / total, 
2) as rank")
println("done d")
displayTop("Top Departure Carrier Cancellations:", d)
  }
}
{code}

in conf/spark-env.sh:
{code}
export SPARK_WORKER_CORES=2
export SPARK_WORKER_INSTANCES=2
{code}

in conf/spark-defaults.conf:
{code}
spark.sql.warehouse.dir /home/aroberts/sql-flights
{code}

Submit including --packages com.databricks:spark-csv_2.11:1.4.0

The job will complete but if you look in the $SPARK_HOME/work dir you'll see 
that after our queries are done, executors will die due to the segv and by 
looking in the stderr files we can see the problem.

Data set to use as the first arg can be downloaded at 
http://stat-computing.org/dataexpo/2009/2008.csv.bz2 (after extracting we can 
do head -100 to create a smaller file and still get the problem without 
waiting so long).

As the second arg you can use ORD as the airport name.


was (Author: aroberts):
Herman, here's the application, note my HashedRelation comment is only a theory 
at this stage.

{code}
import org.apache.spark.SparkConf
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._

object SQLFlights {
  def displayTop(title: String, df: DataFrame) {
println(title);
df.sort(desc("rank")).take(10).foreach(println)
  }

  def main(args: Array[String]) {
val inputfile = args(0)
val airport = args(1)

val conf = new SparkConf().setAppName("SQL Flights")
val sqlContext = 
org.apache.spark.sql.SparkSession.builder.config(conf).getOrCreate()

val df = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load(inputfile)
.cache()

val arrivals = df.filter(s"Dest = '$airport'").cache();
val departures = df.filter(s"Origin = '$airport'").cache();
val departuresByCarrier = departures.groupBy("Dest", 
"UniqueCarrier").count().withColumnRenamed("count", "total")
val a = departures.filter("Cancelled != 0 and CancellationCode = 'A'")
println("done a")
val b = a.groupBy("Dest", "UniqueCarrier").count()
println("done b")
val c = b.join(departuresByCarrier, Seq("Dest", "UniqueCarrier"))
println("done c")
val d = c.selectExpr("Dest", "UniqueCarrier", "round(count * 100 / total, 
2) as rank")
println("done d")
displayTop("Top Departure Carrier Cancellations:", d)
  }
}
{code}

in conf/spark-env.sh:
{code}
export SPARK_WORKER_CORES=2
export SPARK_WORKER_INSTANCES=2
{code}

in conf/spark-defaults.conf:
{code}
spark.sql.warehouse.dir /home/aroberts/sql-flights
{code}

Submit including --packages com.databricks:spark-csv_2.11:1.4.0

The job will complete but if you look in the $SPARK_HOME/work dir you'll see 
that after our queries are done, executors will die due to the segv and by 
looking in the stderr files we can see the problem.

> segmentation violation in o.a.s.unsafe.types.UTF8String 
> 
>
> Key: SPARK-15822
> URL: https://issues.apache.org/jira/browse/SPARK-15822
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: linux amd64
> openjdk version "1.8.0_91"
> OpenJDK Runtime Environment (build 1.8.0_91-b14)
> OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)
>   

[jira] [Resolved] (SPARK-15489) Dataset kryo encoder won't load custom user settings

2016-06-10 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-15489.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13424
[https://github.com/apache/spark/pull/13424]

> Dataset kryo encoder won't load custom user settings 
> -
>
> Key: SPARK-15489
> URL: https://issues.apache.org/jira/browse/SPARK-15489
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Amit Sela
> Fix For: 2.0.0
>
>
> When setting a custom "spark.kryo.registrator" (or any other configuration 
> for that matter) through the API, this configuration will not propagate to 
> the encoder that uses a KryoSerializer since it instantiates with "new 
> SparkConf()".
> See:  
> https://github.com/apache/spark/blob/07c36a2f07fcf5da6fb395f830ebbfc10eb27dcc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala#L554
> This could be hacked by providing those configurations as System properties, 
> but this probably should be passed to the encoder and set in the 
> SerializerInstance after creation.
> Example:
> When using Encoders with kryo to encode generically typed Objects in the 
> following manner:
> public static  Encoder encoder() {
>   return Encoders.kryo((Class) Object.class);
> }
> I get a decoding exception when trying to decode 
> `java.util.Collections$UnmodifiableCollection`, which probably comes from 
> Guava's `ImmutableList`.
> This happens when running with master = local[1]. Same code had no problems 
> with RDD api.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2016-06-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-15784:
--
Issue Type: New Feature  (was: Improvement)

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15654) Reading gzipped files results in duplicate rows

2016-06-10 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15654.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13531
[https://github.com/apache/spark/pull/13531]

> Reading gzipped files results in duplicate rows
> ---
>
> Key: SPARK-15654
> URL: https://issues.apache.org/jira/browse/SPARK-15654
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jurriaan Pruis
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 2.0.0
>
>
> When gzipped files are larger then {{spark.sql.files.maxPartitionBytes}} 
> reading the file will result in duplicate rows in the dataframe.
> Given an example gzipped wordlist (of 740K bytes):
> {code}
> $ gzcat words.gz |wc -l
> 235886
> {code}
> Reading it using spark results in the following output:
> {code}
> >>> sqlContext.setConf('spark.sql.files.maxPartitionBytes', '1000')
> >>> sqlContext.read.text("/Users/jurriaanpruis/spark/words.gz").count()
> 81244093
> >>> sqlContext.setConf('spark.sql.files.maxPartitionBytes', '1')
> >>> sqlContext.read.text("/Users/jurriaanpruis/spark/words.gz").count()
> 8348566
> >>> sqlContext.setConf('spark.sql.files.maxPartitionBytes', '10')
> >>> sqlContext.read.text("/Users/jurriaanpruis/spark/words.gz").count()
> 1051469
> >>> sqlContext.setConf('spark.sql.files.maxPartitionBytes', '100')
> >>> sqlContext.read.text("/Users/jurriaanpruis/spark/words.gz").count()
> 235886
> {code}
> You can clearly see how the number of rows scales with the number of 
> partitions.
> Somehow the data is duplicated when the number of partitions exceeds one 
> (which as seen above approximately scales with the partition size). 
> When using distinct you'll get the correct answer:
> {code}
> >>> sqlContext.setConf('spark.sql.files.maxPartitionBytes', '1')
> >>> sqlContext.read.text("/Users/jurriaanpruis/spark/words.gz").distinct().count()
> 235886
> {code}
> This looks like a pretty serious bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15825) sort-merge-join gives invalid results when joining on a tupled key

2016-06-10 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15825.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13589
[https://github.com/apache/spark/pull/13589]

> sort-merge-join gives invalid results when joining on a tupled key
> --
>
> Key: SPARK-15825
> URL: https://issues.apache.org/jira/browse/SPARK-15825
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: spark 2.0.0-SNAPSHOT
>Reporter: Andres Perez
>Assignee: Herman van Hovell
> Fix For: 2.0.0
>
>
> {noformat}
>   import org.apache.spark.sql.functions
>   val left = List("0", "1", "2").toDS()
> .map{ k => ((k, 0), "l") }
>   val right = List("0", "1", "2").toDS()
> .map{ k => ((k, 0), "r") }
>   val result = left.toDF("k", "v").as[((String, Int), String)].alias("left")
> .joinWith(right.toDF("k", "v").as[((String, Int), 
> String)].alias("right"), functions.col("left.k") === 
> functions.col("right.k"), "inner")
> .as[(((String, Int), String), ((String, Int), String))]
> {noformat}
> When broadcast joins are enabled, we get the expected output:
> {noformat}
> (((0,0),l),((0,0),r))
> (((1,0),l),((1,0),r))
> (((2,0),l),((2,0),r))
> {noformat}
> However, when broadcast joins are disabled (i.e. setting 
> spark.sql.autoBroadcastJoinThreshold to -1), the result is incorrect:
> {noformat}
> (((2,0),l),((2,-1),))
> (((0,0),l),((0,-313907893),))
> (((1,0),l),((null,-313907893),))
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15790) Audit @Since annotations in ML

2016-06-10 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325317#comment-15325317
 ] 

Joseph K. Bradley commented on SPARK-15790:
---

Linking existing umbrella.

Also, I want to note: there are a lot of functions in abstract classes to which 
we can't add Since annotations since they are used in concrete classes added at 
different times.  We could override those functions in concrete classes, but 
that does not seem worthwhile.


> Audit @Since annotations in ML
> --
>
> Key: SPARK-15790
> URL: https://issues.apache.org/jira/browse/SPARK-15790
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, PySpark
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Many classes & methods in ML are missing {{@Since}} annotations. Audit what's 
> missing and add annotations to public API constructors, vals and methods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15879) Update logo in UI and docs to add "Apache"

2016-06-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15879:


Assignee: (was: Apache Spark)

> Update logo in UI and docs to add "Apache"
> --
>
> Key: SPARK-15879
> URL: https://issues.apache.org/jira/browse/SPARK-15879
> Project: Spark
>  Issue Type: Task
>  Components: Documentation, Web UI
>Reporter: Matei Zaharia
>
> We recently added "Apache" to the Spark logo on the website 
> (http://spark.apache.org/images/spark-logo.eps) to have it be the full 
> project name, and we should do the same in the web UI and docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15879) Update logo in UI and docs to add "Apache"

2016-06-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325313#comment-15325313
 ] 

Apache Spark commented on SPARK-15879:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/13609

> Update logo in UI and docs to add "Apache"
> --
>
> Key: SPARK-15879
> URL: https://issues.apache.org/jira/browse/SPARK-15879
> Project: Spark
>  Issue Type: Task
>  Components: Documentation, Web UI
>Reporter: Matei Zaharia
>
> We recently added "Apache" to the Spark logo on the website 
> (http://spark.apache.org/images/spark-logo.eps) to have it be the full 
> project name, and we should do the same in the web UI and docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15879) Update logo in UI and docs to add "Apache"

2016-06-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15879:


Assignee: Apache Spark

> Update logo in UI and docs to add "Apache"
> --
>
> Key: SPARK-15879
> URL: https://issues.apache.org/jira/browse/SPARK-15879
> Project: Spark
>  Issue Type: Task
>  Components: Documentation, Web UI
>Reporter: Matei Zaharia
>Assignee: Apache Spark
>
> We recently added "Apache" to the Spark logo on the website 
> (http://spark.apache.org/images/spark-logo.eps) to have it be the full 
> project name, and we should do the same in the web UI and docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15751) Add generateAssociationRules in fpm in pyspark

2016-06-10 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325307#comment-15325307
 ] 

Joseph K. Bradley commented on SPARK-15751:
---

There isn't a JIRA for this AFAIK, but I think we should focus on porting FPM 
to spark.ml rather than adding to the spark.mllib API, even for PySpark 
wrappers.  Would you mind closing this?  It would be great to get your help 
with the port to spark.ml: [SPARK-14501]
Thank you!

> Add generateAssociationRules in fpm in pyspark
> --
>
> Key: SPARK-15751
> URL: https://issues.apache.org/jira/browse/SPARK-15751
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>
> There's no api for generating association rules in pyspark for now. Please 
> close it if there's already an existing jira tracking this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15881) Update microbenchmark results

2016-06-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15881:


Assignee: Apache Spark

> Update microbenchmark results
> -
>
> Key: SPARK-15881
> URL: https://issues.apache.org/jira/browse/SPARK-15881
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Apache Spark
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15746) SchemaUtils.checkColumnType with VectorUDT prints instance details in error message

2016-06-10 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325294#comment-15325294
 ] 

Joseph K. Bradley commented on SPARK-15746:
---

Either fix seems fine to me.  Modifying checkColumnType seems like a good thing 
to do regardless since it may improve messages for other types too.  I'm OK 
with the case object VectorUDT as a nicety.

> SchemaUtils.checkColumnType with VectorUDT prints instance details in error 
> message
> ---
>
> Key: SPARK-15746
> URL: https://issues.apache.org/jira/browse/SPARK-15746
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> Currently, many feature transformers in {{ml}} use 
> {{SchemaUtils.checkColumnType(schema, ..., new VectorUDT)}} to check the 
> column type is a ({{ml.linalg}}) vector.
> The resulting error message contains "instance" info for the {{VectorUDT}}, 
> i.e. something like this:
> {code}
> java.lang.IllegalArgumentException: requirement failed: Column features must 
> be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually 
> StringType.
> {code}
> A solution would either be to amend {{SchemaUtils.checkColumnType}} to print 
> the error message using {{getClass.getName}}, or to create a {{private[spark] 
> case object VectorUDT extends VectorUDT}} for convenience, since it is used 
> so often (and incidentally this would make it easier to put {{VectorUDT}} 
> into lists of data types e.g. schema validation, UDAFs etc).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15883) Fix broken links on MLLIB documentations

2016-06-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325295#comment-15325295
 ] 

Apache Spark commented on SPARK-15883:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/13608

> Fix broken links on MLLIB documentations
> 
>
> Key: SPARK-15883
> URL: https://issues.apache.org/jira/browse/SPARK-15883
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> This issue fixes all broken links on Spark 2.0 preview MLLib documents. Also, 
> this contains some editorial change.
> **Fix broken links**
>   * mllib-data-types.md
>   * mllib-decision-tree.md
>   * mllib-ensembles.md
>   * mllib-feature-extraction.md
>   * mllib-pmml-model-export.md
>   * mllib-statistics.md
> **Fix malformed section header and scala coding style**
>   * mllib-linear-methods.md
> **Replace indirect forward links with direct one**
>   * ml-classification-regression.md



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15883) Fix broken links on MLLIB documentations

2016-06-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15883:


Assignee: (was: Apache Spark)

> Fix broken links on MLLIB documentations
> 
>
> Key: SPARK-15883
> URL: https://issues.apache.org/jira/browse/SPARK-15883
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> This issue fixes all broken links on Spark 2.0 preview MLLib documents. Also, 
> this contains some editorial change.
> **Fix broken links**
>   * mllib-data-types.md
>   * mllib-decision-tree.md
>   * mllib-ensembles.md
>   * mllib-feature-extraction.md
>   * mllib-pmml-model-export.md
>   * mllib-statistics.md
> **Fix malformed section header and scala coding style**
>   * mllib-linear-methods.md
> **Replace indirect forward links with direct one**
>   * ml-classification-regression.md



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15883) Fix broken links on MLLIB documentations

2016-06-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15883:


Assignee: Apache Spark

> Fix broken links on MLLIB documentations
> 
>
> Key: SPARK-15883
> URL: https://issues.apache.org/jira/browse/SPARK-15883
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Trivial
>
> This issue fixes all broken links on Spark 2.0 preview MLLib documents. Also, 
> this contains some editorial change.
> **Fix broken links**
>   * mllib-data-types.md
>   * mllib-decision-tree.md
>   * mllib-ensembles.md
>   * mllib-feature-extraction.md
>   * mllib-pmml-model-export.md
>   * mllib-statistics.md
> **Fix malformed section header and scala coding style**
>   * mllib-linear-methods.md
> **Replace indirect forward links with direct one**
>   * ml-classification-regression.md



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15628) pyspark.ml.evaluation module

2016-06-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-15628.
---
Resolution: Done
  Assignee: holdenk

[~holdenk] OK, so no missing items?  I'll close this.  Thanks!

> pyspark.ml.evaluation module
> 
>
> Key: SPARK-15628
> URL: https://issues.apache.org/jira/browse/SPARK-15628
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Assignee: holdenk
>
> See parent task SPARK-14813.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15883) Fix broken links on MLLIB documentations

2016-06-10 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-15883:
--
Description: 
This issue fixes all broken links on Spark 2.0 preview MLLib documents. Also, 
this contains some editorial change.

* Fix broken links
  * mllib-data-types.md
  * mllib-decision-tree.md
  * mllib-ensembles.md
  * mllib-feature-extraction.md
  * mllib-pmml-model-export.md
  * mllib-statistics.md

* Fix malformed section header and scala coding style.
  * mllib-linear-methods.md

* ml-classification-regression.md
  * Replace indirect forward links with direct one.


  was:
This issue fixes all broken links on Spark 2.0 preview MLLib documents. Also, 
this contains some editorial change.

* Fix broken links
  * mllib-data-types.md
  * mllib-decision-tree.md
  * mllib-ensembles.md
  * mllib-feature-extraction.md
  * mllib-pmml-model-export.md
  * mllib-statistics.md
* Fix malformed section header and scala coding style.
  * mllib-linear-methods.md
* ml-classification-regression.md
  - Replace indirect forward links with direct one.



> Fix broken links on MLLIB documentations
> 
>
> Key: SPARK-15883
> URL: https://issues.apache.org/jira/browse/SPARK-15883
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> This issue fixes all broken links on Spark 2.0 preview MLLib documents. Also, 
> this contains some editorial change.
> * Fix broken links
>   * mllib-data-types.md
>   * mllib-decision-tree.md
>   * mllib-ensembles.md
>   * mllib-feature-extraction.md
>   * mllib-pmml-model-export.md
>   * mllib-statistics.md
> * Fix malformed section header and scala coding style.
>   * mllib-linear-methods.md
> * ml-classification-regression.md
>   * Replace indirect forward links with direct one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15883) Fix broken links on MLLIB documentations

2016-06-10 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-15883:
--
Description: 
This issue fixes all broken links on Spark 2.0 preview MLLib documents. Also, 
this contains some editorial change.

**Fix broken links**
  * mllib-data-types.md
  * mllib-decision-tree.md
  * mllib-ensembles.md
  * mllib-feature-extraction.md
  * mllib-pmml-model-export.md
  * mllib-statistics.md

**Fix malformed section header and scala coding style**
  * mllib-linear-methods.md

**Replace indirect forward links with direct one**
  * ml-classification-regression.md


  was:
This issue fixes all broken links on Spark 2.0 preview MLLib documents. Also, 
this contains some editorial change.

* Fix broken links
  * mllib-data-types.md
  * mllib-decision-tree.md
  * mllib-ensembles.md
  * mllib-feature-extraction.md
  * mllib-pmml-model-export.md
  * mllib-statistics.md

* Fix malformed section header and scala coding style.
  * mllib-linear-methods.md

* ml-classification-regression.md
  * Replace indirect forward links with direct one.



> Fix broken links on MLLIB documentations
> 
>
> Key: SPARK-15883
> URL: https://issues.apache.org/jira/browse/SPARK-15883
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> This issue fixes all broken links on Spark 2.0 preview MLLib documents. Also, 
> this contains some editorial change.
> **Fix broken links**
>   * mllib-data-types.md
>   * mllib-decision-tree.md
>   * mllib-ensembles.md
>   * mllib-feature-extraction.md
>   * mllib-pmml-model-export.md
>   * mllib-statistics.md
> **Fix malformed section header and scala coding style**
>   * mllib-linear-methods.md
> **Replace indirect forward links with direct one**
>   * ml-classification-regression.md



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15883) Fix broken links on MLLIB documentations

2016-06-10 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-15883:
-

 Summary: Fix broken links on MLLIB documentations
 Key: SPARK-15883
 URL: https://issues.apache.org/jira/browse/SPARK-15883
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Reporter: Dongjoon Hyun
Priority: Trivial


This issue fixes all broken links on Spark 2.0 preview MLLib documents. Also, 
this contains some editorial change.

* Fix broken links
  * mllib-data-types.md
  * mllib-decision-tree.md
  * mllib-ensembles.md
  * mllib-feature-extraction.md
  * mllib-pmml-model-export.md
  * mllib-statistics.md
* Fix malformed section header and scala coding style.
  * mllib-linear-methods.md
* ml-classification-regression.md
  - Replace indirect forward links with direct one.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15699) Add chi-squared test statistic as a split quality metric for decision trees

2016-06-10 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325286#comment-15325286
 ] 

Joseph K. Bradley commented on SPARK-15699:
---

[~eje] Just a warning: There are a lot of doc fixes for 2.0 + high priority 
issues, so it may take a while for this to get looked at.

> Add chi-squared test statistic as a split quality metric for decision trees
> ---
>
> Key: SPARK-15699
> URL: https://issues.apache.org/jira/browse/SPARK-15699
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Affects Versions: 2.0.0
>Reporter: Erik Erlandson
>Priority: Minor
>
> Using test statistics as a measure of decision tree split quality is a useful 
> split halting measure that can yield improved model quality.  I am proposing 
> to add the chi-squared test statistic as a new impurity option (in addition 
> to "gini" and "entropy") for classification decision trees and ensembles.
> I wrote a blog post that explains some useful properties of test-statistics 
> for measuring split quality, with some example results:
> http://erikerlandson.github.io/blog/2016/05/26/measuring-decision-tree-split-quality-with-test-statistic-p-values/
> (Other test statistics are also possible, for example using the Welch's 
> t-test variant for regression trees, but they could be addressed separately)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15881) Update microbenchmark results

2016-06-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15881:


Assignee: (was: Apache Spark)

> Update microbenchmark results
> -
>
> Key: SPARK-15881
> URL: https://issues.apache.org/jira/browse/SPARK-15881
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15881) Update microbenchmark results

2016-06-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325282#comment-15325282
 ] 

Apache Spark commented on SPARK-15881:
--

User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/13607

> Update microbenchmark results
> -
>
> Key: SPARK-15881
> URL: https://issues.apache.org/jira/browse/SPARK-15881
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15628) pyspark.ml.evaluation module

2016-06-10 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325278#comment-15325278
 ] 

Joseph K. Bradley edited comment on SPARK-15628 at 6/10/16 9:05 PM:


[~holdenk] I'll close this.  Thanks!


was (Author: josephkb):
[~holdenk] OK, so no missing items?  I'll close this.  Thanks!

> pyspark.ml.evaluation module
> 
>
> Key: SPARK-15628
> URL: https://issues.apache.org/jira/browse/SPARK-15628
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Assignee: holdenk
>
> See parent task SPARK-14813.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15882) Discuss distributed linear algebra in spark.ml package

2016-06-10 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-15882:
-

 Summary: Discuss distributed linear algebra in spark.ml package
 Key: SPARK-15882
 URL: https://issues.apache.org/jira/browse/SPARK-15882
 Project: Spark
  Issue Type: Brainstorming
  Components: ML
Reporter: Joseph K. Bradley


This JIRA is for discussing how org.apache.spark.mllib.linalg.distributed.* 
should be migrated to org.apache.spark.ml.

Initial questions:
* Should we use Datasets or RDDs underneath?
* If Datasets, are there missing features needed for the migration?
* Do we want to redesign any aspects of the distributed matrices during this 
move?




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15881) Update microbenchmark results

2016-06-10 Thread Eric Liang (JIRA)
Eric Liang created SPARK-15881:
--

 Summary: Update microbenchmark results
 Key: SPARK-15881
 URL: https://issues.apache.org/jira/browse/SPARK-15881
 Project: Spark
  Issue Type: Sub-task
Reporter: Eric Liang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15086) Update Java API once the Scala one is finalized

2016-06-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15086:


Assignee: Apache Spark

> Update Java API once the Scala one is finalized
> ---
>
> Key: SPARK-15086
> URL: https://issues.apache.org/jira/browse/SPARK-15086
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Apache Spark
>Priority: Blocker
>
> We should make sure we update the Java API once the Scala one is finalized. 
> This includes adding the equivalent API in Java as well as deprecating the 
> old ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15086) Update Java API once the Scala one is finalized

2016-06-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15086:


Assignee: (was: Apache Spark)

> Update Java API once the Scala one is finalized
> ---
>
> Key: SPARK-15086
> URL: https://issues.apache.org/jira/browse/SPARK-15086
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Priority: Blocker
>
> We should make sure we update the Java API once the Scala one is finalized. 
> This includes adding the equivalent API in Java as well as deprecating the 
> old ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15086) Update Java API once the Scala one is finalized

2016-06-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325254#comment-15325254
 ] 

Apache Spark commented on SPARK-15086:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/13606

> Update Java API once the Scala one is finalized
> ---
>
> Key: SPARK-15086
> URL: https://issues.apache.org/jira/browse/SPARK-15086
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Priority: Blocker
>
> We should make sure we update the Java API once the Scala one is finalized. 
> This includes adding the equivalent API in Java as well as deprecating the 
> old ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15738) PySpark ml.feature RFormula missing string representation displaying formula

2016-06-10 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-15738.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> PySpark ml.feature RFormula missing string representation displaying formula
> 
>
> Key: SPARK-15738
> URL: https://issues.apache.org/jira/browse/SPARK-15738
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
> Fix For: 2.0.0
>
>
> From 2.0 Python api coverage
> RFormula and model are missing string representations that show formula like 
> in the Scala api



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15738) PySpark ml.feature RFormula missing string representation displaying formula

2016-06-10 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-15738:

Assignee: Bryan Cutler

> PySpark ml.feature RFormula missing string representation displaying formula
> 
>
> Key: SPARK-15738
> URL: https://issues.apache.org/jira/browse/SPARK-15738
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>
> From 2.0 Python api coverage
> RFormula and model are missing string representations that show formula like 
> in the Scala api



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15782) --packages doesn't work with the spark-shell

2016-06-10 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-15782:
---
Target Version/s: 2.0.0
Priority: Blocker  (was: Major)
 Component/s: (was: Spark Core)
  YARN

Regression so raising priority; also fixing the component since as far as I 
understand this only affects yarn.

> --packages doesn't work with the spark-shell
> 
>
> Key: SPARK-15782
> URL: https://issues.apache.org/jira/browse/SPARK-15782
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Nezih Yigitbasi
>Priority: Blocker
>
> When {{--packages}} is specified with {{spark-shell}} the classes from those 
> packages cannot be found, which I think is due to some of the changes in 
> {{SPARK-12343}}. In particular {{SPARK-12343}} removes a line that sets the 
> {{spark.jars}} system property in client mode, which is used by the repl main 
> class to set the classpath.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15875) Avoid using Seq.length == 0 and Seq.lenth > 0. Use Seq.isEmpty and Seq.nonEmpty instead.

2016-06-10 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15875.
-
   Resolution: Fixed
 Assignee: Yang Wang
Fix Version/s: 2.0.0

> Avoid using Seq.length == 0 and Seq.lenth > 0. Use Seq.isEmpty and 
> Seq.nonEmpty instead.
> 
>
> Key: SPARK-15875
> URL: https://issues.apache.org/jira/browse/SPARK-15875
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Minor
> Fix For: 2.0.0
>
>
> In scala, immutable.List.length is an expensive operation so we should
> avoid using Seq.length == 0 and Seq.lenth > 0 and use Seq.isEmpty and 
> Seq.nonEmpty instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6320) Adding new query plan strategy to SQLContext

2016-06-10 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-6320.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13147
[https://github.com/apache/spark/pull/13147]

> Adding new query plan strategy to SQLContext
> 
>
> Key: SPARK-6320
> URL: https://issues.apache.org/jira/browse/SPARK-6320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Youssef Hatem
>Priority: Minor
> Fix For: 2.0.0
>
>
> Hi,
> I would like to add a new strategy to {{SQLContext}}. To do this I created a 
> new class which extends {{Strategy}}. In my new class I need to call 
> {{planLater}} function. However this method is defined in {{SparkPlanner}} 
> (which itself inherits the method from {{QueryPlanner}}).
> To my knowledge the only way to make {{planLater}} function visible to my new 
> strategy is to define my strategy inside another class that extends 
> {{SparkPlanner}} and inherits {{planLater}} as a result, by doing so I will 
> have to extend the {{SQLContext}} such that I can override the {{planner}} 
> field with the new {{Planner}} class I created.
> It seems that this is a design problem because adding a new strategy seems to 
> require extending {{SQLContext}} (unless I am doing it wrong and there is a 
> better way to do it).
> Thanks a lot,
> Youssef



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15871) Add assertNotPartitioned check in DataFrameWriter

2016-06-10 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-15871.
--
   Resolution: Fixed
 Assignee: Liwei Lin
Fix Version/s: 2.0.0

> Add assertNotPartitioned check in DataFrameWriter
> -
>
> Key: SPARK-15871
> URL: https://issues.apache.org/jira/browse/SPARK-15871
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Liwei Lin
>Assignee: Liwei Lin
> Fix For: 2.0.0
>
>
> Sometimes it doesn't make sense to specify partitioning parameters, e.g. when 
>  we write data out from Datasets/DataFrames into jdbc tables or streaming 
> {{ForeachWriters}}. We probably should add checks against this in 
> {{DataFrameWriter}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14485) Task finished cause fetch failure when its executor has already been removed by driver

2016-06-10 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-14485.

Resolution: Won't Fix

> Task finished cause fetch failure when its executor has already been removed 
> by driver 
> ---
>
> Key: SPARK-14485
> URL: https://issues.apache.org/jira/browse/SPARK-14485
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.5.2
>Reporter: iward
>Assignee: iward
>
> Now, when executor is removed by driver with heartbeats timeout, driver will 
> re-queue the task on this executor and send a kill command to cluster to kill 
> this executor.
> But, in a situation, the running task of this executor is finished and return 
> result to driver before this executor killed by kill command sent by driver. 
> At this situation, driver will accept the task finished event and ignore  
> speculative task and re-queued task. But, as we know, this executor has 
> removed by driver, the result of this finished task can not save in driver 
> because the *BlockManagerId* has also removed from *BlockManagerMaster* by 
> driver. So, the result data of this stage is not complete, and then, it will 
> cause fetch failure.
> For example, the following is the task log:
> {noformat}
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN HeartbeatReceiver: Removing 
> executor 322 with no recent heartbeats: 256015 ms exceeds timeout 25 ms
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 ERROR YarnScheduler: Lost executor 
> 322 on BJHC-HERA-16168.hadoop.jd.local: Executor heartbeat timed out after 
> 256015 ms
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO TaskSetManager: Re-queueing 
> tasks for 322 from TaskSet 107.0
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN TaskSetManager: Lost task 
> 229.0 in stage 107.0 (TID 10384, BJHC-HERA-16168.hadoop.jd.local): 
> ExecutorLostFailure (executor 322 lost)
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO DAGScheduler: Executor lost: 
> 322 (epoch 11)
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMasterEndpoint: 
> Trying to remove executor 322 from BlockManagerMaster.
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMaster: Removed 
> 322 successfully in removeExecutor
> {noformat}
> {noformat}
> 2015-12-31 04:38:52 INFO 15/12/31 04:38:52 INFO TaskSetManager: Finished task 
> 229.0 in stage 107.0 (TID 10384) in 272315 ms on 
> BJHC-HERA-16168.hadoop.jd.local (579/700)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Ignoring 
> task-finished event for 229.1 in stage 107.0 because task 229 has already 
> completed successfully
> {noformat}
> {noformat}
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO DAGScheduler: Submitting 3 
> missing tasks from ShuffleMapStage 107 (MapPartitionsRDD[263] at 
> mapPartitions at Exchange.scala:137)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO YarnScheduler: Adding task 
> set 107.1 with 3 tasks
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 0.0 in stage 107.1 (TID 10863, BJHC-HERA-18043.hadoop.jd.local, 
> PROCESS_LOCAL, 3745 bytes)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 1.0 in stage 107.1 (TID 10864, BJHC-HERA-9291.hadoop.jd.local, PROCESS_LOCAL, 
> 3745 bytes)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 2.0 in stage 107.1 (TID 10865, BJHC-HERA-16047.hadoop.jd.local, 
> PROCESS_LOCAL, 3745 bytes)
> {noformat}
> Driver will check the stage's result is not complete, and submit missing 
> task, but this time, the next stage has run because previous stage has finish 
> for its task is all finished although its result is not complete.
> {noformat}
> 2015-12-31 04:40:13 INFO 15/12/31 04:40:13 WARN TaskSetManager: Lost task 
> 39.0 in stage 109.0 (TID 10905, BJHC-HERA-9357.hadoop.jd.local): 
> FetchFailed(null, shuffleId=11, mapId=-1, reduceId=39, message=
> 2015-12-31 04:40:13 INFO 
> org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
> location for shuffle 11
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:385)
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:382)
> 2015-12-31 04:40:13 INFO at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 2015-12-31 04:40:13 INFO at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 2015-12-31 04:40:13 INFO at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> 

[jira] [Commented] (SPARK-14485) Task finished cause fetch failure when its executor has already been removed by driver

2016-06-10 Thread Kay Ousterhout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325175#comment-15325175
 ] 

Kay Ousterhout commented on SPARK-14485:


Reverted this and re-opened the JIRA to mark this as "won't fix".

> Task finished cause fetch failure when its executor has already been removed 
> by driver 
> ---
>
> Key: SPARK-14485
> URL: https://issues.apache.org/jira/browse/SPARK-14485
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.5.2
>Reporter: iward
>Assignee: iward
>
> Now, when executor is removed by driver with heartbeats timeout, driver will 
> re-queue the task on this executor and send a kill command to cluster to kill 
> this executor.
> But, in a situation, the running task of this executor is finished and return 
> result to driver before this executor killed by kill command sent by driver. 
> At this situation, driver will accept the task finished event and ignore  
> speculative task and re-queued task. But, as we know, this executor has 
> removed by driver, the result of this finished task can not save in driver 
> because the *BlockManagerId* has also removed from *BlockManagerMaster* by 
> driver. So, the result data of this stage is not complete, and then, it will 
> cause fetch failure.
> For example, the following is the task log:
> {noformat}
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN HeartbeatReceiver: Removing 
> executor 322 with no recent heartbeats: 256015 ms exceeds timeout 25 ms
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 ERROR YarnScheduler: Lost executor 
> 322 on BJHC-HERA-16168.hadoop.jd.local: Executor heartbeat timed out after 
> 256015 ms
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO TaskSetManager: Re-queueing 
> tasks for 322 from TaskSet 107.0
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN TaskSetManager: Lost task 
> 229.0 in stage 107.0 (TID 10384, BJHC-HERA-16168.hadoop.jd.local): 
> ExecutorLostFailure (executor 322 lost)
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO DAGScheduler: Executor lost: 
> 322 (epoch 11)
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMasterEndpoint: 
> Trying to remove executor 322 from BlockManagerMaster.
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMaster: Removed 
> 322 successfully in removeExecutor
> {noformat}
> {noformat}
> 2015-12-31 04:38:52 INFO 15/12/31 04:38:52 INFO TaskSetManager: Finished task 
> 229.0 in stage 107.0 (TID 10384) in 272315 ms on 
> BJHC-HERA-16168.hadoop.jd.local (579/700)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Ignoring 
> task-finished event for 229.1 in stage 107.0 because task 229 has already 
> completed successfully
> {noformat}
> {noformat}
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO DAGScheduler: Submitting 3 
> missing tasks from ShuffleMapStage 107 (MapPartitionsRDD[263] at 
> mapPartitions at Exchange.scala:137)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO YarnScheduler: Adding task 
> set 107.1 with 3 tasks
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 0.0 in stage 107.1 (TID 10863, BJHC-HERA-18043.hadoop.jd.local, 
> PROCESS_LOCAL, 3745 bytes)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 1.0 in stage 107.1 (TID 10864, BJHC-HERA-9291.hadoop.jd.local, PROCESS_LOCAL, 
> 3745 bytes)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 2.0 in stage 107.1 (TID 10865, BJHC-HERA-16047.hadoop.jd.local, 
> PROCESS_LOCAL, 3745 bytes)
> {noformat}
> Driver will check the stage's result is not complete, and submit missing 
> task, but this time, the next stage has run because previous stage has finish 
> for its task is all finished although its result is not complete.
> {noformat}
> 2015-12-31 04:40:13 INFO 15/12/31 04:40:13 WARN TaskSetManager: Lost task 
> 39.0 in stage 109.0 (TID 10905, BJHC-HERA-9357.hadoop.jd.local): 
> FetchFailed(null, shuffleId=11, mapId=-1, reduceId=39, message=
> 2015-12-31 04:40:13 INFO 
> org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
> location for shuffle 11
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:385)
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:382)
> 2015-12-31 04:40:13 INFO at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 2015-12-31 04:40:13 INFO at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 2015-12-31 04:40:13 INFO at 
> 

[jira] [Reopened] (SPARK-14485) Task finished cause fetch failure when its executor has already been removed by driver

2016-06-10 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout reopened SPARK-14485:


> Task finished cause fetch failure when its executor has already been removed 
> by driver 
> ---
>
> Key: SPARK-14485
> URL: https://issues.apache.org/jira/browse/SPARK-14485
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.5.2
>Reporter: iward
>Assignee: iward
>
> Now, when executor is removed by driver with heartbeats timeout, driver will 
> re-queue the task on this executor and send a kill command to cluster to kill 
> this executor.
> But, in a situation, the running task of this executor is finished and return 
> result to driver before this executor killed by kill command sent by driver. 
> At this situation, driver will accept the task finished event and ignore  
> speculative task and re-queued task. But, as we know, this executor has 
> removed by driver, the result of this finished task can not save in driver 
> because the *BlockManagerId* has also removed from *BlockManagerMaster* by 
> driver. So, the result data of this stage is not complete, and then, it will 
> cause fetch failure.
> For example, the following is the task log:
> {noformat}
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN HeartbeatReceiver: Removing 
> executor 322 with no recent heartbeats: 256015 ms exceeds timeout 25 ms
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 ERROR YarnScheduler: Lost executor 
> 322 on BJHC-HERA-16168.hadoop.jd.local: Executor heartbeat timed out after 
> 256015 ms
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO TaskSetManager: Re-queueing 
> tasks for 322 from TaskSet 107.0
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN TaskSetManager: Lost task 
> 229.0 in stage 107.0 (TID 10384, BJHC-HERA-16168.hadoop.jd.local): 
> ExecutorLostFailure (executor 322 lost)
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO DAGScheduler: Executor lost: 
> 322 (epoch 11)
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMasterEndpoint: 
> Trying to remove executor 322 from BlockManagerMaster.
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMaster: Removed 
> 322 successfully in removeExecutor
> {noformat}
> {noformat}
> 2015-12-31 04:38:52 INFO 15/12/31 04:38:52 INFO TaskSetManager: Finished task 
> 229.0 in stage 107.0 (TID 10384) in 272315 ms on 
> BJHC-HERA-16168.hadoop.jd.local (579/700)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Ignoring 
> task-finished event for 229.1 in stage 107.0 because task 229 has already 
> completed successfully
> {noformat}
> {noformat}
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO DAGScheduler: Submitting 3 
> missing tasks from ShuffleMapStage 107 (MapPartitionsRDD[263] at 
> mapPartitions at Exchange.scala:137)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO YarnScheduler: Adding task 
> set 107.1 with 3 tasks
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 0.0 in stage 107.1 (TID 10863, BJHC-HERA-18043.hadoop.jd.local, 
> PROCESS_LOCAL, 3745 bytes)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 1.0 in stage 107.1 (TID 10864, BJHC-HERA-9291.hadoop.jd.local, PROCESS_LOCAL, 
> 3745 bytes)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 2.0 in stage 107.1 (TID 10865, BJHC-HERA-16047.hadoop.jd.local, 
> PROCESS_LOCAL, 3745 bytes)
> {noformat}
> Driver will check the stage's result is not complete, and submit missing 
> task, but this time, the next stage has run because previous stage has finish 
> for its task is all finished although its result is not complete.
> {noformat}
> 2015-12-31 04:40:13 INFO 15/12/31 04:40:13 WARN TaskSetManager: Lost task 
> 39.0 in stage 109.0 (TID 10905, BJHC-HERA-9357.hadoop.jd.local): 
> FetchFailed(null, shuffleId=11, mapId=-1, reduceId=39, message=
> 2015-12-31 04:40:13 INFO 
> org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
> location for shuffle 11
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:385)
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:382)
> 2015-12-31 04:40:13 INFO at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 2015-12-31 04:40:13 INFO at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 2015-12-31 04:40:13 INFO at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> 2015-12-31 04:40:13 INFO at 
> 

[jira] [Updated] (SPARK-14485) Task finished cause fetch failure when its executor has already been removed by driver

2016-06-10 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-14485:
---
Fix Version/s: (was: 2.0.0)

> Task finished cause fetch failure when its executor has already been removed 
> by driver 
> ---
>
> Key: SPARK-14485
> URL: https://issues.apache.org/jira/browse/SPARK-14485
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.5.2
>Reporter: iward
>Assignee: iward
>
> Now, when executor is removed by driver with heartbeats timeout, driver will 
> re-queue the task on this executor and send a kill command to cluster to kill 
> this executor.
> But, in a situation, the running task of this executor is finished and return 
> result to driver before this executor killed by kill command sent by driver. 
> At this situation, driver will accept the task finished event and ignore  
> speculative task and re-queued task. But, as we know, this executor has 
> removed by driver, the result of this finished task can not save in driver 
> because the *BlockManagerId* has also removed from *BlockManagerMaster* by 
> driver. So, the result data of this stage is not complete, and then, it will 
> cause fetch failure.
> For example, the following is the task log:
> {noformat}
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN HeartbeatReceiver: Removing 
> executor 322 with no recent heartbeats: 256015 ms exceeds timeout 25 ms
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 ERROR YarnScheduler: Lost executor 
> 322 on BJHC-HERA-16168.hadoop.jd.local: Executor heartbeat timed out after 
> 256015 ms
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO TaskSetManager: Re-queueing 
> tasks for 322 from TaskSet 107.0
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN TaskSetManager: Lost task 
> 229.0 in stage 107.0 (TID 10384, BJHC-HERA-16168.hadoop.jd.local): 
> ExecutorLostFailure (executor 322 lost)
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO DAGScheduler: Executor lost: 
> 322 (epoch 11)
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMasterEndpoint: 
> Trying to remove executor 322 from BlockManagerMaster.
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMaster: Removed 
> 322 successfully in removeExecutor
> {noformat}
> {noformat}
> 2015-12-31 04:38:52 INFO 15/12/31 04:38:52 INFO TaskSetManager: Finished task 
> 229.0 in stage 107.0 (TID 10384) in 272315 ms on 
> BJHC-HERA-16168.hadoop.jd.local (579/700)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Ignoring 
> task-finished event for 229.1 in stage 107.0 because task 229 has already 
> completed successfully
> {noformat}
> {noformat}
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO DAGScheduler: Submitting 3 
> missing tasks from ShuffleMapStage 107 (MapPartitionsRDD[263] at 
> mapPartitions at Exchange.scala:137)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO YarnScheduler: Adding task 
> set 107.1 with 3 tasks
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 0.0 in stage 107.1 (TID 10863, BJHC-HERA-18043.hadoop.jd.local, 
> PROCESS_LOCAL, 3745 bytes)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 1.0 in stage 107.1 (TID 10864, BJHC-HERA-9291.hadoop.jd.local, PROCESS_LOCAL, 
> 3745 bytes)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 
> 2.0 in stage 107.1 (TID 10865, BJHC-HERA-16047.hadoop.jd.local, 
> PROCESS_LOCAL, 3745 bytes)
> {noformat}
> Driver will check the stage's result is not complete, and submit missing 
> task, but this time, the next stage has run because previous stage has finish 
> for its task is all finished although its result is not complete.
> {noformat}
> 2015-12-31 04:40:13 INFO 15/12/31 04:40:13 WARN TaskSetManager: Lost task 
> 39.0 in stage 109.0 (TID 10905, BJHC-HERA-9357.hadoop.jd.local): 
> FetchFailed(null, shuffleId=11, mapId=-1, reduceId=39, message=
> 2015-12-31 04:40:13 INFO 
> org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
> location for shuffle 11
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:385)
> 2015-12-31 04:40:13 INFO at 
> org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:382)
> 2015-12-31 04:40:13 INFO at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 2015-12-31 04:40:13 INFO at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 2015-12-31 04:40:13 INFO at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> 

  1   2   3   >