date:20160504

[jira] [Commented] (SPARK-8489) Add regression tests for SPARK-8470

2016-05-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271974#comment-15271974
 ] 

Apache Spark commented on SPARK-8489:
-

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/12924

> Add regression tests for SPARK-8470
> ---
>
> Key: SPARK-8489
> URL: https://issues.apache.org/jira/browse/SPARK-8489
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 1.4.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Critical
> Fix For: 1.4.1, 1.5.0
>
>
> See SPARK-8470 for more detail. Basically the Spark Hive code silently 
> overwrites the context class loader populated in SparkSubmit, resulting in 
> certain classes missing when we do reflection in `SQLContext#createDataFrame`.
> That issue is already resolved in https://github.com/apache/spark/pull/6891, 
> but we should add a regression test for the specific manifestation of the bug 
> in SPARK-8470.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14893) Re-enable HiveSparkSubmitSuite SPARK-8489 test after HiveContext is removed

2016-05-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271973#comment-15271973
 ] 

Apache Spark commented on SPARK-14893:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/12924

> Re-enable HiveSparkSubmitSuite SPARK-8489 test after HiveContext is removed
> ---
>
> Key: SPARK-14893
> URL: https://issues.apache.org/jira/browse/SPARK-14893
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>
> The test was disabled in https://github.com/apache/spark/pull/12585. To 
> re-enable it we need to rebuild the jar using the updated source code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14893) Re-enable HiveSparkSubmitSuite SPARK-8489 test after HiveContext is removed

2016-05-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14893:


Assignee: Apache Spark

> Re-enable HiveSparkSubmitSuite SPARK-8489 test after HiveContext is removed
> ---
>
> Key: SPARK-14893
> URL: https://issues.apache.org/jira/browse/SPARK-14893
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>
> The test was disabled in https://github.com/apache/spark/pull/12585. To 
> re-enable it we need to rebuild the jar using the updated source code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14893) Re-enable HiveSparkSubmitSuite SPARK-8489 test after HiveContext is removed

2016-05-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14893:


Assignee: (was: Apache Spark)

> Re-enable HiveSparkSubmitSuite SPARK-8489 test after HiveContext is removed
> ---
>
> Key: SPARK-14893
> URL: https://issues.apache.org/jira/browse/SPARK-14893
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>
> The test was disabled in https://github.com/apache/spark/pull/12585. To 
> re-enable it we need to rebuild the jar using the updated source code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15114) Column name generated by typed aggregate is super verbose

2016-05-04 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271972#comment-15271972
 ] 

Dilip Biswal commented on SPARK-15114:
--

[~yhuai] Sure Yin. I will give it a try.

> Column name generated by typed aggregate is super verbose
> -
>
> Key: SPARK-15114
> URL: https://issues.apache.org/jira/browse/SPARK-15114
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> {code}
> case class Person(name: String, email: String, age: Long)
> val ds = spark.read.json("/tmp/person.json").as[Person]
> import org.apache.spark.sql.expressions.scala.typed._
> ds.groupByKey(_ => 0).agg(sum(_.age))
> // org.apache.spark.sql.Dataset[(Int, Double)] = [value: int, 
> typedsumdouble(unresolveddeserializer(newInstance(class Person), age#0L, 
> email#1, name#2), upcast(value)): double]
> ds.groupByKey(_ => 0).agg(sum(_.age)).explain
> == Physical Plan ==
> WholeStageCodegen
> :  +- TungstenAggregate(key=[value#84], 
> functions=[(TypedSumDouble($line15.$read$$iw$$iw$Person),mode=Final,isDistinct=false)],
>  output=[value#84,typedsumdouble(unresolveddeserializer(newInstance(class 
> $line15.$read$$iw$$iw$Person), age#0L, email#1, name#2), upcast(value))#91])
> : +- INPUT
> +- Exchange hashpartitioning(value#84, 200), None
>+- WholeStageCodegen
>   :  +- TungstenAggregate(key=[value#84], 
> functions=[(TypedSumDouble($line15.$read$$iw$$iw$Person),mode=Partial,isDistinct=false)],
>  output=[value#84,value#97])
>   : +- INPUT
>   +- AppendColumns , newInstance(class 
> $line15.$read$$iw$$iw$Person), [input[0, int] AS value#84]
>  +- WholeStageCodegen
> :  +- Scan HadoopFiles[age#0L,email#1,name#2] Format: JSON, 
> PushedFilters: [], ReadSchema: struct
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15148) Upgrade Univocity library from 2.0.2 to 2.1.0

2016-05-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15148:


Assignee: Apache Spark

>  Upgrade Univocity library from 2.0.2 to 2.1.0
> --
>
> Key: SPARK-15148
> URL: https://issues.apache.org/jira/browse/SPARK-15148
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> It looks a new release of Univocity CSV library was published, 
> https://github.com/uniVocity/univocity-parsers/releases.
> This contains some improvements as below:
> {quote}
> 1. Performance improvements for parsing/writing CSV and TSV. CSV writing and 
> parsing got 30-40% faster.
> 2. Deprecated methods setParseUnescapedQuotes and 
> setParseUnescapedQuotesUntilDelimiter class CsvParserSettings in favor of the 
> new setUnescapedQuoteHandling method that takes values from the 
> UnescapedQuoteHandling enumeration.
> 3. Default behavior of the CSV parser when unescaped quotes are found on the 
> input changed to parse until a delimiter character is found, i.e. 
> UnescapedQuoteHandling.STOP_AT_DELIMITER. The old default of trying to find a 
> closing quote (i.e. UnescapedQuoteHandling.STOP_AT_CLOSING_QUOTE) can be 
> problematic when no closing quote is found, making the parser accumulate all 
> characters into the same value, until the end of the input.
> {quote}
> With Spark,
> Firstly, It uses this library for CSV data source. This will affect the 
> performance.
> Secondly, Spark uses {{setParseUnescapedQuotesUntilDelimiter}} which is 
> deprecated in this version because It seems there are some more 
> functionalities for parsing unescaped quotes. This seems not directly related 
> with Spark but we might have to consider using this in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15148) Upgrade Univocity library from 2.0.2 to 2.1.0

2016-05-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15148:


Assignee: (was: Apache Spark)

>  Upgrade Univocity library from 2.0.2 to 2.1.0
> --
>
> Key: SPARK-15148
> URL: https://issues.apache.org/jira/browse/SPARK-15148
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> It looks a new release of Univocity CSV library was published, 
> https://github.com/uniVocity/univocity-parsers/releases.
> This contains some improvements as below:
> {quote}
> 1. Performance improvements for parsing/writing CSV and TSV. CSV writing and 
> parsing got 30-40% faster.
> 2. Deprecated methods setParseUnescapedQuotes and 
> setParseUnescapedQuotesUntilDelimiter class CsvParserSettings in favor of the 
> new setUnescapedQuoteHandling method that takes values from the 
> UnescapedQuoteHandling enumeration.
> 3. Default behavior of the CSV parser when unescaped quotes are found on the 
> input changed to parse until a delimiter character is found, i.e. 
> UnescapedQuoteHandling.STOP_AT_DELIMITER. The old default of trying to find a 
> closing quote (i.e. UnescapedQuoteHandling.STOP_AT_CLOSING_QUOTE) can be 
> problematic when no closing quote is found, making the parser accumulate all 
> characters into the same value, until the end of the input.
> {quote}
> With Spark,
> Firstly, It uses this library for CSV data source. This will affect the 
> performance.
> Secondly, Spark uses {{setParseUnescapedQuotesUntilDelimiter}} which is 
> deprecated in this version because It seems there are some more 
> functionalities for parsing unescaped quotes. This seems not directly related 
> with Spark but we might have to consider using this in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15148) Upgrade Univocity library from 2.0.2 to 2.1.0

2016-05-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271964#comment-15271964
 ] 

Apache Spark commented on SPARK-15148:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/12923

>  Upgrade Univocity library from 2.0.2 to 2.1.0
> --
>
> Key: SPARK-15148
> URL: https://issues.apache.org/jira/browse/SPARK-15148
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> It looks a new release of Univocity CSV library was published, 
> https://github.com/uniVocity/univocity-parsers/releases.
> This contains some improvements as below:
> {quote}
> 1. Performance improvements for parsing/writing CSV and TSV. CSV writing and 
> parsing got 30-40% faster.
> 2. Deprecated methods setParseUnescapedQuotes and 
> setParseUnescapedQuotesUntilDelimiter class CsvParserSettings in favor of the 
> new setUnescapedQuoteHandling method that takes values from the 
> UnescapedQuoteHandling enumeration.
> 3. Default behavior of the CSV parser when unescaped quotes are found on the 
> input changed to parse until a delimiter character is found, i.e. 
> UnescapedQuoteHandling.STOP_AT_DELIMITER. The old default of trying to find a 
> closing quote (i.e. UnescapedQuoteHandling.STOP_AT_CLOSING_QUOTE) can be 
> problematic when no closing quote is found, making the parser accumulate all 
> characters into the same value, until the end of the input.
> {quote}
> With Spark,
> Firstly, It uses this library for CSV data source. This will affect the 
> performance.
> Secondly, Spark uses {{setParseUnescapedQuotesUntilDelimiter}} which is 
> deprecated in this version because It seems there are some more 
> functionalities for parsing unescaped quotes. This seems not directly related 
> with Spark but we might have to consider using this in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15146) Allow specifying kafka parameters through configurations

2016-05-04 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271962#comment-15271962
 ] 

Saisai Shao commented on SPARK-15146:
-

[~c...@koeninger.org], what is your opinion on the JIRA? Thanks a lot.

> Allow specifying kafka parameters through configurations
> 
>
> Key: SPARK-15146
> URL: https://issues.apache.org/jira/browse/SPARK-15146
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Saisai Shao
>Priority: Minor
>
> Current Spark Streaming Kafka connector can only specify consumer parameters 
> through parameter {{kafkaParams}}, it is not so convenient for end users, 
> they have to re-compile the codes each time changing the configurations.
> So here propose to allow specifying kafka consumer parameters through 
> configurations, similar to what we do for hadoop configurations. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14495) Distinct aggregation cannot be used in the having clause

2016-05-04 Thread Xin Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271956#comment-15271956
 ] 

Xin Wu edited comment on SPARK-14495 at 5/5/16 6:25 AM:


[~smilegator] I got the fix and running regtest now. Will submit the PR once it 
is done. 


was (Author: xwu0226):
[~smilegator] I got the fix and running regtest now. Will submit the PR one it 
is done. 

> Distinct aggregation cannot be used in the having clause
> 
>
> Key: SPARK-14495
> URL: https://issues.apache.org/jira/browse/SPARK-14495
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Yin Huai
>
> {code}
> select date, count(distinct id)
> from (select '2010-01-01' as date, 1 as id) tmp
> group by date
> having count(distinct id) > 0;
> org.apache.spark.sql.AnalysisException: resolved attribute(s) gid#558,id#559 
> missing from date#554,id#555 in operator !Expand [List(date#554, null, 0, if 
> ((gid#558 = 1)) id#559 else null),List(date#554, id#555, 1, null)], 
> [date#554,id#561,gid#560,if ((gid = 1)) id else null#562];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:183)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:121)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
>   at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:816)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15148) Upgrade Univocity library from 2.0.2 to 2.1.0

2016-05-04 Thread Hyukjin Kwon (JIRA)

Hyukjin Kwon created SPARK-15148:


 Summary:  Upgrade Univocity library from 2.0.2 to 2.1.0
 Key: SPARK-15148
 URL: https://issues.apache.org/jira/browse/SPARK-15148
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Hyukjin Kwon
Priority: Minor


It looks a new release of Univocity CSV library was published, 
https://github.com/uniVocity/univocity-parsers/releases.

This contains some improvements as below:

{quote}
1. Performance improvements for parsing/writing CSV and TSV. CSV writing and 
parsing got 30-40% faster.

2. Deprecated methods setParseUnescapedQuotes and 
setParseUnescapedQuotesUntilDelimiter class CsvParserSettings in favor of the 
new setUnescapedQuoteHandling method that takes values from the 
UnescapedQuoteHandling enumeration.

3. Default behavior of the CSV parser when unescaped quotes are found on the 
input changed to parse until a delimiter character is found, i.e. 
UnescapedQuoteHandling.STOP_AT_DELIMITER. The old default of trying to find a 
closing quote (i.e. UnescapedQuoteHandling.STOP_AT_CLOSING_QUOTE) can be 
problematic when no closing quote is found, making the parser accumulate all 
characters into the same value, until the end of the input.
{quote}


With Spark,
Firstly, It uses this library for CSV data source. This will affect the 
performance.

Secondly, Spark uses {{setParseUnescapedQuotesUntilDelimiter}} which is 
deprecated in this version because It seems there are some more functionalities 
for parsing unescaped quotes. This seems not directly related with Spark but we 
might have to consider using this in the future.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14495) Distinct aggregation cannot be used in the having clause

2016-05-04 Thread Xin Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271956#comment-15271956
 ] 

Xin Wu commented on SPARK-14495:


[~smilegator] I got the fix and running regtest now. Will submit the PR one it 
is done. 

> Distinct aggregation cannot be used in the having clause
> 
>
> Key: SPARK-14495
> URL: https://issues.apache.org/jira/browse/SPARK-14495
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Yin Huai
>
> {code}
> select date, count(distinct id)
> from (select '2010-01-01' as date, 1 as id) tmp
> group by date
> having count(distinct id) > 0;
> org.apache.spark.sql.AnalysisException: resolved attribute(s) gid#558,id#559 
> missing from date#554,id#555 in operator !Expand [List(date#554, null, 0, if 
> ((gid#558 = 1)) id#559 else null),List(date#554, id#555, 1, null)], 
> [date#554,id#561,gid#560,if ((gid = 1)) id else null#562];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:183)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:121)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
>   at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:816)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15147) Catalog should have a property to indicate case-sensitivity

2016-05-04 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-15147:
--

 Summary: Catalog should have a property to indicate 
case-sensitivity
 Key: SPARK-15147
 URL: https://issues.apache.org/jira/browse/SPARK-15147
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Cheng Lian


We are moving from Hive metastore catalog to a more general extensible catalog 
design. One problem that hasn't been taken care of in current Spark 2.0 
interfaces is case sensitivity. More specifically, Hive metastore is case 
insensitive. It simply stores column names, table names, struct field names, 
and function names in lower case, and thus isn't even case-preserving. However, 
case sensitivity in Spark SQL is configurable. We need to add property (or 
properties) to {{Catalog}} interface to indicate case-sensitivity of underlying 
catalog implementations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15114) Column name generated by typed aggregate is super verbose

2016-05-04 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271943#comment-15271943
 ] 

Yin Huai edited comment on SPARK-15114 at 5/5/16 6:22 AM:
--

I think at there, we should use a UnresolvedAlias. Will you have time to try it 
out and see what will be a good way to generate the alias?


was (Author: yhuai):
I think at here, we should use a UnresolvedAlias. Will you have time to try it 
out and see what will be a good way to generate the alias?

> Column name generated by typed aggregate is super verbose
> -
>
> Key: SPARK-15114
> URL: https://issues.apache.org/jira/browse/SPARK-15114
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> {code}
> case class Person(name: String, email: String, age: Long)
> val ds = spark.read.json("/tmp/person.json").as[Person]
> import org.apache.spark.sql.expressions.scala.typed._
> ds.groupByKey(_ => 0).agg(sum(_.age))
> // org.apache.spark.sql.Dataset[(Int, Double)] = [value: int, 
> typedsumdouble(unresolveddeserializer(newInstance(class Person), age#0L, 
> email#1, name#2), upcast(value)): double]
> ds.groupByKey(_ => 0).agg(sum(_.age)).explain
> == Physical Plan ==
> WholeStageCodegen
> :  +- TungstenAggregate(key=[value#84], 
> functions=[(TypedSumDouble($line15.$read$$iw$$iw$Person),mode=Final,isDistinct=false)],
>  output=[value#84,typedsumdouble(unresolveddeserializer(newInstance(class 
> $line15.$read$$iw$$iw$Person), age#0L, email#1, name#2), upcast(value))#91])
> : +- INPUT
> +- Exchange hashpartitioning(value#84, 200), None
>+- WholeStageCodegen
>   :  +- TungstenAggregate(key=[value#84], 
> functions=[(TypedSumDouble($line15.$read$$iw$$iw$Person),mode=Partial,isDistinct=false)],
>  output=[value#84,value#97])
>   : +- INPUT
>   +- AppendColumns , newInstance(class 
> $line15.$read$$iw$$iw$Person), [input[0, int] AS value#84]
>  +- WholeStageCodegen
> :  +- Scan HadoopFiles[age#0L,email#1,name#2] Format: JSON, 
> PushedFilters: [], ReadSchema: struct
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15146) Allow specifying kafka parameters through configurations

2016-05-04 Thread Saisai Shao (JIRA)

Saisai Shao created SPARK-15146:
---

 Summary: Allow specifying kafka parameters through configurations
 Key: SPARK-15146
 URL: https://issues.apache.org/jira/browse/SPARK-15146
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Saisai Shao
Priority: Minor


Current Spark Streaming Kafka connector can only specify consumer parameters 
through parameter {{kafkaParams}}, it is not so convenient for end users, they 
have to re-compile the codes each time changing the configurations.

So here propose to allow specifying kafka consumer parameters through 
configurations, similar to what we do for hadoop configurations. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15114) Column name generated by typed aggregate is super verbose

2016-05-04 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271943#comment-15271943
 ] 

Yin Huai commented on SPARK-15114:
--

I think at here, we should use a UnresolvedAlias. Will you have time to try it 
out and see what will be a good way to generate the alias?

> Column name generated by typed aggregate is super verbose
> -
>
> Key: SPARK-15114
> URL: https://issues.apache.org/jira/browse/SPARK-15114
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> {code}
> case class Person(name: String, email: String, age: Long)
> val ds = spark.read.json("/tmp/person.json").as[Person]
> import org.apache.spark.sql.expressions.scala.typed._
> ds.groupByKey(_ => 0).agg(sum(_.age))
> // org.apache.spark.sql.Dataset[(Int, Double)] = [value: int, 
> typedsumdouble(unresolveddeserializer(newInstance(class Person), age#0L, 
> email#1, name#2), upcast(value)): double]
> ds.groupByKey(_ => 0).agg(sum(_.age)).explain
> == Physical Plan ==
> WholeStageCodegen
> :  +- TungstenAggregate(key=[value#84], 
> functions=[(TypedSumDouble($line15.$read$$iw$$iw$Person),mode=Final,isDistinct=false)],
>  output=[value#84,typedsumdouble(unresolveddeserializer(newInstance(class 
> $line15.$read$$iw$$iw$Person), age#0L, email#1, name#2), upcast(value))#91])
> : +- INPUT
> +- Exchange hashpartitioning(value#84, 200), None
>+- WholeStageCodegen
>   :  +- TungstenAggregate(key=[value#84], 
> functions=[(TypedSumDouble($line15.$read$$iw$$iw$Person),mode=Partial,isDistinct=false)],
>  output=[value#84,value#97])
>   : +- INPUT
>   +- AppendColumns , newInstance(class 
> $line15.$read$$iw$$iw$Person), [input[0, int] AS value#84]
>  +- WholeStageCodegen
> :  +- Scan HadoopFiles[age#0L,email#1,name#2] Format: JSON, 
> PushedFilters: [], ReadSchema: struct
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15144) option nullValue for CSV data source not working for several types.

2016-05-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271927#comment-15271927
 ] 

Apache Spark commented on SPARK-15144:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/12921

> option nullValue for CSV data source not working for several types.
> ---
>
> Key: SPARK-15144
> URL: https://issues.apache.org/jira/browse/SPARK-15144
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> {{nullValue}} option does not work for the types, {{BooleanType}}, 
> {{TimestampType}}, {{DateType}} and {{StringType}}.
> So, currently there is no way to read {{null}} for those types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15144) option nullValue for CSV data source not working for several types.

2016-05-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15144:


Assignee: (was: Apache Spark)

> option nullValue for CSV data source not working for several types.
> ---
>
> Key: SPARK-15144
> URL: https://issues.apache.org/jira/browse/SPARK-15144
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> {{nullValue}} option does not work for the types, {{BooleanType}}, 
> {{TimestampType}}, {{DateType}} and {{StringType}}.
> So, currently there is no way to read {{null}} for those types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15144) option nullValue for CSV data source not working for several types.

2016-05-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15144:


Assignee: Apache Spark

> option nullValue for CSV data source not working for several types.
> ---
>
> Key: SPARK-15144
> URL: https://issues.apache.org/jira/browse/SPARK-15144
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>
> {{nullValue}} option does not work for the types, {{BooleanType}}, 
> {{TimestampType}}, {{DateType}} and {{StringType}}.
> So, currently there is no way to read {{null}} for those types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15143) CSV data source is not being tested as HadoopFsRelation

2016-05-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15143:


Assignee: Apache Spark

> CSV data source is not being tested as HadoopFsRelation
> ---
>
> Key: SPARK-15143
> URL: https://issues.apache.org/jira/browse/SPARK-15143
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>
> JSON, Parquet, Text and ORC are being tested with {{HadoopFsRelationTest}} by 
> extending this.
> This includes 60ish tests. CSV is not being tested with this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15145) spark.ml binary classification should include accuracy

2016-05-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15145:


Assignee: Apache Spark

> spark.ml binary classification should include accuracy
> --
>
> Key: SPARK-15145
> URL: https://issues.apache.org/jira/browse/SPARK-15145
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Miao Wang
>Assignee: Apache Spark
>Priority: Minor
>
> spark.ml binary classification should include accuracy. This JIRA is related 
> to SPARK-14900.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15143) CSV data source is not being tested as HadoopFsRelation

2016-05-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15143:


Assignee: (was: Apache Spark)

> CSV data source is not being tested as HadoopFsRelation
> ---
>
> Key: SPARK-15143
> URL: https://issues.apache.org/jira/browse/SPARK-15143
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> JSON, Parquet, Text and ORC are being tested with {{HadoopFsRelationTest}} by 
> extending this.
> This includes 60ish tests. CSV is not being tested with this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15143) CSV data source is not being tested as HadoopFsRelation

2016-05-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271926#comment-15271926
 ] 

Apache Spark commented on SPARK-15143:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/12921

> CSV data source is not being tested as HadoopFsRelation
> ---
>
> Key: SPARK-15143
> URL: https://issues.apache.org/jira/browse/SPARK-15143
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> JSON, Parquet, Text and ORC are being tested with {{HadoopFsRelationTest}} by 
> extending this.
> This includes 60ish tests. CSV is not being tested with this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15145) spark.ml binary classification should include accuracy

2016-05-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271925#comment-15271925
 ] 

Apache Spark commented on SPARK-15145:
--

User 'wangmiao1981' has created a pull request for this issue:
https://github.com/apache/spark/pull/12922

> spark.ml binary classification should include accuracy
> --
>
> Key: SPARK-15145
> URL: https://issues.apache.org/jira/browse/SPARK-15145
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Miao Wang
>Priority: Minor
>
> spark.ml binary classification should include accuracy. This JIRA is related 
> to SPARK-14900.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15145) spark.ml binary classification should include accuracy

2016-05-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15145:


Assignee: (was: Apache Spark)

> spark.ml binary classification should include accuracy
> --
>
> Key: SPARK-15145
> URL: https://issues.apache.org/jira/browse/SPARK-15145
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Miao Wang
>Priority: Minor
>
> spark.ml binary classification should include accuracy. This JIRA is related 
> to SPARK-14900.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15145) spark.ml binary classification should include accuracy

2016-05-04 Thread Miao Wang (JIRA)

Miao Wang created SPARK-15145:
-

 Summary: spark.ml binary classification should include accuracy
 Key: SPARK-15145
 URL: https://issues.apache.org/jira/browse/SPARK-15145
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Miao Wang
Priority: Minor


spark.ml binary classification should include accuracy. This JIRA is related to 
SPARK-14900.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15144) option nullValue for CSV data source not working for several types.

2016-05-04 Thread Abhinav Gupta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271922#comment-15271922
 ] 

Abhinav Gupta commented on SPARK-15144:
---

Can you explain with example? So that I can reproduce it.

> option nullValue for CSV data source not working for several types.
> ---
>
> Key: SPARK-15144
> URL: https://issues.apache.org/jira/browse/SPARK-15144
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> {{nullValue}} option does not work for the types, {{BooleanType}}, 
> {{TimestampType}}, {{DateType}} and {{StringType}}.
> So, currently there is no way to read {{null}} for those types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15144) option nullValue for CSV data source not working for several types.

2016-05-04 Thread Hyukjin Kwon (JIRA)

Hyukjin Kwon created SPARK-15144:


 Summary: option nullValue for CSV data source not working for 
several types.
 Key: SPARK-15144
 URL: https://issues.apache.org/jira/browse/SPARK-15144
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Hyukjin Kwon


{{nullValue}} option does not work for the types, {{BooleanType}}, 
{{TimestampType}}, {{DateType}} and {{StringType}}.

So, currently there is no way to read {{null}} for those types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15143) CSV data source is not being tested as HadoopFsRelation

2016-05-04 Thread Hyukjin Kwon (JIRA)

Hyukjin Kwon created SPARK-15143:


 Summary: CSV data source is not being tested as HadoopFsRelation
 Key: SPARK-15143
 URL: https://issues.apache.org/jira/browse/SPARK-15143
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 2.0.0
Reporter: Hyukjin Kwon


JSON, Parquet, Text and ORC are being tested with {{HadoopFsRelationTest}} by 
extending this.

This includes 60ish tests. CSV is not being tested with this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15045) Remove dead code in TaskMemoryManager.cleanUpAllAllocatedMemory for pageTable

2016-05-04 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15045:
---
Assignee: Jacek Lewandowski

> Remove dead code in TaskMemoryManager.cleanUpAllAllocatedMemory for pageTable
> -
>
> Key: SPARK-15045
> URL: https://issues.apache.org/jira/browse/SPARK-15045
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Assignee: Jacek Lewandowski
> Fix For: 2.0.0
>
>
> Unless my eyes trick me, {{TaskMemoryManager}} first clears up {{pageTable}}  
> in a synchronized block and right after the block it does it again. I think 
> the outside cleaning is a dead code.
> See 
> https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L382-L397
>  with the relevant snippet pasted below:
> {code}
>   public long cleanUpAllAllocatedMemory() {
> synchronized (this) {
>   Arrays.fill(pageTable, null);
>   ...
> }
> for (MemoryBlock page : pageTable) {
>   if (page != null) {
> memoryManager.tungstenMemoryAllocator().free(page);
>   }
> }
> Arrays.fill(pageTable, null);
>...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15045) Remove dead code in TaskMemoryManager.cleanUpAllAllocatedMemory for pageTable

2016-05-04 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15045.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12829
[https://github.com/apache/spark/pull/12829]

> Remove dead code in TaskMemoryManager.cleanUpAllAllocatedMemory for pageTable
> -
>
> Key: SPARK-15045
> URL: https://issues.apache.org/jira/browse/SPARK-15045
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
> Fix For: 2.0.0
>
>
> Unless my eyes trick me, {{TaskMemoryManager}} first clears up {{pageTable}}  
> in a synchronized block and right after the block it does it again. I think 
> the outside cleaning is a dead code.
> See 
> https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L382-L397
>  with the relevant snippet pasted below:
> {code}
>   public long cleanUpAllAllocatedMemory() {
> synchronized (this) {
>   Arrays.fill(pageTable, null);
>   ...
> }
> for (MemoryBlock page : pageTable) {
>   if (page != null) {
> memoryManager.tungstenMemoryAllocator().free(page);
>   }
> }
> Arrays.fill(pageTable, null);
>...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15142) Spark Mesos dispatcher becomes unusable when the Mesos master restarts

2016-05-04 Thread Devaraj K (JIRA)

Devaraj K created SPARK-15142:
-

 Summary: Spark Mesos dispatcher becomes unusable when the Mesos 
master restarts
 Key: SPARK-15142
 URL: https://issues.apache.org/jira/browse/SPARK-15142
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Mesos
Reporter: Devaraj K
Priority: Minor


While Spark Mesos dispatcher running if the Mesos master gets restarted then 
Spark Mesos dispatcher will keep running and queues up all the submitted 
applications and will not launch them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15132) Debug log for generated code should be printed with proper indentation

2016-05-04 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15132.
-
   Resolution: Fixed
 Assignee: Kousuke Saruta
Fix Version/s: 2.0.0

> Debug log for generated code should be printed with proper indentation
> --
>
> Key: SPARK-15132
> URL: https://issues.apache.org/jira/browse/SPARK-15132
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Trivial
> Fix For: 2.0.0
>
>
> Similar to SPARK-14185, GenerateOrdering and GenerateColumnAccessor should 
> print debug log for generated code with proper indentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15141) Add python example for OneVsRest

2016-05-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15141:


Assignee: (was: Apache Spark)

> Add python example for OneVsRest
> 
>
> Key: SPARK-15141
> URL: https://issues.apache.org/jira/browse/SPARK-15141
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: zhengruifeng
>
> add the missing python example for OvR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15141) Add python example for OneVsRest

2016-05-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15141:


Assignee: Apache Spark

> Add python example for OneVsRest
> 
>
> Key: SPARK-15141
> URL: https://issues.apache.org/jira/browse/SPARK-15141
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: zhengruifeng
>Assignee: Apache Spark
>
> add the missing python example for OvR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15141) Add python example for OneVsRest

2016-05-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271891#comment-15271891
 ] 

Apache Spark commented on SPARK-15141:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/12920

> Add python example for OneVsRest
> 
>
> Key: SPARK-15141
> URL: https://issues.apache.org/jira/browse/SPARK-15141
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: zhengruifeng
>
> add the missing python example for OvR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15141) Add python example for OneVsRest

2016-05-04 Thread zhengruifeng (JIRA)

zhengruifeng created SPARK-15141:


 Summary: Add python example for OneVsRest
 Key: SPARK-15141
 URL: https://issues.apache.org/jira/browse/SPARK-15141
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Reporter: zhengruifeng


add the missing python example for OvR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15140) ensure input object of encoder is not null

2016-05-04 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271885#comment-15271885
 ] 

Wenchen Fan commented on SPARK-15140:
-

cc [~marmbrus] [~lian cheng] 

> ensure input object of encoder is not null
> --
>
> Key: SPARK-15140
> URL: https://issues.apache.org/jira/browse/SPARK-15140
> Project: Spark
>  Issue Type: Improvement
>Reporter: Wenchen Fan
>
> Current we assume the input object for encoder won't be null, but we don't 
> check it. For example, in 1.6 `Seq("a", null).toDS.collect` will throw NPE, 
> in 2.0 this will return Array("a", null).
> We should define this behaviour more clearly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15140) ensure input object of encoder is not null

2016-05-04 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-15140:
---

 Summary: ensure input object of encoder is not null
 Key: SPARK-15140
 URL: https://issues.apache.org/jira/browse/SPARK-15140
 Project: Spark
  Issue Type: Improvement
Reporter: Wenchen Fan


Current we assume the input object for encoder won't be null, but we don't 
check it. For example, in 1.6 `Seq("a", null).toDS.collect` will throw NPE, in 
2.0 this will return Array("a", null).

We should define this behaviour more clearly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15131) StateStore management thread does not stop after SparkContext is shutdown

2016-05-04 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-15131.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

> StateStore management thread does not stop after SparkContext is shutdown
> -
>
> Key: SPARK-15131
> URL: https://issues.apache.org/jira/browse/SPARK-15131
> Project: Spark
>  Issue Type: Bug
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10713) SPARK_DIST_CLASSPATH ignored on Mesos executors

2016-05-04 Thread Devaraj K (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271875#comment-15271875
 ] 

Devaraj K commented on SPARK-10713:
---

bq. However, on Mesos, SPARK_DIST_CLASSPATH is missing from executors and jar 
is not in the classpath. It is present on YARN. Am I missing something? Do you 
see different behavior?
In my case, I see that jars/path provided for SPARK_DIST_CLASSPATH is getting 
included in executors classpath and as well as in the driver's classpath.

> SPARK_DIST_CLASSPATH ignored on Mesos executors
> ---
>
> Key: SPARK-10713
> URL: https://issues.apache.org/jira/browse/SPARK-10713
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Mesos
>Affects Versions: 1.5.0
>Reporter: Dara Adib
>Priority: Minor
>
> If I set the environment variable SPARK_DIST_CLASSPATH, the jars are included 
> on the driver, but not on Mesos executors. Docs: 
> https://spark.apache.org/docs/latest/hadoop-provided.html
> I see SPARK_DIST_CLASSPATH mentioned in these files:
> launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java
> project/SparkBuild.scala
> yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
> But not the Mesos executor (or should it be included by the launcher 
> library?):
> spark/core/src/main/scala/org/apache/spark/executor/Executor.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15122) TPC-DS Qury 41 fails with The correlated scalar subquery can only contain equality predicates

2016-05-04 Thread Frederick Reiss (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271874#comment-15271874
 ] 

Frederick Reiss commented on SPARK-15122:
-

In the official version of the query, the expression {{i_manufact = 
i1.i_manufact}} appears twice: once on either side of an {{OR}}. The optimizer 
needs to normalize the expression enough to factor that subexpression out of 
the two sides of the disjunction. Also, the error checking code in 
{{CheckAnalysis.scala}} that triggers the problem needs to trigger *after* that 
normalization. It looks like that check happens before the call to 
{{Optimizer.execute}}.

> TPC-DS Qury 41 fails with The correlated scalar subquery can only contain 
> equality predicates
> -
>
> Key: SPARK-15122
> URL: https://issues.apache.org/jira/browse/SPARK-15122
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: JESSE CHEN
>Priority: Critical
>
> The official TPC-DS query 41 fails with the following error:
> {noformat}
> Error in query: The correlated scalar subquery can only contain equality 
> predicates: (((i_manufact#38 = i_manufact#16) && (i_category#36 = Women) 
> && ((i_color#41 = powder) || (i_color#41 = khaki))) && (((i_units#42 = Ounce) 
> || (i_units#42 = Oz)) && ((i_size#39 = medium) || (i_size#39 = extra 
> large || (((i_category#36 = Women) && ((i_color#41 = brown) || 
> (i_color#41 = honeydew))) && (((i_units#42 = Bunch) || (i_units#42 = Ton)) && 
> ((i_size#39 = N/A) || (i_size#39 = small) || i_category#36 = Men) && 
> ((i_color#41 = floral) || (i_color#41 = deep))) && (((i_units#42 = N/A) || 
> (i_units#42 = Dozen)) && ((i_size#39 = petite) || (i_size#39 = large || 
> (((i_category#36 = Men) && ((i_color#41 = light) || (i_color#41 = 
> cornflower))) && (((i_units#42 = Box) || (i_units#42 = Pound)) && ((i_size#39 
> = medium) || (i_size#39 = extra large))) || ((i_manufact#38 = 
> i_manufact#16) && (i_category#36 = Women) && ((i_color#41 = midnight) || 
> (i_color#41 = snow))) && (((i_units#42 = Pallet) || (i_units#42 = Gross)) && 
> ((i_size#39 = medium) || (i_size#39 = extra large || (((i_category#36 = 
> Women) && ((i_color#41 = cyan) || (i_color#41 = papaya))) && (((i_units#42 = 
> Cup) || (i_units#42 = Dram)) && ((i_size#39 = N/A) || (i_size#39 = small) 
> || i_category#36 = Men) && ((i_color#41 = orange) || (i_color#41 = 
> frosted))) && (((i_units#42 = Each) || (i_units#42 = Tbl)) && ((i_size#39 = 
> petite) || (i_size#39 = large || (((i_category#36 = Men) && ((i_color#41 
> = forest) || (i_color#41 = ghost))) && (((i_units#42 = Lb) || (i_units#42 = 
> Bundle)) && ((i_size#39 = medium) || (i_size#39 = extra large;
> {noformat}
> The output plans showed the following errors
> {noformat}
> == Parsed Logical Plan ==
> 'GlobalLimit 100
> +- 'LocalLimit 100
>+- 'Sort ['i_product_name ASC], true
>   +- 'Distinct
>  +- 'Project ['i_product_name]
> +- 'Filter ((('i_manufact_id >= 738) && ('i_manufact_id <= (738 + 
> 40))) && (scalar-subquery#1 [] > 0))
>:  +- 'SubqueryAlias scalar-subquery#1 []
>: +- 'Project ['count(1) AS item_cnt#0]
>:+- 'Filter ((('i_manufact = 'i1.i_manufact) && 
> ('i_category = Women) && (('i_color = powder) || ('i_color = khaki))) && 
> ((('i_units = Ounce) || ('i_units = Oz)) && (('i_size = medium) || ('i_size = 
> extra large || ((('i_category = Women) && (('i_color = brown) || 
> ('i_color = honeydew))) && ((('i_units = Bunch) || ('i_units = Ton)) && 
> (('i_size = N/A) || ('i_size = small) || 'i_category = Men) && 
> (('i_color = floral) || ('i_color = deep))) && ((('i_units = N/A) || 
> ('i_units = Dozen)) && (('i_size = petite) || ('i_size = large || 
> ((('i_category = Men) && (('i_color = light) || ('i_color = cornflower))) && 
> ((('i_units = Box) || ('i_units = Pound)) && (('i_size = medium) || ('i_size 
> = extra large))) || (('i_manufact = 'i1.i_manufact) && ('i_category = 
> Women) && (('i_color = midnight) || ('i_color = snow))) && ((('i_units = 
> Pallet) || ('i_units = Gross)) && (('i_size = medium) || ('i_size = extra 
> large || ((('i_category = Women) && (('i_color = cyan) || ('i_color = 
> papaya))) && ((('i_units = Cup) || ('i_units = Dram)) && (('i_size = N/A) || 
> ('i_size = small) || 'i_category = Men) && (('i_color = orange) || 
> ('i_color = frosted))) && ((('i_units = Each) || ('i_units = Tbl)) && 
> (('i_size = petite) || ('i_size = large || ((('i_category = Men) && 
> (('i_color = forest) || ('i_color = ghost))) && ((('i_units = Lb) || 
> ('i_units = Bundle)) && (('i_size = medium) || ('i_size = extra large))

[jira] [Assigned] (SPARK-15092) toDebugString missing from ML DecisionTreeClassifier

2016-05-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15092:


Assignee: Apache Spark

> toDebugString missing from ML DecisionTreeClassifier
> 
>
> Key: SPARK-15092
> URL: https://issues.apache.org/jira/browse/SPARK-15092
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0
> Environment: HDP 2.3.4, Red Hat 6.7
>Reporter: Ivan SPM
>Assignee: Apache Spark
>Priority: Minor
>  Labels: features
>
> The attribute toDebugString is missing from the DecisionTreeClassifier and 
> DecisionTreeClassifierModel from ML. The attribute exists on the MLLib 
> DecisionTree model. 
> There's no way to check or print the model tree structure from the ML.
> The basic code for it is this:
> rom pyspark.ml import Pipeline
> from pyspark.ml.feature import VectorAssembler, StringIndexer
> from pyspark.ml.classification import DecisionTreeClassifier
> cl = DecisionTreeClassifier(labelCol='target_idx', featuresCol='features')
> pipe = Pipeline(stages=[target_index, assembler, cl])
> model = pipe.fit(df_train)
> # Prediction and model evaluation
> predictions = model.transform(df_test)
> mc_evaluator = MulticlassClassificationEvaluator(
> labelCol="target_idx", predictionCol="prediction", metricName="precision")
> accuracy = mc_evaluator.evaluate(predictions)
> print("Test Error = {}".format(1.0 - accuracy))
> now it would be great to be able to do what is being done on the MLLib model:
> print model.toDebugString(),  # it already has newline
> DecisionTreeModel classifier of depth 1 with 3 nodes
>   If (feature 0 <= 0.0)
>Predict: 0.0
>   Else (feature 0 > 0.0)
>Predict: 1.0
> but there's no toDebugString attribute either to the pipeline model or the 
> DecisionTreeClassifier model:
> cl.toDebugString()
> Attribute Error
> https://spark.apache.org/docs/1.6.0/api/python/_modules/pyspark/mllib/tree.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15092) toDebugString missing from ML DecisionTreeClassifier

2016-05-04 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271873#comment-15271873
 ] 

holdenk commented on SPARK-15092:
-

Ah yes, it is present in Java so it is a simple fix. I've created a PR for this 
as part of our API audit with Python ML for 2.0 so hopefully we can get 
something in soon.

> toDebugString missing from ML DecisionTreeClassifier
> 
>
> Key: SPARK-15092
> URL: https://issues.apache.org/jira/browse/SPARK-15092
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0
> Environment: HDP 2.3.4, Red Hat 6.7
>Reporter: Ivan SPM
>Priority: Minor
>  Labels: features
>
> The attribute toDebugString is missing from the DecisionTreeClassifier and 
> DecisionTreeClassifierModel from ML. The attribute exists on the MLLib 
> DecisionTree model. 
> There's no way to check or print the model tree structure from the ML.
> The basic code for it is this:
> rom pyspark.ml import Pipeline
> from pyspark.ml.feature import VectorAssembler, StringIndexer
> from pyspark.ml.classification import DecisionTreeClassifier
> cl = DecisionTreeClassifier(labelCol='target_idx', featuresCol='features')
> pipe = Pipeline(stages=[target_index, assembler, cl])
> model = pipe.fit(df_train)
> # Prediction and model evaluation
> predictions = model.transform(df_test)
> mc_evaluator = MulticlassClassificationEvaluator(
> labelCol="target_idx", predictionCol="prediction", metricName="precision")
> accuracy = mc_evaluator.evaluate(predictions)
> print("Test Error = {}".format(1.0 - accuracy))
> now it would be great to be able to do what is being done on the MLLib model:
> print model.toDebugString(),  # it already has newline
> DecisionTreeModel classifier of depth 1 with 3 nodes
>   If (feature 0 <= 0.0)
>Predict: 0.0
>   Else (feature 0 > 0.0)
>Predict: 1.0
> but there's no toDebugString attribute either to the pipeline model or the 
> DecisionTreeClassifier model:
> cl.toDebugString()
> Attribute Error
> https://spark.apache.org/docs/1.6.0/api/python/_modules/pyspark/mllib/tree.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15139) PySpark TreeEnsemble missing methods

2016-05-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15139:


Assignee: (was: Apache Spark)

> PySpark TreeEnsemble missing methods
> 
>
> Key: SPARK-15139
> URL: https://issues.apache.org/jira/browse/SPARK-15139
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Minor
>
> TreeEnsemble class is missing some accessor methods compared to Scala API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15139) PySpark TreeEnsemble missing methods

2016-05-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271871#comment-15271871
 ] 

Apache Spark commented on SPARK-15139:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/12919

> PySpark TreeEnsemble missing methods
> 
>
> Key: SPARK-15139
> URL: https://issues.apache.org/jira/browse/SPARK-15139
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Minor
>
> TreeEnsemble class is missing some accessor methods compared to Scala API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15092) toDebugString missing from ML DecisionTreeClassifier

2016-05-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271872#comment-15271872
 ] 

Apache Spark commented on SPARK-15092:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/12919

> toDebugString missing from ML DecisionTreeClassifier
> 
>
> Key: SPARK-15092
> URL: https://issues.apache.org/jira/browse/SPARK-15092
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0
> Environment: HDP 2.3.4, Red Hat 6.7
>Reporter: Ivan SPM
>Priority: Minor
>  Labels: features
>
> The attribute toDebugString is missing from the DecisionTreeClassifier and 
> DecisionTreeClassifierModel from ML. The attribute exists on the MLLib 
> DecisionTree model. 
> There's no way to check or print the model tree structure from the ML.
> The basic code for it is this:
> rom pyspark.ml import Pipeline
> from pyspark.ml.feature import VectorAssembler, StringIndexer
> from pyspark.ml.classification import DecisionTreeClassifier
> cl = DecisionTreeClassifier(labelCol='target_idx', featuresCol='features')
> pipe = Pipeline(stages=[target_index, assembler, cl])
> model = pipe.fit(df_train)
> # Prediction and model evaluation
> predictions = model.transform(df_test)
> mc_evaluator = MulticlassClassificationEvaluator(
> labelCol="target_idx", predictionCol="prediction", metricName="precision")
> accuracy = mc_evaluator.evaluate(predictions)
> print("Test Error = {}".format(1.0 - accuracy))
> now it would be great to be able to do what is being done on the MLLib model:
> print model.toDebugString(),  # it already has newline
> DecisionTreeModel classifier of depth 1 with 3 nodes
>   If (feature 0 <= 0.0)
>Predict: 0.0
>   Else (feature 0 > 0.0)
>Predict: 1.0
> but there's no toDebugString attribute either to the pipeline model or the 
> DecisionTreeClassifier model:
> cl.toDebugString()
> Attribute Error
> https://spark.apache.org/docs/1.6.0/api/python/_modules/pyspark/mllib/tree.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15139) PySpark TreeEnsemble missing methods

2016-05-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15139:


Assignee: Apache Spark

> PySpark TreeEnsemble missing methods
> 
>
> Key: SPARK-15139
> URL: https://issues.apache.org/jira/browse/SPARK-15139
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Assignee: Apache Spark
>Priority: Minor
>
> TreeEnsemble class is missing some accessor methods compared to Scala API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15092) toDebugString missing from ML DecisionTreeClassifier

2016-05-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15092:


Assignee: (was: Apache Spark)

> toDebugString missing from ML DecisionTreeClassifier
> 
>
> Key: SPARK-15092
> URL: https://issues.apache.org/jira/browse/SPARK-15092
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0
> Environment: HDP 2.3.4, Red Hat 6.7
>Reporter: Ivan SPM
>Priority: Minor
>  Labels: features
>
> The attribute toDebugString is missing from the DecisionTreeClassifier and 
> DecisionTreeClassifierModel from ML. The attribute exists on the MLLib 
> DecisionTree model. 
> There's no way to check or print the model tree structure from the ML.
> The basic code for it is this:
> rom pyspark.ml import Pipeline
> from pyspark.ml.feature import VectorAssembler, StringIndexer
> from pyspark.ml.classification import DecisionTreeClassifier
> cl = DecisionTreeClassifier(labelCol='target_idx', featuresCol='features')
> pipe = Pipeline(stages=[target_index, assembler, cl])
> model = pipe.fit(df_train)
> # Prediction and model evaluation
> predictions = model.transform(df_test)
> mc_evaluator = MulticlassClassificationEvaluator(
> labelCol="target_idx", predictionCol="prediction", metricName="precision")
> accuracy = mc_evaluator.evaluate(predictions)
> print("Test Error = {}".format(1.0 - accuracy))
> now it would be great to be able to do what is being done on the MLLib model:
> print model.toDebugString(),  # it already has newline
> DecisionTreeModel classifier of depth 1 with 3 nodes
>   If (feature 0 <= 0.0)
>Predict: 0.0
>   Else (feature 0 > 0.0)
>Predict: 1.0
> but there's no toDebugString attribute either to the pipeline model or the 
> DecisionTreeClassifier model:
> cl.toDebugString()
> Attribute Error
> https://spark.apache.org/docs/1.6.0/api/python/_modules/pyspark/mllib/tree.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15139) PySpark TreeEnsemble missing methods

2016-05-04 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271851#comment-15271851
 ] 

holdenk commented on SPARK-15139:
-

This is related to SPARK-15092

> PySpark TreeEnsemble missing methods
> 
>
> Key: SPARK-15139
> URL: https://issues.apache.org/jira/browse/SPARK-15139
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Minor
>
> TreeEnsemble class is missing some accessor methods compared to Scala API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15139) PySpark TreeEnsemble missing methods

2016-05-04 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-15139:

Description: TreeEnsemble class is missing some accessor methods compared 
to Scala API

> PySpark TreeEnsemble missing methods
> 
>
> Key: SPARK-15139
> URL: https://issues.apache.org/jira/browse/SPARK-15139
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Minor
>
> TreeEnsemble class is missing some accessor methods compared to Scala API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15139) PySpark TreeEnsemble missing methods

2016-05-04 Thread holdenk (JIRA)

holdenk created SPARK-15139:
---

 Summary: PySpark TreeEnsemble missing methods
 Key: SPARK-15139
 URL: https://issues.apache.org/jira/browse/SPARK-15139
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: holdenk
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15137) Linkify ML PyDoc classification

2016-05-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15137:


Assignee: Apache Spark

> Linkify ML PyDoc classification
> ---
>
> Key: SPARK-15137
> URL: https://issues.apache.org/jira/browse/SPARK-15137
> Project: Spark
>  Issue Type: Sub-task
>Reporter: holdenk
>Assignee: Apache Spark
>Priority: Minor
>
> PyDoc links in ml are in non-standard format. Switch to standard sphinx link 
> format for better formatted documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15137) Linkify ML PyDoc classification

2016-05-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271839#comment-15271839
 ] 

Apache Spark commented on SPARK-15137:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/12918

> Linkify ML PyDoc classification
> ---
>
> Key: SPARK-15137
> URL: https://issues.apache.org/jira/browse/SPARK-15137
> Project: Spark
>  Issue Type: Sub-task
>Reporter: holdenk
>Priority: Minor
>
> PyDoc links in ml are in non-standard format. Switch to standard sphinx link 
> format for better formatted documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15137) Linkify ML PyDoc classification

2016-05-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15137:


Assignee: (was: Apache Spark)

> Linkify ML PyDoc classification
> ---
>
> Key: SPARK-15137
> URL: https://issues.apache.org/jira/browse/SPARK-15137
> Project: Spark
>  Issue Type: Sub-task
>Reporter: holdenk
>Priority: Minor
>
> PyDoc links in ml are in non-standard format. Switch to standard sphinx link 
> format for better formatted documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15137) Linkify ML PyDoc classification

2016-05-04 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-15137:

Description: PyDoc links in ml are in non-standard format. Switch to 
standard sphinx link format for better formatted documentation.

> Linkify ML PyDoc classification
> ---
>
> Key: SPARK-15137
> URL: https://issues.apache.org/jira/browse/SPARK-15137
> Project: Spark
>  Issue Type: Sub-task
>Reporter: holdenk
>Priority: Minor
>
> PyDoc links in ml are in non-standard format. Switch to standard sphinx link 
> format for better formatted documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15101) Audit: ml.clustering and ml.recommendation

2016-05-04 Thread zhengruifeng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271824#comment-15271824
 ] 

zhengruifeng commented on SPARK-15101:
--

[~josephkb] User Doc and scala example for ml.BisectingKMeans are now missing. 
I had made a corresponding PR.

> Audit: ml.clustering and ml.recommendation
> --
>
> Key: SPARK-15101
> URL: https://issues.apache.org/jira/browse/SPARK-15101
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>
> Audit this sub-package for new algorithms which do not have corresponding 
> sections & examples in the user guide.
> See parent issue for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15138) Linkify ML PyDoc regression

2016-05-04 Thread holdenk (JIRA)

holdenk created SPARK-15138:
---

 Summary: Linkify ML PyDoc regression
 Key: SPARK-15138
 URL: https://issues.apache.org/jira/browse/SPARK-15138
 Project: Spark
  Issue Type: Sub-task
Reporter: holdenk
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15137) Linkify ML PyDoc classification

2016-05-04 Thread holdenk (JIRA)

holdenk created SPARK-15137:
---

 Summary: Linkify ML PyDoc classification
 Key: SPARK-15137
 URL: https://issues.apache.org/jira/browse/SPARK-15137
 Project: Spark
  Issue Type: Sub-task
Reporter: holdenk
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15136) Linkify ML PyDoc

2016-05-04 Thread holdenk (JIRA)

holdenk created SPARK-15136:
---

 Summary: Linkify ML PyDoc
 Key: SPARK-15136
 URL: https://issues.apache.org/jira/browse/SPARK-15136
 Project: Spark
  Issue Type: Improvement
Reporter: holdenk
Priority: Minor


PyDoc links in ml are in non-standard format. Switch to standard sphinx link 
format for better formatted documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14837) Add support in file stream source for reading new files added to subdirs

2016-05-04 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-14837:
--
Target Version/s: 2.0.0

> Add support in file stream source for reading new files added to subdirs
> 
>
> Key: SPARK-14837
> URL: https://issues.apache.org/jira/browse/SPARK-14837
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15131) StateStore management thread does not stop after SparkContext is shutdown

2016-05-04 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-15131:
--
Fix Version/s: (was: 2.0.0)

> StateStore management thread does not stop after SparkContext is shutdown
> -
>
> Key: SPARK-15131
> URL: https://issues.apache.org/jira/browse/SPARK-15131
> Project: Spark
>  Issue Type: Bug
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15131) StateStore management thread does not stop after SparkContext is shutdown

2016-05-04 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-15131:
--
Target Version/s: 2.0.0

> StateStore management thread does not stop after SparkContext is shutdown
> -
>
> Key: SPARK-15131
> URL: https://issues.apache.org/jira/browse/SPARK-15131
> Project: Spark
>  Issue Type: Bug
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14340) Add Scala Example and User DOC for ml.BisectingKMeans

2016-05-04 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-14340:
-
Summary: Add Scala Example and User DOC for ml.BisectingKMeans  (was: Add 
Scala Example and Description for ml.BisectingKMeans)

> Add Scala Example and User DOC for ml.BisectingKMeans
> -
>
> Key: SPARK-14340
> URL: https://issues.apache.org/jira/browse/SPARK-14340
> Project: Spark
>  Issue Type: Improvement
>Reporter: zhengruifeng
>Priority: Minor
>
> 1, add BisectingKMeans to ml-clustering.md
> 2, add the missing Scala BisectingKMeansExample



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14993) Inconsistent behavior of partitioning discovery

2016-05-04 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-14993:
-
Assignee: Xiao Li

> Inconsistent behavior of partitioning discovery
> ---
>
> Key: SPARK-14993
> URL: https://issues.apache.org/jira/browse/SPARK-14993
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.0.0
>
>
> When we load a dataset, if we set the path to {{/path/a=1}}, we will not take 
> a as the partitioning column. However, if we set the path to 
> {{/path/a=1/file.parquet}}, we take a as the partitioning column and it shows 
> up in the schema. We should make the behaviors of these two cases consistent 
> by not putting a into the schema for the second case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14993) Inconsistent behavior of partitioning discovery

2016-05-04 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-14993.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12828
[https://github.com/apache/spark/pull/12828]

> Inconsistent behavior of partitioning discovery
> ---
>
> Key: SPARK-14993
> URL: https://issues.apache.org/jira/browse/SPARK-14993
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
> Fix For: 2.0.0
>
>
> When we load a dataset, if we set the path to {{/path/a=1}}, we will not take 
> a as the partitioning column. However, if we set the path to 
> {{/path/a=1/file.parquet}}, we take a as the partitioning column and it shows 
> up in the schema. We should make the behaviors of these two cases consistent 
> by not putting a into the schema for the second case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6339) Support creating temporary views with DDL

2016-05-04 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-6339:

Assignee: Sean Zhong

> Support creating temporary views with DDL
> -
>
> Key: SPARK-6339
> URL: https://issues.apache.org/jira/browse/SPARK-6339
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Hossein Falaki
>Assignee: Sean Zhong
> Fix For: 2.0.0
>
>
> It would useful to support following:
> {code}
> create temporary view counted as
> select count(transactions), company from sales group by company
> {code}
> Right now this is possible through registerTempTable()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6339) Support creating temporary tables with DDL

2016-05-04 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-6339.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12872
[https://github.com/apache/spark/pull/12872]

> Support creating temporary tables with DDL
> --
>
> Key: SPARK-6339
> URL: https://issues.apache.org/jira/browse/SPARK-6339
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Hossein Falaki
> Fix For: 2.0.0
>
>
> It would useful to support following:
> {code}
> create temporary table counted as
> select count(transactions), company from sales group by company
> {code}
> Right now this is possible through registerTempTable()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6339) Support creating temporary views with DDL

2016-05-04 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-6339:

Summary: Support creating temporary views with DDL  (was: Support creating 
temporary tables with DDL)

> Support creating temporary views with DDL
> -
>
> Key: SPARK-6339
> URL: https://issues.apache.org/jira/browse/SPARK-6339
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Hossein Falaki
> Fix For: 2.0.0
>
>
> It would useful to support following:
> {code}
> create temporary table counted as
> select count(transactions), company from sales group by company
> {code}
> Right now this is possible through registerTempTable()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6339) Support creating temporary views with DDL

2016-05-04 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-6339:

Description: 
It would useful to support following:

{code}
create temporary view counted as
select count(transactions), company from sales group by company
{code}

Right now this is possible through registerTempTable()

  was:
It would useful to support following:

{code}
create temporary table counted as
select count(transactions), company from sales group by company
{code}

Right now this is possible through registerTempTable()


> Support creating temporary views with DDL
> -
>
> Key: SPARK-6339
> URL: https://issues.apache.org/jira/browse/SPARK-6339
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Hossein Falaki
> Fix For: 2.0.0
>
>
> It would useful to support following:
> {code}
> create temporary view counted as
> select count(transactions), company from sales group by company
> {code}
> Right now this is possible through registerTempTable()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14896) Deprecate HiveContext in Python

2016-05-04 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-14896.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Deprecate HiveContext in Python
> ---
>
> Key: SPARK-14896
> URL: https://issues.apache.org/jira/browse/SPARK-14896
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Critical
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15112) Dataset filter returns garbage

2016-05-04 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271025#comment-15271025
 ] 

Cheng Lian edited comment on SPARK-15112 at 5/5/16 12:28 AM:
-

The following Spark shell session illustrates this issue:

{noformat}
scala> case class T(a: String, b: Int)
defined class T

scala> val ds = Seq(1 -> "foo", 2 -> "bar").toDF("b", "a").as[T]
ds: org.apache.spark.sql.Dataset[T] = [b: int, a: string]

scala> ds.show()
+---+---+
|  b|  a|
+---+---+
|foo|  1|
|bar|  2|
+---+---+


scala> ds.filter(_.b > 1).show()
+---+---+
|  a|  b|
+---+---+
| |  3|
+---+---+
{noformat}

Dataset encoders actually doesn't require the order of input columns to be 
exactly the same as its own schema. Essentially it performs a projection to 
adjust column order during analysis phase. This is can be quite helpful for 
data sources that support schema evolution, where the column order of merged 
schema may be non-deterministic. The JSON data source falls into this category, 
and it always sorts all input columns by name.

This leads to the following facts, for a Dataset {{ds}}:

# {{ds.resolvedTEncoder.schema}} may differ from {{ds.logicalPlan.schema}}, and
# {{ds.schema}} should conform to {{ds.resolvedTEncoder.schema}}, and
# {{ds.toDF()}} uses a {{RowEncoder}} to convert user space Scala objects to 
{{InternalRow}} instances, and this {{RowEncoder}} should be initialized using 
{{ds.logicalPlan.schema}}.

Spark 1.6 conforms to the above requirements. For example:

{noformat}
scala> case class T(a: String, b: Int)
defined class T

scala> val ds = Seq(1 -> "foo", 2 -> "bar").toDF("b", "a").as[T]
ds: org.apache.spark.sql.Dataset[T] = [b: int, a: string]

scala> ds.show()
+---+---+
|  b|  a|
+---+---+
|foo|  1|
|bar|  2|
+---+---+


scala> ds.toDF().show()
+---+---+
|  a|  b|
+---+---+
|  1|foo|
|  2|bar|
+---+---+
{noformat}

However, while merging DF/DF API in Spark 2.0, requirement 2 was broken by 
accident, and we are using {{ds.logicalPlan.schema}} as {{ds.schema}}, which 
leads to this bug.

Working on a fix for it.


was (Author: lian cheng):
The following Spark shell session illustrates this issue:

{noformat}
scala> case class T(a: String, b: Int)
defined class T

scala> val ds = Seq(1 -> "foo", 2 -> "bar").toDF("b", "a").as[T]
ds: org.apache.spark.sql.Dataset[T] = [b: int, a: string]

scala> ds.show()
+---+---+
|  b|  a|
+---+---+
|foo|  1|
|bar|  2|
+---+---+


scala> ds.filter(_.b > 1).show()
+---+---+
|  a|  b|
+---+---+
| |  3|
+---+---+
{noformat}

Dataset encoders actually doesn't require the order of input columns to be 
exactly the same as its own schema. Essentially it performs a projection to 
adjust column order during analysis phase. This is can be quite helpful for 
data sources that support schema evolution, where the column order of merged 
schema may be non-deterministic. The JSON data source falls into this category, 
and it always sorts all input columns by name.

This leads to the following facts, for a Dataset {{ds}}:

# {{ds.resolvedTEncoder.schema}} may differ from {{ds.logicalPlan.schema}}, and
# {{ds.schema}} should conform to {{ds.resolvedTEncoder.schema}}, and
# {{ds.toDF()}} uses a {{RowEncoder}} to convert user space Scala objects to 
{{InternalRow}}s, and this {{RowEncoder}} should be initialized using 
{{ds.logicalPlan.schema}}.

Spark 1.6 conforms to the above requirements. For example:

{noformat}
scala> case class T(a: String, b: Int)
defined class T

scala> val ds = Seq(1 -> "foo", 2 -> "bar").toDF("b", "a").as[T]
ds: org.apache.spark.sql.Dataset[T] = [b: int, a: string]

scala> ds.show()
+---+---+
|  b|  a|
+---+---+
|foo|  1|
|bar|  2|
+---+---+


scala> ds.toDF().show()
+---+---+
|  a|  b|
+---+---+
|  1|foo|
|  2|bar|
+---+---+
{noformat}

However, while merging DF/DF API in Spark 2.0, requirement 2 was broken by 
accident, and we are using {{ds.logicalPlan.schema}} as {{ds.schema}}, which 
leads to this bug.

Working on a fix for it.

> Dataset filter returns garbage
> --
>
> Key: SPARK-15112
> URL: https://issues.apache.org/jira/browse/SPARK-15112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Blocker
> Attachments: demo 1 dataset - Databricks.htm
>
>
> See the following notebook:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6122906529858466/2727501386611535/5382278320999420/latest.html
> I think it happens only when using JSON. I'm also going to attach it to the 
> ticket.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14896) Deprecate HiveContext in Python

2016-05-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271685#comment-15271685
 ] 

Apache Spark commented on SPARK-14896:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/12917

> Deprecate HiveContext in Python
> ---
>
> Key: SPARK-14896
> URL: https://issues.apache.org/jira/browse/SPARK-14896
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14896) Deprecate HiveContext in Python

2016-05-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14896:


Assignee: Andrew Or  (was: Apache Spark)

> Deprecate HiveContext in Python
> ---
>
> Key: SPARK-14896
> URL: https://issues.apache.org/jira/browse/SPARK-14896
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14896) Deprecate HiveContext in Python

2016-05-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14896:


Assignee: Apache Spark  (was: Andrew Or)

> Deprecate HiveContext in Python
> ---
>
> Key: SPARK-14896
> URL: https://issues.apache.org/jira/browse/SPARK-14896
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14897) Upgrade Jetty to latest version of 8/9

2016-05-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271669#comment-15271669
 ] 

Apache Spark commented on SPARK-14897:
--

User 'bomeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/12916

> Upgrade Jetty to latest version of 8/9
> --
>
> Key: SPARK-14897
> URL: https://issues.apache.org/jira/browse/SPARK-14897
> Project: Spark
>  Issue Type: Improvement
>Reporter: Adam Kramer
>  Labels: web-ui
>
> It looks like the head/master branch of Spark uses quite an old version of 
> Jetty: 8.1.14.v20131031
> There have been some announcement of security vulnerabilities, notably in 
> 2015 and there are versions of both 8 and 9 that address those. We recently 
> left a web-ui port open and had the server compromised within days. Albeit, 
> this upgrade shouldn't be the only security improvement made, the current 
> version is clearly vulnerable, as-is.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15037) Use SparkSession instead of SQLContext in testsuites

2016-05-04 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15037:
--
Assignee: Sandeep Singh

> Use SparkSession instead of SQLContext in testsuites
> 
>
> Key: SPARK-15037
> URL: https://issues.apache.org/jira/browse/SPARK-15037
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Dongjoon Hyun
>Assignee: Sandeep Singh
>
> This issue aims to update the existing testsuites to use `SparkSession` 
> instread of `SQLContext` since `SQLContext` exists just for backward 
> compatibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15045) Remove dead code in TaskMemoryManager.cleanUpAllAllocatedMemory for pageTable

2016-05-04 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15045:
--
Priority: Major  (was: Trivial)

> Remove dead code in TaskMemoryManager.cleanUpAllAllocatedMemory for pageTable
> -
>
> Key: SPARK-15045
> URL: https://issues.apache.org/jira/browse/SPARK-15045
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>
> Unless my eyes trick me, {{TaskMemoryManager}} first clears up {{pageTable}}  
> in a synchronized block and right after the block it does it again. I think 
> the outside cleaning is a dead code.
> See 
> https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L382-L397
>  with the relevant snippet pasted below:
> {code}
>   public long cleanUpAllAllocatedMemory() {
> synchronized (this) {
>   Arrays.fill(pageTable, null);
>   ...
> }
> for (MemoryBlock page : pageTable) {
>   if (page != null) {
> memoryManager.tungstenMemoryAllocator().free(page);
>   }
> }
> Arrays.fill(pageTable, null);
>...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15135) Make sure SparkSession thread safe

2016-05-04 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271621#comment-15271621
 ] 

Shixiong Zhu commented on SPARK-15135:
--

https://github.com/apache/spark/pull/12915

> Make sure SparkSession thread safe
> --
>
> Key: SPARK-15135
> URL: https://issues.apache.org/jira/browse/SPARK-15135
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Fixed non-thread-safe classed used by SparkSession.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15135) Make sure SparkSession thread safe

2016-05-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271622#comment-15271622
 ] 

Apache Spark commented on SPARK-15135:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/12915

> Make sure SparkSession thread safe
> --
>
> Key: SPARK-15135
> URL: https://issues.apache.org/jira/browse/SPARK-15135
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Fixed non-thread-safe classed used by SparkSession.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15135) Make sure SparkSession thread safe

2016-05-04 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-15135:


 Summary: Make sure SparkSession thread safe
 Key: SPARK-15135
 URL: https://issues.apache.org/jira/browse/SPARK-15135
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


Fixed non-thread-safe classed used by SparkSession.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10713) SPARK_DIST_CLASSPATH ignored on Mesos executors

2016-05-04 Thread Dara Adib (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271613#comment-15271613
 ] 

Dara Adib commented on SPARK-10713:
---

[~devaraj.k] Thanks for trying to reproduce.  I'm not using the Hadoop-free 
builds anymore, so I tried testing with a random jar (in this case 
spark-streaming-kafka-assembly) on Spark 1.6.1.

I'm using PySpark but here is a Scala example that seems to work too:

{code}
# Get classpath, taken from https://gist.github.com/jessitron/8376139.
def urlses(cl: ClassLoader): Array[java.net.URL] = cl match {
case null => Array()
case u: java.net.URLClassLoader => u.getURLs() ++ urlses(cl.getParent)
case _ => urlses(cl.getParent)
}

# driver
println(sys.env.get("SPARK_DIST_CLASSPATH"))
println(urlses(getClass.getClassLoader).mkString(":"))

# executor
println(sc.parallelize(Vector(0)).map(_ => 
sys.env.get("SPARK_DIST_CLASSPATH")).collect()(0))
println(sc.parallelize(Vector(0)).map(_ => 
urlses(getClass.getClassLoader).mkString(":")).collect()(0))
{code}

On both Mesos and YARN, SPARK_DIST_CLASSPATH is defined on the driver and jar 
is included in classpath. However, on Mesos, SPARK_DIST_CLASSPATH is missing 
from executors and jar is not in the classpath. It is present on YARN. Am I 
missing something? Do you see different behavior?

> SPARK_DIST_CLASSPATH ignored on Mesos executors
> ---
>
> Key: SPARK-10713
> URL: https://issues.apache.org/jira/browse/SPARK-10713
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Mesos
>Affects Versions: 1.5.0
>Reporter: Dara Adib
>Priority: Minor
>
> If I set the environment variable SPARK_DIST_CLASSPATH, the jars are included 
> on the driver, but not on Mesos executors. Docs: 
> https://spark.apache.org/docs/latest/hadoop-provided.html
> I see SPARK_DIST_CLASSPATH mentioned in these files:
> launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java
> project/SparkBuild.scala
> yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
> But not the Mesos executor (or should it be included by the launcher 
> library?):
> spark/core/src/main/scala/org/apache/spark/executor/Executor.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-10713) SPARK_DIST_CLASSPATH ignored on Mesos executors

2016-05-04 Thread Dara Adib (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271613#comment-15271613
 ] 

Dara Adib edited comment on SPARK-10713 at 5/4/16 11:18 PM:


[~devaraj.k] Thanks for trying to reproduce.  I'm not using the Hadoop-free 
builds anymore, so I tried testing with a random jar (in this case 
spark-streaming-kafka-assembly) on Spark 1.6.1.

I'm using PySpark but here is a Scala example that seems to work too:

{code}
// Get classpath, taken from https://gist.github.com/jessitron/8376139.
def urlses(cl: ClassLoader): Array[java.net.URL] = cl match {
case null => Array()
case u: java.net.URLClassLoader => u.getURLs() ++ urlses(cl.getParent)
case _ => urlses(cl.getParent)
}

// driver
println(sys.env.get("SPARK_DIST_CLASSPATH"))
println(urlses(getClass.getClassLoader).mkString(":"))

// executor
println(sc.parallelize(Vector(0)).map(_ => 
sys.env.get("SPARK_DIST_CLASSPATH")).collect()(0))
println(sc.parallelize(Vector(0)).map(_ => 
urlses(getClass.getClassLoader).mkString(":")).collect()(0))
{code}

On both Mesos and YARN, SPARK_DIST_CLASSPATH is defined on the driver and jar 
is included in classpath. However, on Mesos, SPARK_DIST_CLASSPATH is missing 
from executors and jar is not in the classpath. It is present on YARN. Am I 
missing something? Do you see different behavior?


was (Author: daradib):
[~devaraj.k] Thanks for trying to reproduce.  I'm not using the Hadoop-free 
builds anymore, so I tried testing with a random jar (in this case 
spark-streaming-kafka-assembly) on Spark 1.6.1.

I'm using PySpark but here is a Scala example that seems to work too:

{code}
# Get classpath, taken from https://gist.github.com/jessitron/8376139.
def urlses(cl: ClassLoader): Array[java.net.URL] = cl match {
case null => Array()
case u: java.net.URLClassLoader => u.getURLs() ++ urlses(cl.getParent)
case _ => urlses(cl.getParent)
}

# driver
println(sys.env.get("SPARK_DIST_CLASSPATH"))
println(urlses(getClass.getClassLoader).mkString(":"))

# executor
println(sc.parallelize(Vector(0)).map(_ => 
sys.env.get("SPARK_DIST_CLASSPATH")).collect()(0))
println(sc.parallelize(Vector(0)).map(_ => 
urlses(getClass.getClassLoader).mkString(":")).collect()(0))
{code}

On both Mesos and YARN, SPARK_DIST_CLASSPATH is defined on the driver and jar 
is included in classpath. However, on Mesos, SPARK_DIST_CLASSPATH is missing 
from executors and jar is not in the classpath. It is present on YARN. Am I 
missing something? Do you see different behavior?

> SPARK_DIST_CLASSPATH ignored on Mesos executors
> ---
>
> Key: SPARK-10713
> URL: https://issues.apache.org/jira/browse/SPARK-10713
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Mesos
>Affects Versions: 1.5.0
>Reporter: Dara Adib
>Priority: Minor
>
> If I set the environment variable SPARK_DIST_CLASSPATH, the jars are included 
> on the driver, but not on Mesos executors. Docs: 
> https://spark.apache.org/docs/latest/hadoop-provided.html
> I see SPARK_DIST_CLASSPATH mentioned in these files:
> launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java
> project/SparkBuild.scala
> yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
> But not the Mesos executor (or should it be included by the launcher 
> library?):
> spark/core/src/main/scala/org/apache/spark/executor/Executor.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15130) PySpark shared params should include default values to match Scala

2016-05-04 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-15130:

Summary: PySpark shared params should include default values to match Scala 
 (was: PySpark decision tree params should include default values to match 
Scala)

> PySpark shared params should include default values to match Scala
> --
>
> Key: SPARK-15130
> URL: https://issues.apache.org/jira/browse/SPARK-15130
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, PySpark
>Reporter: holdenk
>Priority: Minor
>
> As part of checking the documentation in SPARK-14813, PySpark decision tree 
> params do not include the default values (unlike the Scala ones). While the 
> existing Scala default values will have been used, this information is likely 
> worth exposing in the docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15130) PySpark decision tree params should include default values to match Scala

2016-05-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15130:


Assignee: (was: Apache Spark)

> PySpark decision tree params should include default values to match Scala
> -
>
> Key: SPARK-15130
> URL: https://issues.apache.org/jira/browse/SPARK-15130
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, PySpark
>Reporter: holdenk
>Priority: Minor
>
> As part of checking the documentation in SPARK-14813, PySpark decision tree 
> params do not include the default values (unlike the Scala ones). While the 
> existing Scala default values will have been used, this information is likely 
> worth exposing in the docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15130) PySpark decision tree params should include default values to match Scala

2016-05-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15130:


Assignee: Apache Spark

> PySpark decision tree params should include default values to match Scala
> -
>
> Key: SPARK-15130
> URL: https://issues.apache.org/jira/browse/SPARK-15130
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, PySpark
>Reporter: holdenk
>Assignee: Apache Spark
>Priority: Minor
>
> As part of checking the documentation in SPARK-14813, PySpark decision tree 
> params do not include the default values (unlike the Scala ones). While the 
> existing Scala default values will have been used, this information is likely 
> worth exposing in the docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15130) PySpark decision tree params should include default values to match Scala

2016-05-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271611#comment-15271611
 ] 

Apache Spark commented on SPARK-15130:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/12914

> PySpark decision tree params should include default values to match Scala
> -
>
> Key: SPARK-15130
> URL: https://issues.apache.org/jira/browse/SPARK-15130
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, PySpark
>Reporter: holdenk
>Priority: Minor
>
> As part of checking the documentation in SPARK-14813, PySpark decision tree 
> params do not include the default values (unlike the Scala ones). While the 
> existing Scala default values will have been used, this information is likely 
> worth exposing in the docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15130) PySpark decision tree params should include default values to match Scala

2016-05-04 Thread Holden Karau (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271585#comment-15271585
 ] 

Holden Karau commented on SPARK-15130:
--

I mean that the pydocs should include what the default value is. I'm
working on a PR for this I'll cc you when it's up.




-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


> PySpark decision tree params should include default values to match Scala
> -
>
> Key: SPARK-15130
> URL: https://issues.apache.org/jira/browse/SPARK-15130
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, PySpark
>Reporter: holdenk
>Priority: Minor
>
> As part of checking the documentation in SPARK-14813, PySpark decision tree 
> params do not include the default values (unlike the Scala ones). While the 
> existing Scala default values will have been used, this information is likely 
> worth exposing in the docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15112) Dataset filter returns garbage

2016-05-04 Thread Suresh Thalamati (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271576#comment-15271576
 ] 

Suresh Thalamati commented on SPARK-15112:
--

I ran into similar issue SPARK-14218

> Dataset filter returns garbage
> --
>
> Key: SPARK-15112
> URL: https://issues.apache.org/jira/browse/SPARK-15112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Blocker
> Attachments: demo 1 dataset - Databricks.htm
>
>
> See the following notebook:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6122906529858466/2727501386611535/5382278320999420/latest.html
> I think it happens only when using JSON. I'm also going to attach it to the 
> ticket.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-928) Add support for Unsafe-based serializer in Kryo 2.22

2016-05-04 Thread Sandeep Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15265805#comment-15265805
 ] 

Sandeep Singh edited comment on SPARK-928 at 5/4/16 10:43 PM:
--

[~joshrosen] I would like to work on this.

I tried benchmarking the difference between unsafe kryo and our current impl. 
and then we can have a spark.kryo.useUnsafe flag as Matei has mentioned.

{code:title=Benchmarking results|borderStyle=solid}
Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Mac OS X 10.11.4
  Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz

  Benchmark Kryo Unsafe vs safe Serialization: Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative
  

  basicTypes: Int unsafe:false 2 /4   8988.0
   0.1   1.0X
  basicTypes: Long unsafe:false1 /1  13981.3
   0.1   1.6X
  basicTypes: Float unsafe:false   1 /1  14460.6
   0.1   1.6X
  basicTypes: Double unsafe:false  1 /1  15876.9
   0.1   1.8X
  Array: Int unsafe:false 33 /   44474.8
   2.1   0.1X
  Array: Long unsafe:false18 /   25888.6
   1.1   0.1X
  Array: Float unsafe:false   10 /   16   1627.4
   0.6   0.2X
  Array: Double unsafe:false  10 /   13   1523.1
   0.7   0.2X
  Map of string->Double unsafe:false 413 /  447 38.1
  26.3   0.0X
  basicTypes: Int unsafe:true  1 /1  16402.6
   0.1   1.8X
  basicTypes: Long unsafe:true 1 /1  19732.1
   0.1   2.2X
  basicTypes: Float unsafe:true1 /1  19752.9
   0.1   2.2X
  basicTypes: Double unsafe:true   1 /1  23111.4
   0.0   2.6X
  Array: Int unsafe:true   7 /8   2239.9
   0.4   0.2X
  Array: Long unsafe:true  8 /9   2000.1
   0.5   0.2X
  Array: Float unsafe:true 7 /8   2191.5
   0.5   0.2X
  Array: Double unsafe:true9 /   10   1841.2
   0.5   0.2X
  Map of string->Double unsafe:true  387 /  407 40.7
  24.6   0.0X
{code}

You can find the code for benchmarking here 
(https://github.com/techaddict/spark/commit/46fa44141c849ca15bbd6136cea2fa52bd927da2),
 very ugly right now but will improve it(add more benchmarks) before creating a 
PR.



was (Author: techaddict):
[~joshrosen] I would like to work on this.

I tried benchmarking the difference between unsafe kryo and our current impl. 
and then we can have a spark.kryo.useUnsafe flag as Matei has mentioned.

{code:title=Without Kryo UnSafe|borderStyle=solid}
Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Mac OS X 10.11.4
Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz

Serialize and then deserialize: Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative
---
primitive:Long  1 /4  11223.1   0.1 
  1.0X
primitive:Double1 /1  19409.0   0.1 
  1.7X
Array:Long 38 /   49412.4   2.4 
  0.0X
Array:Double   25 /   35631.4   1.6 
  0.1X
Map of string->Double2651 / 2766  5.9 168.6 
  0.0X
{code}

{code:title=With Kryo UnSafe|borderStyle=solid}
Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Mac OS X 10.11.4
Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz

Serialize and then deserialize: Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative
---
primitive:Long  1 /3  15872.0   0.1 
  1.0X
primitive:Double1 /1  17769.7   0.1 
  1.1X
Array:Long 24 /   42642.3   1.6 
  0.0X
Array:Double   22 /   26719.4   1.4 
  0.0X
Map of string->Double2560 / 2582  6.1 162.8 
  0.0X
{code}

You can find the code for benchmarking here 
(https://github.com/techaddict/spark/commit/46fa44141c849ca15bbd6136cea2fa52bd927da2),
 very ugly right now but will improve it(add more benchmarks

[jira] [Assigned] (SPARK-928) Add support for Unsafe-based serializer in Kryo 2.22

2016-05-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-928:
--

Assignee: Apache Spark

> Add support for Unsafe-based serializer in Kryo 2.22
> 
>
> Key: SPARK-928
> URL: https://issues.apache.org/jira/browse/SPARK-928
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Matei Zaharia
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
>
> This can reportedly be quite a bit faster, but it also requires Chill to 
> update its Kryo dependency. Once that happens we should add a 
> spark.kryo.useUnsafe flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-928) Add support for Unsafe-based serializer in Kryo 2.22

2016-05-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271570#comment-15271570
 ] 

Apache Spark commented on SPARK-928:


User 'techaddict' has created a pull request for this issue:
https://github.com/apache/spark/pull/12913

> Add support for Unsafe-based serializer in Kryo 2.22
> 
>
> Key: SPARK-928
> URL: https://issues.apache.org/jira/browse/SPARK-928
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>  Labels: starter
>
> This can reportedly be quite a bit faster, but it also requires Chill to 
> update its Kryo dependency. Once that happens we should add a 
> spark.kryo.useUnsafe flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-928) Add support for Unsafe-based serializer in Kryo 2.22

2016-05-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-928:
--

Assignee: (was: Apache Spark)

> Add support for Unsafe-based serializer in Kryo 2.22
> 
>
> Key: SPARK-928
> URL: https://issues.apache.org/jira/browse/SPARK-928
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>  Labels: starter
>
> This can reportedly be quite a bit faster, but it also requires Chill to 
> update its Kryo dependency. Once that happens we should add a 
> spark.kryo.useUnsafe flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15134) Indent SparkSession builder patterns and update binary_classification_metrics_example.py

2016-05-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15134:


Assignee: Apache Spark

> Indent SparkSession builder patterns and update 
> binary_classification_metrics_example.py
> 
>
> Key: SPARK-15134
> URL: https://issues.apache.org/jira/browse/SPARK-15134
> Project: Spark
>  Issue Type: Task
>  Components: Examples
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>
> This issue addresses the comments in SPARK-15031 and also fix java-linter 
> errors.
> - Use multiline format in SparkSession builder patterns.
> - Update `binary_classification_metrics_example.py` to use `SparkSession`.
> - Fix Java Linter errors (in SPARK-13745, SPARK-15031, and so far)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15134) Indent SparkSession builder patterns and update binary_classification_metrics_example.py

2016-05-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271541#comment-15271541
 ] 

Apache Spark commented on SPARK-15134:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/12911

> Indent SparkSession builder patterns and update 
> binary_classification_metrics_example.py
> 
>
> Key: SPARK-15134
> URL: https://issues.apache.org/jira/browse/SPARK-15134
> Project: Spark
>  Issue Type: Task
>  Components: Examples
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> This issue addresses the comments in SPARK-15031 and also fix java-linter 
> errors.
> - Use multiline format in SparkSession builder patterns.
> - Update `binary_classification_metrics_example.py` to use `SparkSession`.
> - Fix Java Linter errors (in SPARK-13745, SPARK-15031, and so far)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15134) Indent SparkSession builder patterns and update binary_classification_metrics_example.py

2016-05-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15134:


Assignee: (was: Apache Spark)

> Indent SparkSession builder patterns and update 
> binary_classification_metrics_example.py
> 
>
> Key: SPARK-15134
> URL: https://issues.apache.org/jira/browse/SPARK-15134
> Project: Spark
>  Issue Type: Task
>  Components: Examples
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> This issue addresses the comments in SPARK-15031 and also fix java-linter 
> errors.
> - Use multiline format in SparkSession builder patterns.
> - Update `binary_classification_metrics_example.py` to use `SparkSession`.
> - Fix Java Linter errors (in SPARK-13745, SPARK-15031, and so far)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15134) Indent SparkSession builder patterns and update binary_classification_metrics_example.py

2016-05-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-15134:
--
Description: 
This issue addresses the comments in SPARK-15031 and also fix java-linter 
errors.

- Use multiline format in SparkSession builder patterns.
- Update `binary_classification_metrics_example.py` to use `SparkSession`.
- Fix Java Linter errors (in SPARK-13745, SPARK-15031, and so far)

  was:
This issue addresses the comments in SPARK-15031 and also fix java-linter 
errors.

- Use multiline format in SparkSession builder patterns.
- Update `binary_classification_metrics_example.py` to use `SparkSession`.
- Fix Java Linter errors (in SPARK-13745 and so far)


> Indent SparkSession builder patterns and update 
> binary_classification_metrics_example.py
> 
>
> Key: SPARK-15134
> URL: https://issues.apache.org/jira/browse/SPARK-15134
> Project: Spark
>  Issue Type: Task
>  Components: Examples
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> This issue addresses the comments in SPARK-15031 and also fix java-linter 
> errors.
> - Use multiline format in SparkSession builder patterns.
> - Update `binary_classification_metrics_example.py` to use `SparkSession`.
> - Fix Java Linter errors (in SPARK-13745, SPARK-15031, and so far)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15134) Indent SparkSession builder patterns and update binary_classification_metrics_example.py

2016-05-04 Thread Dongjoon Hyun (JIRA)

Dongjoon Hyun created SPARK-15134:
-

 Summary: Indent SparkSession builder patterns and update 
binary_classification_metrics_example.py
 Key: SPARK-15134
 URL: https://issues.apache.org/jira/browse/SPARK-15134
 Project: Spark
  Issue Type: Task
  Components: Examples
Reporter: Dongjoon Hyun
Priority: Minor


This issue addresses the comments in SPARK-15031 and also fix java-linter 
errors.

- Use multiline format in SparkSession builder patterns.
- Update `binary_classification_metrics_example.py` to use `SparkSession`.
- Fix Java Linter errors (in SPARK-13745 and so far)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15130) PySpark decision tree params should include default values to match Scala

2016-05-04 Thread Xin Ren (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271531#comment-15271531
 ] 

Xin Ren commented on SPARK-15130:
-

Hi I just found in class DecisionTreeClassifier of pyspark, there is a 
setParams method which sort of matches what is in scala ones.

do you mean to create a separate class "Param"?

{code}
@keyword_only
@since("1.4.0")
def setParams(self, featuresCol="features", labelCol="label", 
predictionCol="prediction",
  probabilityCol="probability", 
rawPredictionCol="rawPrediction",
  maxDepth=5, maxBins=32, minInstancesPerNode=1, 
minInfoGain=0.0,
  maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10,
  impurity="gini", seed=None):
"""
setParams(self, featuresCol="features", labelCol="label", 
predictionCol="prediction", \
  probabilityCol="probability", 
rawPredictionCol="rawPrediction", \
  maxDepth=5, maxBins=32, minInstancesPerNode=1, 
minInfoGain=0.0, \
  maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10, 
impurity="gini", \
  seed=None)
Sets params for the DecisionTreeClassifier.
"""
kwargs = self.setParams._input_kwargs
return self._set(**kwargs)
{code}

> PySpark decision tree params should include default values to match Scala
> -
>
> Key: SPARK-15130
> URL: https://issues.apache.org/jira/browse/SPARK-15130
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, PySpark
>Reporter: holdenk
>Priority: Minor
>
> As part of checking the documentation in SPARK-14813, PySpark decision tree 
> params do not include the default values (unlike the Scala ones). While the 
> existing Scala default values will have been used, this information is likely 
> worth exposing in the docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13670) spark-class doesn't bubble up error from launcher command

2016-05-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13670:


Assignee: Apache Spark

> spark-class doesn't bubble up error from launcher command
> -
>
> Key: SPARK-13670
> URL: https://issues.apache.org/jira/browse/SPARK-13670
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0
>Reporter: Mark Grover
>Assignee: Apache Spark
>Priority: Minor
>
> There's a particular snippet in spark-class 
> [here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that 
> runs the spark-launcher code in a subshell.
> {code}
> # The launcher library will print arguments separated by a NULL character, to 
> allow arguments with
> # characters that would be otherwise interpreted by the shell. Read that in a 
> while loop, populating
> # an array that will be used to exec the final command.
> CMD=()
> while IFS= read -d '' -r ARG; do
>   CMD+=("$ARG")
> done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main 
> "$@")
> {code}
> The problem is that the if the launcher Main fails, this code still still 
> returns success and continues, even though the top level script is marked 
> {{set -e}}. This is because the launcher.Main is run within a subshell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 >

1 - 100 of 292 matches

Mail list logo