[jira] [Commented] (SPARK-8489) Add regression tests for SPARK-8470
[ https://issues.apache.org/jira/browse/SPARK-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271974#comment-15271974 ] Apache Spark commented on SPARK-8489: - User 'dilipbiswal' has created a pull request for this issue: https://github.com/apache/spark/pull/12924 > Add regression tests for SPARK-8470 > --- > > Key: SPARK-8489 > URL: https://issues.apache.org/jira/browse/SPARK-8489 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 1.4.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Critical > Fix For: 1.4.1, 1.5.0 > > > See SPARK-8470 for more detail. Basically the Spark Hive code silently > overwrites the context class loader populated in SparkSubmit, resulting in > certain classes missing when we do reflection in `SQLContext#createDataFrame`. > That issue is already resolved in https://github.com/apache/spark/pull/6891, > but we should add a regression test for the specific manifestation of the bug > in SPARK-8470. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14893) Re-enable HiveSparkSubmitSuite SPARK-8489 test after HiveContext is removed
[ https://issues.apache.org/jira/browse/SPARK-14893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271973#comment-15271973 ] Apache Spark commented on SPARK-14893: -- User 'dilipbiswal' has created a pull request for this issue: https://github.com/apache/spark/pull/12924 > Re-enable HiveSparkSubmitSuite SPARK-8489 test after HiveContext is removed > --- > > Key: SPARK-14893 > URL: https://issues.apache.org/jira/browse/SPARK-14893 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.0.0 >Reporter: Andrew Or > > The test was disabled in https://github.com/apache/spark/pull/12585. To > re-enable it we need to rebuild the jar using the updated source code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14893) Re-enable HiveSparkSubmitSuite SPARK-8489 test after HiveContext is removed
[ https://issues.apache.org/jira/browse/SPARK-14893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14893: Assignee: Apache Spark > Re-enable HiveSparkSubmitSuite SPARK-8489 test after HiveContext is removed > --- > > Key: SPARK-14893 > URL: https://issues.apache.org/jira/browse/SPARK-14893 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Apache Spark > > The test was disabled in https://github.com/apache/spark/pull/12585. To > re-enable it we need to rebuild the jar using the updated source code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14893) Re-enable HiveSparkSubmitSuite SPARK-8489 test after HiveContext is removed
[ https://issues.apache.org/jira/browse/SPARK-14893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14893: Assignee: (was: Apache Spark) > Re-enable HiveSparkSubmitSuite SPARK-8489 test after HiveContext is removed > --- > > Key: SPARK-14893 > URL: https://issues.apache.org/jira/browse/SPARK-14893 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.0.0 >Reporter: Andrew Or > > The test was disabled in https://github.com/apache/spark/pull/12585. To > re-enable it we need to rebuild the jar using the updated source code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15114) Column name generated by typed aggregate is super verbose
[ https://issues.apache.org/jira/browse/SPARK-15114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271972#comment-15271972 ] Dilip Biswal commented on SPARK-15114: -- [~yhuai] Sure Yin. I will give it a try. > Column name generated by typed aggregate is super verbose > - > > Key: SPARK-15114 > URL: https://issues.apache.org/jira/browse/SPARK-15114 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Critical > > {code} > case class Person(name: String, email: String, age: Long) > val ds = spark.read.json("/tmp/person.json").as[Person] > import org.apache.spark.sql.expressions.scala.typed._ > ds.groupByKey(_ => 0).agg(sum(_.age)) > // org.apache.spark.sql.Dataset[(Int, Double)] = [value: int, > typedsumdouble(unresolveddeserializer(newInstance(class Person), age#0L, > email#1, name#2), upcast(value)): double] > ds.groupByKey(_ => 0).agg(sum(_.age)).explain > == Physical Plan == > WholeStageCodegen > : +- TungstenAggregate(key=[value#84], > functions=[(TypedSumDouble($line15.$read$$iw$$iw$Person),mode=Final,isDistinct=false)], > output=[value#84,typedsumdouble(unresolveddeserializer(newInstance(class > $line15.$read$$iw$$iw$Person), age#0L, email#1, name#2), upcast(value))#91]) > : +- INPUT > +- Exchange hashpartitioning(value#84, 200), None >+- WholeStageCodegen > : +- TungstenAggregate(key=[value#84], > functions=[(TypedSumDouble($line15.$read$$iw$$iw$Person),mode=Partial,isDistinct=false)], > output=[value#84,value#97]) > : +- INPUT > +- AppendColumns , newInstance(class > $line15.$read$$iw$$iw$Person), [input[0, int] AS value#84] > +- WholeStageCodegen > : +- Scan HadoopFiles[age#0L,email#1,name#2] Format: JSON, > PushedFilters: [], ReadSchema: struct > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15148) Upgrade Univocity library from 2.0.2 to 2.1.0
[ https://issues.apache.org/jira/browse/SPARK-15148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15148: Assignee: Apache Spark > Upgrade Univocity library from 2.0.2 to 2.1.0 > -- > > Key: SPARK-15148 > URL: https://issues.apache.org/jira/browse/SPARK-15148 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Minor > > It looks a new release of Univocity CSV library was published, > https://github.com/uniVocity/univocity-parsers/releases. > This contains some improvements as below: > {quote} > 1. Performance improvements for parsing/writing CSV and TSV. CSV writing and > parsing got 30-40% faster. > 2. Deprecated methods setParseUnescapedQuotes and > setParseUnescapedQuotesUntilDelimiter class CsvParserSettings in favor of the > new setUnescapedQuoteHandling method that takes values from the > UnescapedQuoteHandling enumeration. > 3. Default behavior of the CSV parser when unescaped quotes are found on the > input changed to parse until a delimiter character is found, i.e. > UnescapedQuoteHandling.STOP_AT_DELIMITER. The old default of trying to find a > closing quote (i.e. UnescapedQuoteHandling.STOP_AT_CLOSING_QUOTE) can be > problematic when no closing quote is found, making the parser accumulate all > characters into the same value, until the end of the input. > {quote} > With Spark, > Firstly, It uses this library for CSV data source. This will affect the > performance. > Secondly, Spark uses {{setParseUnescapedQuotesUntilDelimiter}} which is > deprecated in this version because It seems there are some more > functionalities for parsing unescaped quotes. This seems not directly related > with Spark but we might have to consider using this in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15148) Upgrade Univocity library from 2.0.2 to 2.1.0
[ https://issues.apache.org/jira/browse/SPARK-15148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15148: Assignee: (was: Apache Spark) > Upgrade Univocity library from 2.0.2 to 2.1.0 > -- > > Key: SPARK-15148 > URL: https://issues.apache.org/jira/browse/SPARK-15148 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > It looks a new release of Univocity CSV library was published, > https://github.com/uniVocity/univocity-parsers/releases. > This contains some improvements as below: > {quote} > 1. Performance improvements for parsing/writing CSV and TSV. CSV writing and > parsing got 30-40% faster. > 2. Deprecated methods setParseUnescapedQuotes and > setParseUnescapedQuotesUntilDelimiter class CsvParserSettings in favor of the > new setUnescapedQuoteHandling method that takes values from the > UnescapedQuoteHandling enumeration. > 3. Default behavior of the CSV parser when unescaped quotes are found on the > input changed to parse until a delimiter character is found, i.e. > UnescapedQuoteHandling.STOP_AT_DELIMITER. The old default of trying to find a > closing quote (i.e. UnescapedQuoteHandling.STOP_AT_CLOSING_QUOTE) can be > problematic when no closing quote is found, making the parser accumulate all > characters into the same value, until the end of the input. > {quote} > With Spark, > Firstly, It uses this library for CSV data source. This will affect the > performance. > Secondly, Spark uses {{setParseUnescapedQuotesUntilDelimiter}} which is > deprecated in this version because It seems there are some more > functionalities for parsing unescaped quotes. This seems not directly related > with Spark but we might have to consider using this in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15148) Upgrade Univocity library from 2.0.2 to 2.1.0
[ https://issues.apache.org/jira/browse/SPARK-15148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271964#comment-15271964 ] Apache Spark commented on SPARK-15148: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/12923 > Upgrade Univocity library from 2.0.2 to 2.1.0 > -- > > Key: SPARK-15148 > URL: https://issues.apache.org/jira/browse/SPARK-15148 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > It looks a new release of Univocity CSV library was published, > https://github.com/uniVocity/univocity-parsers/releases. > This contains some improvements as below: > {quote} > 1. Performance improvements for parsing/writing CSV and TSV. CSV writing and > parsing got 30-40% faster. > 2. Deprecated methods setParseUnescapedQuotes and > setParseUnescapedQuotesUntilDelimiter class CsvParserSettings in favor of the > new setUnescapedQuoteHandling method that takes values from the > UnescapedQuoteHandling enumeration. > 3. Default behavior of the CSV parser when unescaped quotes are found on the > input changed to parse until a delimiter character is found, i.e. > UnescapedQuoteHandling.STOP_AT_DELIMITER. The old default of trying to find a > closing quote (i.e. UnescapedQuoteHandling.STOP_AT_CLOSING_QUOTE) can be > problematic when no closing quote is found, making the parser accumulate all > characters into the same value, until the end of the input. > {quote} > With Spark, > Firstly, It uses this library for CSV data source. This will affect the > performance. > Secondly, Spark uses {{setParseUnescapedQuotesUntilDelimiter}} which is > deprecated in this version because It seems there are some more > functionalities for parsing unescaped quotes. This seems not directly related > with Spark but we might have to consider using this in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15146) Allow specifying kafka parameters through configurations
[ https://issues.apache.org/jira/browse/SPARK-15146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271962#comment-15271962 ] Saisai Shao commented on SPARK-15146: - [~c...@koeninger.org], what is your opinion on the JIRA? Thanks a lot. > Allow specifying kafka parameters through configurations > > > Key: SPARK-15146 > URL: https://issues.apache.org/jira/browse/SPARK-15146 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Saisai Shao >Priority: Minor > > Current Spark Streaming Kafka connector can only specify consumer parameters > through parameter {{kafkaParams}}, it is not so convenient for end users, > they have to re-compile the codes each time changing the configurations. > So here propose to allow specifying kafka consumer parameters through > configurations, similar to what we do for hadoop configurations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14495) Distinct aggregation cannot be used in the having clause
[ https://issues.apache.org/jira/browse/SPARK-14495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271956#comment-15271956 ] Xin Wu edited comment on SPARK-14495 at 5/5/16 6:25 AM: [~smilegator] I got the fix and running regtest now. Will submit the PR once it is done. was (Author: xwu0226): [~smilegator] I got the fix and running regtest now. Will submit the PR one it is done. > Distinct aggregation cannot be used in the having clause > > > Key: SPARK-14495 > URL: https://issues.apache.org/jira/browse/SPARK-14495 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Yin Huai > > {code} > select date, count(distinct id) > from (select '2010-01-01' as date, 1 as id) tmp > group by date > having count(distinct id) > 0; > org.apache.spark.sql.AnalysisException: resolved attribute(s) gid#558,id#559 > missing from date#554,id#555 in operator !Expand [List(date#554, null, 0, if > ((gid#558 = 1)) id#559 else null),List(date#554, id#555, 1, null)], > [date#554,id#561,gid#560,if ((gid = 1)) id else null#562]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:183) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:121) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:133) > at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:816) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15148) Upgrade Univocity library from 2.0.2 to 2.1.0
Hyukjin Kwon created SPARK-15148: Summary: Upgrade Univocity library from 2.0.2 to 2.1.0 Key: SPARK-15148 URL: https://issues.apache.org/jira/browse/SPARK-15148 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: Hyukjin Kwon Priority: Minor It looks a new release of Univocity CSV library was published, https://github.com/uniVocity/univocity-parsers/releases. This contains some improvements as below: {quote} 1. Performance improvements for parsing/writing CSV and TSV. CSV writing and parsing got 30-40% faster. 2. Deprecated methods setParseUnescapedQuotes and setParseUnescapedQuotesUntilDelimiter class CsvParserSettings in favor of the new setUnescapedQuoteHandling method that takes values from the UnescapedQuoteHandling enumeration. 3. Default behavior of the CSV parser when unescaped quotes are found on the input changed to parse until a delimiter character is found, i.e. UnescapedQuoteHandling.STOP_AT_DELIMITER. The old default of trying to find a closing quote (i.e. UnescapedQuoteHandling.STOP_AT_CLOSING_QUOTE) can be problematic when no closing quote is found, making the parser accumulate all characters into the same value, until the end of the input. {quote} With Spark, Firstly, It uses this library for CSV data source. This will affect the performance. Secondly, Spark uses {{setParseUnescapedQuotesUntilDelimiter}} which is deprecated in this version because It seems there are some more functionalities for parsing unescaped quotes. This seems not directly related with Spark but we might have to consider using this in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14495) Distinct aggregation cannot be used in the having clause
[ https://issues.apache.org/jira/browse/SPARK-14495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271956#comment-15271956 ] Xin Wu commented on SPARK-14495: [~smilegator] I got the fix and running regtest now. Will submit the PR one it is done. > Distinct aggregation cannot be used in the having clause > > > Key: SPARK-14495 > URL: https://issues.apache.org/jira/browse/SPARK-14495 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Yin Huai > > {code} > select date, count(distinct id) > from (select '2010-01-01' as date, 1 as id) tmp > group by date > having count(distinct id) > 0; > org.apache.spark.sql.AnalysisException: resolved attribute(s) gid#558,id#559 > missing from date#554,id#555 in operator !Expand [List(date#554, null, 0, if > ((gid#558 = 1)) id#559 else null),List(date#554, id#555, 1, null)], > [date#554,id#561,gid#560,if ((gid = 1)) id else null#562]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:183) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:121) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:120) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:120) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:133) > at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:816) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15147) Catalog should have a property to indicate case-sensitivity
Cheng Lian created SPARK-15147: -- Summary: Catalog should have a property to indicate case-sensitivity Key: SPARK-15147 URL: https://issues.apache.org/jira/browse/SPARK-15147 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: Cheng Lian We are moving from Hive metastore catalog to a more general extensible catalog design. One problem that hasn't been taken care of in current Spark 2.0 interfaces is case sensitivity. More specifically, Hive metastore is case insensitive. It simply stores column names, table names, struct field names, and function names in lower case, and thus isn't even case-preserving. However, case sensitivity in Spark SQL is configurable. We need to add property (or properties) to {{Catalog}} interface to indicate case-sensitivity of underlying catalog implementations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15114) Column name generated by typed aggregate is super verbose
[ https://issues.apache.org/jira/browse/SPARK-15114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271943#comment-15271943 ] Yin Huai edited comment on SPARK-15114 at 5/5/16 6:22 AM: -- I think at there, we should use a UnresolvedAlias. Will you have time to try it out and see what will be a good way to generate the alias? was (Author: yhuai): I think at here, we should use a UnresolvedAlias. Will you have time to try it out and see what will be a good way to generate the alias? > Column name generated by typed aggregate is super verbose > - > > Key: SPARK-15114 > URL: https://issues.apache.org/jira/browse/SPARK-15114 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Critical > > {code} > case class Person(name: String, email: String, age: Long) > val ds = spark.read.json("/tmp/person.json").as[Person] > import org.apache.spark.sql.expressions.scala.typed._ > ds.groupByKey(_ => 0).agg(sum(_.age)) > // org.apache.spark.sql.Dataset[(Int, Double)] = [value: int, > typedsumdouble(unresolveddeserializer(newInstance(class Person), age#0L, > email#1, name#2), upcast(value)): double] > ds.groupByKey(_ => 0).agg(sum(_.age)).explain > == Physical Plan == > WholeStageCodegen > : +- TungstenAggregate(key=[value#84], > functions=[(TypedSumDouble($line15.$read$$iw$$iw$Person),mode=Final,isDistinct=false)], > output=[value#84,typedsumdouble(unresolveddeserializer(newInstance(class > $line15.$read$$iw$$iw$Person), age#0L, email#1, name#2), upcast(value))#91]) > : +- INPUT > +- Exchange hashpartitioning(value#84, 200), None >+- WholeStageCodegen > : +- TungstenAggregate(key=[value#84], > functions=[(TypedSumDouble($line15.$read$$iw$$iw$Person),mode=Partial,isDistinct=false)], > output=[value#84,value#97]) > : +- INPUT > +- AppendColumns , newInstance(class > $line15.$read$$iw$$iw$Person), [input[0, int] AS value#84] > +- WholeStageCodegen > : +- Scan HadoopFiles[age#0L,email#1,name#2] Format: JSON, > PushedFilters: [], ReadSchema: struct > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15146) Allow specifying kafka parameters through configurations
Saisai Shao created SPARK-15146: --- Summary: Allow specifying kafka parameters through configurations Key: SPARK-15146 URL: https://issues.apache.org/jira/browse/SPARK-15146 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Saisai Shao Priority: Minor Current Spark Streaming Kafka connector can only specify consumer parameters through parameter {{kafkaParams}}, it is not so convenient for end users, they have to re-compile the codes each time changing the configurations. So here propose to allow specifying kafka consumer parameters through configurations, similar to what we do for hadoop configurations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15114) Column name generated by typed aggregate is super verbose
[ https://issues.apache.org/jira/browse/SPARK-15114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271943#comment-15271943 ] Yin Huai commented on SPARK-15114: -- I think at here, we should use a UnresolvedAlias. Will you have time to try it out and see what will be a good way to generate the alias? > Column name generated by typed aggregate is super verbose > - > > Key: SPARK-15114 > URL: https://issues.apache.org/jira/browse/SPARK-15114 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Critical > > {code} > case class Person(name: String, email: String, age: Long) > val ds = spark.read.json("/tmp/person.json").as[Person] > import org.apache.spark.sql.expressions.scala.typed._ > ds.groupByKey(_ => 0).agg(sum(_.age)) > // org.apache.spark.sql.Dataset[(Int, Double)] = [value: int, > typedsumdouble(unresolveddeserializer(newInstance(class Person), age#0L, > email#1, name#2), upcast(value)): double] > ds.groupByKey(_ => 0).agg(sum(_.age)).explain > == Physical Plan == > WholeStageCodegen > : +- TungstenAggregate(key=[value#84], > functions=[(TypedSumDouble($line15.$read$$iw$$iw$Person),mode=Final,isDistinct=false)], > output=[value#84,typedsumdouble(unresolveddeserializer(newInstance(class > $line15.$read$$iw$$iw$Person), age#0L, email#1, name#2), upcast(value))#91]) > : +- INPUT > +- Exchange hashpartitioning(value#84, 200), None >+- WholeStageCodegen > : +- TungstenAggregate(key=[value#84], > functions=[(TypedSumDouble($line15.$read$$iw$$iw$Person),mode=Partial,isDistinct=false)], > output=[value#84,value#97]) > : +- INPUT > +- AppendColumns , newInstance(class > $line15.$read$$iw$$iw$Person), [input[0, int] AS value#84] > +- WholeStageCodegen > : +- Scan HadoopFiles[age#0L,email#1,name#2] Format: JSON, > PushedFilters: [], ReadSchema: struct > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15144) option nullValue for CSV data source not working for several types.
[ https://issues.apache.org/jira/browse/SPARK-15144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271927#comment-15271927 ] Apache Spark commented on SPARK-15144: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/12921 > option nullValue for CSV data source not working for several types. > --- > > Key: SPARK-15144 > URL: https://issues.apache.org/jira/browse/SPARK-15144 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > {{nullValue}} option does not work for the types, {{BooleanType}}, > {{TimestampType}}, {{DateType}} and {{StringType}}. > So, currently there is no way to read {{null}} for those types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15144) option nullValue for CSV data source not working for several types.
[ https://issues.apache.org/jira/browse/SPARK-15144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15144: Assignee: (was: Apache Spark) > option nullValue for CSV data source not working for several types. > --- > > Key: SPARK-15144 > URL: https://issues.apache.org/jira/browse/SPARK-15144 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > {{nullValue}} option does not work for the types, {{BooleanType}}, > {{TimestampType}}, {{DateType}} and {{StringType}}. > So, currently there is no way to read {{null}} for those types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15144) option nullValue for CSV data source not working for several types.
[ https://issues.apache.org/jira/browse/SPARK-15144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15144: Assignee: Apache Spark > option nullValue for CSV data source not working for several types. > --- > > Key: SPARK-15144 > URL: https://issues.apache.org/jira/browse/SPARK-15144 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark > > {{nullValue}} option does not work for the types, {{BooleanType}}, > {{TimestampType}}, {{DateType}} and {{StringType}}. > So, currently there is no way to read {{null}} for those types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15143) CSV data source is not being tested as HadoopFsRelation
[ https://issues.apache.org/jira/browse/SPARK-15143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15143: Assignee: Apache Spark > CSV data source is not being tested as HadoopFsRelation > --- > > Key: SPARK-15143 > URL: https://issues.apache.org/jira/browse/SPARK-15143 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark > > JSON, Parquet, Text and ORC are being tested with {{HadoopFsRelationTest}} by > extending this. > This includes 60ish tests. CSV is not being tested with this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15145) spark.ml binary classification should include accuracy
[ https://issues.apache.org/jira/browse/SPARK-15145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15145: Assignee: Apache Spark > spark.ml binary classification should include accuracy > -- > > Key: SPARK-15145 > URL: https://issues.apache.org/jira/browse/SPARK-15145 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Miao Wang >Assignee: Apache Spark >Priority: Minor > > spark.ml binary classification should include accuracy. This JIRA is related > to SPARK-14900. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15143) CSV data source is not being tested as HadoopFsRelation
[ https://issues.apache.org/jira/browse/SPARK-15143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15143: Assignee: (was: Apache Spark) > CSV data source is not being tested as HadoopFsRelation > --- > > Key: SPARK-15143 > URL: https://issues.apache.org/jira/browse/SPARK-15143 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > JSON, Parquet, Text and ORC are being tested with {{HadoopFsRelationTest}} by > extending this. > This includes 60ish tests. CSV is not being tested with this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15143) CSV data source is not being tested as HadoopFsRelation
[ https://issues.apache.org/jira/browse/SPARK-15143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271926#comment-15271926 ] Apache Spark commented on SPARK-15143: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/12921 > CSV data source is not being tested as HadoopFsRelation > --- > > Key: SPARK-15143 > URL: https://issues.apache.org/jira/browse/SPARK-15143 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > JSON, Parquet, Text and ORC are being tested with {{HadoopFsRelationTest}} by > extending this. > This includes 60ish tests. CSV is not being tested with this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15145) spark.ml binary classification should include accuracy
[ https://issues.apache.org/jira/browse/SPARK-15145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271925#comment-15271925 ] Apache Spark commented on SPARK-15145: -- User 'wangmiao1981' has created a pull request for this issue: https://github.com/apache/spark/pull/12922 > spark.ml binary classification should include accuracy > -- > > Key: SPARK-15145 > URL: https://issues.apache.org/jira/browse/SPARK-15145 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Miao Wang >Priority: Minor > > spark.ml binary classification should include accuracy. This JIRA is related > to SPARK-14900. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15145) spark.ml binary classification should include accuracy
[ https://issues.apache.org/jira/browse/SPARK-15145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15145: Assignee: (was: Apache Spark) > spark.ml binary classification should include accuracy > -- > > Key: SPARK-15145 > URL: https://issues.apache.org/jira/browse/SPARK-15145 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Miao Wang >Priority: Minor > > spark.ml binary classification should include accuracy. This JIRA is related > to SPARK-14900. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15145) spark.ml binary classification should include accuracy
Miao Wang created SPARK-15145: - Summary: spark.ml binary classification should include accuracy Key: SPARK-15145 URL: https://issues.apache.org/jira/browse/SPARK-15145 Project: Spark Issue Type: New Feature Components: ML Reporter: Miao Wang Priority: Minor spark.ml binary classification should include accuracy. This JIRA is related to SPARK-14900. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15144) option nullValue for CSV data source not working for several types.
[ https://issues.apache.org/jira/browse/SPARK-15144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271922#comment-15271922 ] Abhinav Gupta commented on SPARK-15144: --- Can you explain with example? So that I can reproduce it. > option nullValue for CSV data source not working for several types. > --- > > Key: SPARK-15144 > URL: https://issues.apache.org/jira/browse/SPARK-15144 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > {{nullValue}} option does not work for the types, {{BooleanType}}, > {{TimestampType}}, {{DateType}} and {{StringType}}. > So, currently there is no way to read {{null}} for those types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15144) option nullValue for CSV data source not working for several types.
Hyukjin Kwon created SPARK-15144: Summary: option nullValue for CSV data source not working for several types. Key: SPARK-15144 URL: https://issues.apache.org/jira/browse/SPARK-15144 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: Hyukjin Kwon {{nullValue}} option does not work for the types, {{BooleanType}}, {{TimestampType}}, {{DateType}} and {{StringType}}. So, currently there is no way to read {{null}} for those types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15143) CSV data source is not being tested as HadoopFsRelation
Hyukjin Kwon created SPARK-15143: Summary: CSV data source is not being tested as HadoopFsRelation Key: SPARK-15143 URL: https://issues.apache.org/jira/browse/SPARK-15143 Project: Spark Issue Type: Test Components: SQL Affects Versions: 2.0.0 Reporter: Hyukjin Kwon JSON, Parquet, Text and ORC are being tested with {{HadoopFsRelationTest}} by extending this. This includes 60ish tests. CSV is not being tested with this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15045) Remove dead code in TaskMemoryManager.cleanUpAllAllocatedMemory for pageTable
[ https://issues.apache.org/jira/browse/SPARK-15045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-15045: --- Assignee: Jacek Lewandowski > Remove dead code in TaskMemoryManager.cleanUpAllAllocatedMemory for pageTable > - > > Key: SPARK-15045 > URL: https://issues.apache.org/jira/browse/SPARK-15045 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Jacek Laskowski >Assignee: Jacek Lewandowski > Fix For: 2.0.0 > > > Unless my eyes trick me, {{TaskMemoryManager}} first clears up {{pageTable}} > in a synchronized block and right after the block it does it again. I think > the outside cleaning is a dead code. > See > https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L382-L397 > with the relevant snippet pasted below: > {code} > public long cleanUpAllAllocatedMemory() { > synchronized (this) { > Arrays.fill(pageTable, null); > ... > } > for (MemoryBlock page : pageTable) { > if (page != null) { > memoryManager.tungstenMemoryAllocator().free(page); > } > } > Arrays.fill(pageTable, null); >... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15045) Remove dead code in TaskMemoryManager.cleanUpAllAllocatedMemory for pageTable
[ https://issues.apache.org/jira/browse/SPARK-15045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-15045. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12829 [https://github.com/apache/spark/pull/12829] > Remove dead code in TaskMemoryManager.cleanUpAllAllocatedMemory for pageTable > - > > Key: SPARK-15045 > URL: https://issues.apache.org/jira/browse/SPARK-15045 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Jacek Laskowski > Fix For: 2.0.0 > > > Unless my eyes trick me, {{TaskMemoryManager}} first clears up {{pageTable}} > in a synchronized block and right after the block it does it again. I think > the outside cleaning is a dead code. > See > https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L382-L397 > with the relevant snippet pasted below: > {code} > public long cleanUpAllAllocatedMemory() { > synchronized (this) { > Arrays.fill(pageTable, null); > ... > } > for (MemoryBlock page : pageTable) { > if (page != null) { > memoryManager.tungstenMemoryAllocator().free(page); > } > } > Arrays.fill(pageTable, null); >... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15142) Spark Mesos dispatcher becomes unusable when the Mesos master restarts
Devaraj K created SPARK-15142: - Summary: Spark Mesos dispatcher becomes unusable when the Mesos master restarts Key: SPARK-15142 URL: https://issues.apache.org/jira/browse/SPARK-15142 Project: Spark Issue Type: Bug Components: Deploy, Mesos Reporter: Devaraj K Priority: Minor While Spark Mesos dispatcher running if the Mesos master gets restarted then Spark Mesos dispatcher will keep running and queues up all the submitted applications and will not launch them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15132) Debug log for generated code should be printed with proper indentation
[ https://issues.apache.org/jira/browse/SPARK-15132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-15132. - Resolution: Fixed Assignee: Kousuke Saruta Fix Version/s: 2.0.0 > Debug log for generated code should be printed with proper indentation > -- > > Key: SPARK-15132 > URL: https://issues.apache.org/jira/browse/SPARK-15132 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Trivial > Fix For: 2.0.0 > > > Similar to SPARK-14185, GenerateOrdering and GenerateColumnAccessor should > print debug log for generated code with proper indentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15141) Add python example for OneVsRest
[ https://issues.apache.org/jira/browse/SPARK-15141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15141: Assignee: (was: Apache Spark) > Add python example for OneVsRest > > > Key: SPARK-15141 > URL: https://issues.apache.org/jira/browse/SPARK-15141 > Project: Spark > Issue Type: Documentation > Components: Documentation >Reporter: zhengruifeng > > add the missing python example for OvR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15141) Add python example for OneVsRest
[ https://issues.apache.org/jira/browse/SPARK-15141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15141: Assignee: Apache Spark > Add python example for OneVsRest > > > Key: SPARK-15141 > URL: https://issues.apache.org/jira/browse/SPARK-15141 > Project: Spark > Issue Type: Documentation > Components: Documentation >Reporter: zhengruifeng >Assignee: Apache Spark > > add the missing python example for OvR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15141) Add python example for OneVsRest
[ https://issues.apache.org/jira/browse/SPARK-15141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271891#comment-15271891 ] Apache Spark commented on SPARK-15141: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/12920 > Add python example for OneVsRest > > > Key: SPARK-15141 > URL: https://issues.apache.org/jira/browse/SPARK-15141 > Project: Spark > Issue Type: Documentation > Components: Documentation >Reporter: zhengruifeng > > add the missing python example for OvR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15141) Add python example for OneVsRest
zhengruifeng created SPARK-15141: Summary: Add python example for OneVsRest Key: SPARK-15141 URL: https://issues.apache.org/jira/browse/SPARK-15141 Project: Spark Issue Type: Documentation Components: Documentation Reporter: zhengruifeng add the missing python example for OvR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15140) ensure input object of encoder is not null
[ https://issues.apache.org/jira/browse/SPARK-15140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271885#comment-15271885 ] Wenchen Fan commented on SPARK-15140: - cc [~marmbrus] [~lian cheng] > ensure input object of encoder is not null > -- > > Key: SPARK-15140 > URL: https://issues.apache.org/jira/browse/SPARK-15140 > Project: Spark > Issue Type: Improvement >Reporter: Wenchen Fan > > Current we assume the input object for encoder won't be null, but we don't > check it. For example, in 1.6 `Seq("a", null).toDS.collect` will throw NPE, > in 2.0 this will return Array("a", null). > We should define this behaviour more clearly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15140) ensure input object of encoder is not null
Wenchen Fan created SPARK-15140: --- Summary: ensure input object of encoder is not null Key: SPARK-15140 URL: https://issues.apache.org/jira/browse/SPARK-15140 Project: Spark Issue Type: Improvement Reporter: Wenchen Fan Current we assume the input object for encoder won't be null, but we don't check it. For example, in 1.6 `Seq("a", null).toDS.collect` will throw NPE, in 2.0 this will return Array("a", null). We should define this behaviour more clearly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15131) StateStore management thread does not stop after SparkContext is shutdown
[ https://issues.apache.org/jira/browse/SPARK-15131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-15131. -- Resolution: Fixed Fix Version/s: 2.0.0 > StateStore management thread does not stop after SparkContext is shutdown > - > > Key: SPARK-15131 > URL: https://issues.apache.org/jira/browse/SPARK-15131 > Project: Spark > Issue Type: Bug >Reporter: Tathagata Das >Assignee: Tathagata Das > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10713) SPARK_DIST_CLASSPATH ignored on Mesos executors
[ https://issues.apache.org/jira/browse/SPARK-10713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271875#comment-15271875 ] Devaraj K commented on SPARK-10713: --- bq. However, on Mesos, SPARK_DIST_CLASSPATH is missing from executors and jar is not in the classpath. It is present on YARN. Am I missing something? Do you see different behavior? In my case, I see that jars/path provided for SPARK_DIST_CLASSPATH is getting included in executors classpath and as well as in the driver's classpath. > SPARK_DIST_CLASSPATH ignored on Mesos executors > --- > > Key: SPARK-10713 > URL: https://issues.apache.org/jira/browse/SPARK-10713 > Project: Spark > Issue Type: Bug > Components: Deploy, Mesos >Affects Versions: 1.5.0 >Reporter: Dara Adib >Priority: Minor > > If I set the environment variable SPARK_DIST_CLASSPATH, the jars are included > on the driver, but not on Mesos executors. Docs: > https://spark.apache.org/docs/latest/hadoop-provided.html > I see SPARK_DIST_CLASSPATH mentioned in these files: > launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java > project/SparkBuild.scala > yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala > But not the Mesos executor (or should it be included by the launcher > library?): > spark/core/src/main/scala/org/apache/spark/executor/Executor.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15122) TPC-DS Qury 41 fails with The correlated scalar subquery can only contain equality predicates
[ https://issues.apache.org/jira/browse/SPARK-15122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271874#comment-15271874 ] Frederick Reiss commented on SPARK-15122: - In the official version of the query, the expression {{i_manufact = i1.i_manufact}} appears twice: once on either side of an {{OR}}. The optimizer needs to normalize the expression enough to factor that subexpression out of the two sides of the disjunction. Also, the error checking code in {{CheckAnalysis.scala}} that triggers the problem needs to trigger *after* that normalization. It looks like that check happens before the call to {{Optimizer.execute}}. > TPC-DS Qury 41 fails with The correlated scalar subquery can only contain > equality predicates > - > > Key: SPARK-15122 > URL: https://issues.apache.org/jira/browse/SPARK-15122 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: JESSE CHEN >Priority: Critical > > The official TPC-DS query 41 fails with the following error: > {noformat} > Error in query: The correlated scalar subquery can only contain equality > predicates: (((i_manufact#38 = i_manufact#16) && (i_category#36 = Women) > && ((i_color#41 = powder) || (i_color#41 = khaki))) && (((i_units#42 = Ounce) > || (i_units#42 = Oz)) && ((i_size#39 = medium) || (i_size#39 = extra > large || (((i_category#36 = Women) && ((i_color#41 = brown) || > (i_color#41 = honeydew))) && (((i_units#42 = Bunch) || (i_units#42 = Ton)) && > ((i_size#39 = N/A) || (i_size#39 = small) || i_category#36 = Men) && > ((i_color#41 = floral) || (i_color#41 = deep))) && (((i_units#42 = N/A) || > (i_units#42 = Dozen)) && ((i_size#39 = petite) || (i_size#39 = large || > (((i_category#36 = Men) && ((i_color#41 = light) || (i_color#41 = > cornflower))) && (((i_units#42 = Box) || (i_units#42 = Pound)) && ((i_size#39 > = medium) || (i_size#39 = extra large))) || ((i_manufact#38 = > i_manufact#16) && (i_category#36 = Women) && ((i_color#41 = midnight) || > (i_color#41 = snow))) && (((i_units#42 = Pallet) || (i_units#42 = Gross)) && > ((i_size#39 = medium) || (i_size#39 = extra large || (((i_category#36 = > Women) && ((i_color#41 = cyan) || (i_color#41 = papaya))) && (((i_units#42 = > Cup) || (i_units#42 = Dram)) && ((i_size#39 = N/A) || (i_size#39 = small) > || i_category#36 = Men) && ((i_color#41 = orange) || (i_color#41 = > frosted))) && (((i_units#42 = Each) || (i_units#42 = Tbl)) && ((i_size#39 = > petite) || (i_size#39 = large || (((i_category#36 = Men) && ((i_color#41 > = forest) || (i_color#41 = ghost))) && (((i_units#42 = Lb) || (i_units#42 = > Bundle)) && ((i_size#39 = medium) || (i_size#39 = extra large; > {noformat} > The output plans showed the following errors > {noformat} > == Parsed Logical Plan == > 'GlobalLimit 100 > +- 'LocalLimit 100 >+- 'Sort ['i_product_name ASC], true > +- 'Distinct > +- 'Project ['i_product_name] > +- 'Filter ((('i_manufact_id >= 738) && ('i_manufact_id <= (738 + > 40))) && (scalar-subquery#1 [] > 0)) >: +- 'SubqueryAlias scalar-subquery#1 [] >: +- 'Project ['count(1) AS item_cnt#0] >:+- 'Filter ((('i_manufact = 'i1.i_manufact) && > ('i_category = Women) && (('i_color = powder) || ('i_color = khaki))) && > ((('i_units = Ounce) || ('i_units = Oz)) && (('i_size = medium) || ('i_size = > extra large || ((('i_category = Women) && (('i_color = brown) || > ('i_color = honeydew))) && ((('i_units = Bunch) || ('i_units = Ton)) && > (('i_size = N/A) || ('i_size = small) || 'i_category = Men) && > (('i_color = floral) || ('i_color = deep))) && ((('i_units = N/A) || > ('i_units = Dozen)) && (('i_size = petite) || ('i_size = large || > ((('i_category = Men) && (('i_color = light) || ('i_color = cornflower))) && > ((('i_units = Box) || ('i_units = Pound)) && (('i_size = medium) || ('i_size > = extra large))) || (('i_manufact = 'i1.i_manufact) && ('i_category = > Women) && (('i_color = midnight) || ('i_color = snow))) && ((('i_units = > Pallet) || ('i_units = Gross)) && (('i_size = medium) || ('i_size = extra > large || ((('i_category = Women) && (('i_color = cyan) || ('i_color = > papaya))) && ((('i_units = Cup) || ('i_units = Dram)) && (('i_size = N/A) || > ('i_size = small) || 'i_category = Men) && (('i_color = orange) || > ('i_color = frosted))) && ((('i_units = Each) || ('i_units = Tbl)) && > (('i_size = petite) || ('i_size = large || ((('i_category = Men) && > (('i_color = forest) || ('i_color = ghost))) && ((('i_units = Lb) || > ('i_units = Bundle)) && (('i_size = medium) || ('i_size = extra large))
[jira] [Assigned] (SPARK-15092) toDebugString missing from ML DecisionTreeClassifier
[ https://issues.apache.org/jira/browse/SPARK-15092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15092: Assignee: Apache Spark > toDebugString missing from ML DecisionTreeClassifier > > > Key: SPARK-15092 > URL: https://issues.apache.org/jira/browse/SPARK-15092 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0 > Environment: HDP 2.3.4, Red Hat 6.7 >Reporter: Ivan SPM >Assignee: Apache Spark >Priority: Minor > Labels: features > > The attribute toDebugString is missing from the DecisionTreeClassifier and > DecisionTreeClassifierModel from ML. The attribute exists on the MLLib > DecisionTree model. > There's no way to check or print the model tree structure from the ML. > The basic code for it is this: > rom pyspark.ml import Pipeline > from pyspark.ml.feature import VectorAssembler, StringIndexer > from pyspark.ml.classification import DecisionTreeClassifier > cl = DecisionTreeClassifier(labelCol='target_idx', featuresCol='features') > pipe = Pipeline(stages=[target_index, assembler, cl]) > model = pipe.fit(df_train) > # Prediction and model evaluation > predictions = model.transform(df_test) > mc_evaluator = MulticlassClassificationEvaluator( > labelCol="target_idx", predictionCol="prediction", metricName="precision") > accuracy = mc_evaluator.evaluate(predictions) > print("Test Error = {}".format(1.0 - accuracy)) > now it would be great to be able to do what is being done on the MLLib model: > print model.toDebugString(), # it already has newline > DecisionTreeModel classifier of depth 1 with 3 nodes > If (feature 0 <= 0.0) >Predict: 0.0 > Else (feature 0 > 0.0) >Predict: 1.0 > but there's no toDebugString attribute either to the pipeline model or the > DecisionTreeClassifier model: > cl.toDebugString() > Attribute Error > https://spark.apache.org/docs/1.6.0/api/python/_modules/pyspark/mllib/tree.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15092) toDebugString missing from ML DecisionTreeClassifier
[ https://issues.apache.org/jira/browse/SPARK-15092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271873#comment-15271873 ] holdenk commented on SPARK-15092: - Ah yes, it is present in Java so it is a simple fix. I've created a PR for this as part of our API audit with Python ML for 2.0 so hopefully we can get something in soon. > toDebugString missing from ML DecisionTreeClassifier > > > Key: SPARK-15092 > URL: https://issues.apache.org/jira/browse/SPARK-15092 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0 > Environment: HDP 2.3.4, Red Hat 6.7 >Reporter: Ivan SPM >Priority: Minor > Labels: features > > The attribute toDebugString is missing from the DecisionTreeClassifier and > DecisionTreeClassifierModel from ML. The attribute exists on the MLLib > DecisionTree model. > There's no way to check or print the model tree structure from the ML. > The basic code for it is this: > rom pyspark.ml import Pipeline > from pyspark.ml.feature import VectorAssembler, StringIndexer > from pyspark.ml.classification import DecisionTreeClassifier > cl = DecisionTreeClassifier(labelCol='target_idx', featuresCol='features') > pipe = Pipeline(stages=[target_index, assembler, cl]) > model = pipe.fit(df_train) > # Prediction and model evaluation > predictions = model.transform(df_test) > mc_evaluator = MulticlassClassificationEvaluator( > labelCol="target_idx", predictionCol="prediction", metricName="precision") > accuracy = mc_evaluator.evaluate(predictions) > print("Test Error = {}".format(1.0 - accuracy)) > now it would be great to be able to do what is being done on the MLLib model: > print model.toDebugString(), # it already has newline > DecisionTreeModel classifier of depth 1 with 3 nodes > If (feature 0 <= 0.0) >Predict: 0.0 > Else (feature 0 > 0.0) >Predict: 1.0 > but there's no toDebugString attribute either to the pipeline model or the > DecisionTreeClassifier model: > cl.toDebugString() > Attribute Error > https://spark.apache.org/docs/1.6.0/api/python/_modules/pyspark/mllib/tree.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15139) PySpark TreeEnsemble missing methods
[ https://issues.apache.org/jira/browse/SPARK-15139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15139: Assignee: (was: Apache Spark) > PySpark TreeEnsemble missing methods > > > Key: SPARK-15139 > URL: https://issues.apache.org/jira/browse/SPARK-15139 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: holdenk >Priority: Minor > > TreeEnsemble class is missing some accessor methods compared to Scala API -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15139) PySpark TreeEnsemble missing methods
[ https://issues.apache.org/jira/browse/SPARK-15139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271871#comment-15271871 ] Apache Spark commented on SPARK-15139: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/12919 > PySpark TreeEnsemble missing methods > > > Key: SPARK-15139 > URL: https://issues.apache.org/jira/browse/SPARK-15139 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: holdenk >Priority: Minor > > TreeEnsemble class is missing some accessor methods compared to Scala API -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15092) toDebugString missing from ML DecisionTreeClassifier
[ https://issues.apache.org/jira/browse/SPARK-15092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271872#comment-15271872 ] Apache Spark commented on SPARK-15092: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/12919 > toDebugString missing from ML DecisionTreeClassifier > > > Key: SPARK-15092 > URL: https://issues.apache.org/jira/browse/SPARK-15092 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0 > Environment: HDP 2.3.4, Red Hat 6.7 >Reporter: Ivan SPM >Priority: Minor > Labels: features > > The attribute toDebugString is missing from the DecisionTreeClassifier and > DecisionTreeClassifierModel from ML. The attribute exists on the MLLib > DecisionTree model. > There's no way to check or print the model tree structure from the ML. > The basic code for it is this: > rom pyspark.ml import Pipeline > from pyspark.ml.feature import VectorAssembler, StringIndexer > from pyspark.ml.classification import DecisionTreeClassifier > cl = DecisionTreeClassifier(labelCol='target_idx', featuresCol='features') > pipe = Pipeline(stages=[target_index, assembler, cl]) > model = pipe.fit(df_train) > # Prediction and model evaluation > predictions = model.transform(df_test) > mc_evaluator = MulticlassClassificationEvaluator( > labelCol="target_idx", predictionCol="prediction", metricName="precision") > accuracy = mc_evaluator.evaluate(predictions) > print("Test Error = {}".format(1.0 - accuracy)) > now it would be great to be able to do what is being done on the MLLib model: > print model.toDebugString(), # it already has newline > DecisionTreeModel classifier of depth 1 with 3 nodes > If (feature 0 <= 0.0) >Predict: 0.0 > Else (feature 0 > 0.0) >Predict: 1.0 > but there's no toDebugString attribute either to the pipeline model or the > DecisionTreeClassifier model: > cl.toDebugString() > Attribute Error > https://spark.apache.org/docs/1.6.0/api/python/_modules/pyspark/mllib/tree.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15139) PySpark TreeEnsemble missing methods
[ https://issues.apache.org/jira/browse/SPARK-15139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15139: Assignee: Apache Spark > PySpark TreeEnsemble missing methods > > > Key: SPARK-15139 > URL: https://issues.apache.org/jira/browse/SPARK-15139 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: holdenk >Assignee: Apache Spark >Priority: Minor > > TreeEnsemble class is missing some accessor methods compared to Scala API -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15092) toDebugString missing from ML DecisionTreeClassifier
[ https://issues.apache.org/jira/browse/SPARK-15092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15092: Assignee: (was: Apache Spark) > toDebugString missing from ML DecisionTreeClassifier > > > Key: SPARK-15092 > URL: https://issues.apache.org/jira/browse/SPARK-15092 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0 > Environment: HDP 2.3.4, Red Hat 6.7 >Reporter: Ivan SPM >Priority: Minor > Labels: features > > The attribute toDebugString is missing from the DecisionTreeClassifier and > DecisionTreeClassifierModel from ML. The attribute exists on the MLLib > DecisionTree model. > There's no way to check or print the model tree structure from the ML. > The basic code for it is this: > rom pyspark.ml import Pipeline > from pyspark.ml.feature import VectorAssembler, StringIndexer > from pyspark.ml.classification import DecisionTreeClassifier > cl = DecisionTreeClassifier(labelCol='target_idx', featuresCol='features') > pipe = Pipeline(stages=[target_index, assembler, cl]) > model = pipe.fit(df_train) > # Prediction and model evaluation > predictions = model.transform(df_test) > mc_evaluator = MulticlassClassificationEvaluator( > labelCol="target_idx", predictionCol="prediction", metricName="precision") > accuracy = mc_evaluator.evaluate(predictions) > print("Test Error = {}".format(1.0 - accuracy)) > now it would be great to be able to do what is being done on the MLLib model: > print model.toDebugString(), # it already has newline > DecisionTreeModel classifier of depth 1 with 3 nodes > If (feature 0 <= 0.0) >Predict: 0.0 > Else (feature 0 > 0.0) >Predict: 1.0 > but there's no toDebugString attribute either to the pipeline model or the > DecisionTreeClassifier model: > cl.toDebugString() > Attribute Error > https://spark.apache.org/docs/1.6.0/api/python/_modules/pyspark/mllib/tree.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15139) PySpark TreeEnsemble missing methods
[ https://issues.apache.org/jira/browse/SPARK-15139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271851#comment-15271851 ] holdenk commented on SPARK-15139: - This is related to SPARK-15092 > PySpark TreeEnsemble missing methods > > > Key: SPARK-15139 > URL: https://issues.apache.org/jira/browse/SPARK-15139 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: holdenk >Priority: Minor > > TreeEnsemble class is missing some accessor methods compared to Scala API -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15139) PySpark TreeEnsemble missing methods
[ https://issues.apache.org/jira/browse/SPARK-15139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk updated SPARK-15139: Description: TreeEnsemble class is missing some accessor methods compared to Scala API > PySpark TreeEnsemble missing methods > > > Key: SPARK-15139 > URL: https://issues.apache.org/jira/browse/SPARK-15139 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: holdenk >Priority: Minor > > TreeEnsemble class is missing some accessor methods compared to Scala API -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15139) PySpark TreeEnsemble missing methods
holdenk created SPARK-15139: --- Summary: PySpark TreeEnsemble missing methods Key: SPARK-15139 URL: https://issues.apache.org/jira/browse/SPARK-15139 Project: Spark Issue Type: Improvement Components: ML, PySpark Reporter: holdenk Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15137) Linkify ML PyDoc classification
[ https://issues.apache.org/jira/browse/SPARK-15137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15137: Assignee: Apache Spark > Linkify ML PyDoc classification > --- > > Key: SPARK-15137 > URL: https://issues.apache.org/jira/browse/SPARK-15137 > Project: Spark > Issue Type: Sub-task >Reporter: holdenk >Assignee: Apache Spark >Priority: Minor > > PyDoc links in ml are in non-standard format. Switch to standard sphinx link > format for better formatted documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15137) Linkify ML PyDoc classification
[ https://issues.apache.org/jira/browse/SPARK-15137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271839#comment-15271839 ] Apache Spark commented on SPARK-15137: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/12918 > Linkify ML PyDoc classification > --- > > Key: SPARK-15137 > URL: https://issues.apache.org/jira/browse/SPARK-15137 > Project: Spark > Issue Type: Sub-task >Reporter: holdenk >Priority: Minor > > PyDoc links in ml are in non-standard format. Switch to standard sphinx link > format for better formatted documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15137) Linkify ML PyDoc classification
[ https://issues.apache.org/jira/browse/SPARK-15137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15137: Assignee: (was: Apache Spark) > Linkify ML PyDoc classification > --- > > Key: SPARK-15137 > URL: https://issues.apache.org/jira/browse/SPARK-15137 > Project: Spark > Issue Type: Sub-task >Reporter: holdenk >Priority: Minor > > PyDoc links in ml are in non-standard format. Switch to standard sphinx link > format for better formatted documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15137) Linkify ML PyDoc classification
[ https://issues.apache.org/jira/browse/SPARK-15137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk updated SPARK-15137: Description: PyDoc links in ml are in non-standard format. Switch to standard sphinx link format for better formatted documentation. > Linkify ML PyDoc classification > --- > > Key: SPARK-15137 > URL: https://issues.apache.org/jira/browse/SPARK-15137 > Project: Spark > Issue Type: Sub-task >Reporter: holdenk >Priority: Minor > > PyDoc links in ml are in non-standard format. Switch to standard sphinx link > format for better formatted documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15101) Audit: ml.clustering and ml.recommendation
[ https://issues.apache.org/jira/browse/SPARK-15101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271824#comment-15271824 ] zhengruifeng commented on SPARK-15101: -- [~josephkb] User Doc and scala example for ml.BisectingKMeans are now missing. I had made a corresponding PR. > Audit: ml.clustering and ml.recommendation > -- > > Key: SPARK-15101 > URL: https://issues.apache.org/jira/browse/SPARK-15101 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Reporter: Joseph K. Bradley > > Audit this sub-package for new algorithms which do not have corresponding > sections & examples in the user guide. > See parent issue for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15138) Linkify ML PyDoc regression
holdenk created SPARK-15138: --- Summary: Linkify ML PyDoc regression Key: SPARK-15138 URL: https://issues.apache.org/jira/browse/SPARK-15138 Project: Spark Issue Type: Sub-task Reporter: holdenk Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15137) Linkify ML PyDoc classification
holdenk created SPARK-15137: --- Summary: Linkify ML PyDoc classification Key: SPARK-15137 URL: https://issues.apache.org/jira/browse/SPARK-15137 Project: Spark Issue Type: Sub-task Reporter: holdenk Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15136) Linkify ML PyDoc
holdenk created SPARK-15136: --- Summary: Linkify ML PyDoc Key: SPARK-15136 URL: https://issues.apache.org/jira/browse/SPARK-15136 Project: Spark Issue Type: Improvement Reporter: holdenk Priority: Minor PyDoc links in ml are in non-standard format. Switch to standard sphinx link format for better formatted documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14837) Add support in file stream source for reading new files added to subdirs
[ https://issues.apache.org/jira/browse/SPARK-14837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-14837: -- Target Version/s: 2.0.0 > Add support in file stream source for reading new files added to subdirs > > > Key: SPARK-14837 > URL: https://issues.apache.org/jira/browse/SPARK-14837 > Project: Spark > Issue Type: Sub-task > Components: SQL, Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15131) StateStore management thread does not stop after SparkContext is shutdown
[ https://issues.apache.org/jira/browse/SPARK-15131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-15131: -- Fix Version/s: (was: 2.0.0) > StateStore management thread does not stop after SparkContext is shutdown > - > > Key: SPARK-15131 > URL: https://issues.apache.org/jira/browse/SPARK-15131 > Project: Spark > Issue Type: Bug >Reporter: Tathagata Das >Assignee: Tathagata Das > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15131) StateStore management thread does not stop after SparkContext is shutdown
[ https://issues.apache.org/jira/browse/SPARK-15131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-15131: -- Target Version/s: 2.0.0 > StateStore management thread does not stop after SparkContext is shutdown > - > > Key: SPARK-15131 > URL: https://issues.apache.org/jira/browse/SPARK-15131 > Project: Spark > Issue Type: Bug >Reporter: Tathagata Das >Assignee: Tathagata Das > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14340) Add Scala Example and User DOC for ml.BisectingKMeans
[ https://issues.apache.org/jira/browse/SPARK-14340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-14340: - Summary: Add Scala Example and User DOC for ml.BisectingKMeans (was: Add Scala Example and Description for ml.BisectingKMeans) > Add Scala Example and User DOC for ml.BisectingKMeans > - > > Key: SPARK-14340 > URL: https://issues.apache.org/jira/browse/SPARK-14340 > Project: Spark > Issue Type: Improvement >Reporter: zhengruifeng >Priority: Minor > > 1, add BisectingKMeans to ml-clustering.md > 2, add the missing Scala BisectingKMeansExample -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14993) Inconsistent behavior of partitioning discovery
[ https://issues.apache.org/jira/browse/SPARK-14993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-14993: - Assignee: Xiao Li > Inconsistent behavior of partitioning discovery > --- > > Key: SPARK-14993 > URL: https://issues.apache.org/jira/browse/SPARK-14993 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Xiao Li >Priority: Critical > Fix For: 2.0.0 > > > When we load a dataset, if we set the path to {{/path/a=1}}, we will not take > a as the partitioning column. However, if we set the path to > {{/path/a=1/file.parquet}}, we take a as the partitioning column and it shows > up in the schema. We should make the behaviors of these two cases consistent > by not putting a into the schema for the second case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14993) Inconsistent behavior of partitioning discovery
[ https://issues.apache.org/jira/browse/SPARK-14993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-14993. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12828 [https://github.com/apache/spark/pull/12828] > Inconsistent behavior of partitioning discovery > --- > > Key: SPARK-14993 > URL: https://issues.apache.org/jira/browse/SPARK-14993 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Critical > Fix For: 2.0.0 > > > When we load a dataset, if we set the path to {{/path/a=1}}, we will not take > a as the partitioning column. However, if we set the path to > {{/path/a=1/file.parquet}}, we take a as the partitioning column and it shows > up in the schema. We should make the behaviors of these two cases consistent > by not putting a into the schema for the second case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6339) Support creating temporary views with DDL
[ https://issues.apache.org/jira/browse/SPARK-6339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-6339: Assignee: Sean Zhong > Support creating temporary views with DDL > - > > Key: SPARK-6339 > URL: https://issues.apache.org/jira/browse/SPARK-6339 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.3.0 >Reporter: Hossein Falaki >Assignee: Sean Zhong > Fix For: 2.0.0 > > > It would useful to support following: > {code} > create temporary view counted as > select count(transactions), company from sales group by company > {code} > Right now this is possible through registerTempTable() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6339) Support creating temporary tables with DDL
[ https://issues.apache.org/jira/browse/SPARK-6339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-6339. - Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12872 [https://github.com/apache/spark/pull/12872] > Support creating temporary tables with DDL > -- > > Key: SPARK-6339 > URL: https://issues.apache.org/jira/browse/SPARK-6339 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.3.0 >Reporter: Hossein Falaki > Fix For: 2.0.0 > > > It would useful to support following: > {code} > create temporary table counted as > select count(transactions), company from sales group by company > {code} > Right now this is possible through registerTempTable() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6339) Support creating temporary views with DDL
[ https://issues.apache.org/jira/browse/SPARK-6339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-6339: Summary: Support creating temporary views with DDL (was: Support creating temporary tables with DDL) > Support creating temporary views with DDL > - > > Key: SPARK-6339 > URL: https://issues.apache.org/jira/browse/SPARK-6339 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.3.0 >Reporter: Hossein Falaki > Fix For: 2.0.0 > > > It would useful to support following: > {code} > create temporary table counted as > select count(transactions), company from sales group by company > {code} > Right now this is possible through registerTempTable() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6339) Support creating temporary views with DDL
[ https://issues.apache.org/jira/browse/SPARK-6339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-6339: Description: It would useful to support following: {code} create temporary view counted as select count(transactions), company from sales group by company {code} Right now this is possible through registerTempTable() was: It would useful to support following: {code} create temporary table counted as select count(transactions), company from sales group by company {code} Right now this is possible through registerTempTable() > Support creating temporary views with DDL > - > > Key: SPARK-6339 > URL: https://issues.apache.org/jira/browse/SPARK-6339 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.3.0 >Reporter: Hossein Falaki > Fix For: 2.0.0 > > > It would useful to support following: > {code} > create temporary view counted as > select count(transactions), company from sales group by company > {code} > Right now this is possible through registerTempTable() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14896) Deprecate HiveContext in Python
[ https://issues.apache.org/jira/browse/SPARK-14896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-14896. --- Resolution: Fixed Fix Version/s: 2.0.0 > Deprecate HiveContext in Python > --- > > Key: SPARK-14896 > URL: https://issues.apache.org/jira/browse/SPARK-14896 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Critical > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15112) Dataset filter returns garbage
[ https://issues.apache.org/jira/browse/SPARK-15112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271025#comment-15271025 ] Cheng Lian edited comment on SPARK-15112 at 5/5/16 12:28 AM: - The following Spark shell session illustrates this issue: {noformat} scala> case class T(a: String, b: Int) defined class T scala> val ds = Seq(1 -> "foo", 2 -> "bar").toDF("b", "a").as[T] ds: org.apache.spark.sql.Dataset[T] = [b: int, a: string] scala> ds.show() +---+---+ | b| a| +---+---+ |foo| 1| |bar| 2| +---+---+ scala> ds.filter(_.b > 1).show() +---+---+ | a| b| +---+---+ | | 3| +---+---+ {noformat} Dataset encoders actually doesn't require the order of input columns to be exactly the same as its own schema. Essentially it performs a projection to adjust column order during analysis phase. This is can be quite helpful for data sources that support schema evolution, where the column order of merged schema may be non-deterministic. The JSON data source falls into this category, and it always sorts all input columns by name. This leads to the following facts, for a Dataset {{ds}}: # {{ds.resolvedTEncoder.schema}} may differ from {{ds.logicalPlan.schema}}, and # {{ds.schema}} should conform to {{ds.resolvedTEncoder.schema}}, and # {{ds.toDF()}} uses a {{RowEncoder}} to convert user space Scala objects to {{InternalRow}} instances, and this {{RowEncoder}} should be initialized using {{ds.logicalPlan.schema}}. Spark 1.6 conforms to the above requirements. For example: {noformat} scala> case class T(a: String, b: Int) defined class T scala> val ds = Seq(1 -> "foo", 2 -> "bar").toDF("b", "a").as[T] ds: org.apache.spark.sql.Dataset[T] = [b: int, a: string] scala> ds.show() +---+---+ | b| a| +---+---+ |foo| 1| |bar| 2| +---+---+ scala> ds.toDF().show() +---+---+ | a| b| +---+---+ | 1|foo| | 2|bar| +---+---+ {noformat} However, while merging DF/DF API in Spark 2.0, requirement 2 was broken by accident, and we are using {{ds.logicalPlan.schema}} as {{ds.schema}}, which leads to this bug. Working on a fix for it. was (Author: lian cheng): The following Spark shell session illustrates this issue: {noformat} scala> case class T(a: String, b: Int) defined class T scala> val ds = Seq(1 -> "foo", 2 -> "bar").toDF("b", "a").as[T] ds: org.apache.spark.sql.Dataset[T] = [b: int, a: string] scala> ds.show() +---+---+ | b| a| +---+---+ |foo| 1| |bar| 2| +---+---+ scala> ds.filter(_.b > 1).show() +---+---+ | a| b| +---+---+ | | 3| +---+---+ {noformat} Dataset encoders actually doesn't require the order of input columns to be exactly the same as its own schema. Essentially it performs a projection to adjust column order during analysis phase. This is can be quite helpful for data sources that support schema evolution, where the column order of merged schema may be non-deterministic. The JSON data source falls into this category, and it always sorts all input columns by name. This leads to the following facts, for a Dataset {{ds}}: # {{ds.resolvedTEncoder.schema}} may differ from {{ds.logicalPlan.schema}}, and # {{ds.schema}} should conform to {{ds.resolvedTEncoder.schema}}, and # {{ds.toDF()}} uses a {{RowEncoder}} to convert user space Scala objects to {{InternalRow}}s, and this {{RowEncoder}} should be initialized using {{ds.logicalPlan.schema}}. Spark 1.6 conforms to the above requirements. For example: {noformat} scala> case class T(a: String, b: Int) defined class T scala> val ds = Seq(1 -> "foo", 2 -> "bar").toDF("b", "a").as[T] ds: org.apache.spark.sql.Dataset[T] = [b: int, a: string] scala> ds.show() +---+---+ | b| a| +---+---+ |foo| 1| |bar| 2| +---+---+ scala> ds.toDF().show() +---+---+ | a| b| +---+---+ | 1|foo| | 2|bar| +---+---+ {noformat} However, while merging DF/DF API in Spark 2.0, requirement 2 was broken by accident, and we are using {{ds.logicalPlan.schema}} as {{ds.schema}}, which leads to this bug. Working on a fix for it. > Dataset filter returns garbage > -- > > Key: SPARK-15112 > URL: https://issues.apache.org/jira/browse/SPARK-15112 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Reynold Xin >Priority: Blocker > Attachments: demo 1 dataset - Databricks.htm > > > See the following notebook: > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6122906529858466/2727501386611535/5382278320999420/latest.html > I think it happens only when using JSON. I'm also going to attach it to the > ticket. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14896) Deprecate HiveContext in Python
[ https://issues.apache.org/jira/browse/SPARK-14896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271685#comment-15271685 ] Apache Spark commented on SPARK-14896: -- User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/12917 > Deprecate HiveContext in Python > --- > > Key: SPARK-14896 > URL: https://issues.apache.org/jira/browse/SPARK-14896 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14896) Deprecate HiveContext in Python
[ https://issues.apache.org/jira/browse/SPARK-14896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14896: Assignee: Andrew Or (was: Apache Spark) > Deprecate HiveContext in Python > --- > > Key: SPARK-14896 > URL: https://issues.apache.org/jira/browse/SPARK-14896 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14896) Deprecate HiveContext in Python
[ https://issues.apache.org/jira/browse/SPARK-14896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14896: Assignee: Apache Spark (was: Andrew Or) > Deprecate HiveContext in Python > --- > > Key: SPARK-14896 > URL: https://issues.apache.org/jira/browse/SPARK-14896 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Apache Spark >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14897) Upgrade Jetty to latest version of 8/9
[ https://issues.apache.org/jira/browse/SPARK-14897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271669#comment-15271669 ] Apache Spark commented on SPARK-14897: -- User 'bomeng' has created a pull request for this issue: https://github.com/apache/spark/pull/12916 > Upgrade Jetty to latest version of 8/9 > -- > > Key: SPARK-14897 > URL: https://issues.apache.org/jira/browse/SPARK-14897 > Project: Spark > Issue Type: Improvement >Reporter: Adam Kramer > Labels: web-ui > > It looks like the head/master branch of Spark uses quite an old version of > Jetty: 8.1.14.v20131031 > There have been some announcement of security vulnerabilities, notably in > 2015 and there are versions of both 8 and 9 that address those. We recently > left a web-ui port open and had the server compromised within days. Albeit, > this upgrade shouldn't be the only security improvement made, the current > version is clearly vulnerable, as-is. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15037) Use SparkSession instead of SQLContext in testsuites
[ https://issues.apache.org/jira/browse/SPARK-15037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-15037: -- Assignee: Sandeep Singh > Use SparkSession instead of SQLContext in testsuites > > > Key: SPARK-15037 > URL: https://issues.apache.org/jira/browse/SPARK-15037 > Project: Spark > Issue Type: Sub-task >Reporter: Dongjoon Hyun >Assignee: Sandeep Singh > > This issue aims to update the existing testsuites to use `SparkSession` > instread of `SQLContext` since `SQLContext` exists just for backward > compatibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15045) Remove dead code in TaskMemoryManager.cleanUpAllAllocatedMemory for pageTable
[ https://issues.apache.org/jira/browse/SPARK-15045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-15045: -- Priority: Major (was: Trivial) > Remove dead code in TaskMemoryManager.cleanUpAllAllocatedMemory for pageTable > - > > Key: SPARK-15045 > URL: https://issues.apache.org/jira/browse/SPARK-15045 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Jacek Laskowski > > Unless my eyes trick me, {{TaskMemoryManager}} first clears up {{pageTable}} > in a synchronized block and right after the block it does it again. I think > the outside cleaning is a dead code. > See > https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java#L382-L397 > with the relevant snippet pasted below: > {code} > public long cleanUpAllAllocatedMemory() { > synchronized (this) { > Arrays.fill(pageTable, null); > ... > } > for (MemoryBlock page : pageTable) { > if (page != null) { > memoryManager.tungstenMemoryAllocator().free(page); > } > } > Arrays.fill(pageTable, null); >... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15135) Make sure SparkSession thread safe
[ https://issues.apache.org/jira/browse/SPARK-15135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271621#comment-15271621 ] Shixiong Zhu commented on SPARK-15135: -- https://github.com/apache/spark/pull/12915 > Make sure SparkSession thread safe > -- > > Key: SPARK-15135 > URL: https://issues.apache.org/jira/browse/SPARK-15135 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Fixed non-thread-safe classed used by SparkSession. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15135) Make sure SparkSession thread safe
[ https://issues.apache.org/jira/browse/SPARK-15135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271622#comment-15271622 ] Apache Spark commented on SPARK-15135: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/12915 > Make sure SparkSession thread safe > -- > > Key: SPARK-15135 > URL: https://issues.apache.org/jira/browse/SPARK-15135 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Fixed non-thread-safe classed used by SparkSession. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15135) Make sure SparkSession thread safe
Shixiong Zhu created SPARK-15135: Summary: Make sure SparkSession thread safe Key: SPARK-15135 URL: https://issues.apache.org/jira/browse/SPARK-15135 Project: Spark Issue Type: Bug Components: SQL Reporter: Shixiong Zhu Assignee: Shixiong Zhu Fixed non-thread-safe classed used by SparkSession. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10713) SPARK_DIST_CLASSPATH ignored on Mesos executors
[ https://issues.apache.org/jira/browse/SPARK-10713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271613#comment-15271613 ] Dara Adib commented on SPARK-10713: --- [~devaraj.k] Thanks for trying to reproduce. I'm not using the Hadoop-free builds anymore, so I tried testing with a random jar (in this case spark-streaming-kafka-assembly) on Spark 1.6.1. I'm using PySpark but here is a Scala example that seems to work too: {code} # Get classpath, taken from https://gist.github.com/jessitron/8376139. def urlses(cl: ClassLoader): Array[java.net.URL] = cl match { case null => Array() case u: java.net.URLClassLoader => u.getURLs() ++ urlses(cl.getParent) case _ => urlses(cl.getParent) } # driver println(sys.env.get("SPARK_DIST_CLASSPATH")) println(urlses(getClass.getClassLoader).mkString(":")) # executor println(sc.parallelize(Vector(0)).map(_ => sys.env.get("SPARK_DIST_CLASSPATH")).collect()(0)) println(sc.parallelize(Vector(0)).map(_ => urlses(getClass.getClassLoader).mkString(":")).collect()(0)) {code} On both Mesos and YARN, SPARK_DIST_CLASSPATH is defined on the driver and jar is included in classpath. However, on Mesos, SPARK_DIST_CLASSPATH is missing from executors and jar is not in the classpath. It is present on YARN. Am I missing something? Do you see different behavior? > SPARK_DIST_CLASSPATH ignored on Mesos executors > --- > > Key: SPARK-10713 > URL: https://issues.apache.org/jira/browse/SPARK-10713 > Project: Spark > Issue Type: Bug > Components: Deploy, Mesos >Affects Versions: 1.5.0 >Reporter: Dara Adib >Priority: Minor > > If I set the environment variable SPARK_DIST_CLASSPATH, the jars are included > on the driver, but not on Mesos executors. Docs: > https://spark.apache.org/docs/latest/hadoop-provided.html > I see SPARK_DIST_CLASSPATH mentioned in these files: > launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java > project/SparkBuild.scala > yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala > But not the Mesos executor (or should it be included by the launcher > library?): > spark/core/src/main/scala/org/apache/spark/executor/Executor.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10713) SPARK_DIST_CLASSPATH ignored on Mesos executors
[ https://issues.apache.org/jira/browse/SPARK-10713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271613#comment-15271613 ] Dara Adib edited comment on SPARK-10713 at 5/4/16 11:18 PM: [~devaraj.k] Thanks for trying to reproduce. I'm not using the Hadoop-free builds anymore, so I tried testing with a random jar (in this case spark-streaming-kafka-assembly) on Spark 1.6.1. I'm using PySpark but here is a Scala example that seems to work too: {code} // Get classpath, taken from https://gist.github.com/jessitron/8376139. def urlses(cl: ClassLoader): Array[java.net.URL] = cl match { case null => Array() case u: java.net.URLClassLoader => u.getURLs() ++ urlses(cl.getParent) case _ => urlses(cl.getParent) } // driver println(sys.env.get("SPARK_DIST_CLASSPATH")) println(urlses(getClass.getClassLoader).mkString(":")) // executor println(sc.parallelize(Vector(0)).map(_ => sys.env.get("SPARK_DIST_CLASSPATH")).collect()(0)) println(sc.parallelize(Vector(0)).map(_ => urlses(getClass.getClassLoader).mkString(":")).collect()(0)) {code} On both Mesos and YARN, SPARK_DIST_CLASSPATH is defined on the driver and jar is included in classpath. However, on Mesos, SPARK_DIST_CLASSPATH is missing from executors and jar is not in the classpath. It is present on YARN. Am I missing something? Do you see different behavior? was (Author: daradib): [~devaraj.k] Thanks for trying to reproduce. I'm not using the Hadoop-free builds anymore, so I tried testing with a random jar (in this case spark-streaming-kafka-assembly) on Spark 1.6.1. I'm using PySpark but here is a Scala example that seems to work too: {code} # Get classpath, taken from https://gist.github.com/jessitron/8376139. def urlses(cl: ClassLoader): Array[java.net.URL] = cl match { case null => Array() case u: java.net.URLClassLoader => u.getURLs() ++ urlses(cl.getParent) case _ => urlses(cl.getParent) } # driver println(sys.env.get("SPARK_DIST_CLASSPATH")) println(urlses(getClass.getClassLoader).mkString(":")) # executor println(sc.parallelize(Vector(0)).map(_ => sys.env.get("SPARK_DIST_CLASSPATH")).collect()(0)) println(sc.parallelize(Vector(0)).map(_ => urlses(getClass.getClassLoader).mkString(":")).collect()(0)) {code} On both Mesos and YARN, SPARK_DIST_CLASSPATH is defined on the driver and jar is included in classpath. However, on Mesos, SPARK_DIST_CLASSPATH is missing from executors and jar is not in the classpath. It is present on YARN. Am I missing something? Do you see different behavior? > SPARK_DIST_CLASSPATH ignored on Mesos executors > --- > > Key: SPARK-10713 > URL: https://issues.apache.org/jira/browse/SPARK-10713 > Project: Spark > Issue Type: Bug > Components: Deploy, Mesos >Affects Versions: 1.5.0 >Reporter: Dara Adib >Priority: Minor > > If I set the environment variable SPARK_DIST_CLASSPATH, the jars are included > on the driver, but not on Mesos executors. Docs: > https://spark.apache.org/docs/latest/hadoop-provided.html > I see SPARK_DIST_CLASSPATH mentioned in these files: > launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java > project/SparkBuild.scala > yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala > But not the Mesos executor (or should it be included by the launcher > library?): > spark/core/src/main/scala/org/apache/spark/executor/Executor.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15130) PySpark shared params should include default values to match Scala
[ https://issues.apache.org/jira/browse/SPARK-15130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk updated SPARK-15130: Summary: PySpark shared params should include default values to match Scala (was: PySpark decision tree params should include default values to match Scala) > PySpark shared params should include default values to match Scala > -- > > Key: SPARK-15130 > URL: https://issues.apache.org/jira/browse/SPARK-15130 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML, PySpark >Reporter: holdenk >Priority: Minor > > As part of checking the documentation in SPARK-14813, PySpark decision tree > params do not include the default values (unlike the Scala ones). While the > existing Scala default values will have been used, this information is likely > worth exposing in the docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15130) PySpark decision tree params should include default values to match Scala
[ https://issues.apache.org/jira/browse/SPARK-15130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15130: Assignee: (was: Apache Spark) > PySpark decision tree params should include default values to match Scala > - > > Key: SPARK-15130 > URL: https://issues.apache.org/jira/browse/SPARK-15130 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML, PySpark >Reporter: holdenk >Priority: Minor > > As part of checking the documentation in SPARK-14813, PySpark decision tree > params do not include the default values (unlike the Scala ones). While the > existing Scala default values will have been used, this information is likely > worth exposing in the docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15130) PySpark decision tree params should include default values to match Scala
[ https://issues.apache.org/jira/browse/SPARK-15130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15130: Assignee: Apache Spark > PySpark decision tree params should include default values to match Scala > - > > Key: SPARK-15130 > URL: https://issues.apache.org/jira/browse/SPARK-15130 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML, PySpark >Reporter: holdenk >Assignee: Apache Spark >Priority: Minor > > As part of checking the documentation in SPARK-14813, PySpark decision tree > params do not include the default values (unlike the Scala ones). While the > existing Scala default values will have been used, this information is likely > worth exposing in the docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15130) PySpark decision tree params should include default values to match Scala
[ https://issues.apache.org/jira/browse/SPARK-15130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271611#comment-15271611 ] Apache Spark commented on SPARK-15130: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/12914 > PySpark decision tree params should include default values to match Scala > - > > Key: SPARK-15130 > URL: https://issues.apache.org/jira/browse/SPARK-15130 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML, PySpark >Reporter: holdenk >Priority: Minor > > As part of checking the documentation in SPARK-14813, PySpark decision tree > params do not include the default values (unlike the Scala ones). While the > existing Scala default values will have been used, this information is likely > worth exposing in the docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15130) PySpark decision tree params should include default values to match Scala
[ https://issues.apache.org/jira/browse/SPARK-15130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271585#comment-15271585 ] Holden Karau commented on SPARK-15130: -- I mean that the pydocs should include what the default value is. I'm working on a PR for this I'll cc you when it's up. -- Cell : 425-233-8271 Twitter: https://twitter.com/holdenkarau > PySpark decision tree params should include default values to match Scala > - > > Key: SPARK-15130 > URL: https://issues.apache.org/jira/browse/SPARK-15130 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML, PySpark >Reporter: holdenk >Priority: Minor > > As part of checking the documentation in SPARK-14813, PySpark decision tree > params do not include the default values (unlike the Scala ones). While the > existing Scala default values will have been used, this information is likely > worth exposing in the docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15112) Dataset filter returns garbage
[ https://issues.apache.org/jira/browse/SPARK-15112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271576#comment-15271576 ] Suresh Thalamati commented on SPARK-15112: -- I ran into similar issue SPARK-14218 > Dataset filter returns garbage > -- > > Key: SPARK-15112 > URL: https://issues.apache.org/jira/browse/SPARK-15112 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Reynold Xin >Priority: Blocker > Attachments: demo 1 dataset - Databricks.htm > > > See the following notebook: > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6122906529858466/2727501386611535/5382278320999420/latest.html > I think it happens only when using JSON. I'm also going to attach it to the > ticket. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-928) Add support for Unsafe-based serializer in Kryo 2.22
[ https://issues.apache.org/jira/browse/SPARK-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15265805#comment-15265805 ] Sandeep Singh edited comment on SPARK-928 at 5/4/16 10:43 PM: -- [~joshrosen] I would like to work on this. I tried benchmarking the difference between unsafe kryo and our current impl. and then we can have a spark.kryo.useUnsafe flag as Matei has mentioned. {code:title=Benchmarking results|borderStyle=solid} Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Mac OS X 10.11.4 Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz Benchmark Kryo Unsafe vs safe Serialization: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative basicTypes: Int unsafe:false 2 /4 8988.0 0.1 1.0X basicTypes: Long unsafe:false1 /1 13981.3 0.1 1.6X basicTypes: Float unsafe:false 1 /1 14460.6 0.1 1.6X basicTypes: Double unsafe:false 1 /1 15876.9 0.1 1.8X Array: Int unsafe:false 33 / 44474.8 2.1 0.1X Array: Long unsafe:false18 / 25888.6 1.1 0.1X Array: Float unsafe:false 10 / 16 1627.4 0.6 0.2X Array: Double unsafe:false 10 / 13 1523.1 0.7 0.2X Map of string->Double unsafe:false 413 / 447 38.1 26.3 0.0X basicTypes: Int unsafe:true 1 /1 16402.6 0.1 1.8X basicTypes: Long unsafe:true 1 /1 19732.1 0.1 2.2X basicTypes: Float unsafe:true1 /1 19752.9 0.1 2.2X basicTypes: Double unsafe:true 1 /1 23111.4 0.0 2.6X Array: Int unsafe:true 7 /8 2239.9 0.4 0.2X Array: Long unsafe:true 8 /9 2000.1 0.5 0.2X Array: Float unsafe:true 7 /8 2191.5 0.5 0.2X Array: Double unsafe:true9 / 10 1841.2 0.5 0.2X Map of string->Double unsafe:true 387 / 407 40.7 24.6 0.0X {code} You can find the code for benchmarking here (https://github.com/techaddict/spark/commit/46fa44141c849ca15bbd6136cea2fa52bd927da2), very ugly right now but will improve it(add more benchmarks) before creating a PR. was (Author: techaddict): [~joshrosen] I would like to work on this. I tried benchmarking the difference between unsafe kryo and our current impl. and then we can have a spark.kryo.useUnsafe flag as Matei has mentioned. {code:title=Without Kryo UnSafe|borderStyle=solid} Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Mac OS X 10.11.4 Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz Serialize and then deserialize: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative --- primitive:Long 1 /4 11223.1 0.1 1.0X primitive:Double1 /1 19409.0 0.1 1.7X Array:Long 38 / 49412.4 2.4 0.0X Array:Double 25 / 35631.4 1.6 0.1X Map of string->Double2651 / 2766 5.9 168.6 0.0X {code} {code:title=With Kryo UnSafe|borderStyle=solid} Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Mac OS X 10.11.4 Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz Serialize and then deserialize: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative --- primitive:Long 1 /3 15872.0 0.1 1.0X primitive:Double1 /1 17769.7 0.1 1.1X Array:Long 24 / 42642.3 1.6 0.0X Array:Double 22 / 26719.4 1.4 0.0X Map of string->Double2560 / 2582 6.1 162.8 0.0X {code} You can find the code for benchmarking here (https://github.com/techaddict/spark/commit/46fa44141c849ca15bbd6136cea2fa52bd927da2), very ugly right now but will improve it(add more benchmarks
[jira] [Assigned] (SPARK-928) Add support for Unsafe-based serializer in Kryo 2.22
[ https://issues.apache.org/jira/browse/SPARK-928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-928: -- Assignee: Apache Spark > Add support for Unsafe-based serializer in Kryo 2.22 > > > Key: SPARK-928 > URL: https://issues.apache.org/jira/browse/SPARK-928 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Matei Zaharia >Assignee: Apache Spark >Priority: Minor > Labels: starter > > This can reportedly be quite a bit faster, but it also requires Chill to > update its Kryo dependency. Once that happens we should add a > spark.kryo.useUnsafe flag. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-928) Add support for Unsafe-based serializer in Kryo 2.22
[ https://issues.apache.org/jira/browse/SPARK-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271570#comment-15271570 ] Apache Spark commented on SPARK-928: User 'techaddict' has created a pull request for this issue: https://github.com/apache/spark/pull/12913 > Add support for Unsafe-based serializer in Kryo 2.22 > > > Key: SPARK-928 > URL: https://issues.apache.org/jira/browse/SPARK-928 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > Labels: starter > > This can reportedly be quite a bit faster, but it also requires Chill to > update its Kryo dependency. Once that happens we should add a > spark.kryo.useUnsafe flag. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-928) Add support for Unsafe-based serializer in Kryo 2.22
[ https://issues.apache.org/jira/browse/SPARK-928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-928: -- Assignee: (was: Apache Spark) > Add support for Unsafe-based serializer in Kryo 2.22 > > > Key: SPARK-928 > URL: https://issues.apache.org/jira/browse/SPARK-928 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > Labels: starter > > This can reportedly be quite a bit faster, but it also requires Chill to > update its Kryo dependency. Once that happens we should add a > spark.kryo.useUnsafe flag. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15134) Indent SparkSession builder patterns and update binary_classification_metrics_example.py
[ https://issues.apache.org/jira/browse/SPARK-15134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15134: Assignee: Apache Spark > Indent SparkSession builder patterns and update > binary_classification_metrics_example.py > > > Key: SPARK-15134 > URL: https://issues.apache.org/jira/browse/SPARK-15134 > Project: Spark > Issue Type: Task > Components: Examples >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Minor > > This issue addresses the comments in SPARK-15031 and also fix java-linter > errors. > - Use multiline format in SparkSession builder patterns. > - Update `binary_classification_metrics_example.py` to use `SparkSession`. > - Fix Java Linter errors (in SPARK-13745, SPARK-15031, and so far) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15134) Indent SparkSession builder patterns and update binary_classification_metrics_example.py
[ https://issues.apache.org/jira/browse/SPARK-15134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271541#comment-15271541 ] Apache Spark commented on SPARK-15134: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/12911 > Indent SparkSession builder patterns and update > binary_classification_metrics_example.py > > > Key: SPARK-15134 > URL: https://issues.apache.org/jira/browse/SPARK-15134 > Project: Spark > Issue Type: Task > Components: Examples >Reporter: Dongjoon Hyun >Priority: Minor > > This issue addresses the comments in SPARK-15031 and also fix java-linter > errors. > - Use multiline format in SparkSession builder patterns. > - Update `binary_classification_metrics_example.py` to use `SparkSession`. > - Fix Java Linter errors (in SPARK-13745, SPARK-15031, and so far) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15134) Indent SparkSession builder patterns and update binary_classification_metrics_example.py
[ https://issues.apache.org/jira/browse/SPARK-15134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15134: Assignee: (was: Apache Spark) > Indent SparkSession builder patterns and update > binary_classification_metrics_example.py > > > Key: SPARK-15134 > URL: https://issues.apache.org/jira/browse/SPARK-15134 > Project: Spark > Issue Type: Task > Components: Examples >Reporter: Dongjoon Hyun >Priority: Minor > > This issue addresses the comments in SPARK-15031 and also fix java-linter > errors. > - Use multiline format in SparkSession builder patterns. > - Update `binary_classification_metrics_example.py` to use `SparkSession`. > - Fix Java Linter errors (in SPARK-13745, SPARK-15031, and so far) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15134) Indent SparkSession builder patterns and update binary_classification_metrics_example.py
[ https://issues.apache.org/jira/browse/SPARK-15134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-15134: -- Description: This issue addresses the comments in SPARK-15031 and also fix java-linter errors. - Use multiline format in SparkSession builder patterns. - Update `binary_classification_metrics_example.py` to use `SparkSession`. - Fix Java Linter errors (in SPARK-13745, SPARK-15031, and so far) was: This issue addresses the comments in SPARK-15031 and also fix java-linter errors. - Use multiline format in SparkSession builder patterns. - Update `binary_classification_metrics_example.py` to use `SparkSession`. - Fix Java Linter errors (in SPARK-13745 and so far) > Indent SparkSession builder patterns and update > binary_classification_metrics_example.py > > > Key: SPARK-15134 > URL: https://issues.apache.org/jira/browse/SPARK-15134 > Project: Spark > Issue Type: Task > Components: Examples >Reporter: Dongjoon Hyun >Priority: Minor > > This issue addresses the comments in SPARK-15031 and also fix java-linter > errors. > - Use multiline format in SparkSession builder patterns. > - Update `binary_classification_metrics_example.py` to use `SparkSession`. > - Fix Java Linter errors (in SPARK-13745, SPARK-15031, and so far) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15134) Indent SparkSession builder patterns and update binary_classification_metrics_example.py
Dongjoon Hyun created SPARK-15134: - Summary: Indent SparkSession builder patterns and update binary_classification_metrics_example.py Key: SPARK-15134 URL: https://issues.apache.org/jira/browse/SPARK-15134 Project: Spark Issue Type: Task Components: Examples Reporter: Dongjoon Hyun Priority: Minor This issue addresses the comments in SPARK-15031 and also fix java-linter errors. - Use multiline format in SparkSession builder patterns. - Update `binary_classification_metrics_example.py` to use `SparkSession`. - Fix Java Linter errors (in SPARK-13745 and so far) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15130) PySpark decision tree params should include default values to match Scala
[ https://issues.apache.org/jira/browse/SPARK-15130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271531#comment-15271531 ] Xin Ren commented on SPARK-15130: - Hi I just found in class DecisionTreeClassifier of pyspark, there is a setParams method which sort of matches what is in scala ones. do you mean to create a separate class "Param"? {code} @keyword_only @since("1.4.0") def setParams(self, featuresCol="features", labelCol="label", predictionCol="prediction", probabilityCol="probability", rawPredictionCol="rawPrediction", maxDepth=5, maxBins=32, minInstancesPerNode=1, minInfoGain=0.0, maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10, impurity="gini", seed=None): """ setParams(self, featuresCol="features", labelCol="label", predictionCol="prediction", \ probabilityCol="probability", rawPredictionCol="rawPrediction", \ maxDepth=5, maxBins=32, minInstancesPerNode=1, minInfoGain=0.0, \ maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10, impurity="gini", \ seed=None) Sets params for the DecisionTreeClassifier. """ kwargs = self.setParams._input_kwargs return self._set(**kwargs) {code} > PySpark decision tree params should include default values to match Scala > - > > Key: SPARK-15130 > URL: https://issues.apache.org/jira/browse/SPARK-15130 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML, PySpark >Reporter: holdenk >Priority: Minor > > As part of checking the documentation in SPARK-14813, PySpark decision tree > params do not include the default values (unlike the Scala ones). While the > existing Scala default values will have been used, this information is likely > worth exposing in the docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13670) spark-class doesn't bubble up error from launcher command
[ https://issues.apache.org/jira/browse/SPARK-13670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13670: Assignee: Apache Spark > spark-class doesn't bubble up error from launcher command > - > > Key: SPARK-13670 > URL: https://issues.apache.org/jira/browse/SPARK-13670 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.0.0 >Reporter: Mark Grover >Assignee: Apache Spark >Priority: Minor > > There's a particular snippet in spark-class > [here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that > runs the spark-launcher code in a subshell. > {code} > # The launcher library will print arguments separated by a NULL character, to > allow arguments with > # characters that would be otherwise interpreted by the shell. Read that in a > while loop, populating > # an array that will be used to exec the final command. > CMD=() > while IFS= read -d '' -r ARG; do > CMD+=("$ARG") > done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main > "$@") > {code} > The problem is that the if the launcher Main fails, this code still still > returns success and continues, even though the top level script is marked > {{set -e}}. This is because the launcher.Main is run within a subshell. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org