[jira] [Resolved] (SPARK-17186) remove catalog table type INDEX
[ https://issues.apache.org/jira/browse/SPARK-17186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-17186. - Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 > remove catalog table type INDEX > --- > > Key: SPARK-17186 > URL: https://issues.apache.org/jira/browse/SPARK-17186 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.1, 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16785) dapply doesn't return array or raw columns
[ https://issues.apache.org/jira/browse/SPARK-16785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15434253#comment-15434253 ] Apache Spark commented on SPARK-16785: -- User 'clarkfitzg' has created a pull request for this issue: https://github.com/apache/spark/pull/14783 > dapply doesn't return array or raw columns > -- > > Key: SPARK-16785 > URL: https://issues.apache.org/jira/browse/SPARK-16785 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 > Environment: Mac OS X >Reporter: Clark Fitzgerald >Priority: Minor > > Calling SparkR::dapplyCollect with R functions that return dataframes > produces an error. This comes up when returning columns of binary data- ie. > serialized fitted models. Also happens when functions return columns > containing vectors. > The error message: > R computation failed with > Error in (function (..., deparse.level = 1, make.row.names = TRUE, > stringsAsFactors = default.stringsAsFactors()) : > invalid list argument: all variables should have the same length > Reproducible example: > https://github.com/clarkfitzg/phd_research/blob/master/ddR/spark/sparkR_dapplyCollect7.R > Relates to SPARK-16611 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16785) dapply doesn't return array or raw columns
[ https://issues.apache.org/jira/browse/SPARK-16785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16785: Assignee: Apache Spark > dapply doesn't return array or raw columns > -- > > Key: SPARK-16785 > URL: https://issues.apache.org/jira/browse/SPARK-16785 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 > Environment: Mac OS X >Reporter: Clark Fitzgerald >Assignee: Apache Spark >Priority: Minor > > Calling SparkR::dapplyCollect with R functions that return dataframes > produces an error. This comes up when returning columns of binary data- ie. > serialized fitted models. Also happens when functions return columns > containing vectors. > The error message: > R computation failed with > Error in (function (..., deparse.level = 1, make.row.names = TRUE, > stringsAsFactors = default.stringsAsFactors()) : > invalid list argument: all variables should have the same length > Reproducible example: > https://github.com/clarkfitzg/phd_research/blob/master/ddR/spark/sparkR_dapplyCollect7.R > Relates to SPARK-16611 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16785) dapply doesn't return array or raw columns
[ https://issues.apache.org/jira/browse/SPARK-16785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16785: Assignee: (was: Apache Spark) > dapply doesn't return array or raw columns > -- > > Key: SPARK-16785 > URL: https://issues.apache.org/jira/browse/SPARK-16785 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 > Environment: Mac OS X >Reporter: Clark Fitzgerald >Priority: Minor > > Calling SparkR::dapplyCollect with R functions that return dataframes > produces an error. This comes up when returning columns of binary data- ie. > serialized fitted models. Also happens when functions return columns > containing vectors. > The error message: > R computation failed with > Error in (function (..., deparse.level = 1, make.row.names = TRUE, > stringsAsFactors = default.stringsAsFactors()) : > invalid list argument: all variables should have the same length > Reproducible example: > https://github.com/clarkfitzg/phd_research/blob/master/ddR/spark/sparkR_dapplyCollect7.R > Relates to SPARK-16611 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17207) Comparing Vector in relative tolerance or absolute tolerance in UnitTests error
[ https://issues.apache.org/jira/browse/SPARK-17207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15434241#comment-15434241 ] Peng Meng edited comment on SPARK-17207 at 8/24/16 6:20 AM: This is caused by two Vector zip problem: def absTol(eps: Double): CompareVectorRightSide = CompareVectorRightSide( (x: Vector, y: Vector, eps: Double) => { x.toArray.zip(y.toArray).forall(x => x._1 ~= x._2 absTol eps) }, x, eps, ABS_TOL_MSG) // forall () always return true if x or y is zero element Vector val a = Vectors.dense(Array(1.0, 2.0)) val b = Vectors.dense(Array(1.0)) a ~== b absTol 1e-1 // this also return true. Because, a.toArray.zip(b.toArray) = Array((1.0, 1.0)) was (Author: peng.m...@intel.com): This is caused by two Vector zip problem: def absTol(eps: Double): CompareVectorRightSide = CompareVectorRightSide( (x: Vector, y: Vector, eps: Double) => { x.toArray.zip(y.toArray).forall(x => x._1 ~= x._2 absTol eps) }, x, eps, ABS_TOL_MSG) // forall () always return true if x or y is zero element Vector > Comparing Vector in relative tolerance or absolute tolerance in UnitTests > error > > > Key: SPARK-17207 > URL: https://issues.apache.org/jira/browse/SPARK-17207 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Reporter: Peng Meng > > The result of compare two vectors using UnitTests > (org.apache.spark.mllib.util.TestingUtils) is not right sometime. > For example: > val a = Vectors.dense(Arrary(1.0, 2.0)) > val b = Vectors.zeros(0) > a ~== b absTol 1e-1 // the result is true. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17208) Build Outer Join Test Cases in File-based Testing Framework
[ https://issues.apache.org/jira/browse/SPARK-17208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17208: Assignee: (was: Apache Spark) > Build Outer Join Test Cases in File-based Testing Framework > --- > > Key: SPARK-17208 > URL: https://issues.apache.org/jira/browse/SPARK-17208 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Based on file-based SQL end-to-end testing framework in `SQLQueryTestSuite`, > this JIRA is to create test cases for outer joins. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17208) Build Outer Join Test Cases in File-based Testing Framework
[ https://issues.apache.org/jira/browse/SPARK-17208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17208: Assignee: Apache Spark > Build Outer Join Test Cases in File-based Testing Framework > --- > > Key: SPARK-17208 > URL: https://issues.apache.org/jira/browse/SPARK-17208 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Apache Spark > > Based on file-based SQL end-to-end testing framework in `SQLQueryTestSuite`, > this JIRA is to create test cases for outer joins. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17208) Build Outer Join Test Cases in File-based Testing Framework
[ https://issues.apache.org/jira/browse/SPARK-17208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15434249#comment-15434249 ] Apache Spark commented on SPARK-17208: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/14782 > Build Outer Join Test Cases in File-based Testing Framework > --- > > Key: SPARK-17208 > URL: https://issues.apache.org/jira/browse/SPARK-17208 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Based on file-based SQL end-to-end testing framework in `SQLQueryTestSuite`, > this JIRA is to create test cases for outer joins. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17208) Build Outer Join Test Cases in File-based Testing Framework
Xiao Li created SPARK-17208: --- Summary: Build Outer Join Test Cases in File-based Testing Framework Key: SPARK-17208 URL: https://issues.apache.org/jira/browse/SPARK-17208 Project: Spark Issue Type: Test Components: SQL Affects Versions: 2.0.0 Reporter: Xiao Li Based on file-based SQL end-to-end testing framework in `SQLQueryTestSuite`, this JIRA is to create test cases for outer joins. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17207) Comparing Vector in relative tolerance or absolute tolerance in UnitTests error
[ https://issues.apache.org/jira/browse/SPARK-17207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15434241#comment-15434241 ] Peng Meng commented on SPARK-17207: --- This is caused by two Vector zip problem: def absTol(eps: Double): CompareVectorRightSide = CompareVectorRightSide( (x: Vector, y: Vector, eps: Double) => { x.toArray.zip(y.toArray).forall(x => x._1 ~= x._2 absTol eps) }, x, eps, ABS_TOL_MSG) // forall () always return true if x or y is zero element Vector > Comparing Vector in relative tolerance or absolute tolerance in UnitTests > error > > > Key: SPARK-17207 > URL: https://issues.apache.org/jira/browse/SPARK-17207 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Reporter: Peng Meng > > The result of compare two vectors using UnitTests > (org.apache.spark.mllib.util.TestingUtils) is not right sometime. > For example: > val a = Vectors.dense(Arrary(1.0, 2.0)) > val b = Vectors.zeros(0) > a ~== b absTol 1e-1 // the result is true. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17207) Comparing Vector in relative tolerance or absolute tolerance in UnitTests error
Peng Meng created SPARK-17207: - Summary: Comparing Vector in relative tolerance or absolute tolerance in UnitTests error Key: SPARK-17207 URL: https://issues.apache.org/jira/browse/SPARK-17207 Project: Spark Issue Type: Bug Components: ML, MLlib Reporter: Peng Meng The result of compare two vectors using UnitTests (org.apache.spark.mllib.util.TestingUtils) is not right sometime. For example: val a = Vectors.dense(Arrary(1.0, 2.0)) val b = Vectors.zeros(0) a ~== b absTol 1e-1 // the result is true. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17174) Provide support for Timestamp type Column in add_months function to return HH:mm:ss
[ https://issues.apache.org/jira/browse/SPARK-17174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15434195#comment-15434195 ] Hyukjin Kwon commented on SPARK-17174: -- [~cloud_fan] I realised that {{Ruturns date ...}} might imply truncating time part so I closed the documentation change. BTW, do you think those functions should change the return type according to the input type? It seems some DBMS does. > Provide support for Timestamp type Column in add_months function to return > HH:mm:ss > --- > > Key: SPARK-17174 > URL: https://issues.apache.org/jira/browse/SPARK-17174 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.0.0 >Reporter: Amit Baghel >Priority: Minor > > add_months function currently supports Date types. If Column is Timestamp > type then it adds month to date but it doesn't return timestamp part > (HH:mm:ss). See the code below. > {code} > import java.util.Calendar > val now = Calendar.getInstance().getTime() > val df = sc.parallelize((0 to 3).map(i => {now.setMonth(i); (i, new > java.sql.Timestamp(now.getTime))}).toSeq).toDF("ID", "DateWithTS") > df.withColumn("NewDateWithTS", add_months(df("DateWithTS"),1)).show > {code} > Above code gives following response. See the HH:mm:ss is missing from > NewDateWithTS column. > {code} > +---++-+ > | ID| DateWithTS|NewDateWithTS| > +---++-+ > | 0|2016-01-21 09:38:...| 2016-02-21| > | 1|2016-02-21 09:38:...| 2016-03-21| > | 2|2016-03-21 09:38:...| 2016-04-21| > | 3|2016-04-21 09:38:...| 2016-05-21| > +---++-+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17198) ORC fixed char literal filter does not work
[ https://issues.apache.org/jira/browse/SPARK-17198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15434191#comment-15434191 ] Dongjoon Hyun commented on SPARK-17198: --- Hi, [~tuming]. The reported error scenario in HIVE-11312 seems to work without problems in Spark 2.0 like the following. {code} scala> sql("create table orc_test( col1 string, col2 char(10)) stored as orc tblproperties ('orc.compress'='NONE')") scala> sql("insert into orc_test values ('val1', '1')") scala> sql("select * from orc_test where col2='1'").show +++ |col1|col2| +++ |val1| 1| +++ scala> spark.version res3: String = 2.0.0 {code} Could you give us some reproducible examples? > ORC fixed char literal filter does not work > --- > > Key: SPARK-17198 > URL: https://issues.apache.org/jira/browse/SPARK-17198 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: tuming > > I have got wrong result when I run the following query in SparkSQL. > select * from orc_table where char_col ='5LZS'; > Table orc_table is a ORC format table. > Column char_col is defined as char(6). > The hive record reader will return a char(6) string to the spark. And the > spark has no fixed char type. All fixed char type attributes are converted to > String by default. Meanwhile the constant literal is parsed to a string > Literal. So it won't return true forever while doing the equal comparison. > For instance: '5LZS'=='5LZS '. > But I can get correct result in Hive using same data and sql string because > hive append spaces for those constant literal. Please refer to: > https://issues.apache.org/jira/browse/HIVE-11312 > I found there is no such patch for spark. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17167) Issue Exceptions when Analyze Table on In-Memory Cataloged Tables
[ https://issues.apache.org/jira/browse/SPARK-17167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15434181#comment-15434181 ] Apache Spark commented on SPARK-17167: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/14781 > Issue Exceptions when Analyze Table on In-Memory Cataloged Tables > - > > Key: SPARK-17167 > URL: https://issues.apache.org/jira/browse/SPARK-17167 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Currently, `Analyze Table` is only for Hive-serde tables. We should issue > exceptions in all the other cases. When the tables are data source tables, we > issued an exception. However, when tables are In-Memory Cataloged tables, we > do not issue any exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3162) Train DecisionTree locally when possible
[ https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15434154#comment-15434154 ] Siddharth Murching commented on SPARK-3162: --- Here's a design doc with proposed changes - any comments/feedback are much appreciated :) https://docs.google.com/document/d/1baU5KeorrmLpC4EZoqLuG-E8sUJqmdELLbr8o6wdbVM/ > Train DecisionTree locally when possible > > > Key: SPARK-3162 > URL: https://issues.apache.org/jira/browse/SPARK-3162 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Critical > > Improvement: communication > Currently, every level of a DecisionTree is trained in a distributed manner. > However, at deeper levels in the tree, it is possible that a small set of > training data will be matched with any given node. If the node’s training > data can fit on one machine’s memory, it may be more efficient to shuffle the > data and do local training for the rest of the subtree rooted at that node. > Note: It is possible that local training would become possible at different > levels in different branches of the tree. There are multiple options for > handling this case: > (1) Train in a distributed fashion until all remaining nodes can be trained > locally. This would entail training multiple levels at once (locally). > (2) Train branches locally when possible, and interleave this with > distributed training of the other branches. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-16822) Support latex in scaladoc with MathJax
[ https://issues.apache.org/jira/browse/SPARK-16822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jagadeesan A S closed SPARK-16822. -- Resolution: Fixed > Support latex in scaladoc with MathJax > -- > > Key: SPARK-16822 > URL: https://issues.apache.org/jira/browse/SPARK-16822 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Shuai Lin >Assignee: Shuai Lin >Priority: Minor > Fix For: 2.1.0 > > > The scaladoc of some classes (mainly ml/mllib classes) include math formulas, > but currently it renders very ugly, e.g. [the doc of the LogisticGradient > class|https://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.mllib.optimization.LogisticGradient]. > We can improve this by including MathJax javascripts in the scaladocs page, > much like what we do for the markdown docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17206) Support ANALYZE TABLE on analyzable temoprary table/view
[ https://issues.apache.org/jira/browse/SPARK-17206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15434131#comment-15434131 ] Apache Spark commented on SPARK-17206: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/14780 > Support ANALYZE TABLE on analyzable temoprary table/view > > > Key: SPARK-17206 > URL: https://issues.apache.org/jira/browse/SPARK-17206 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh > > Currently ANALYZE TABLE DDL command can't work on temporary view. However, > for the specified type of temporary view which is analyzable, we can support > the DDL command for it. So the CBO can work with temporary view too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17206) Support ANALYZE TABLE on analyzable temoprary table/view
[ https://issues.apache.org/jira/browse/SPARK-17206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17206: Assignee: Apache Spark > Support ANALYZE TABLE on analyzable temoprary table/view > > > Key: SPARK-17206 > URL: https://issues.apache.org/jira/browse/SPARK-17206 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark > > Currently ANALYZE TABLE DDL command can't work on temporary view. However, > for the specified type of temporary view which is analyzable, we can support > the DDL command for it. So the CBO can work with temporary view too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17206) Support ANALYZE TABLE on analyzable temoprary table/view
[ https://issues.apache.org/jira/browse/SPARK-17206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17206: Assignee: (was: Apache Spark) > Support ANALYZE TABLE on analyzable temoprary table/view > > > Key: SPARK-17206 > URL: https://issues.apache.org/jira/browse/SPARK-17206 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh > > Currently ANALYZE TABLE DDL command can't work on temporary view. However, > for the specified type of temporary view which is analyzable, we can support > the DDL command for it. So the CBO can work with temporary view too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17206) Support ANALYZE TABLE on analyzable temoprary table/view
Liang-Chi Hsieh created SPARK-17206: --- Summary: Support ANALYZE TABLE on analyzable temoprary table/view Key: SPARK-17206 URL: https://issues.apache.org/jira/browse/SPARK-17206 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Currently ANALYZE TABLE DDL command can't work on temporary view. However, for the specified type of temporary view which is analyzable, we can support the DDL command for it. So the CBO can work with temporary view too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6235) Address various 2G limits
[ https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-6235: --- Attachment: SPARK-6235_Design_V0.01.pdf Preliminary Design Document. > Address various 2G limits > - > > Key: SPARK-6235 > URL: https://issues.apache.org/jira/browse/SPARK-6235 > Project: Spark > Issue Type: Umbrella > Components: Shuffle, Spark Core >Reporter: Reynold Xin > Attachments: SPARK-6235_Design_V0.01.pdf > > > An umbrella ticket to track the various 2G limit we have in Spark, due to the > use of byte arrays and ByteBuffers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-3162) Train DecisionTree locally when possible
[ https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Murching updated SPARK-3162: -- Comment: was deleted (was: Here's a design doc with proposed changes - any comments/feedback are much appreciated :) https://docs.google.com/document/d/1baU5KeorrmLpC4EZoqLuG-E8sUJqmdELLbr8o6wdbVM/edit?usp=sharing ) > Train DecisionTree locally when possible > > > Key: SPARK-3162 > URL: https://issues.apache.org/jira/browse/SPARK-3162 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Critical > > Improvement: communication > Currently, every level of a DecisionTree is trained in a distributed manner. > However, at deeper levels in the tree, it is possible that a small set of > training data will be matched with any given node. If the node’s training > data can fit on one machine’s memory, it may be more efficient to shuffle the > data and do local training for the rest of the subtree rooted at that node. > Note: It is possible that local training would become possible at different > levels in different branches of the tree. There are multiple options for > handling this case: > (1) Train in a distributed fashion until all remaining nodes can be trained > locally. This would entail training multiple levels at once (locally). > (2) Train branches locally when possible, and interleave this with > distributed training of the other branches. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17201) Investigate numerical instability for MLOR without regularization
[ https://issues.apache.org/jira/browse/SPARK-17201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15434073#comment-15434073 ] Weichen Xu commented on SPARK-17201: yeah, you are right... I search some proof for this such as this: http://math.stackexchange.com/questions/209237/why-does-the-standard-bfgs-update-rule-preserve-positive-definiteness to BFGS, it can be proven that the approximate Hassien will keep psd. thanks for looking into this problem deeply ! > Investigate numerical instability for MLOR without regularization > - > > Key: SPARK-17201 > URL: https://issues.apache.org/jira/browse/SPARK-17201 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > As mentioned > [here|http://ufldl.stanford.edu/wiki/index.php/Softmax_Regression], when no > regularization is applied in Softmax regression, second order Newton solvers > may run into numerical instability problems. We should investigate this in > practice and find a solution, possibly by implementing pivoting when no > regularization is applied. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16862) Configurable buffer size in `UnsafeSorterSpillReader`
[ https://issues.apache.org/jira/browse/SPARK-16862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-16862. - Resolution: Fixed Assignee: Tejas Patil Fix Version/s: 2.1.0 > Configurable buffer size in `UnsafeSorterSpillReader` > - > > Key: SPARK-16862 > URL: https://issues.apache.org/jira/browse/SPARK-16862 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Tejas Patil >Assignee: Tejas Patil >Priority: Minor > Fix For: 2.1.0 > > > `BufferedInputStream` used in `UnsafeSorterSpillReader` uses the default 8k > buffer to read data off disk. This could be made configurable to improve on > disk reads. > https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java#L53 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17174) Provide support for Timestamp type Column in add_months function to return HH:mm:ss
[ https://issues.apache.org/jira/browse/SPARK-17174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17174: Assignee: Apache Spark > Provide support for Timestamp type Column in add_months function to return > HH:mm:ss > --- > > Key: SPARK-17174 > URL: https://issues.apache.org/jira/browse/SPARK-17174 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.0.0 >Reporter: Amit Baghel >Assignee: Apache Spark >Priority: Minor > > add_months function currently supports Date types. If Column is Timestamp > type then it adds month to date but it doesn't return timestamp part > (HH:mm:ss). See the code below. > {code} > import java.util.Calendar > val now = Calendar.getInstance().getTime() > val df = sc.parallelize((0 to 3).map(i => {now.setMonth(i); (i, new > java.sql.Timestamp(now.getTime))}).toSeq).toDF("ID", "DateWithTS") > df.withColumn("NewDateWithTS", add_months(df("DateWithTS"),1)).show > {code} > Above code gives following response. See the HH:mm:ss is missing from > NewDateWithTS column. > {code} > +---++-+ > | ID| DateWithTS|NewDateWithTS| > +---++-+ > | 0|2016-01-21 09:38:...| 2016-02-21| > | 1|2016-02-21 09:38:...| 2016-03-21| > | 2|2016-03-21 09:38:...| 2016-04-21| > | 3|2016-04-21 09:38:...| 2016-05-21| > +---++-+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17174) Provide support for Timestamp type Column in add_months function to return HH:mm:ss
[ https://issues.apache.org/jira/browse/SPARK-17174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15434038#comment-15434038 ] Apache Spark commented on SPARK-17174: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/14778 > Provide support for Timestamp type Column in add_months function to return > HH:mm:ss > --- > > Key: SPARK-17174 > URL: https://issues.apache.org/jira/browse/SPARK-17174 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.0.0 >Reporter: Amit Baghel >Priority: Minor > > add_months function currently supports Date types. If Column is Timestamp > type then it adds month to date but it doesn't return timestamp part > (HH:mm:ss). See the code below. > {code} > import java.util.Calendar > val now = Calendar.getInstance().getTime() > val df = sc.parallelize((0 to 3).map(i => {now.setMonth(i); (i, new > java.sql.Timestamp(now.getTime))}).toSeq).toDF("ID", "DateWithTS") > df.withColumn("NewDateWithTS", add_months(df("DateWithTS"),1)).show > {code} > Above code gives following response. See the HH:mm:ss is missing from > NewDateWithTS column. > {code} > +---++-+ > | ID| DateWithTS|NewDateWithTS| > +---++-+ > | 0|2016-01-21 09:38:...| 2016-02-21| > | 1|2016-02-21 09:38:...| 2016-03-21| > | 2|2016-03-21 09:38:...| 2016-04-21| > | 3|2016-04-21 09:38:...| 2016-05-21| > +---++-+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17174) Provide support for Timestamp type Column in add_months function to return HH:mm:ss
[ https://issues.apache.org/jira/browse/SPARK-17174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17174: Assignee: (was: Apache Spark) > Provide support for Timestamp type Column in add_months function to return > HH:mm:ss > --- > > Key: SPARK-17174 > URL: https://issues.apache.org/jira/browse/SPARK-17174 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.0.0 >Reporter: Amit Baghel >Priority: Minor > > add_months function currently supports Date types. If Column is Timestamp > type then it adds month to date but it doesn't return timestamp part > (HH:mm:ss). See the code below. > {code} > import java.util.Calendar > val now = Calendar.getInstance().getTime() > val df = sc.parallelize((0 to 3).map(i => {now.setMonth(i); (i, new > java.sql.Timestamp(now.getTime))}).toSeq).toDF("ID", "DateWithTS") > df.withColumn("NewDateWithTS", add_months(df("DateWithTS"),1)).show > {code} > Above code gives following response. See the HH:mm:ss is missing from > NewDateWithTS column. > {code} > +---++-+ > | ID| DateWithTS|NewDateWithTS| > +---++-+ > | 0|2016-01-21 09:38:...| 2016-02-21| > | 1|2016-02-21 09:38:...| 2016-03-21| > | 2|2016-03-21 09:38:...| 2016-04-21| > | 3|2016-04-21 09:38:...| 2016-05-21| > +---++-+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3162) Train DecisionTree locally when possible
[ https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15434031#comment-15434031 ] Siddharth Murching edited comment on SPARK-3162 at 8/24/16 1:37 AM: Here's a design doc with proposed changes - any comments/feedback are much appreciated :) Design doc link: [Link|https://docs.google.com/document/d/1baU5KeorrmLpC4EZoqLuG-E8sUJqmdELLbr8o6wdbVM/edit?usp=sharing] was (Author: siddharth murching): Here's a design doc with proposed changes - any comments/feedback are much appreciated :) [Link|https://docs.google.com/document/d/1baU5KeorrmLpC4EZoqLuG-E8sUJqmdELLbr8o6wdbVM/edit?usp=sharing] > Train DecisionTree locally when possible > > > Key: SPARK-3162 > URL: https://issues.apache.org/jira/browse/SPARK-3162 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Critical > > Improvement: communication > Currently, every level of a DecisionTree is trained in a distributed manner. > However, at deeper levels in the tree, it is possible that a small set of > training data will be matched with any given node. If the node’s training > data can fit on one machine’s memory, it may be more efficient to shuffle the > data and do local training for the rest of the subtree rooted at that node. > Note: It is possible that local training would become possible at different > levels in different branches of the tree. There are multiple options for > handling this case: > (1) Train in a distributed fashion until all remaining nodes can be trained > locally. This would entail training multiple levels at once (locally). > (2) Train branches locally when possible, and interleave this with > distributed training of the other branches. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3162) Train DecisionTree locally when possible
[ https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15434031#comment-15434031 ] Siddharth Murching edited comment on SPARK-3162 at 8/24/16 1:37 AM: Here's a design doc with proposed changes - any comments/feedback are much appreciated :) [Link|https://docs.google.com/document/d/1baU5KeorrmLpC4EZoqLuG-E8sUJqmdELLbr8o6wdbVM/edit?usp=sharing] was (Author: siddharth murching): Here's a design doc with proposed changes - any comments/feedback are much appreciated :) [Link](https://docs.google.com/document/d/1baU5KeorrmLpC4EZoqLuG-E8sUJqmdELLbr8o6wdbVM/edit?usp=sharing) > Train DecisionTree locally when possible > > > Key: SPARK-3162 > URL: https://issues.apache.org/jira/browse/SPARK-3162 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Critical > > Improvement: communication > Currently, every level of a DecisionTree is trained in a distributed manner. > However, at deeper levels in the tree, it is possible that a small set of > training data will be matched with any given node. If the node’s training > data can fit on one machine’s memory, it may be more efficient to shuffle the > data and do local training for the rest of the subtree rooted at that node. > Note: It is possible that local training would become possible at different > levels in different branches of the tree. There are multiple options for > handling this case: > (1) Train in a distributed fashion until all remaining nodes can be trained > locally. This would entail training multiple levels at once (locally). > (2) Train branches locally when possible, and interleave this with > distributed training of the other branches. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3162) Train DecisionTree locally when possible
[ https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15434031#comment-15434031 ] Siddharth Murching edited comment on SPARK-3162 at 8/24/16 1:37 AM: Here's a design doc with proposed changes - any comments/feedback are much appreciated :) https://docs.google.com/document/d/1baU5KeorrmLpC4EZoqLuG-E8sUJqmdELLbr8o6wdbVM/edit?usp=sharing was (Author: siddharth murching): Here's a design doc with proposed changes - any comments/feedback are much appreciated :) Design doc link: [Link|https://docs.google.com/document/d/1baU5KeorrmLpC4EZoqLuG-E8sUJqmdELLbr8o6wdbVM/edit?usp=sharing] > Train DecisionTree locally when possible > > > Key: SPARK-3162 > URL: https://issues.apache.org/jira/browse/SPARK-3162 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Critical > > Improvement: communication > Currently, every level of a DecisionTree is trained in a distributed manner. > However, at deeper levels in the tree, it is possible that a small set of > training data will be matched with any given node. If the node’s training > data can fit on one machine’s memory, it may be more efficient to shuffle the > data and do local training for the rest of the subtree rooted at that node. > Note: It is possible that local training would become possible at different > levels in different branches of the tree. There are multiple options for > handling this case: > (1) Train in a distributed fashion until all remaining nodes can be trained > locally. This would entail training multiple levels at once (locally). > (2) Train branches locally when possible, and interleave this with > distributed training of the other branches. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3162) Train DecisionTree locally when possible
[ https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15434031#comment-15434031 ] Siddharth Murching commented on SPARK-3162: --- Here's a design doc with proposed changes - any comments/feedback are much appreciated :) [Link](https://docs.google.com/document/d/1baU5KeorrmLpC4EZoqLuG-E8sUJqmdELLbr8o6wdbVM/edit?usp=sharing) > Train DecisionTree locally when possible > > > Key: SPARK-3162 > URL: https://issues.apache.org/jira/browse/SPARK-3162 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Critical > > Improvement: communication > Currently, every level of a DecisionTree is trained in a distributed manner. > However, at deeper levels in the tree, it is possible that a small set of > training data will be matched with any given node. If the node’s training > data can fit on one machine’s memory, it may be more efficient to shuffle the > data and do local training for the rest of the subtree rooted at that node. > Note: It is possible that local training would become possible at different > levels in different branches of the tree. There are multiple options for > handling this case: > (1) Train in a distributed fashion until all remaining nodes can be trained > locally. This would entail training multiple levels at once (locally). > (2) Train branches locally when possible, and interleave this with > distributed training of the other branches. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16581) Making JVM backend calling functions public
[ https://issues.apache.org/jira/browse/SPARK-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15434030#comment-15434030 ] Xusen Yin commented on SPARK-16581: --- Sure, no problem. > Making JVM backend calling functions public > --- > > Key: SPARK-16581 > URL: https://issues.apache.org/jira/browse/SPARK-16581 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman > > As described in the design doc in SPARK-15799, to help packages that need to > call into the JVM, it will be good to expose some of the R -> JVM functions > we have. > As a part of this we could also rename, reformat the functions to make them > more user friendly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17205) Literal.sql does not properly convert NaN and Infinity literals
[ https://issues.apache.org/jira/browse/SPARK-17205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17205: Assignee: Josh Rosen (was: Apache Spark) > Literal.sql does not properly convert NaN and Infinity literals > --- > > Key: SPARK-17205 > URL: https://issues.apache.org/jira/browse/SPARK-17205 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Minor > > {{Literal.sql}} mishandles NaN and Infinity literals: the handling of these > needs to be special-cased instead of simply appending a suffix to the string > representation of the value -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17205) Literal.sql does not properly convert NaN and Infinity literals
[ https://issues.apache.org/jira/browse/SPARK-17205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15433962#comment-15433962 ] Apache Spark commented on SPARK-17205: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/14777 > Literal.sql does not properly convert NaN and Infinity literals > --- > > Key: SPARK-17205 > URL: https://issues.apache.org/jira/browse/SPARK-17205 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Minor > > {{Literal.sql}} mishandles NaN and Infinity literals: the handling of these > needs to be special-cased instead of simply appending a suffix to the string > representation of the value -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17205) Literal.sql does not properly convert NaN and Infinity literals
[ https://issues.apache.org/jira/browse/SPARK-17205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17205: Assignee: Apache Spark (was: Josh Rosen) > Literal.sql does not properly convert NaN and Infinity literals > --- > > Key: SPARK-17205 > URL: https://issues.apache.org/jira/browse/SPARK-17205 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Josh Rosen >Assignee: Apache Spark >Priority: Minor > > {{Literal.sql}} mishandles NaN and Infinity literals: the handling of these > needs to be special-cased instead of simply appending a suffix to the string > representation of the value -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17205) Literal.sql does not properly convert NaN and Infinity literals
[ https://issues.apache.org/jira/browse/SPARK-17205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-17205: --- Description: {{Literal.sql}} mishandles NaN and Infinity literals: the handling of these needs to be special-cased instead of simply appending a suffix to the string representation of the value (was: {{Literal.sql}} mishandles NaN and Infinity literals: the handling of these needs to be special-cased.) > Literal.sql does not properly convert NaN and Infinity literals > --- > > Key: SPARK-17205 > URL: https://issues.apache.org/jira/browse/SPARK-17205 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Minor > > {{Literal.sql}} mishandles NaN and Infinity literals: the handling of these > needs to be special-cased instead of simply appending a suffix to the string > representation of the value -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17205) Literal.sql does not properly convert NaN and Infinity literals
Josh Rosen created SPARK-17205: -- Summary: Literal.sql does not properly convert NaN and Infinity literals Key: SPARK-17205 URL: https://issues.apache.org/jira/browse/SPARK-17205 Project: Spark Issue Type: Bug Components: SQL Reporter: Josh Rosen Assignee: Josh Rosen Priority: Minor {{Literal.sql}} mishandles NaN and Infinity literals: the handling of these needs to be special-cased. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to in-memory data corruption
[ https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Allman updated SPARK-17204: --- Summary: Spark 2.0 off heap RDD persistence with replication factor 2 leads to in-memory data corruption (was: Spark 2.0 off heap RDD persistence with replication factor 2 leads to data corruption) > Spark 2.0 off heap RDD persistence with replication factor 2 leads to > in-memory data corruption > --- > > Key: SPARK-17204 > URL: https://issues.apache.org/jira/browse/SPARK-17204 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Michael Allman > > We use the OFF_HEAP storage level extensively. We've tried off-heap storage > with replication factor 2 and have always received exceptions on the executor > side very shortly after starting the job. For example: > {code} > com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: > 9086 > at > com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137) > at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781) > at > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > or > {code} > java.lang.IndexOutOfBoundsException: Index: 6, Size: 0 > at java.util.ArrayList.rangeCheck(ArrayList.java:653) > at java.util.ArrayList.get(ArrayList.java:429) > at > com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60) > at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788) > at > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.
[jira] [Updated] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to data corruption
[ https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Allman updated SPARK-17204: --- Description: We use the OFF_HEAP storage level extensively. We've tried off-heap storage with replication factor 2 and have always received exceptions on the executor side very shortly after starting the job. For example: {code} com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 9086 at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137) at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781) at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} or {code} java.lang.IndexOutOfBoundsException: Index: 6, Size: 0 at java.util.ArrayList.rangeCheck(ArrayList.java:653) at java.util.ArrayList.get(ArrayList.java:429) at com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60) at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788) at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPo
[jira] [Commented] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to data corruption
[ https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15433878#comment-15433878 ] Michael Allman commented on SPARK-17204: [~rxin] I rebuilt from master as of commit 8fd63e808e15c8a7e78fef847183c86f332daa91 (which includes https://github.com/apache/spark/commit/8e223ea67acf5aa730ccf688802f17f6fc10907c) and am still experiencing this issue. I'll work on instructions to reproduce next. > Spark 2.0 off heap RDD persistence with replication factor 2 leads to data > corruption > - > > Key: SPARK-17204 > URL: https://issues.apache.org/jira/browse/SPARK-17204 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Michael Allman > > We use the OFF_HEAP storage level extensively. We've tried off-heap storage > with replication factor 2 and have always received exceptions on the executor > side very shortly after starting the job. For example: > {code} > com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: > 9086 > at > com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137) > at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781) > at > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > or > {code} > java.lang.IndexOutOfBoundsException: Index: 6, Size: 0 > at java.util.ArrayList.rangeCheck(ArrayList.java:653) > at java.util.ArrayList.get(ArrayList.java:429) > at > com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60) > at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788) > at > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.s
[jira] [Commented] (SPARK-17099) Incorrect result when HAVING clause is added to group by query
[ https://issues.apache.org/jira/browse/SPARK-17099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15433849#comment-15433849 ] Herman van Hovell commented on SPARK-17099: --- A small update. Disregard my previous diagnoses. This is caused by a bug in the {{EliminateOuterJoin}} rule. This converts the Right Outer join into an Inner join, see the following optimizer log: {noformat} === Applying Rule org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin === Project [sum(coalesce(int_col_5, int_col_2))#34L, (coalesce(int_col_5, int_col_2) * 2)#32] Project [sum(coalesce(int_col_5, int_col_2))#34L, (coalesce(int_col_5, int_col_2) * 2)#32] +- Filter (isnotnull(sum(cast(coalesce(int_col_5#4, int_col_2#13) as bigint))#37L) && (sum(cast(coalesce(int_col_5#4, int_col_2#13) as bigint))#37L > cast((coalesce(int_col_5#4, int_col_2#13)#38 * 2) as bigint))) +- Filter (isnotnull(sum(cast(coalesce(int_col_5#4, int_col_2#13) as bigint))#37L) && (sum(cast(coalesce(int_col_5#4, int_col_2#13) as bigint))#37L > cast((coalesce(int_col_5#4, int_col_2#13)#38 * 2) as bigint))) +- Aggregate [greatest(coalesce(int_col_5#14, 109), coalesce(int_col_5#4, -449)), coalesce(int_col_5#4, int_col_2#13)], [sum(cast(coalesce(int_col_5#4, int_col_2#13) as bigint)) AS sum(coalesce(int_col_5, int_col_2))#34L, (coalesce(int_col_5#4, int_col_2#13) * 2) AS (coalesce(int_col_5, int_col_2) * 2)#32, sum(cast(coalesce(int_col_5#4, int_col_2#13) as bigint)) AS sum(cast(coalesce(int_col_5#4, int_col_2#13) as bigint))#37L, coalesce(int_col_5#4, int_col_2#13) AS coalesce(int_col_5#4, int_col_2#13)#38] +- Aggregate [greatest(coalesce(int_col_5#14, 109), coalesce(int_col_5#4, -449)), coalesce(int_col_5#4, int_col_2#13)], [sum(cast(coalesce(int_col_5#4, int_col_2#13) as bigint)) AS sum(coalesce(int_col_5, int_col_2))#34L, (coalesce(int_col_5#4, int_col_2#13) * 2) AS (coalesce(int_col_5, int_col_2) * 2)#32, sum(cast(coalesce(int_col_5#4, int_col_2#13) as bigint)) AS sum(cast(coalesce(int_col_5#4, int_col_2#13) as bigint))#37L, coalesce(int_col_5#4, int_col_2#13) AS coalesce(int_col_5#4, int_col_2#13)#38] +- Filter isnotnull(coalesce(int_col_5#4, int_col_2#13)) +- Filter isnotnull(coalesce(int_col_5#4, int_col_2#13)) ! +- Join RightOuter, (int_col_2#13 = int_col_5#4) +- Join Inner, (int_col_2#13 = int_col_5#4) :- Project [value#2 AS int_col_5#4] :- Project [value#2 AS int_col_5#4] : +- SerializeFromObject [input[0, int, true] AS value#2]
[jira] [Commented] (SPARK-17120) Analyzer incorrectly optimizes plan to empty LocalRelation
[ https://issues.apache.org/jira/browse/SPARK-17120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15433833#comment-15433833 ] Herman van Hovell commented on SPARK-17120: --- PR https://github.com/apache/spark/pull/14661 fixes this > Analyzer incorrectly optimizes plan to empty LocalRelation > -- > > Key: SPARK-17120 > URL: https://issues.apache.org/jira/browse/SPARK-17120 > Project: Spark > Issue Type: Bug >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Priority: Blocker > > Consider the following query: > {code} > sc.parallelize(Seq(97)).toDF("int_col_6").createOrReplaceTempView("table_3") > sc.parallelize(Seq(0)).toDF("int_col_1").createOrReplaceTempView("table_4") > println(sql(""" > SELECT > * > FROM ( > SELECT > COALESCE(t2.int_col_1, t1.int_col_6) AS int_col > FROM table_3 t1 > LEFT JOIN table_4 t2 ON false > ) t where (t.int_col) is not null > """).collect().toSeq) > {code} > In the innermost query, the LEFT JOIN's condition is {{false}} but > nevertheless the number of rows produced should equal the number of rows in > {{table_3}} (which is non-empty). Since no values are {{null}}, the outer > {{where}} should retain all rows, so the overall result of this query should > contain a single row with the value '97'. > Instead, the current Spark master (as of > 12a89e55cbd630fa2986da984e066cd07d3bf1f7 at least) returns no rows. Looking > at {{explain}}, it appears that the logical plan is optimizing to > {{LocalRelation }}, so Spark doesn't even run the query. My suspicion > is that there's a bug in constraint propagation or filter pushdown. > This issue doesn't seem to affect Spark 2.0, so I think it's a regression in > master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17120) Analyzer incorrectly optimizes plan to empty LocalRelation
[ https://issues.apache.org/jira/browse/SPARK-17120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15433780#comment-15433780 ] Herman van Hovell edited comment on SPARK-17120 at 8/23/16 11:02 PM: - TL;DR the {{EliminateOuterJoin}} rule converts the outer join into an Inner join: {noformat} 16/08/24 00:55:46 TRACE SparkOptimizer: === Applying Rule org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin === Project [coalesce(int_col_1#12, int_col_6#4) AS int_col#16] Project [coalesce(int_col_1#12, int_col_6#4) AS int_col#16] +- Filter isnotnull(coalesce(int_col_1#12, int_col_6#4))+- Filter isnotnull(coalesce(int_col_1#12, int_col_6#4)) ! +- Join LeftOuter, false+- Join Inner, false :- Project [value#2 AS int_col_6#4] :- Project [value#2 AS int_col_6#4] : +- SerializeFromObject [input[0, int, true] AS value#2] : +- SerializeFromObject [input[0, int, true] AS value#2] : +- ExternalRDD [obj#1]: +- ExternalRDD [obj#1] +- Project [value#10 AS int_col_1#12] +- Project [value#10 AS int_col_1#12] +- SerializeFromObject [input[0, int, true] AS value#10] +- SerializeFromObject [input[0, int, true] AS value#10] +- ExternalRDD [obj#9] +- ExternalRDD [obj#9] {noformat} I correctly assumes that a non-null literal cannot be well... non-null, and then converts the join. BTW: set {{spark.sql.crossJoin.enabled}} to {{true}} if you want to run this. Also use {{sc.setLogLevel("TRACE")}} to see what the optimizer is doing. (updated this: my first attempt at diagnoses was way off). was (Author: hvanhovell): TL;DR the {{EliminateOuterJoin}} rule converts the outer join into an Inner join: {noformat} 16/08/24 00:55:46 TRACE SparkOptimizer: === Applying Rule org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin === Project [coalesce(int_col_1#12, int_col_6#4) AS int_col#16] Project [coalesce(int_col_1#12, int_col_6#4) AS int_col#16] +- Filter isnotnull(coalesce(int_col_1#12, int_col_6#4))+- Filter isnotnull(coalesce(int_col_1#12, int_col_6#4)) ! +- Join LeftOuter, false+- Join Inner, false :- Project [value#2 AS int_col_6#4] :- Project [value#2 AS int_col_6#4] : +- SerializeFromObject [input[0, int, true] AS value#2] : +- SerializeFromObject [input[0, int, true] AS value#2] : +- ExternalRDD [obj#1]: +- ExternalRDD [obj#1] +- Project [value#10 AS int_col_1#12] +- Project [value#10 AS int_col_1#12] +- SerializeFromObject [input[0, int, true] AS value#10] +- SerializeFromObject [input[0, int, true] AS value#10] +- ExternalRDD [obj#9] +- ExternalRDD [obj#9] {noformat} I correctly assumes that a non-null literal cannot be well... non-null, and then converts the join. BTW: set {{spark.sql.crossJoin.enabled}} to {{true}} if you want to run this. Also use {{sc.setLogLevel("TRACE")}} to see what the optimizer is doing. > Analyzer incorrectly optimizes plan to empty LocalRelation > -- > > Key: SPARK-17120 > URL: https://issues.apache.org/jira/browse/SPARK-17120 > Project: Spark > Issue Type: Bug >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Priority: Blocker > > Consider the following query: > {code} > sc.parallelize(Seq(97)).toDF("int_col_6").createOrReplaceTempView("table_3") > sc.parallelize(Seq(0)).toDF("int_col_1").createOrReplaceTempView("table_4") > println(sql(""" > SELECT > * > FROM ( > SELECT > COALESCE(t2.int_col_1, t1.int_col_6) AS int_col > FROM table_3 t1 > LEFT JOIN table_4 t2 ON false > ) t where (t.int_col) is not null > """).collect().toSeq) > {code} > In the innermost query, the LEFT JOIN's condition is {{false}} but > nevertheless the number of rows produced should equal the number of rows in > {{table_3}} (which is non-empty). Since no values are {{null}}, the outer > {{where}} should retain all rows, so the overall result of this query should > contain a single row with the value '97'. > Instead, the current Spark master (as of > 12a89e55cbd630fa2986da984e066cd07d3bf1f7 at least) returns no rows. Looking > at {{explain}}, it appears that the logical plan is optimizing to > {{LocalRelation }}, so Spark doesn't even run the query. My suspicion > is that there's a bug in constraint propagation or filter pushdown. >
[jira] [Comment Edited] (SPARK-17120) Analyzer incorrectly optimizes plan to empty LocalRelation
[ https://issues.apache.org/jira/browse/SPARK-17120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15433780#comment-15433780 ] Herman van Hovell edited comment on SPARK-17120 at 8/23/16 11:01 PM: - TL;DR the {{EliminateOuterJoin}} rule converts the outer join into an Inner join: {noformat} 16/08/24 00:55:46 TRACE SparkOptimizer: === Applying Rule org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin === Project [coalesce(int_col_1#12, int_col_6#4) AS int_col#16] Project [coalesce(int_col_1#12, int_col_6#4) AS int_col#16] +- Filter isnotnull(coalesce(int_col_1#12, int_col_6#4))+- Filter isnotnull(coalesce(int_col_1#12, int_col_6#4)) ! +- Join LeftOuter, false+- Join Inner, false :- Project [value#2 AS int_col_6#4] :- Project [value#2 AS int_col_6#4] : +- SerializeFromObject [input[0, int, true] AS value#2] : +- SerializeFromObject [input[0, int, true] AS value#2] : +- ExternalRDD [obj#1]: +- ExternalRDD [obj#1] +- Project [value#10 AS int_col_1#12] +- Project [value#10 AS int_col_1#12] +- SerializeFromObject [input[0, int, true] AS value#10] +- SerializeFromObject [input[0, int, true] AS value#10] +- ExternalRDD [obj#9] +- ExternalRDD [obj#9] {noformat} I correctly assumes that a non-null literal cannot be well... non-null, and then converts the join. BTW: set {{spark.sql.crossJoin.enabled}} to {{true}} if you want to run this. Also use {{sc.setLogLevel("TRACE")}} to see what the optimizer is doing. was (Author: hvanhovell): TL;DR the {{PushDownPredicate}} rule pushed the {{false}} join predicate down, into the left hand side of the join (which should have been the right hand side). This caused the {{EliminateOuterJoin}} rule to rewrite this into an inner join. The optimized plan before disabling the {{PushDownPredicate}} rule (I had to disable the {{PruneFilters}} rule to prevent the plan from being erased): {noformat} Project [coalesce(int_col_1#12, int_col_6#4) AS int_col#16] +- Join Inner :- Project [value#2 AS int_col_6#4] : +- Filter false : +- SerializeFromObject [input[0, int, true] AS value#2] :+- ExternalRDD [obj#1] +- Project [value#10 AS int_col_1#12] +- SerializeFromObject [input[0, int, true] AS value#10] +- ExternalRDD [obj#9] {noformat} The optimized plan after disabling the {{PushDownPredicate}} rule: {noformat} == Optimized Logical Plan == Filter isnotnull(int_col#16) +- Project [coalesce(int_col_1#12, int_col_6#4) AS int_col#16] +- Join LeftOuter, false :- Project [value#2 AS int_col_6#4] : +- SerializeFromObject [input[0, int, true] AS value#2] : +- ExternalRDD [obj#1] +- Project [value#10 AS int_col_1#12] +- SerializeFromObject [input[0, int, true] AS value#10] +- ExternalRDD [obj#9] {noformat} Btw set {{spark.sql.crossJoin.enabled}} to {{true}} if you want to run this. > Analyzer incorrectly optimizes plan to empty LocalRelation > -- > > Key: SPARK-17120 > URL: https://issues.apache.org/jira/browse/SPARK-17120 > Project: Spark > Issue Type: Bug >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Priority: Blocker > > Consider the following query: > {code} > sc.parallelize(Seq(97)).toDF("int_col_6").createOrReplaceTempView("table_3") > sc.parallelize(Seq(0)).toDF("int_col_1").createOrReplaceTempView("table_4") > println(sql(""" > SELECT > * > FROM ( > SELECT > COALESCE(t2.int_col_1, t1.int_col_6) AS int_col > FROM table_3 t1 > LEFT JOIN table_4 t2 ON false > ) t where (t.int_col) is not null > """).collect().toSeq) > {code} > In the innermost query, the LEFT JOIN's condition is {{false}} but > nevertheless the number of rows produced should equal the number of rows in > {{table_3}} (which is non-empty). Since no values are {{null}}, the outer > {{where}} should retain all rows, so the overall result of this query should > contain a single row with the value '97'. > Instead, the current Spark master (as of > 12a89e55cbd630fa2986da984e066cd07d3bf1f7 at least) returns no rows. Looking > at {{explain}}, it appears that the logical plan is optimizing to > {{LocalRelation }}, so Spark doesn't even run the query. My suspicion > is that there's a bug in constraint propagation or filter pushdown. > This issue doesn't seem to affect Spark 2.0, so I think it's a regression in > master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) ---
[jira] [Commented] (SPARK-17120) Analyzer incorrectly optimizes plan to empty LocalRelation
[ https://issues.apache.org/jira/browse/SPARK-17120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15433780#comment-15433780 ] Herman van Hovell commented on SPARK-17120: --- TL;DR the {{PushDownPredicate}} rule pushed the {{false}} join predicate down, into the left hand side of the join (which should have been the right hand side). This caused the {{EliminateOuterJoin}} rule to rewrite this into an inner join. The optimized plan before disabling the {{PushDownPredicate}} rule (I had to disable the {{PruneFilters}} rule to prevent the plan from being erased): {noformat} Project [coalesce(int_col_1#12, int_col_6#4) AS int_col#16] +- Join Inner :- Project [value#2 AS int_col_6#4] : +- Filter false : +- SerializeFromObject [input[0, int, true] AS value#2] :+- ExternalRDD [obj#1] +- Project [value#10 AS int_col_1#12] +- SerializeFromObject [input[0, int, true] AS value#10] +- ExternalRDD [obj#9] {noformat} The optimized plan after disabling the {{PushDownPredicate}} rule: {noformat} == Optimized Logical Plan == Filter isnotnull(int_col#16) +- Project [coalesce(int_col_1#12, int_col_6#4) AS int_col#16] +- Join LeftOuter, false :- Project [value#2 AS int_col_6#4] : +- SerializeFromObject [input[0, int, true] AS value#2] : +- ExternalRDD [obj#1] +- Project [value#10 AS int_col_1#12] +- SerializeFromObject [input[0, int, true] AS value#10] +- ExternalRDD [obj#9] {noformat} Btw set {{spark.sql.crossJoin.enabled}} to {{true}} if you want to run this. > Analyzer incorrectly optimizes plan to empty LocalRelation > -- > > Key: SPARK-17120 > URL: https://issues.apache.org/jira/browse/SPARK-17120 > Project: Spark > Issue Type: Bug >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Priority: Blocker > > Consider the following query: > {code} > sc.parallelize(Seq(97)).toDF("int_col_6").createOrReplaceTempView("table_3") > sc.parallelize(Seq(0)).toDF("int_col_1").createOrReplaceTempView("table_4") > println(sql(""" > SELECT > * > FROM ( > SELECT > COALESCE(t2.int_col_1, t1.int_col_6) AS int_col > FROM table_3 t1 > LEFT JOIN table_4 t2 ON false > ) t where (t.int_col) is not null > """).collect().toSeq) > {code} > In the innermost query, the LEFT JOIN's condition is {{false}} but > nevertheless the number of rows produced should equal the number of rows in > {{table_3}} (which is non-empty). Since no values are {{null}}, the outer > {{where}} should retain all rows, so the overall result of this query should > contain a single row with the value '97'. > Instead, the current Spark master (as of > 12a89e55cbd630fa2986da984e066cd07d3bf1f7 at least) returns no rows. Looking > at {{explain}}, it appears that the logical plan is optimizing to > {{LocalRelation }}, so Spark doesn't even run the query. My suspicion > is that there's a bug in constraint propagation or filter pushdown. > This issue doesn't seem to affect Spark 2.0, so I think it's a regression in > master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17201) Investigate numerical instability for MLOR without regularization
[ https://issues.apache.org/jira/browse/SPARK-17201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15433754#comment-15433754 ] Seth Hendrickson edited comment on SPARK-17201 at 8/23/16 10:24 PM: Restating some of what was said on github: _Concern is that for softmax regression without regularization, the Hessian becomes singular and Newton methods can run into problems. Excerpt from this [link|http://ufldl.stanford.edu/wiki/index.php/Softmax_Regression]: "Thus, the minimizer of J(θ) is not unique. (Interestingly, J(θ) is still convex, and thus gradient descent will not run into a local optima problems. But the Hessian is singular/non-invertible, which causes a straightforward implementation of Newton's method to run into numerical problems.)"_ I looked into this. It is true that for softmax regression the Hessian is Symmetric positive _semidefinite_, not symmetric positive definite. There is a good-enough proof of such [here|http://qwone.com/~jason/writing/convexLR.pdf]. Still consider the quote from the resources mentioned above "... which causes a *straightforward* implementation of Newton's method to run into numerical problems." It's true the lack of positive definiteness can be a problem for *naive* Newton methods, but LBFGS is not a straightforward implementation - it does not use the Hessian directly, but it uses an approximation to the Hessian. In fact, there are an abundance of resources showing that as long as the initial Hessian approximation is symmetric positive definite, then the subsequent recursive updates are also symmetric positive definite. From one resource: "H(-1)_(n + 1) is positive definite (psd) when H^(-1)_n is. Assuming our initial guess of H0 is psd, it follows by induction each inverse Hessian estimate is as well. Since we can choose any H^(-1)_0 we want, including the identity matrix, this is easy to ensure." I appreciate other opinions on this to make sure I am understanding things correctly. Seems like LBFGS will work fine even without regularization. Have we seen this problem in practice? cc [~dbtsai] [~WeichenXu123] was (Author: sethah): Restating some of what was said on github: _Concern is that for softmax regression without regularization, the Hessian becomes singular and Newton methods can run into problems. Excerpt from this [link|http://ufldl.stanford.edu/wiki/index.php/Softmax_Regression]: "Thus, the minimizer of J(θ) is not unique. (Interestingly, J(θ) is still convex, and thus gradient descent will not run into a local optima problems. But the Hessian is singular/non-invertible, which causes a straightforward implementation of Newton's method to run into numerical problems.)"_ I looked into this. It is true that for softmax regression the Hessian is Symmetric positive _semidefinite_, not symmetric positive definite. There is a good-enough proof of such [here|http://qwone.com/~jason/writing/convexLR.pdf]. Still consider the quote from the resources mentioned above "... which causes a *straightforward* implementation of Newton's method to run into numerical problems." It's true the lack of positive definiteness can be a problem for *naive* Newton methods, but LBFGS is not a straightforward implementation - it does not use the Hessian directly, but it uses an approximation to the Hessian. In fact, there are an abundance of resources showing that as long as the initial Hessian approximation is symmetric positive definite, then the subsequent recursive updates are also symmetric positive definite. From one resource: "H(-1)_(n + 1) is positive definite (psd) when H^(-1)_n is. Assuming our initial guess of H0 is psd, it follows by induction each inverse Hessian estimate is as well. Since we can choose any H^(-1)_0 we want, including the identity matrix, this is easy to ensure." I appreciate other opinions on this to make sure I am understanding things correctly. Seems like LBFGS will work fine even without regularization. cc [~dbtsai] [~WeichenXu123] > Investigate numerical instability for MLOR without regularization > - > > Key: SPARK-17201 > URL: https://issues.apache.org/jira/browse/SPARK-17201 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > As mentioned > [here|http://ufldl.stanford.edu/wiki/index.php/Softmax_Regression], when no > regularization is applied in Softmax regression, second order Newton solvers > may run into numerical instability problems. We should investigate this in > practice and find a solution, possibly by implementing pivoting when no > regularization is applied. -- This message was sent by Atlassian JIRA (v6.3.4#6332) -
[jira] [Commented] (SPARK-17201) Investigate numerical instability for MLOR without regularization
[ https://issues.apache.org/jira/browse/SPARK-17201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15433754#comment-15433754 ] Seth Hendrickson commented on SPARK-17201: -- Restating some of what was said on github: _Concern is that for softmax regression without regularization, the Hessian becomes singular and Newton methods can run into problems. Excerpt from this [link|http://ufldl.stanford.edu/wiki/index.php/Softmax_Regression]: "Thus, the minimizer of J(θ) is not unique. (Interestingly, J(θ) is still convex, and thus gradient descent will not run into a local optima problems. But the Hessian is singular/non-invertible, which causes a straightforward implementation of Newton's method to run into numerical problems.)"_ I looked into this. It is true that for softmax regression the Hessian is Symmetric positive _semidefinite_, not symmetric positive definite. There is a good-enough proof of such [here|http://qwone.com/~jason/writing/convexLR.pdf]. Still consider the quote from the resources mentioned above "... which causes a *straightforward* implementation of Newton's method to run into numerical problems." It's true the lack of positive definiteness can be a problem for *naive* Newton methods, but LBFGS is not a straightforward implementation - it does not use the Hessian directly, but it uses an approximation to the Hessian. In fact, there are an abundance of resources showing that as long as the initial Hessian approximation is symmetric positive definite, then the subsequent recursive updates are also symmetric positive definite. From one resource: "H(-1)_(n + 1) is positive definite (psd) when H^(-1)_n is. Assuming our initial guess of H0 is psd, it follows by induction each inverse Hessian estimate is as well. Since we can choose any H^(-1)_0 we want, including the identity matrix, this is easy to ensure." I appreciate other opinions on this to make sure I am understanding things correctly. Seems like LBFGS will work fine even without regularization. cc [~dbtsai] [~WeichenXu123] > Investigate numerical instability for MLOR without regularization > - > > Key: SPARK-17201 > URL: https://issues.apache.org/jira/browse/SPARK-17201 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > As mentioned > [here|http://ufldl.stanford.edu/wiki/index.php/Softmax_Regression], when no > regularization is applied in Softmax regression, second order Newton solvers > may run into numerical instability problems. We should investigate this in > practice and find a solution, possibly by implementing pivoting when no > regularization is applied. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17200) Automate building and testing on Windows
[ https://issues.apache.org/jira/browse/SPARK-17200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15433739#comment-15433739 ] Hyukjin Kwon commented on SPARK-17200: -- Thank you so much! > Automate building and testing on Windows > - > > Key: SPARK-17200 > URL: https://issues.apache.org/jira/browse/SPARK-17200 > Project: Spark > Issue Type: Test > Components: Build, Project Infra >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > It seems there is no automated tests on Windows (I am not sure this is being > done manually before each release). > Assuming from this comment, > https://github.com/apache/spark/pull/14743#issuecomment-241473794, It seems > we have Windows infrastructure in the AMPLab Jenkins cluster. > It seems pretty much important because as far as I know we should manually > test and verify some patches related with Windows-specific problem. > For example, https://github.com/apache/spark/pull/14743#issuecomment-241473794 > I was thinking a combination with Travis CI and Docker with Windows image. > Although this might not be merged, I will try to give a shot with this (at > least for SparkR) anyway (just to verify some PRs I just linked above). > I would appreciate it if I can hear any thoughts about this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17099) Incorrect result when HAVING clause is added to group by query
[ https://issues.apache.org/jira/browse/SPARK-17099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-17099: --- Priority: Blocker (was: Critical) > Incorrect result when HAVING clause is added to group by query > -- > > Key: SPARK-17099 > URL: https://issues.apache.org/jira/browse/SPARK-17099 > Project: Spark > Issue Type: Bug >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Priority: Blocker > > Random query generation uncovered the following query which returns incorrect > results when run on Spark SQL. This wasn't the original query uncovered by > the generator, since I performed a bit of minimization to try to make it more > understandable. > With the following tables: > {code} > val t1 = sc.parallelize(Seq(-234, 145, 367, 975, 298)).toDF("int_col_5") > val t2 = sc.parallelize( > Seq( > (-769, -244), > (-800, -409), > (940, 86), > (-507, 304), > (-367, 158)) > ).toDF("int_col_2", "int_col_5") > t1.registerTempTable("t1") > t2.registerTempTable("t2") > {code} > Run > {code} > SELECT > (SUM(COALESCE(t1.int_col_5, t2.int_col_2))), > ((COALESCE(t1.int_col_5, t2.int_col_2)) * 2) > FROM t1 > RIGHT JOIN t2 > ON (t2.int_col_2) = (t1.int_col_5) > GROUP BY GREATEST(COALESCE(t2.int_col_5, 109), COALESCE(t1.int_col_5, -449)), > COALESCE(t1.int_col_5, t2.int_col_2) > HAVING (SUM(COALESCE(t1.int_col_5, t2.int_col_2))) > ((COALESCE(t1.int_col_5, > t2.int_col_2)) * 2) > {code} > In Spark SQL, this returns an empty result set, whereas Postgres returns four > rows. However, if I omit the {{HAVING}} clause I see that the group's rows > are being incorrectly filtered by the {{HAVING}} clause: > {code} > +--+---+--+ > | sum(coalesce(int_col_5, int_col_2)) | (coalesce(int_col_5, int_col_2) * 2) > | > +--+---+--+ > | -507 | -1014 > | > | 940 | 1880 > | > | -769 | -1538 > | > | -367 | -734 > | > | -800 | -1600 > | > +--+---+--+ > {code} > Based on this, the output after adding the {{HAVING}} should contain four > rows, not zero. > I'm not sure how to further shrink this in a straightforward way, so I'm > opening this bug to get help in triaging further. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17120) Analyzer incorrectly optimizes plan to empty LocalRelation
[ https://issues.apache.org/jira/browse/SPARK-17120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-17120: --- Target Version/s: 2.0.1, 2.1.0 (was: 2.1.0) > Analyzer incorrectly optimizes plan to empty LocalRelation > -- > > Key: SPARK-17120 > URL: https://issues.apache.org/jira/browse/SPARK-17120 > Project: Spark > Issue Type: Bug >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Priority: Blocker > > Consider the following query: > {code} > sc.parallelize(Seq(97)).toDF("int_col_6").createOrReplaceTempView("table_3") > sc.parallelize(Seq(0)).toDF("int_col_1").createOrReplaceTempView("table_4") > println(sql(""" > SELECT > * > FROM ( > SELECT > COALESCE(t2.int_col_1, t1.int_col_6) AS int_col > FROM table_3 t1 > LEFT JOIN table_4 t2 ON false > ) t where (t.int_col) is not null > """).collect().toSeq) > {code} > In the innermost query, the LEFT JOIN's condition is {{false}} but > nevertheless the number of rows produced should equal the number of rows in > {{table_3}} (which is non-empty). Since no values are {{null}}, the outer > {{where}} should retain all rows, so the overall result of this query should > contain a single row with the value '97'. > Instead, the current Spark master (as of > 12a89e55cbd630fa2986da984e066cd07d3bf1f7 at least) returns no rows. Looking > at {{explain}}, it appears that the logical plan is optimizing to > {{LocalRelation }}, so Spark doesn't even run the query. My suspicion > is that there's a bug in constraint propagation or filter pushdown. > This issue doesn't seem to affect Spark 2.0, so I think it's a regression in > master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17194) When emitting SQL for string literals Spark should use single quotes, not double
[ https://issues.apache.org/jira/browse/SPARK-17194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-17194. --- Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 > When emitting SQL for string literals Spark should use single quotes, not > double > > > Key: SPARK-17194 > URL: https://issues.apache.org/jira/browse/SPARK-17194 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Minor > Fix For: 2.0.1, 2.1.0 > > > When Spark emits SQL for a string literal, it should wrap the string in > single quotes, not double quotes. Databases which adhere more strictly to the > ANSI SQL standards, such as Postgres, allow only single-quotes to be used for > denoting string literals (see http://stackoverflow.com/a/1992331/590203). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16581) Making JVM backend calling functions public
[ https://issues.apache.org/jira/browse/SPARK-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15433492#comment-15433492 ] Shivaram Venkataraman commented on SPARK-16581: --- [~yinxusen] I created a PR for this as we are trying to get the CRAN release out and it will be good to have this in it. Apologies if this resulted in duplicate work. [~felixcheung] Could you comment on the PR what kind of redesign you have in mind for the S4 class etc. ? > Making JVM backend calling functions public > --- > > Key: SPARK-16581 > URL: https://issues.apache.org/jira/browse/SPARK-16581 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman > > As described in the design doc in SPARK-15799, to help packages that need to > call into the JVM, it will be good to expose some of the R -> JVM functions > we have. > As a part of this we could also rename, reformat the functions to make them > more user friendly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16581) Making JVM backend calling functions public
[ https://issues.apache.org/jira/browse/SPARK-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15433479#comment-15433479 ] Apache Spark commented on SPARK-16581: -- User 'shivaram' has created a pull request for this issue: https://github.com/apache/spark/pull/14775 > Making JVM backend calling functions public > --- > > Key: SPARK-16581 > URL: https://issues.apache.org/jira/browse/SPARK-16581 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman > > As described in the design doc in SPARK-15799, to help packages that need to > call into the JVM, it will be good to expose some of the R -> JVM functions > we have. > As a part of this we could also rename, reformat the functions to make them > more user friendly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16581) Making JVM backend calling functions public
[ https://issues.apache.org/jira/browse/SPARK-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16581: Assignee: (was: Apache Spark) > Making JVM backend calling functions public > --- > > Key: SPARK-16581 > URL: https://issues.apache.org/jira/browse/SPARK-16581 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman > > As described in the design doc in SPARK-15799, to help packages that need to > call into the JVM, it will be good to expose some of the R -> JVM functions > we have. > As a part of this we could also rename, reformat the functions to make them > more user friendly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16581) Making JVM backend calling functions public
[ https://issues.apache.org/jira/browse/SPARK-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16581: Assignee: Apache Spark > Making JVM backend calling functions public > --- > > Key: SPARK-16581 > URL: https://issues.apache.org/jira/browse/SPARK-16581 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman >Assignee: Apache Spark > > As described in the design doc in SPARK-15799, to help packages that need to > call into the JVM, it will be good to expose some of the R -> JVM functions > we have. > As a part of this we could also rename, reformat the functions to make them > more user friendly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16508) Fix documentation warnings found by R CMD check
[ https://issues.apache.org/jira/browse/SPARK-16508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-16508. --- Resolution: Fixed Assignee: Junyang Qian Fix Version/s: 2.1.0 2.0.1 Resolved by https://github.com/apache/spark/pull/14705 and https://github.com/apache/spark/pull/14734 > Fix documentation warnings found by R CMD check > --- > > Key: SPARK-16508 > URL: https://issues.apache.org/jira/browse/SPARK-16508 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman >Assignee: Junyang Qian > Fix For: 2.0.1, 2.1.0 > > > A full list of warnings after the fixes in SPARK-16507 is at > https://gist.github.com/shivaram/62866c4ca59c5d34b8963939cf04b5eb -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to data corruption
[ https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15433408#comment-15433408 ] Michael Allman commented on SPARK-17204: [~rxin] I'll give it a try. Thanks for the heads up. I missed that Jira/PR. > Spark 2.0 off heap RDD persistence with replication factor 2 leads to data > corruption > - > > Key: SPARK-17204 > URL: https://issues.apache.org/jira/browse/SPARK-17204 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Michael Allman > > We use the OFF_HEAP storage level extensively. We've tried off-heap storage > with replication factor 2 and have always received exceptions on the executor > side very shortly after starting the job. For example: > {code} > com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: > 9086 > at > com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137) > at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781) > at > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > or > {code} > java.lang.IndexOutOfBoundsException: Index: 6, Size: 0 > at java.util.ArrayList.rangeCheck(ArrayList.java:653) > at java.util.ArrayList.get(ArrayList.java:429) > at > com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60) > at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788) > at > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:
[jira] [Commented] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to data corruption
[ https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15433394#comment-15433394 ] Reynold Xin commented on SPARK-17204: - Does this problem still exist on today's master/branch-2.0? SPARK-16550 was merged. It might be fixed already. > Spark 2.0 off heap RDD persistence with replication factor 2 leads to data > corruption > - > > Key: SPARK-17204 > URL: https://issues.apache.org/jira/browse/SPARK-17204 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Michael Allman > > We use the OFF_HEAP storage level extensively. We've tried off-heap storage > with replication factor 2 and have always received exceptions on the executor > side very shortly after starting the job. For example: > {code} > com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: > 9086 > at > com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137) > at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781) > at > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > or > {code} > java.lang.IndexOutOfBoundsException: Index: 6, Size: 0 > at java.util.ArrayList.rangeCheck(ArrayList.java:653) > at java.util.ArrayList.get(ArrayList.java:429) > at > com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60) > at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788) > at > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNe
[jira] [Commented] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB
[ https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1541#comment-1541 ] yeshwanth commented on SPARK-5928: -- i ran into this issue in a production job, org.apache.spark.shuffle.FetchFailedException: Too large frame: 4323231670 at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:152) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:45) at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:97) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.IllegalArgumentException: Too large frame: 4323231670 at org.spark-project.guava.base.Preconditions.checkArgument(Preconditions.java:119) at org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:134) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:82) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) ... 1 more > Remote Shuffle Blocks cannot be more than 2 GB > -- > > Key: SPARK-5928 > URL: https://issues.apache.org/jira/browse/SPARK-5928 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Imran Rashid > > If a shuffle block is over 2GB, the shuffle fails, with an uninformative > exception. The tasks get retried a few times and then eventually the job > fails. > Here is an example program which can cause the exception: > {code} > val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore => > val n = 3e3.toInt > val arr = new Array[Byte](n) > //need to make sure the array doesn't compress to something small > scala.util.Random.nextBytes(arr) > arr > } > rdd.map { x => (1, x)}.groupByKey().count() > {code} > Note that you can't trigger this exception in local mode, it only happens on > remote f
[jira] [Comment Edited] (SPARK-14560) Cooperative Memory Management for Spillables
[ https://issues.apache.org/jira/browse/SPARK-14560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15433210#comment-15433210 ] Sean Owen edited comment on SPARK-14560 at 8/23/16 5:11 PM: I have a few somewhat-specific additional data points: More memory didn't seem to help. A job that ran comfortably with tens of gigabytes total with Java serialization would fail even with almost a terabyte of memory available. The memory fraction was at the default of 0.75, or up to 0.9. I don't think we tried less, on the theory that the shuffle memory ought to be tracked as part of the 'storage' memory? But the same thing happened with the legacy memory manager. Unhelpfully, the heap appeared full of byte[] and String. The shuffle involved user classes that were reasonably complex: nested objects involving case classes, third-party library classes, etc. None of them were registered with Kryo. I tried registering most of them, on the theory that this was causing some in-memory serialized representation to become huge. It didn't seem to help, but I still wonder if there's a lead there. When Kryo doesn't know about a class it serializes its class name first, but not the class names of everything in the graph (right?) so it can only make so much difference. Java serialization does the same. For the record, it's just this Spark app that reproduces it: https://github.com/sryza/aas/blob/1st-edition/ch08-geotime/src/main/scala/com/cloudera/datascience/geotime/RunGeoTime.scala I have not tried on Spark 2, only 1.6 (CDH 5.8 flavor). was (Author: srowen): I have a few somewhat-specific additional data points: More memory didn't seem to help. A job that ran comfortably with tens of gigabytes total with Java serialization would fail even with almost a terabyte of memory available. The memory fraction was at the default of 0.75, or up to 0.9. I don't think we tried less, on the theory that the shuffle memory ought to be tracked as part of the 'storage' memory? But the same thing happened with the legacy memory manager. Unhelpfully, the heap appeared full of byte[] and String. The shuffle involved user classes that were reasonably complex: nested objects involving case classes, third-party library classes, etc. None of them were registered with Kryo. I tried registering most of them, on the theory that this was causing some in-memory serialized representation to become huge. It didn't seem to help, but I still wonder if there's a lead there. When Kryo doesn't know about a class it serializes its class name first, but not the class names of everything in the graph (right?) so it can only make so much difference. Java serialization does the same. For the record, it's just this Spark app that reproduces it: https://github.com/sryza/aas/blob/master/ch08-geotime/src/main/scala/com/cloudera/datascience/geotime/RunGeoTime.scala I have not tried on Spark 2, only 1.6 (CDH 5.8 flavor). > Cooperative Memory Management for Spillables > > > Key: SPARK-14560 > URL: https://issues.apache.org/jira/browse/SPARK-14560 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Imran Rashid >Assignee: Lianhui Wang > Fix For: 2.0.0 > > > SPARK-10432 introduced cooperative memory management for SQL operators that > can spill; however, {{Spillable}} s used by the old RDD api still do not > cooperate. This can lead to memory starvation, in particular on a > shuffle-to-shuffle stage, eventually resulting in errors like: > {noformat} > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Memory used in task 3081 > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Acquired by > org.apache.spark.shuffle.sort.ShuffleExternalSorter@69ab0291: 32.0 KB > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317230346 bytes of memory > were used by task 3081 but are not associated with specific consumers > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317263114 bytes of memory > are used for execution and 1710484 bytes of memory are used for storage > 16/03/28 08:59:54 ERROR executor.Executor: Managed memory leak detected; size > = 1317230346 bytes, TID = 3081 > 16/03/28 08:59:54 ERROR executor.Executor: Exception in task 533.0 in stage > 3.0 (TID 3081) > java.lang.OutOfMemoryError: Unable to acquire 75 bytes of memory, got 0 > at > org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter
[jira] [Commented] (SPARK-17200) Automate building and testing on Windows
[ https://issues.apache.org/jira/browse/SPARK-17200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15433209#comment-15433209 ] Dongjoon Hyun commented on SPARK-17200: --- Hi, [~hyukjin.kwon]. FYI, for Window CI, there is AppVeyor ( https://www.appveyor.com/ ) which is similar with Travis CI, too. Some Apache projects use Travis CI / Windows CI / Jenkins CI in parallel like the following. https://github.com/apache/reef/pull/1099 > Automate building and testing on Windows > - > > Key: SPARK-17200 > URL: https://issues.apache.org/jira/browse/SPARK-17200 > Project: Spark > Issue Type: Test > Components: Build, Project Infra >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > It seems there is no automated tests on Windows (I am not sure this is being > done manually before each release). > Assuming from this comment, > https://github.com/apache/spark/pull/14743#issuecomment-241473794, It seems > we have Windows infrastructure in the AMPLab Jenkins cluster. > It seems pretty much important because as far as I know we should manually > test and verify some patches related with Windows-specific problem. > For example, https://github.com/apache/spark/pull/14743#issuecomment-241473794 > I was thinking a combination with Travis CI and Docker with Windows image. > Although this might not be merged, I will try to give a shot with this (at > least for SparkR) anyway (just to verify some PRs I just linked above). > I would appreciate it if I can hear any thoughts about this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14560) Cooperative Memory Management for Spillables
[ https://issues.apache.org/jira/browse/SPARK-14560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15433210#comment-15433210 ] Sean Owen commented on SPARK-14560: --- I have a few somewhat-specific additional data points: More memory didn't seem to help. A job that ran comfortably with tens of gigabytes total with Java serialization would fail even with almost a terabyte of memory available. The memory fraction was at the default of 0.75, or up to 0.9. I don't think we tried less, on the theory that the shuffle memory ought to be tracked as part of the 'storage' memory? But the same thing happened with the legacy memory manager. Unhelpfully, the heap appeared full of byte[] and String. The shuffle involved user classes that were reasonably complex: nested objects involving case classes, third-party library classes, etc. None of them were registered with Kryo. I tried registering most of them, on the theory that this was causing some in-memory serialized representation to become huge. It didn't seem to help, but I still wonder if there's a lead there. When Kryo doesn't know about a class it serializes its class name first, but not the class names of everything in the graph (right?) so it can only make so much difference. Java serialization does the same. For the record, it's just this Spark app that reproduces it: https://github.com/sryza/aas/blob/master/ch08-geotime/src/main/scala/com/cloudera/datascience/geotime/RunGeoTime.scala I have not tried on Spark 2, only 1.6 (CDH 5.8 flavor). > Cooperative Memory Management for Spillables > > > Key: SPARK-14560 > URL: https://issues.apache.org/jira/browse/SPARK-14560 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Imran Rashid >Assignee: Lianhui Wang > Fix For: 2.0.0 > > > SPARK-10432 introduced cooperative memory management for SQL operators that > can spill; however, {{Spillable}} s used by the old RDD api still do not > cooperate. This can lead to memory starvation, in particular on a > shuffle-to-shuffle stage, eventually resulting in errors like: > {noformat} > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Memory used in task 3081 > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Acquired by > org.apache.spark.shuffle.sort.ShuffleExternalSorter@69ab0291: 32.0 KB > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317230346 bytes of memory > were used by task 3081 but are not associated with specific consumers > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317263114 bytes of memory > are used for execution and 1710484 bytes of memory are used for storage > 16/03/28 08:59:54 ERROR executor.Executor: Managed memory leak detected; size > = 1317230346 bytes, TID = 3081 > 16/03/28 08:59:54 ERROR executor.Executor: Exception in task 533.0 in stage > 3.0 (TID 3081) > java.lang.OutOfMemoryError: Unable to acquire 75 bytes of memory, got 0 > at > org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > This can happen anytime the shuffle read side requires more memory than what > is available for the task. Since the shuffle-read side doubles its memory > request each time, it can easily end up acquiring all of the available > memory, even if it does not use it. Eg., say that after the final spill, the > shuffle-read side requires 10 MB more memory, and there is 15 MB of memory > available. But if it starts at 2 MB, it will double to 4, 8, and then > request 16 MB of memory, and in fact get all available 15 MB. Since the 15 > MB of memory is sufficient, it will not spill, and will continue holding on > to all available memory. But this leaves *no* memory available for the > shuffle-write sid
[jira] [Commented] (SPARK-14560) Cooperative Memory Management for Spillables
[ https://issues.apache.org/jira/browse/SPARK-14560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15433200#comment-15433200 ] Davies Liu commented on SPARK-14560: Even with SPARK-4452, we still can not say that we fixed the OOM problem totally, because the memory used by Java object (used by RDD) can't be measured and predicted exactly, in the case that they use memory than we thought, it will still OOM. The java serializer may use less memory than Kyro, so it helped in this case. Also, there are many memory are tracked in memory manager, for example, the buffer used by shuffle reader, they could also go beyond the capacity we preserved for all others, that could be also another cause for the OOM in large scale job. The workaround could be decreasing the memoryFaction to preserve more memory for all other stuff. Have you tried that? > Cooperative Memory Management for Spillables > > > Key: SPARK-14560 > URL: https://issues.apache.org/jira/browse/SPARK-14560 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Imran Rashid >Assignee: Lianhui Wang > Fix For: 2.0.0 > > > SPARK-10432 introduced cooperative memory management for SQL operators that > can spill; however, {{Spillable}} s used by the old RDD api still do not > cooperate. This can lead to memory starvation, in particular on a > shuffle-to-shuffle stage, eventually resulting in errors like: > {noformat} > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Memory used in task 3081 > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Acquired by > org.apache.spark.shuffle.sort.ShuffleExternalSorter@69ab0291: 32.0 KB > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317230346 bytes of memory > were used by task 3081 but are not associated with specific consumers > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317263114 bytes of memory > are used for execution and 1710484 bytes of memory are used for storage > 16/03/28 08:59:54 ERROR executor.Executor: Managed memory leak detected; size > = 1317230346 bytes, TID = 3081 > 16/03/28 08:59:54 ERROR executor.Executor: Exception in task 533.0 in stage > 3.0 (TID 3081) > java.lang.OutOfMemoryError: Unable to acquire 75 bytes of memory, got 0 > at > org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > This can happen anytime the shuffle read side requires more memory than what > is available for the task. Since the shuffle-read side doubles its memory > request each time, it can easily end up acquiring all of the available > memory, even if it does not use it. Eg., say that after the final spill, the > shuffle-read side requires 10 MB more memory, and there is 15 MB of memory > available. But if it starts at 2 MB, it will double to 4, 8, and then > request 16 MB of memory, and in fact get all available 15 MB. Since the 15 > MB of memory is sufficient, it will not spill, and will continue holding on > to all available memory. But this leaves *no* memory available for the > shuffle-write side. Since the shuffle-write side cannot request the > shuffle-read side to free up memory, this leads to an OOM. > The simple solution is to make {{Spillable}} implement {{MemoryConsumer}} as > well, so RDDs can benefit from the cooperative memory management introduced > by SPARK-10342. > Note that an additional improvement would be for the shuffle-read side to > simple release unused memory, without spilling, in case that would leave > enough memory, and only spill if that was inadequate. However that can come > as a later improvement. > *Workaround*: You can set > {{spark.shuffle.spill.numElementsForceSpillThreshold=N}} to force spilling to > occur
[jira] [Resolved] (SPARK-13286) JDBC driver doesn't report full exception
[ https://issues.apache.org/jira/browse/SPARK-13286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-13286. Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Issue resolved by pull request 14722 [https://github.com/apache/spark/pull/14722] > JDBC driver doesn't report full exception > - > > Key: SPARK-13286 > URL: https://issues.apache.org/jira/browse/SPARK-13286 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Adrian Bridgett >Assignee: Davies Liu >Priority: Minor > Fix For: 2.0.1, 2.1.0 > > > Testing some failure scenarios (inserting data into postgresql where there is > a schema mismatch) , there is an exception thrown (fine so far) however it > doesn't report the actual SQL error. It refers to a getNextException call > but this is beyond my non-existant Java skills to deal with correctly. > Supporting this would help users to see the SQL error quickly and resolve the > underlying problem. > {noformat} > Caused by: java.sql.BatchUpdateException: Batch entry 0 INSERT INTO core > VALUES('5fdf5...',) was aborted. Call getNextException to see the cause. > at > org.postgresql.jdbc2.AbstractJdbc2Statement$BatchResultHandler.handleError(AbstractJdbc2Statement.java:2746) > at > org.postgresql.core.v3.QueryExecutorImpl$1.handleError(QueryExecutorImpl.java:457) > at > org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1887) > at > org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:405) > at > org.postgresql.jdbc2.AbstractJdbc2Statement.executeBatch(AbstractJdbc2Statement.java:2893) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:185) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:248) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to data corruption
Michael Allman created SPARK-17204: -- Summary: Spark 2.0 off heap RDD persistence with replication factor 2 leads to data corruption Key: SPARK-17204 URL: https://issues.apache.org/jira/browse/SPARK-17204 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.0 Reporter: Michael Allman We use the OFF_HEAP storage level extensively. We've tried off-heap storage with replication factor 2 and have always received exceptions on the executor side very shortly after starting the job. For example: {code} com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 9086 at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137) at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781) at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} or {code} java.lang.IndexOutOfBoundsException: Index: 6, Size: 0 at java.util.ArrayList.rangeCheck(ArrayList.java:653) at java.util.ArrayList.get(ArrayList.java:429) at com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60) at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788) at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Tas
[jira] [Updated] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to data corruption
[ https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Allman updated SPARK-17204: --- Description: We use the OFF_HEAP storage level extensively. We've tried off-heap storage with replication factor 2 and have always received exceptions on the executor side very shortly after starting the job. For example: {code} com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 9086 at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137) at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781) at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} or {code} java.lang.IndexOutOfBoundsException: Index: 6, Size: 0 at java.util.ArrayList.rangeCheck(ArrayList.java:653) at java.util.ArrayList.get(ArrayList.java:429) at com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60) at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788) at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPo
[jira] [Assigned] (SPARK-17203) data source options should always be case insensitive
[ https://issues.apache.org/jira/browse/SPARK-17203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17203: Assignee: Wenchen Fan (was: Apache Spark) > data source options should always be case insensitive > - > > Key: SPARK-17203 > URL: https://issues.apache.org/jira/browse/SPARK-17203 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17203) data source options should always be case insensitive
[ https://issues.apache.org/jira/browse/SPARK-17203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17203: Assignee: Apache Spark (was: Wenchen Fan) > data source options should always be case insensitive > - > > Key: SPARK-17203 > URL: https://issues.apache.org/jira/browse/SPARK-17203 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17203) data source options should always be case insensitive
[ https://issues.apache.org/jira/browse/SPARK-17203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15433005#comment-15433005 ] Apache Spark commented on SPARK-17203: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/14773 > data source options should always be case insensitive > - > > Key: SPARK-17203 > URL: https://issues.apache.org/jira/browse/SPARK-17203 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17202) "Pipeline guide" link is broken in MLlib Guide main page
[ https://issues.apache.org/jira/browse/SPARK-17202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17202. --- Resolution: Duplicate Yep, though this was already fixed in master. The next release docs would contain the fix. > "Pipeline guide" link is broken in MLlib Guide main page > > > Key: SPARK-17202 > URL: https://issues.apache.org/jira/browse/SPARK-17202 > Project: Spark > Issue Type: Bug > Components: Documentation, MLlib >Affects Versions: 2.0.0 >Reporter: Vitalii Kotliarenko >Priority: Trivial > > Steps to reproduce: > 1) Check http://spark.apache.org/docs/latest/ml-guide.html > 2) Link in sentence "See the Pipelines guide for details" is broken, it > points to https://spark.apache.org/docs/latest/ml-pipeline.md > Expected result: "Pipeline guide" link should point to > https://spark.apache.org/docs/latest/ml-pipeline.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17203) data source options should always be case insensitive
Wenchen Fan created SPARK-17203: --- Summary: data source options should always be case insensitive Key: SPARK-17203 URL: https://issues.apache.org/jira/browse/SPARK-17203 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17202) "Pipeline guide" link is broken in MLlib Guide main page
Vitalii Kotliarenko created SPARK-17202: --- Summary: "Pipeline guide" link is broken in MLlib Guide main page Key: SPARK-17202 URL: https://issues.apache.org/jira/browse/SPARK-17202 Project: Spark Issue Type: Bug Components: Documentation, MLlib Affects Versions: 2.0.0 Reporter: Vitalii Kotliarenko Priority: Trivial Steps to reproduce: 1) Check http://spark.apache.org/docs/latest/ml-guide.html 2) Link in sentence "See the Pipelines guide for details" is broken, it points to https://spark.apache.org/docs/latest/ml-pipeline.md Expected result: "Pipeline guide" link should point to https://spark.apache.org/docs/latest/ml-pipeline.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17201) Investigate numerical instability for MLOR without regularization
Seth Hendrickson created SPARK-17201: Summary: Investigate numerical instability for MLOR without regularization Key: SPARK-17201 URL: https://issues.apache.org/jira/browse/SPARK-17201 Project: Spark Issue Type: Sub-task Reporter: Seth Hendrickson As mentioned [here|http://ufldl.stanford.edu/wiki/index.php/Softmax_Regression], when no regularization is applied in Softmax regression, second order Newton solvers may run into numerical instability problems. We should investigate this in practice and find a solution, possibly by implementing pivoting when no regularization is applied. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14560) Cooperative Memory Management for Spillables
[ https://issues.apache.org/jira/browse/SPARK-14560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432883#comment-15432883 ] Imran Rashid commented on SPARK-14560: -- One minor clarification -- SPARK-4452 is not included in Spark 1.6, but Sean was running a version with that fix backported. So if you see this problem in Spark 1.6, (a) considering backporting SPARK-4452 and then try switching to java serialization. > Cooperative Memory Management for Spillables > > > Key: SPARK-14560 > URL: https://issues.apache.org/jira/browse/SPARK-14560 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Imran Rashid >Assignee: Lianhui Wang > Fix For: 2.0.0 > > > SPARK-10432 introduced cooperative memory management for SQL operators that > can spill; however, {{Spillable}} s used by the old RDD api still do not > cooperate. This can lead to memory starvation, in particular on a > shuffle-to-shuffle stage, eventually resulting in errors like: > {noformat} > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Memory used in task 3081 > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Acquired by > org.apache.spark.shuffle.sort.ShuffleExternalSorter@69ab0291: 32.0 KB > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317230346 bytes of memory > were used by task 3081 but are not associated with specific consumers > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317263114 bytes of memory > are used for execution and 1710484 bytes of memory are used for storage > 16/03/28 08:59:54 ERROR executor.Executor: Managed memory leak detected; size > = 1317230346 bytes, TID = 3081 > 16/03/28 08:59:54 ERROR executor.Executor: Exception in task 533.0 in stage > 3.0 (TID 3081) > java.lang.OutOfMemoryError: Unable to acquire 75 bytes of memory, got 0 > at > org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > This can happen anytime the shuffle read side requires more memory than what > is available for the task. Since the shuffle-read side doubles its memory > request each time, it can easily end up acquiring all of the available > memory, even if it does not use it. Eg., say that after the final spill, the > shuffle-read side requires 10 MB more memory, and there is 15 MB of memory > available. But if it starts at 2 MB, it will double to 4, 8, and then > request 16 MB of memory, and in fact get all available 15 MB. Since the 15 > MB of memory is sufficient, it will not spill, and will continue holding on > to all available memory. But this leaves *no* memory available for the > shuffle-write side. Since the shuffle-write side cannot request the > shuffle-read side to free up memory, this leads to an OOM. > The simple solution is to make {{Spillable}} implement {{MemoryConsumer}} as > well, so RDDs can benefit from the cooperative memory management introduced > by SPARK-10342. > Note that an additional improvement would be for the shuffle-read side to > simple release unused memory, without spilling, in case that would leave > enough memory, and only spill if that was inadequate. However that can come > as a later improvement. > *Workaround*: You can set > {{spark.shuffle.spill.numElementsForceSpillThreshold=N}} to force spilling to > occur every {{N}} elements, thus preventing the shuffle-read side from ever > grabbing all of the available memory. However, this requires careful tuning > of {{N}} to specific workloads: too big, and you will still get an OOM; too > small, and there will be so much spilling that performance will suffer > drastically. Furthermore, this workaround uses an *undocumented* > configuration with *no compatibility guarantees* for fu
[jira] [Commented] (SPARK-17126) Errors setting driver classpath in spark-defaults.conf on Windows 7
[ https://issues.apache.org/jira/browse/SPARK-17126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432767#comment-15432767 ] Ozioma Ihekwoaba commented on SPARK-17126: -- It works, all my jars get listed in the Spark Web UI. I think from Java 6 upwards you can use the wildcard option to specify all jars in a classpath directory. I get your point, my scenario was a spark-shell tutorial session for Spark-SQL using a custom Hive instance. I needed a way to add the MySQL connector jar to the classpath for the Hive metastore, and also for other jars like the Spark CSV jar. Works like a charm on Linux, but failed repeatedly on Windows. Just curious do you know of any company running production Spark clusters on Windows? Cos it appears Spark is not built for Windows and all the examples point to a Linux setting. Thing is lots of up and coming young devs are totally flummoxed by the Linux command-line, and since they use Windows by default, Windows should be supported at a minimum...as a dev platform. You know, like the sbin folder scripts are all bash scripts. Ok, that was a subtle rant, maybe I should adapt the scripts myself to run on Windows. Thanks for the awesome work! > Errors setting driver classpath in spark-defaults.conf on Windows 7 > --- > > Key: SPARK-17126 > URL: https://issues.apache.org/jira/browse/SPARK-17126 > Project: Spark > Issue Type: Question > Components: Spark Shell, SQL >Affects Versions: 1.6.1 > Environment: Windows 7 >Reporter: Ozioma Ihekwoaba > > I am having issues starting up Spark shell with a local hive-site.xml on > Windows 7. > I have a local Hive 2.1.0 instance on Windows using a MySQL metastore. > The Hive instance is working fine. > I copied over the hive-site.xml to my local instance of Spark 1.6.1 conf > folder and also copied over mysql-connector-java-5.1.25-bin.jar to the lib > folder. > I was expecting Spark to pick up jar files in the lib folder automatically, > but found out Spark expects a spark.driver.extraClassPath and > spark.executor.extraClassPath settings to resolve jars. > Thing is this has failed on Windows for me with a > DataStoreDriverNotFoundException saying com.mysql.jdbc.Driver could not be > found. > Here are some of the different file paths I've tried: > C:/hadoop/spark/v161/lib/mysql-connector-java-5.1.25-bin.jar;C:/hadoop/spark/v161/lib/commons-csv-1.4.jar;C:/hadoop/spark/v161/lib/spark-csv_2.11-1.4.0.jar > ".;C:\hadoop\spark\v161\lib\*" > NONE has worked so far. > Please, what is the correct way to set driver classpaths on Windows? > Also, what is the correct file path format on Windows? > I have it working fine on Linux but my current engagement requires me to run > Spark on a Windows box. > Is there a way for Spark to automatically resolve jars from the lib folder in > all modes? > Thanks. > Ozzy -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17126) Errors setting driver classpath in spark-defaults.conf on Windows 7
[ https://issues.apache.org/jira/browse/SPARK-17126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432715#comment-15432715 ] Sean Owen commented on SPARK-17126: --- Hm, I am not sure that "*" works on any JVM. Maybe I'm missing a reason it works for the env variable. But you would in general not specify it this way, which could be the problem. You would also not in general set app jar dependencies this way, but rather build them into your app. > Errors setting driver classpath in spark-defaults.conf on Windows 7 > --- > > Key: SPARK-17126 > URL: https://issues.apache.org/jira/browse/SPARK-17126 > Project: Spark > Issue Type: Question > Components: Spark Shell, SQL >Affects Versions: 1.6.1 > Environment: Windows 7 >Reporter: Ozioma Ihekwoaba > > I am having issues starting up Spark shell with a local hive-site.xml on > Windows 7. > I have a local Hive 2.1.0 instance on Windows using a MySQL metastore. > The Hive instance is working fine. > I copied over the hive-site.xml to my local instance of Spark 1.6.1 conf > folder and also copied over mysql-connector-java-5.1.25-bin.jar to the lib > folder. > I was expecting Spark to pick up jar files in the lib folder automatically, > but found out Spark expects a spark.driver.extraClassPath and > spark.executor.extraClassPath settings to resolve jars. > Thing is this has failed on Windows for me with a > DataStoreDriverNotFoundException saying com.mysql.jdbc.Driver could not be > found. > Here are some of the different file paths I've tried: > C:/hadoop/spark/v161/lib/mysql-connector-java-5.1.25-bin.jar;C:/hadoop/spark/v161/lib/commons-csv-1.4.jar;C:/hadoop/spark/v161/lib/spark-csv_2.11-1.4.0.jar > ".;C:\hadoop\spark\v161\lib\*" > NONE has worked so far. > Please, what is the correct way to set driver classpaths on Windows? > Also, what is the correct file path format on Windows? > I have it working fine on Linux but my current engagement requires me to run > Spark on a Windows box. > Is there a way for Spark to automatically resolve jars from the lib folder in > all modes? > Thanks. > Ozzy -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17126) Errors setting driver classpath in spark-defaults.conf on Windows 7
[ https://issues.apache.org/jira/browse/SPARK-17126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432707#comment-15432707 ] Ozioma Ihekwoaba edited comment on SPARK-17126 at 8/23/16 12:32 PM: Hi Sean, Thanks for the update. What I meant was the driver path and executor path entries make it to the Web UI. Meaning the values I set for the driver classpath and executor classpath are read by Spark during startup. However, the jars I specified in the 2 paths are not on the classpath entries list in the Web UI. They are also not loaded by Spark during startup. For example, the Spark CSV jar and other associated jars are not loaded. On Linux, the driver path jars and executor path jars are successfully added to the Spark classpath, IN ADDITION to being listed in the Spark Web UI environment tab. On Windows, the jars in the folder do not get listed in the Spark Web UI. I finally found a solution to this on Windows, I simply set SPARK_CLASSPATH. That was it. In summary, this worked when set in spark-env.cmd: set SPARK_CLASSPATH=C://hadoop//spark//v162//lib//* But none of these did not work when set in spark-defaults.conf: spark.driver.extraClassPath C:\\hadoop\\spark\\v162\\lib\\* spark.driver.extraClassPath C://hadoop//spark//v162//lib//* spark.driver.extraClassPath C:\\hadoop\\spark\\v162\\lib\\mysql-connector-java-5.1.25-bin.jar; spark.driver.extraClassPath C:\\hadoop\\spark\\v162\\lib\\mysql-connector-java-5.1.25-bin.jar spark.driver.extraClassPath file:/C:/hadoop/spark/v162/lib/*jar; spark.driver.extraClassPath file:///C:/hadoop/spark/v162/lib/mysql-connector-java-5.1.25-bin.jar; What I needed was a way to add all necessary jars to the classpath during startup, I found the commandline syntax for adding packages and driver jars too cumbersome. Still wondering why just dropping jars in the lib folder (pre 2.0 versions) does not suffice as a default folder to resolve jars. Thanks, Ozzy was (Author: ozioma): Hi Sean, Thanks for the update. What I meant was the driver path and executor path entries make it to the Web UI. Meaning the values I set for the driver classpath and executor classpath are read by Spark during startup. However, the jars I specified in the 2 paths are not on the classpath entries list in the Web UI. They are also not loaded by Spark during startup. For example, the Spark CSV jar and other associated jars are not loaded. On Linux, the driver path jars and executor path jars are successfully added to the Spark classpath, IN ADDITION to being listed in the Spark Web UI environment tab. On Windows, the jars in the folder do not get listed in the Spark Web UI. I finally found a solution to this on Windows, I simply set SPARK_CLASSPATH. That was it. In summary, this worked when set in spark-env.cmd: set SPARK_CLASSPATH=C://hadoop//spark//v162//lib//* But none of these did not work when set in spark-defaults.conf: spark.driver.extraClassPath C:\\hadoop\\spark\\v162\\lib\\* spark.driver.extraClassPath C://hadoop//spark//v162//lib//* spark.driver.extraClassPath C:\\hadoop\\spark\\v162\\lib\\mysql-connector-java-5.1.25-bin.jar; spark.driver.extraClassPath C:\\hadoop\\spark\\v162\\lib\\mysql-connector-java-5.1.25-bin.jar spark.driver.extraClassPath file:/C:/hadoop/spark/v162/lib/*jar; spark.driver.extraClassPath file:///C:/hadoop/spark/v162/lib/mysql-connector-java-5.1.25-bin.jar; What I needed was add all necessary jars to the classpath during startup, I found the commandline syntax for adding packages and driver jars too cumbersome. Still wondering why just dropping jars in the lib folder (pre 2.0 versions) does not suffice as a default folder to resolve jars. Thanks, Ozzy > Errors setting driver classpath in spark-defaults.conf on Windows 7 > --- > > Key: SPARK-17126 > URL: https://issues.apache.org/jira/browse/SPARK-17126 > Project: Spark > Issue Type: Question > Components: Spark Shell, SQL >Affects Versions: 1.6.1 > Environment: Windows 7 >Reporter: Ozioma Ihekwoaba > > I am having issues starting up Spark shell with a local hive-site.xml on > Windows 7. > I have a local Hive 2.1.0 instance on Windows using a MySQL metastore. > The Hive instance is working fine. > I copied over the hive-site.xml to my local instance of Spark 1.6.1 conf > folder and also copied over mysql-connector-java-5.1.25-bin.jar to the lib > folder. > I was expecting Spark to pick up jar files in the lib folder automatically, > but found out Spark expects a spark.driver.extraClassPath and > spark.executor.extraClassPath settings to resolve jars. > Thing is this has failed on Windows for me with a > DataStoreDriverNotFoundException saying com.mysql.jdbc.Driver could not be > found.
[jira] [Commented] (SPARK-17126) Errors setting driver classpath in spark-defaults.conf on Windows 7
[ https://issues.apache.org/jira/browse/SPARK-17126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432707#comment-15432707 ] Ozioma Ihekwoaba commented on SPARK-17126: -- Hi Sean, Thanks for the update. What I meant was the driver path and executor path entries make it to the Web UI. Meaning the values I set for the driver classpath and executor classpath are read by Spark during startup. However, the jars I specified in the 2 paths are not on the classpath entries list in the Web UI. They are also not loaded by Spark during startup. For example, the Spark CSV jar and other associated jars are not loaded. On Linux, the driver path jars and executor path jars are successfully added to the Spark classpath, IN ADDITION to being listed in the Spark Web UI environment tab. On Windows, the jars in the folder do not get listed in the Spark Web UI. I finally found a solution to this on Windows, I simply set SPARK_CLASSPATH. That was it. In summary, this worked when set in spark-env.cmd: set SPARK_CLASSPATH=C://hadoop//spark//v162//lib//* But none of these did not work when set in spark-defaults.conf: spark.driver.extraClassPath C:\\hadoop\\spark\\v162\\lib\\* spark.driver.extraClassPath C://hadoop//spark//v162//lib//* spark.driver.extraClassPath C:\\hadoop\\spark\\v162\\lib\\mysql-connector-java-5.1.25-bin.jar; spark.driver.extraClassPath C:\\hadoop\\spark\\v162\\lib\\mysql-connector-java-5.1.25-bin.jar spark.driver.extraClassPath file:/C:/hadoop/spark/v162/lib/*jar; spark.driver.extraClassPath file:///C:/hadoop/spark/v162/lib/mysql-connector-java-5.1.25-bin.jar; What I needed was add all necessary jars to the classpath during startup, I found the commandline syntax for adding packages and driver jars too cumbersome. Still wondering why just dropping jars in the lib folder (pre 2.0 versions) does not suffice as a default folder to resolve jars. Thanks, Ozzy > Errors setting driver classpath in spark-defaults.conf on Windows 7 > --- > > Key: SPARK-17126 > URL: https://issues.apache.org/jira/browse/SPARK-17126 > Project: Spark > Issue Type: Question > Components: Spark Shell, SQL >Affects Versions: 1.6.1 > Environment: Windows 7 >Reporter: Ozioma Ihekwoaba > > I am having issues starting up Spark shell with a local hive-site.xml on > Windows 7. > I have a local Hive 2.1.0 instance on Windows using a MySQL metastore. > The Hive instance is working fine. > I copied over the hive-site.xml to my local instance of Spark 1.6.1 conf > folder and also copied over mysql-connector-java-5.1.25-bin.jar to the lib > folder. > I was expecting Spark to pick up jar files in the lib folder automatically, > but found out Spark expects a spark.driver.extraClassPath and > spark.executor.extraClassPath settings to resolve jars. > Thing is this has failed on Windows for me with a > DataStoreDriverNotFoundException saying com.mysql.jdbc.Driver could not be > found. > Here are some of the different file paths I've tried: > C:/hadoop/spark/v161/lib/mysql-connector-java-5.1.25-bin.jar;C:/hadoop/spark/v161/lib/commons-csv-1.4.jar;C:/hadoop/spark/v161/lib/spark-csv_2.11-1.4.0.jar > ".;C:\hadoop\spark\v161\lib\*" > NONE has worked so far. > Please, what is the correct way to set driver classpaths on Windows? > Also, what is the correct file path format on Windows? > I have it working fine on Linux but my current engagement requires me to run > Spark on a Windows box. > Is there a way for Spark to automatically resolve jars from the lib folder in > all modes? > Thanks. > Ozzy -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17126) Errors setting driver classpath in spark-defaults.conf on Windows 7
[ https://issues.apache.org/jira/browse/SPARK-17126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17126. --- Resolution: Not A Problem You said, "Yes the entries make it to the classpath, I checked up on the web UI." So I said I understood that you have successfully set the classpath to the desired value. That much is clear right? One other thing I forgot to note is that Java does not support directories of JARs on the classpath. That could be the problem. I was thinking that your problem is referring to local files on remote machines, but I'm not sure that's the issue. Normally, you build an "uber JAR" with your app and all dependencies and you do not set the classpath like this. That is another possible solution. At the moment this looks like a question about using Spark, rather than a problem. That should take place on user@ really. > Errors setting driver classpath in spark-defaults.conf on Windows 7 > --- > > Key: SPARK-17126 > URL: https://issues.apache.org/jira/browse/SPARK-17126 > Project: Spark > Issue Type: Question > Components: Spark Shell, SQL >Affects Versions: 1.6.1 > Environment: Windows 7 >Reporter: Ozioma Ihekwoaba > > I am having issues starting up Spark shell with a local hive-site.xml on > Windows 7. > I have a local Hive 2.1.0 instance on Windows using a MySQL metastore. > The Hive instance is working fine. > I copied over the hive-site.xml to my local instance of Spark 1.6.1 conf > folder and also copied over mysql-connector-java-5.1.25-bin.jar to the lib > folder. > I was expecting Spark to pick up jar files in the lib folder automatically, > but found out Spark expects a spark.driver.extraClassPath and > spark.executor.extraClassPath settings to resolve jars. > Thing is this has failed on Windows for me with a > DataStoreDriverNotFoundException saying com.mysql.jdbc.Driver could not be > found. > Here are some of the different file paths I've tried: > C:/hadoop/spark/v161/lib/mysql-connector-java-5.1.25-bin.jar;C:/hadoop/spark/v161/lib/commons-csv-1.4.jar;C:/hadoop/spark/v161/lib/spark-csv_2.11-1.4.0.jar > ".;C:\hadoop\spark\v161\lib\*" > NONE has worked so far. > Please, what is the correct way to set driver classpaths on Windows? > Also, what is the correct file path format on Windows? > I have it working fine on Linux but my current engagement requires me to run > Spark on a Windows box. > Is there a way for Spark to automatically resolve jars from the lib folder in > all modes? > Thanks. > Ozzy -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17095) Latex and Scala doc do not play nicely
[ https://issues.apache.org/jira/browse/SPARK-17095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-17095: -- Assignee: Jagadeesan A S Priority: Trivial (was: Minor) The net change here was slightly different: to fix up a few instances where LaTeX was being rendered as code but not cases involving "}}}" > Latex and Scala doc do not play nicely > -- > > Key: SPARK-17095 > URL: https://issues.apache.org/jira/browse/SPARK-17095 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Seth Hendrickson >Assignee: Jagadeesan A S >Priority: Trivial > Labels: starter > Fix For: 2.1.0 > > > In Latex, it is common to find "}}}" when closing several expressions at > once. [SPARK-16822|https://issues.apache.org/jira/browse/SPARK-16822] added > Mathjax to render Latex equations in scaladoc. However, when scala doc sees > "}}}" or "{{{" it treats it as a special character for code block. This > results in some very strange output. > A poor workaround is to use "}}\,}" in latex which inserts a small > whitespace. This is not ideal, and we can hopefully find a better solution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17095) Latex and Scala doc do not play nicely
[ https://issues.apache.org/jira/browse/SPARK-17095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17095. --- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14688 [https://github.com/apache/spark/pull/14688] > Latex and Scala doc do not play nicely > -- > > Key: SPARK-17095 > URL: https://issues.apache.org/jira/browse/SPARK-17095 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Seth Hendrickson >Priority: Minor > Labels: starter > Fix For: 2.1.0 > > > In Latex, it is common to find "}}}" when closing several expressions at > once. [SPARK-16822|https://issues.apache.org/jira/browse/SPARK-16822] added > Mathjax to render Latex equations in scaladoc. However, when scala doc sees > "}}}" or "{{{" it treats it as a special character for code block. This > results in some very strange output. > A poor workaround is to use "}}\,}" in latex which inserts a small > whitespace. This is not ideal, and we can hopefully find a better solution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17199) Use CatalystConf.resolver for case-sensitivity comparison
[ https://issues.apache.org/jira/browse/SPARK-17199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-17199. --- Resolution: Fixed Assignee: Jacek Laskowski Fix Version/s: 2.1.0 > Use CatalystConf.resolver for case-sensitivity comparison > - > > Key: SPARK-17199 > URL: https://issues.apache.org/jira/browse/SPARK-17199 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.1, 2.1.0 >Reporter: Jacek Laskowski >Assignee: Jacek Laskowski >Priority: Trivial > Fix For: 2.1.0 > > > {{CatalystConf.resolver}} does the branching per {{caseSensitiveAnalysis}}. > There's no need to repeat the code across the codebase. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17055) add labelKFold to CrossValidator
[ https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432463#comment-15432463 ] Vincent edited comment on SPARK-17055 at 8/23/16 10:34 AM: --- sorry for late reply. Yes, I just knew they intend to rename it to GroupKFold, or something like that. Personally I think it's fine to keep the way it is, though, it could be still kinda confusing when someone first uses it before understanding the idea behinds it. As for application, take face recognition as an example. features are, say, eyes, nose, lips etc. training data are obtained from a number of different person, this method can create subject independent folds, so we can train the model with features from certain group of people and take the data from the rest of group of people for validation. it will enhance the generic ability of the model and avoid over-fitting. it's a useful method, seen in sklearn, and currently caret is on the way trying to add this feature. was (Author: vincexie): sorry for late reply. Yes, I just knew they intend to rename it to GroupKFold, or something like that. Though personally I think it's fine to keep the way it is, though, it could be still kinda confusing when someone first uses it before understanding the idea behinds it. As for application, take face recognition as an example. features are, say, eyes, nose, lips etc. training data are obtained from a number of different person, this method can create subject independent folds, so we can train the model with features from certain group of people and take the data from the rest of group of people for validation. it will enhance the generic ability of the model and avoid over-fitting. it's a useful method, seen in sklearn, and currently caret is on the way trying to add this feature. > add labelKFold to CrossValidator > > > Key: SPARK-17055 > URL: https://issues.apache.org/jira/browse/SPARK-17055 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Vincent >Priority: Minor > > Current CrossValidator only supports k-fold, which randomly divides all the > samples in k groups of samples. But in cases when data is gathered from > different subjects and we want to avoid over-fitting, we want to hold out > samples with certain labels from training data and put them into validation > fold, i.e. we want to ensure that the same label is not in both testing and > training sets. > Mainstream packages like Sklearn already supports such cross validation > method. > (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17200) Automate building and testing on Windows
[ https://issues.apache.org/jira/browse/SPARK-17200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432521#comment-15432521 ] Hyukjin Kwon commented on SPARK-17200: -- Maybe, I will try to do this and I guess this will be done with a lot of messing around with flaky builds. I have tried this combination before but the combination itself is really flaky. So, I am thinking this would not be merged into codebase but I probably use this at least for SparkR (if I make it). I will try this one anyway but please leave any thoughts if you have any better idea. > Automate building and testing on Windows > - > > Key: SPARK-17200 > URL: https://issues.apache.org/jira/browse/SPARK-17200 > Project: Spark > Issue Type: Test > Components: Build, Project Infra >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > It seems there is no automated tests on Windows (I am not sure this is being > done manually before each release). > Assuming from this comment, > https://github.com/apache/spark/pull/14743#issuecomment-241473794, It seems > we have Windows infrastructure in the AMPLab Jenkins cluster. > It seems pretty much important because as far as I know we should manually > test and verify some patches related with Windows-specific problem. > For example, https://github.com/apache/spark/pull/14743#issuecomment-241473794 > I was thinking a combination with Travis CI and Docker with Windows image. > Although this might not be merged, I will try to give a shot with this (at > least for SparkR) anyway (just to verify some PRs I just linked above). > I would appreciate it if I can hear any thoughts about this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17200) Automate building and testing on Windows
[ https://issues.apache.org/jira/browse/SPARK-17200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432511#comment-15432511 ] Sean Owen commented on SPARK-17200: --- Without a doubt that would be beneficial. The question is how much work it takes to set up and maintain, and that I don't know. While I think the goal is to make Spark run on Windows if possible, I'm not sure how well the dev setup / tests support Windows. > Automate building and testing on Windows > - > > Key: SPARK-17200 > URL: https://issues.apache.org/jira/browse/SPARK-17200 > Project: Spark > Issue Type: Test > Components: Build, Project Infra >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > It seems there is no automated tests on Windows (I am not sure this is being > done manually before each release). > Assuming from this comment, > https://github.com/apache/spark/pull/14743#issuecomment-241473794, It seems > we have Windows infrastructure in the AMPLab Jenkins cluster. > It seems pretty much important because as far as I know we should manually > test and verify some patches related with Windows-specific problem. > For example, https://github.com/apache/spark/pull/14743#issuecomment-241473794 > I was thinking a combination with Travis CI and Docker with Windows image. > Although this might not be merged, I will try to give a shot with this (at > least for SparkR) anyway (just to verify some PRs I just linked above). > I would appreciate it if I can hear any thoughts about this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17200) Automate building and testing on Windows
[ https://issues.apache.org/jira/browse/SPARK-17200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432508#comment-15432508 ] Hyukjin Kwon commented on SPARK-17200: -- cc [~felixcheung] and [~shivaram] who might be interested in this ticket (from the linked PR above). > Automate building and testing on Windows > - > > Key: SPARK-17200 > URL: https://issues.apache.org/jira/browse/SPARK-17200 > Project: Spark > Issue Type: Test > Components: Build, Project Infra >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > It seems there is no automated tests on Windows (I am not sure this is being > done manually before each release). > Assuming from this comment, > https://github.com/apache/spark/pull/14743#issuecomment-241473794, It seems > we have Windows infrastructure in the AMPLab Jenkins cluster. > It seems pretty much important because as far as I know we should manually > test and verify some patches related with Windows-specific problem. > For example, https://github.com/apache/spark/pull/14743#issuecomment-241473794 > I was thinking a combination with Travis CI and Docker with Windows image. > Although this might not be merged, I will try to give a shot with this (at > least for SparkR) anyway (just to verify some PRs I just linked above). > I would appreciate it if I can hear any thoughts about this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17200) Automate building and testing on Windows
[ https://issues.apache.org/jira/browse/SPARK-17200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432504#comment-15432504 ] Hyukjin Kwon commented on SPARK-17200: -- Could I please ask your opinion [~srowen]? I know you are an expert in this area. > Automate building and testing on Windows > - > > Key: SPARK-17200 > URL: https://issues.apache.org/jira/browse/SPARK-17200 > Project: Spark > Issue Type: Test > Components: Build, Project Infra >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > It seems there is no automated tests on Windows (I am not sure this is being > done manually before each release). > Assuming from this comment, > https://github.com/apache/spark/pull/14743#issuecomment-241473794, It seems > we have Windows infrastructure in the AMPLab Jenkins cluster. > It seems pretty much important because as far as I know we should manually > test and verify some patches related with Windows-specific problem. > For example, https://github.com/apache/spark/pull/14743#issuecomment-241473794 > I was thinking a combination with Travis CI and Docker with Windows image. > Although this might not be merged, I will try to give a shot with this (at > least for SparkR) anyway (just to verify some PRs I just linked above). > I would appreciate it if I can hear any thoughts about this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17174) Provide support for Timestamp type Column in add_months function to return HH:mm:ss
[ https://issues.apache.org/jira/browse/SPARK-17174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432503#comment-15432503 ] Amit Baghel commented on SPARK-17174: - Please go ahead and submit PR. Thanks > Provide support for Timestamp type Column in add_months function to return > HH:mm:ss > --- > > Key: SPARK-17174 > URL: https://issues.apache.org/jira/browse/SPARK-17174 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.0.0 >Reporter: Amit Baghel >Priority: Minor > > add_months function currently supports Date types. If Column is Timestamp > type then it adds month to date but it doesn't return timestamp part > (HH:mm:ss). See the code below. > {code} > import java.util.Calendar > val now = Calendar.getInstance().getTime() > val df = sc.parallelize((0 to 3).map(i => {now.setMonth(i); (i, new > java.sql.Timestamp(now.getTime))}).toSeq).toDF("ID", "DateWithTS") > df.withColumn("NewDateWithTS", add_months(df("DateWithTS"),1)).show > {code} > Above code gives following response. See the HH:mm:ss is missing from > NewDateWithTS column. > {code} > +---++-+ > | ID| DateWithTS|NewDateWithTS| > +---++-+ > | 0|2016-01-21 09:38:...| 2016-02-21| > | 1|2016-02-21 09:38:...| 2016-03-21| > | 2|2016-03-21 09:38:...| 2016-04-21| > | 3|2016-04-21 09:38:...| 2016-05-21| > +---++-+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17200) Automate building and testing on Windows
[ https://issues.apache.org/jira/browse/SPARK-17200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-17200: - Description: It seems there is no automated tests on Windows (I am not sure this is being done manually before each release). Assuming from this comment, https://github.com/apache/spark/pull/14743#issuecomment-241473794, It seems we have Windows infrastructure in the AMPLab Jenkins cluster. It seems pretty much important because as far as I know we should manually test and verify some patches related with Windows-specific problem. For example, https://github.com/apache/spark/pull/14743#issuecomment-241473794 I was thinking a combination with Travis CI and Docker with Windows image. Although this might not be merged, I will try to give a shot with this (at least for SparkR) anyway (just to verify some PRs I just linked above). I would appreciate it if I can hear any thoughts about this. was: It seems there is no automated tests on Windows (I am not sure this is being done manually before each release). Assuming from this comment, https://github.com/apache/spark/pull/14743#issuecomment-241473794, It seems we have Windows infrastructure in the AMPLab Jenkins cluster. It seems pretty much important because as far as I know we should manually test and verify some patches related with Windows-specific problem. For example, https://github.com/apache/spark/pull/14743#issuecomment-241473794 I was thinking a combination with Travis CI and Docker with Windows image. Although this might not be merged, I will try to give a shot with this anyway (just to verify some PRs I just linked above). I would appreciate it if I can hear any thoughts about this. > Automate building and testing on Windows > - > > Key: SPARK-17200 > URL: https://issues.apache.org/jira/browse/SPARK-17200 > Project: Spark > Issue Type: Test > Components: Build, Project Infra >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > It seems there is no automated tests on Windows (I am not sure this is being > done manually before each release). > Assuming from this comment, > https://github.com/apache/spark/pull/14743#issuecomment-241473794, It seems > we have Windows infrastructure in the AMPLab Jenkins cluster. > It seems pretty much important because as far as I know we should manually > test and verify some patches related with Windows-specific problem. > For example, https://github.com/apache/spark/pull/14743#issuecomment-241473794 > I was thinking a combination with Travis CI and Docker with Windows image. > Although this might not be merged, I will try to give a shot with this (at > least for SparkR) anyway (just to verify some PRs I just linked above). > I would appreciate it if I can hear any thoughts about this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17200) Automate building and testing on Windows
Hyukjin Kwon created SPARK-17200: Summary: Automate building and testing on Windows Key: SPARK-17200 URL: https://issues.apache.org/jira/browse/SPARK-17200 Project: Spark Issue Type: Test Components: Build, Project Infra Affects Versions: 2.0.0 Reporter: Hyukjin Kwon It seems there is no automated tests on Windows (I am not sure this is being done manually before each release). Assuming from this comment, https://github.com/apache/spark/pull/14743#issuecomment-241473794, It seems we have Windows infrastructure in the AMPLab Jenkins cluster. It seems pretty much important because as far as I know we should manually test and verify some patches related with Windows-specific problem. For example, https://github.com/apache/spark/pull/14743#issuecomment-241473794 I was thinking a combination with Travis CI and Docker with Windows image. Although this might not be merged, I will try to give a shot with this anyway (just to verify some PRs I just linked above). I would appreciate it if I can hear any thoughts about this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17055) add labelKFold to CrossValidator
[ https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432463#comment-15432463 ] Vincent edited comment on SPARK-17055 at 8/23/16 9:18 AM: -- sorry for late reply. Yes, I just knew they intend to rename it to GroupKFold, or something like that. Though personally I think it's fine to keep the way it is, though, it could be still kinda confusing when someone first uses it before understanding the idea behinds it. As for application, take face recognition as an example. features are, say, eyes, nose, lips etc. training data are obtained from a number of different person, this method can create subject independent folds, so we can train the model with features from certain group of people and take the data from the rest of group of people for validation. it will enhance the generic ability of the model and avoid over-fitting. it's a useful method, seen in sklearn, and currently caret is on the way trying to add this feature. was (Author: vincexie): sorry for late reply. Yes, I just knew they intend to rename it to GroupKFold, or something like that. Though personally I think it's fine to keep the way it is, though, it could be still kinda confusing when someone first uses it before understanding the idea behinds it. As for application, take face recognition as an example. features are, say, eyes, nose, lips etc. training data are obtained from a number of different person, this method can create subject independent folds, so we can train the model with features from certain group of people and take the data from the rest of group of people for validation. it will enhance the generic ability of the model and avoid over-fitting. it's a useful method, seen in sklearn, and currently caret is on the way add this feature. > add labelKFold to CrossValidator > > > Key: SPARK-17055 > URL: https://issues.apache.org/jira/browse/SPARK-17055 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Vincent >Priority: Minor > > Current CrossValidator only supports k-fold, which randomly divides all the > samples in k groups of samples. But in cases when data is gathered from > different subjects and we want to avoid over-fitting, we want to hold out > samples with certain labels from training data and put them into validation > fold, i.e. we want to ensure that the same label is not in both testing and > training sets. > Mainstream packages like Sklearn already supports such cross validation > method. > (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12072) python dataframe ._jdf.schema().json() breaks on large metadata dataframes
[ https://issues.apache.org/jira/browse/SPARK-12072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432470#comment-15432470 ] holdenk commented on SPARK-12072: - My guess is probably not a directly related issue - your OOM from there seems to be happening inside of the JVM and this probably wouldn't be the cause of that. > python dataframe ._jdf.schema().json() breaks on large metadata dataframes > -- > > Key: SPARK-12072 > URL: https://issues.apache.org/jira/browse/SPARK-12072 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 >Reporter: Rares Mirica > > When a dataframe contains a column with a large number of values in ml_attr, > schema evaluation will routinely fail on getting the schema as json, this > will, in turn, cause a bunch of problems with, eg: calling udfs on the schema > because calling columns relies on > _parse_datatype_json_string(self._jdf.schema().json()) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9004) Add s3 bytes read/written metrics
[ https://issues.apache.org/jira/browse/SPARK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432462#comment-15432462 ] Steve Loughran commented on SPARK-9004: --- If you know the filesystem, you can get summary stats from {{FileSystem.getStatistics()}}; they'd have to be collected across all the executors These counters are per-JVM, not isolated into individual jobs > Add s3 bytes read/written metrics > - > > Key: SPARK-9004 > URL: https://issues.apache.org/jira/browse/SPARK-9004 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Reporter: Abhishek Modi >Priority: Minor > > s3 read/write metrics can be pretty useful in finding the total aggregate > data processed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator
[ https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432463#comment-15432463 ] Vincent commented on SPARK-17055: - sorry for late reply. Yes, I just knew they intend to rename it to GroupKFold, or something like that. Though personally I think it's fine to keep the way it is, though, it could be still kinda confusing when someone first uses it before understanding the idea behinds it. As for application, take face recognition as an example. features are, say, eyes, nose, lips etc. training data are obtained from a number of different person, this method can create subject independent folds, so we can train the model with features from certain group of people and take the data from the rest of group of people for validation. it will enhance the generic ability of the model and avoid over-fitting. it's a useful method, seen in sklearn, and currently caret is on the way add this feature. > add labelKFold to CrossValidator > > > Key: SPARK-17055 > URL: https://issues.apache.org/jira/browse/SPARK-17055 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Vincent >Priority: Minor > > Current CrossValidator only supports k-fold, which randomly divides all the > samples in k groups of samples. But in cases when data is gathered from > different subjects and we want to avoid over-fitting, we want to hold out > samples with certain labels from training data and put them into validation > fold, i.e. we want to ensure that the same label is not in both testing and > training sets. > Mainstream packages like Sklearn already supports such cross validation > method. > (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12072) python dataframe ._jdf.schema().json() breaks on large metadata dataframes
[ https://issues.apache.org/jira/browse/SPARK-12072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432459#comment-15432459 ] Ben Teeuwen commented on SPARK-12072: - [~holdenk] we haven't been able to test the patch above (yet). Workarounds have been created using non-dataframe like operations. But recently I seem to have hit a wall related to the above. The discussion I've started on the spark 'user' mailinglist, topic "OOM with StringIndexer, 800m rows & 56m distinct value column", is that related to this ticket? Do you think your patch addresses it? > python dataframe ._jdf.schema().json() breaks on large metadata dataframes > -- > > Key: SPARK-12072 > URL: https://issues.apache.org/jira/browse/SPARK-12072 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 >Reporter: Rares Mirica > > When a dataframe contains a column with a large number of values in ml_attr, > schema evaluation will routinely fail on getting the schema as json, this > will, in turn, cause a bunch of problems with, eg: calling udfs on the schema > because calling columns relies on > _parse_datatype_json_string(self._jdf.schema().json()) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10297) When save data to a data source table, we should bound the size of a saved file
[ https://issues.apache.org/jira/browse/SPARK-10297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432435#comment-15432435 ] Steve Loughran commented on SPARK-10297: FWIW, S3 and the S3a doesn't have a size limit any more > When save data to a data source table, we should bound the size of a saved > file > --- > > Key: SPARK-10297 > URL: https://issues.apache.org/jira/browse/SPARK-10297 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Priority: Critical > > When we save a table to a data source table, it is possible that a writer is > responsible to write out a larger number of rows, which can make the > generated file very large and cause job failed if the underlying storage > system has a limit of max file size (e.g. S3's limit is 5GB). We should bound > the size of a file generated by a writer and create new writers for the same > partition if necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15965) No FileSystem for scheme: s3n or s3a spark-2.0.0 and spark-1.6.1
[ https://issues.apache.org/jira/browse/SPARK-15965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432432#comment-15432432 ] Steve Loughran commented on SPARK-15965: This is being fixed with tests in my work in SPARK-7481; the manual workaround is Spark 2: # Get the same hadoop version that your spark version is built against # add hadoop-aws, everything with amazon-*.jar into the JARs subdir Spark 1.6+ This needs my patch a rebuild of spark assembly. However, once that patch is in, trying to use the assembly without the AWS JARs will stop spark from starting —unless you move up to Hadoop 2.7.3 > No FileSystem for scheme: s3n or s3a spark-2.0.0 and spark-1.6.1 > - > > Key: SPARK-15965 > URL: https://issues.apache.org/jira/browse/SPARK-15965 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.6.1 > Environment: Debian GNU/Linux 8 > java version "1.7.0_79" >Reporter: thauvin damien > Original Estimate: 8h > Remaining Estimate: 8h > > The spark programming-guide explain that Spark can create distributed > datasets on Amazon S3 . > But since the pre-buid "Hadoop 2.6" the S3 access doesn't work with s3n or > s3a. > sc.hadoopConfiguration.set("fs.s3a.awsAccessKeyId", "XXXZZZHHH") > sc.hadoopConfiguration.set("fs.s3a.awsSecretAccessKey", > "xxx") > val > lines=sc.textFile("s3a://poc-XXX/access/2016/02/20160201202001_xxx.log.gz") > java.lang.RuntimeException: java.lang.ClassNotFoundException: Class > org.apache.hadoop.fs.s3a.S3AFileSystem not found > Any version of spark : spark-1.3.1 ; spark-1.6.1 even spark-2.0.0 with > hadoop.7.2 . > I understand this is an Hadoop Issue (SPARK-7442) but can you make some > documentation to explain what jar we need to add and where ? ( for standalone > installation) . > "hadoop-aws-x.x.x.jar and aws-java-sdk-x.x.x.jar is enough ? > What env variable we need to set and what file we need to modifiy . > Is it "$CLASSPATH "or a variable in "spark-defaults.conf" with variable > "spark.driver.extraClassPath" and "spark.executor.extraClassPath" > But Still Works with spark-1.6.1 pre build with hadoop2.4 > Thanks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator
[ https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432427#comment-15432427 ] Sean Owen commented on SPARK-17055: --- >From a comment in the PR, I get it. This is not actually about labels, but >about some arbitrary attribute or function of each example. The purpose is to >group examples into train/test such that examples with the same attribute >value always go into the same data set. So maybe you want all examples for one >customer ID to go into train, or all into test, but not split across both. This needs a different name I think because 'label' has a specific and different meaning, and even scikit says they want to rename it. It's coherent, but I still don't know how useful it is. It would need to be reconstruted for Spark ML. > add labelKFold to CrossValidator > > > Key: SPARK-17055 > URL: https://issues.apache.org/jira/browse/SPARK-17055 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Vincent >Priority: Minor > > Current CrossValidator only supports k-fold, which randomly divides all the > samples in k groups of samples. But in cases when data is gathered from > different subjects and we want to avoid over-fitting, we want to hold out > samples with certain labels from training data and put them into validation > fold, i.e. we want to ensure that the same label is not in both testing and > training sets. > Mainstream packages like Sklearn already supports such cross validation > method. > (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16363) Spark-submit doesn't work with IAM Roles
[ https://issues.apache.org/jira/browse/SPARK-16363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432421#comment-15432421 ] Steve Loughran commented on SPARK-16363: S3A uses {{com.amazonaws.auth.InstanceProfileCredentialsProvider}} to talk to Amazon EC2 Instance Metadata Service. Switch to S3A and Hadoop 2.7+ and you should be able to do this. That said, I do want to make some changes to how Spark propagates env vars as (a) it ignores the AWS_SESSION env var and (b) it stamps on any existing id/secret. That's not going to help > Spark-submit doesn't work with IAM Roles > > > Key: SPARK-16363 > URL: https://issues.apache.org/jira/browse/SPARK-16363 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.6.2 > Environment: Spark Stand-Alone with EC2 instances configured with IAM > Roles. >Reporter: Ashic Mahtab > > When running Spark Stand-alone in EC2 boxes, > spark-submit --master spark://master-ip:7077 --class Foo > --deploy-mode cluster --verbose s3://bucket/dir/foo/jar > fails to find the jar even if AWS IAM roles are configured to allow the EC2 > boxes (that are running Spark master, and workers) access to the file in S3. > The exception is provided below. It's asking us to set keys, etc. when the > boxes are configured via IAM roles. > 16/07/04 11:44:09 ERROR ClientEndpoint: Exception from cluster was: > java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key > must be specified as the username or password (respectively) of a s3 URL, or > by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties > (respectively). > java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key > must be specified as the username or password (respectively) of a s3 URL, or > by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties > (respectively). > at > org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:66) > at > org.apache.hadoop.fs.s3.Jets3tFileSystemStore.initialize(Jets3tFileSystemStore.java:82) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62) > at com.sun.proxy.$Proxy5.initialize(Unknown Source) > at > org.apache.hadoop.fs.s3.S3FileSystem.initialize(S3FileSystem.java:77) > at > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1446) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1464) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:263) > at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1686) > at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:598) > at org.apache.spark.util.Utils$.fetchFile(Utils.scala:395) > at > org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:150) > at > org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:79) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14560) Cooperative Memory Management for Spillables
[ https://issues.apache.org/jira/browse/SPARK-14560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432406#comment-15432406 ] Sean Owen commented on SPARK-14560: --- Even with the fix for SPARK-4452, I observed a problem like what's described here: running out of memory in the shuffle read phase rather inexplicably, which was worked around with numElementsForceSpillThreshold. It turned out that enabling Java serialization caused the problem to go away entirely. This was in Spark 1.6. No idea why, but leaving this as a note for anyone who may find it, or if we later connect the dots elsewhere to an underlying problem. > Cooperative Memory Management for Spillables > > > Key: SPARK-14560 > URL: https://issues.apache.org/jira/browse/SPARK-14560 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Imran Rashid >Assignee: Lianhui Wang > Fix For: 2.0.0 > > > SPARK-10432 introduced cooperative memory management for SQL operators that > can spill; however, {{Spillable}} s used by the old RDD api still do not > cooperate. This can lead to memory starvation, in particular on a > shuffle-to-shuffle stage, eventually resulting in errors like: > {noformat} > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Memory used in task 3081 > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: Acquired by > org.apache.spark.shuffle.sort.ShuffleExternalSorter@69ab0291: 32.0 KB > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317230346 bytes of memory > were used by task 3081 but are not associated with specific consumers > 16/03/28 08:59:54 INFO memory.TaskMemoryManager: 1317263114 bytes of memory > are used for execution and 1710484 bytes of memory are used for storage > 16/03/28 08:59:54 ERROR executor.Executor: Managed memory leak detected; size > = 1317230346 bytes, TID = 3081 > 16/03/28 08:59:54 ERROR executor.Executor: Exception in task 533.0 in stage > 3.0 (TID 3081) > java.lang.OutOfMemoryError: Unable to acquire 75 bytes of memory, got 0 > at > org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > This can happen anytime the shuffle read side requires more memory than what > is available for the task. Since the shuffle-read side doubles its memory > request each time, it can easily end up acquiring all of the available > memory, even if it does not use it. Eg., say that after the final spill, the > shuffle-read side requires 10 MB more memory, and there is 15 MB of memory > available. But if it starts at 2 MB, it will double to 4, 8, and then > request 16 MB of memory, and in fact get all available 15 MB. Since the 15 > MB of memory is sufficient, it will not spill, and will continue holding on > to all available memory. But this leaves *no* memory available for the > shuffle-write side. Since the shuffle-write side cannot request the > shuffle-read side to free up memory, this leads to an OOM. > The simple solution is to make {{Spillable}} implement {{MemoryConsumer}} as > well, so RDDs can benefit from the cooperative memory management introduced > by SPARK-10342. > Note that an additional improvement would be for the shuffle-read side to > simple release unused memory, without spilling, in case that would leave > enough memory, and only spill if that was inadequate. However that can come > as a later improvement. > *Workaround*: You can set > {{spark.shuffle.spill.numElementsForceSpillThreshold=N}} to force spilling to > occur every {{N}} elements, thus preventing the shuffle-read side from ever > grabbing all of the available memory. However, this requires careful tuning > of {{N}} to specific workloads: too big, and you will still get an OOM; t