[jira] [Assigned] (SPARK-10348) Improve Spark ML user guide
[ https://issues.apache.org/jira/browse/SPARK-10348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10348: Assignee: Apache Spark (was: Xiangrui Meng) > Improve Spark ML user guide > --- > > Key: SPARK-10348 > URL: https://issues.apache.org/jira/browse/SPARK-10348 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng >Assignee: Apache Spark > > improve ml-guide: > * replace `ML Dataset` by `DataFrame` to simplify the abstraction > * remove links to Scala API doc in the main guide > * change ML algorithms to pipeline components -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10348) Improve Spark ML user guide
[ https://issues.apache.org/jira/browse/SPARK-10348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720995#comment-14720995 ] Apache Spark commented on SPARK-10348: -- User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/8517 > Improve Spark ML user guide > --- > > Key: SPARK-10348 > URL: https://issues.apache.org/jira/browse/SPARK-10348 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > improve ml-guide: > * replace `ML Dataset` by `DataFrame` to simplify the abstraction > * remove links to Scala API doc in the main guide > * change ML algorithms to pipeline components -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10348) Improve Spark ML user guide
[ https://issues.apache.org/jira/browse/SPARK-10348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10348: Assignee: Xiangrui Meng (was: Apache Spark) > Improve Spark ML user guide > --- > > Key: SPARK-10348 > URL: https://issues.apache.org/jira/browse/SPARK-10348 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > improve ml-guide: > * replace `ML Dataset` by `DataFrame` to simplify the abstraction > * remove links to Scala API doc in the main guide > * change ML algorithms to pipeline components -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10348) Improve Spark ML user guide
Xiangrui Meng created SPARK-10348: - Summary: Improve Spark ML user guide Key: SPARK-10348 URL: https://issues.apache.org/jira/browse/SPARK-10348 Project: Spark Issue Type: Improvement Components: Documentation, ML Affects Versions: 1.5.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng improve ml-guide: * replace `ML Dataset` by `DataFrame` to simplify the abstraction * remove links to Scala API doc in the main guide * change ML algorithms to pipeline components -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10331) Update user guide to address minor comments during code review
[ https://issues.apache.org/jira/browse/SPARK-10331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10331: -- Priority: Major (was: Minor) > Update user guide to address minor comments during code review > -- > > Key: SPARK-10331 > URL: https://issues.apache.org/jira/browse/SPARK-10331 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML, MLlib >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > Clean-up user guides to address some minor comments in: > https://github.com/apache/spark/pull/8304 > https://github.com/apache/spark/pull/8487 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9042) Spark SQL incompatibility with Apache Sentry
[ https://issues.apache.org/jira/browse/SPARK-9042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720977#comment-14720977 ] Nitin Kak commented on SPARK-9042: -- I agree. This is a broader issue, it only does not break unless Sentry is intalled. Not sure how much efforts are going on HiveContext lately. Is there a way to highlight this issue to the people who have worked on HiveContext? > Spark SQL incompatibility with Apache Sentry > > > Key: SPARK-9042 > URL: https://issues.apache.org/jira/browse/SPARK-9042 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: Nitin Kak > > Hive queries executed from Spark using HiveContext use CLI to create the > query plan and then access the Hive table directories(under > /user/hive/warehouse/) directly. This gives AccessContolException if Apache > Sentry is installed: > org.apache.hadoop.security.AccessControlException: Permission denied: > user=kakn, access=READ_EXECUTE, > inode="/user/hive/warehouse/mastering.db/sample_table":hive:hive:drwxrwx--t > With Apache Sentry, only "hive" user(created only for Sentry) has the > permissions to access the hive warehouse directory. After Sentry > installations all the queries are directed to HiveServer2 which translates > the changes the invoking user to "hive" and then access the hive warehouse > directory. However, HiveContext does not execute the query through > HiveServer2 which is leading to the issue. Here is an example of executing > hive query through HiveContext. > val hqlContext = new HiveContext(sc) // Create context to run Hive queries > val pairRDD = hqlContext.sql(hql) // where hql is the string with hive query -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9963) ML RandomForest cleanup: replace predictNodeIndex with predictImpl
[ https://issues.apache.org/jira/browse/SPARK-9963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720966#comment-14720966 ] Luvsandondov Lkhamsuren edited comment on SPARK-9963 at 8/29/15 4:31 AM: - Thanks @Trevor Cai! I was kind of hoping someone would approve my direction (this is my first time) so that I can start implementing. Once we agree on the direction, I can definitely implement the part. was (Author: lkhamsurenl): Thanks Trevor Cai! I was kind of hoping someone would approve my direction (this is my first time) so that I can start implementing. Once we agree on the direction, I can definitely implement the part. > ML RandomForest cleanup: replace predictNodeIndex with predictImpl > -- > > Key: SPARK-9963 > URL: https://issues.apache.org/jira/browse/SPARK-9963 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Trivial > Labels: starter > > Replace ml.tree.impl.RandomForest.predictNodeIndex with Node.predictImpl. > This should be straightforward, but please ping me if anything is unclear. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9963) ML RandomForest cleanup: replace predictNodeIndex with predictImpl
[ https://issues.apache.org/jira/browse/SPARK-9963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720966#comment-14720966 ] Luvsandondov Lkhamsuren edited comment on SPARK-9963 at 8/29/15 4:31 AM: - Thanks [~trevorcai]! I was kind of hoping someone would approve my direction (this is my first time) so that I can start implementing. Once we agree on the direction, I can definitely implement the part. was (Author: lkhamsurenl): Thanks @Trevor Cai! I was kind of hoping someone would approve my direction (this is my first time) so that I can start implementing. Once we agree on the direction, I can definitely implement the part. > ML RandomForest cleanup: replace predictNodeIndex with predictImpl > -- > > Key: SPARK-9963 > URL: https://issues.apache.org/jira/browse/SPARK-9963 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Trivial > Labels: starter > > Replace ml.tree.impl.RandomForest.predictNodeIndex with Node.predictImpl. > This should be straightforward, but please ping me if anything is unclear. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9963) ML RandomForest cleanup: replace predictNodeIndex with predictImpl
[ https://issues.apache.org/jira/browse/SPARK-9963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720966#comment-14720966 ] Luvsandondov Lkhamsuren commented on SPARK-9963: Thanks Trevor Cai! I was kind of hoping someone would approve my direction (this is my first time) so that I can start implementing. Once we agree on the direction, I can definitely implement the part. > ML RandomForest cleanup: replace predictNodeIndex with predictImpl > -- > > Key: SPARK-9963 > URL: https://issues.apache.org/jira/browse/SPARK-9963 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Trivial > Labels: starter > > Replace ml.tree.impl.RandomForest.predictNodeIndex with Node.predictImpl. > This should be straightforward, but please ping me if anything is unclear. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8517) Improve the organization and style of MLlib's user guide
[ https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8517: - Target Version/s: 1.6.0 (was: 1.5.0) > Improve the organization and style of MLlib's user guide > > > Key: SPARK-8517 > URL: https://issues.apache.org/jira/browse/SPARK-8517 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML, MLlib >Reporter: Xiangrui Meng > > The current MLlib's user guide (and spark.ml's), especially the main page, > doesn't have a nice style. We could update it and re-organize the content to > make it easier to navigate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7457) Perf test for ALS.recommendAll
[ https://issues.apache.org/jira/browse/SPARK-7457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-7457: - Target Version/s: 1.6.0 (was: 1.5.0) > Perf test for ALS.recommendAll > -- > > Key: SPARK-7457 > URL: https://issues.apache.org/jira/browse/SPARK-7457 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9910) User guide for train validation split
[ https://issues.apache.org/jira/browse/SPARK-9910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-9910. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8377 [https://github.com/apache/spark/pull/8377] > User guide for train validation split > - > > Key: SPARK-9910 > URL: https://issues.apache.org/jira/browse/SPARK-9910 > Project: Spark > Issue Type: Documentation > Components: ML >Reporter: Feynman Liang >Assignee: Martin Zapletal > Fix For: 1.5.0 > > > SPARK-8484 adds a TrainValidationSplit transformer which needs user guide > docs and example code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10340) Use S3 bulk listing for S3-backed Hive tables
[ https://issues.apache.org/jira/browse/SPARK-10340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-10340: Assignee: Cheolsoo Park > Use S3 bulk listing for S3-backed Hive tables > - > > Key: SPARK-10340 > URL: https://issues.apache.org/jira/browse/SPARK-10340 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.4.1, 1.5.0 >Reporter: Cheolsoo Park >Assignee: Cheolsoo Park > > AWS S3 provides bulk listing API. It takes the common prefix of all input > paths as a parameter and returns all the objects whose prefixes start with > the common prefix in blocks of 1000. > Since SPARK-9926 allow us to list multiple partitions all together, we can > significantly speed up input split calculation using S3 bulk listing. This > optimization is particularly useful for queries like {{select * from > partitioned_table limit 10}}. > This is a common optimization for S3. For eg, here is a [blog > post|http://www.qubole.com/blog/product/optimizing-hadoop-for-s3-part-1/] > from Qubole on this topic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10337) Views are broken
[ https://issues.apache.org/jira/browse/SPARK-10337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720922#comment-14720922 ] Yin Huai commented on SPARK-10337: -- I think this will break if the view is created from a data source table that does not have a corresponding serde. I guess 100milints was created before we added the support of saving hive column and serde to metastore. I tried master with the following code. {code} sqlContext.range(1, 10).write.format("json").saveAsTable("yinJsonTable”) describe formatted yinJsonTable # col_name data_type comment col array from deserializer ... sqlContext.range(1, 10).write.format("parquet").saveAsTable("yinParquetTable”) describe formatted yinParquetTable # col_name data_type comment id bigint ... {code} For json, we do not store column info to hive, you can see that metastore actually stores a column col with type array in it. > Views are broken > > > Key: SPARK-10337 > URL: https://issues.apache.org/jira/browse/SPARK-10337 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Michael Armbrust >Priority: Critical > > I haven't dug into this yet... but it seems like this should work: > This works: > {code} > SELECT * FROM 100milints > {code} > This seems to work: > {code} > CREATE VIEW testView AS SELECT * FROM 100milints > {code} > This fails: > {code} > SELECT * FROM testView > org.apache.spark.sql.AnalysisException: cannot resolve '100milints.col' given > input columns id; line 1 pos 7 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:56) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:108) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:118) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:122) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPla
[jira] [Resolved] (SPARK-9803) Add transform and subset to DataFrame
[ https://issues.apache.org/jira/browse/SPARK-9803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-9803. -- Resolution: Fixed Assignee: Felix Cheung Fix Version/s: 1.5.1,1.6.0 Resolved by https://github.com/apache/spark/pull/8503 > Add transform and subset to DataFrame > -- > > Key: SPARK-9803 > URL: https://issues.apache.org/jira/browse/SPARK-9803 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Hossein Falaki >Assignee: Felix Cheung > Fix For: 1.5.1,1.6.0 > > > These three base functions are heavily used with R dataframes. It would be > great to have them work with Spark DataFrames: > * transform > * subset -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10346) SparkR mutate and transform should replace column with same name to match R data.frame behavior
[ https://issues.apache.org/jira/browse/SPARK-10346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-10346: - Description: Spark doesn't seem to replace existing column with the name in mutate (ie. mutate(df, age = df$age + 2) - returned DataFrame has 2 columns with the same name 'age'), so therefore not doing that for now in transform. Though it is clearly stated it should replace column with matching name: https://stat.ethz.ch/R-manual/R-devel/library/base/html/transform.html "The tags are matched against names(_data), and for those that match, the value replace the corresponding variable in _data, and the others are appended to _data." Also the resulting DataFrame might be hard to work with if one is to use select with column names, or to register the table to SQL, and so on, since then 2 columns have the same name. was: Spark doesn't seem to replace existing column with the name in mutate (ie. mutate(df, age = df$age + 2) - returned DataFrame has 2 columns with the same name 'age'), so therefore not doing that for now in transform. Though it is clearly stated it should replace column with matching name: https://stat.ethz.ch/R-manual/R-devel/library/base/html/transform.html "The tags are matched against names(_data), and for those that match, the value replace the corresponding variable in _data, and the others are appended to _data." Also the resulting DataFrame might be hard to work with if one is to use select with column names and so on. > SparkR mutate and transform should replace column with same name to match R > data.frame behavior > --- > > Key: SPARK-10346 > URL: https://issues.apache.org/jira/browse/SPARK-10346 > Project: Spark > Issue Type: Bug > Components: R >Affects Versions: 1.5.0 >Reporter: Felix Cheung > > Spark doesn't seem to replace existing column with the name in mutate (ie. > mutate(df, age = df$age + 2) - returned DataFrame has 2 columns with the same > name 'age'), so therefore not doing that for now in transform. > Though it is clearly stated it should replace column with matching name: > https://stat.ethz.ch/R-manual/R-devel/library/base/html/transform.html > "The tags are matched against names(_data), and for those that match, the > value replace the corresponding variable in _data, and the others are > appended to _data." > Also the resulting DataFrame might be hard to work with if one is to use > select with column names, or to register the table to SQL, and so on, since > then 2 columns have the same name. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8952) JsonFile() of SQLContext display improper warning message for a S3 path
[ https://issues.apache.org/jira/browse/SPARK-8952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720889#comment-14720889 ] Sun Rui commented on SPARK-8952: I submitted https://issues.apache.org/jira/browse/SPARK-10347 to track this issue for a long-term solution. > JsonFile() of SQLContext display improper warning message for a S3 path > --- > > Key: SPARK-8952 > URL: https://issues.apache.org/jira/browse/SPARK-8952 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.4.0 >Reporter: Sun Rui >Assignee: Luciano Resende > Fix For: 1.6.0, 1.5.1 > > > This is an issue reported by Ben Spark . > {quote} > Spark 1.4 deployed on AWS EMR > "jsonFile" is working though with some warning message > Warning message: > In normalizePath(path) : > > path[1]="s3://rea-consumer-data-dev/cbr/profiler/output/20150618/part-0": > No such file or directory > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10347) Investigate the usage of normalizePath()
Sun Rui created SPARK-10347: --- Summary: Investigate the usage of normalizePath() Key: SPARK-10347 URL: https://issues.apache.org/jira/browse/SPARK-10347 Project: Spark Issue Type: Bug Components: SparkR Reporter: Sun Rui Priority: Minor Currently normalizePath() is used in several places allowing users to specify paths via the use of tilde expansion, or to normalize a relative path to an absolute path. However, normalizePath() is used for paths which are actually expected to be a URI. normalizePath() may display warning messages when it does not recognize a URI as a local file path. So suppressWarnings() is used to suppress the possible warnings. Worse than warnings, call normalizePath() on a URI may cause error. Because it may turn a user specified relative path to an absolute path using the local current directory, but this may not be true because the path is actually relative to the working directory of the default file system instead of the local file system (depends on the Hadoop configuration of Spark). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10346) SparkR mutate and transform should replace column with same name to match R data.frame behavior
Felix Cheung created SPARK-10346: Summary: SparkR mutate and transform should replace column with same name to match R data.frame behavior Key: SPARK-10346 URL: https://issues.apache.org/jira/browse/SPARK-10346 Project: Spark Issue Type: Bug Components: R Affects Versions: 1.5.0 Reporter: Felix Cheung Spark doesn't seem to replace existing column with the name in mutate (ie. mutate(df, age = df$age + 2) - returned DataFrame has 2 columns with the same name 'age'), so therefore not doing that for now in transform. Though it is clearly stated it should replace column with matching name: https://stat.ethz.ch/R-manual/R-devel/library/base/html/transform.html "The tags are matched against names(_data), and for those that match, the value replace the corresponding variable in _data, and the others are appended to _data." Also the resulting DataFrame might be hard to work with if one is to use select with column names and so on. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7454) Perf test for power iteration clustering (PIC)
[ https://issues.apache.org/jira/browse/SPARK-7454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720860#comment-14720860 ] Feynman Liang edited comment on SPARK-7454 at 8/29/15 12:54 AM: [~mengxr] do you mind assigning to me? Thanks! was (Author: fliang): [~mengxr] do you mind assigning to me and linking my PR, thanks! > Perf test for power iteration clustering (PIC) > -- > > Key: SPARK-7454 > URL: https://issues.apache.org/jira/browse/SPARK-7454 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7454) Perf test for power iteration clustering (PIC)
[ https://issues.apache.org/jira/browse/SPARK-7454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720860#comment-14720860 ] Feynman Liang commented on SPARK-7454: -- [~mengxr] do you mind assigning to me and linking my PR, thanks! > Perf test for power iteration clustering (PIC) > -- > > Key: SPARK-7454 > URL: https://issues.apache.org/jira/browse/SPARK-7454 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7454) Perf test for power iteration clustering (PIC)
[ https://issues.apache.org/jira/browse/SPARK-7454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720854#comment-14720854 ] Feynman Liang commented on SPARK-7454: -- https://github.com/databricks/spark-perf/pull/86 > Perf test for power iteration clustering (PIC) > -- > > Key: SPARK-7454 > URL: https://issues.apache.org/jira/browse/SPARK-7454 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720845#comment-14720845 ] Shivaram Venkataraman commented on SPARK-6817: -- The idea behind having `dapplyCollect` was that it might be easier to implement as the output doesn't necessarily need to be converted to a valid Spark DataFrame on the JVM and could instead be just any R data frame. But I agree that adding more keywords is confusing for users and we could avoid this in a couple of ways (1) implement the type conversion from R to JVM first so we wouldn't need this (2) have a slightly different class on the JVM that only supports collect on it (i.e. not a DataFrame) and use that to Regarding gapply -- SparkR (and dplyr) already have a `group_by` function that does the grouping and in SparkR this returns a `GroupedData` object. Right now the only function available on the `GroupedData` object is `agg` to perform aggregations on it. We could instead support `dapply` on `GroupedData` objects and then the syntax would be something like grouped_df <- group_by(df, df$city) collect(dapply(grouped_df, function(group) {} )) cc [~rxin] > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10345) Flaky test: HiveCompatibilitySuite.nonblock_op_deduplicate
Davies Liu created SPARK-10345: -- Summary: Flaky test: HiveCompatibilitySuite.nonblock_op_deduplicate Key: SPARK-10345 URL: https://issues.apache.org/jira/browse/SPARK-10345 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41759/testReport/org.apache.spark.sql.hive.execution/HiveCompatibilitySuite/nonblock_op_deduplicate/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9963) ML RandomForest cleanup: replace predictNodeIndex with predictImpl
[ https://issues.apache.org/jira/browse/SPARK-9963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720824#comment-14720824 ] Trevor Cai commented on SPARK-9963: --- [~lkhamsurenl]: Are you still working on this? I took a quick look at the code, and it looks to me like it's not trivial to convert the binnedFeatures and splits arrays into a single vector mapping feature to threshold. The below section of code seems to indicate to me that some features don't have a threshold and should always move right, and using Double.MAX_VALUE as the threshold in those cases seems like it could potentially cause issues. {code} override private[tree] def shouldGoLeft(binnedFeature: Int, splits: Array[Split]): Boolean = { if (binnedFeature == splits.length) { // > last split, so split right false } else { val featureValueUpperBound = splits(binnedFeature).asInstanceOf[ContinuousSplit].threshold featureValueUpperBound <= threshold } } {code} As a result, if you're still working on this, your second proposal makes more sense. If not, I can pick up the issue and implement the second option. [~josephkb] Does this seem reasonable to you? > ML RandomForest cleanup: replace predictNodeIndex with predictImpl > -- > > Key: SPARK-9963 > URL: https://issues.apache.org/jira/browse/SPARK-9963 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Trivial > Labels: starter > > Replace ml.tree.impl.RandomForest.predictNodeIndex with Node.predictImpl. > This should be straightforward, but please ping me if anything is unclear. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10344) Add tests for extraStrategies
[ https://issues.apache.org/jira/browse/SPARK-10344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10344: Assignee: Apache Spark (was: Michael Armbrust) > Add tests for extraStrategies > - > > Key: SPARK-10344 > URL: https://issues.apache.org/jira/browse/SPARK-10344 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Michael Armbrust >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10344) Add tests for extraStrategies
[ https://issues.apache.org/jira/browse/SPARK-10344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10344: Assignee: Michael Armbrust (was: Apache Spark) > Add tests for extraStrategies > - > > Key: SPARK-10344 > URL: https://issues.apache.org/jira/browse/SPARK-10344 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Michael Armbrust >Assignee: Michael Armbrust > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10344) Add tests for extraStrategies
[ https://issues.apache.org/jira/browse/SPARK-10344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720820#comment-14720820 ] Apache Spark commented on SPARK-10344: -- User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/8516 > Add tests for extraStrategies > - > > Key: SPARK-10344 > URL: https://issues.apache.org/jira/browse/SPARK-10344 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Michael Armbrust >Assignee: Michael Armbrust > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10344) Add tests for extraStrategies
Michael Armbrust created SPARK-10344: Summary: Add tests for extraStrategies Key: SPARK-10344 URL: https://issues.apache.org/jira/browse/SPARK-10344 Project: Spark Issue Type: Improvement Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10326) Cannot launch YARN job on Windows
[ https://issues.apache.org/jira/browse/SPARK-10326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-10326: Target Version/s: 1.5.0 (was: 1.5.1) > Cannot launch YARN job on Windows > -- > > Key: SPARK-10326 > URL: https://issues.apache.org/jira/browse/SPARK-10326 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.5.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > Fix For: 1.5.0 > > > The fix is already in master, and it's one line out of the patch for > SPARK-5754; the bug is that a Windows file path cannot be used to create a > URI, to {{File.toURI()}} needs to be called. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10334) Partitioned table scan's query plan does not show Filter and Project on top of the table scan
[ https://issues.apache.org/jira/browse/SPARK-10334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10334: Assignee: Yin Huai (was: Apache Spark) > Partitioned table scan's query plan does not show Filter and Project on top > of the table scan > - > > Key: SPARK-10334 > URL: https://issues.apache.org/jira/browse/SPARK-10334 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Critical > > {code} > Seq(Tuple2(1, 1), Tuple2(2, 2)).toDF("i", > "j").write.format("parquet").partitionBy("i").save("/tmp/testFilter_partitioned") > val df1 = > sqlContext.read.format("parquet").load("/tmp/testFilter_partitioned") > df1.selectExpr("hash(i)", "hash(j)").show > df1.filter("hash(j) = 1").explain > == Physical Plan == > Scan ParquetRelation[file:/tmp/testFilter_partitioned][j#20,i#21] > {code} > Looks like the reason is that we correctly apply the project and filter. > Then, we create an RDD for the result and then manually create a PhysicalRDD. > So, the Project and Filter on top of the original table scan disappears from > the physical plan. > See > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L138-L175 > We will not generate wrong result. But, the query plan is confusing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10339) When scanning a partitioned table having thousands of partitions, Driver has a very high memory pressure because of SQL metrics
[ https://issues.apache.org/jira/browse/SPARK-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10339: Assignee: Apache Spark (was: Yin Huai) > When scanning a partitioned table having thousands of partitions, Driver has > a very high memory pressure because of SQL metrics > --- > > Key: SPARK-10339 > URL: https://issues.apache.org/jira/browse/SPARK-10339 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Apache Spark >Priority: Blocker > > I have a local dataset having 5000 partitions stored in {{/tmp/partitioned}}. > When I run the following code, the free memory space in driver's old gen > gradually decreases and eventually there is pretty much no free space in > driver's old gen. Finally, all kinds of timeouts happen and the cluster is > died. > {code} > val df = sqlContext.read.format("parquet").load("/tmp/partitioned") > df.filter("a > -100").selectExpr("hash(a, b)").queryExecution.toRdd.foreach(_ > => Unit) > {code} > I did a quick test by deleting SQL metrics from project and filter operator, > my job works fine. > The reason is that for a partitioned table, when we scan it, the actual plan > is like > {code} >other operators >| >| > /--|--\ >/ | \ > /|\ > / | \ > project project ... project > || | > filter filter ... filter > || | > part1part2 ... part n > {code} > We create SQL metrics for every filter and project, which causing the > extremely high memory pressure to the driver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10334) Partitioned table scan's query plan does not show Filter and Project on top of the table scan
[ https://issues.apache.org/jira/browse/SPARK-10334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720749#comment-14720749 ] Apache Spark commented on SPARK-10334: -- User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/8515 > Partitioned table scan's query plan does not show Filter and Project on top > of the table scan > - > > Key: SPARK-10334 > URL: https://issues.apache.org/jira/browse/SPARK-10334 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Critical > > {code} > Seq(Tuple2(1, 1), Tuple2(2, 2)).toDF("i", > "j").write.format("parquet").partitionBy("i").save("/tmp/testFilter_partitioned") > val df1 = > sqlContext.read.format("parquet").load("/tmp/testFilter_partitioned") > df1.selectExpr("hash(i)", "hash(j)").show > df1.filter("hash(j) = 1").explain > == Physical Plan == > Scan ParquetRelation[file:/tmp/testFilter_partitioned][j#20,i#21] > {code} > Looks like the reason is that we correctly apply the project and filter. > Then, we create an RDD for the result and then manually create a PhysicalRDD. > So, the Project and Filter on top of the original table scan disappears from > the physical plan. > See > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L138-L175 > We will not generate wrong result. But, the query plan is confusing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10326) Cannot launch YARN job on Windows
[ https://issues.apache.org/jira/browse/SPARK-10326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-10326. - Resolution: Fixed Assignee: Marcelo Vanzin Fix Version/s: 1.5.0 > Cannot launch YARN job on Windows > -- > > Key: SPARK-10326 > URL: https://issues.apache.org/jira/browse/SPARK-10326 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.5.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > Fix For: 1.5.0 > > > The fix is already in master, and it's one line out of the patch for > SPARK-5754; the bug is that a Windows file path cannot be used to create a > URI, to {{File.toURI()}} needs to be called. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10334) Partitioned table scan's query plan does not show Filter and Project on top of the table scan
[ https://issues.apache.org/jira/browse/SPARK-10334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10334: Assignee: Apache Spark (was: Yin Huai) > Partitioned table scan's query plan does not show Filter and Project on top > of the table scan > - > > Key: SPARK-10334 > URL: https://issues.apache.org/jira/browse/SPARK-10334 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Apache Spark >Priority: Critical > > {code} > Seq(Tuple2(1, 1), Tuple2(2, 2)).toDF("i", > "j").write.format("parquet").partitionBy("i").save("/tmp/testFilter_partitioned") > val df1 = > sqlContext.read.format("parquet").load("/tmp/testFilter_partitioned") > df1.selectExpr("hash(i)", "hash(j)").show > df1.filter("hash(j) = 1").explain > == Physical Plan == > Scan ParquetRelation[file:/tmp/testFilter_partitioned][j#20,i#21] > {code} > Looks like the reason is that we correctly apply the project and filter. > Then, we create an RDD for the result and then manually create a PhysicalRDD. > So, the Project and Filter on top of the original table scan disappears from > the physical plan. > See > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L138-L175 > We will not generate wrong result. But, the query plan is confusing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10339) When scanning a partitioned table having thousands of partitions, Driver has a very high memory pressure because of SQL metrics
[ https://issues.apache.org/jira/browse/SPARK-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10339: Assignee: Yin Huai (was: Apache Spark) > When scanning a partitioned table having thousands of partitions, Driver has > a very high memory pressure because of SQL metrics > --- > > Key: SPARK-10339 > URL: https://issues.apache.org/jira/browse/SPARK-10339 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Blocker > > I have a local dataset having 5000 partitions stored in {{/tmp/partitioned}}. > When I run the following code, the free memory space in driver's old gen > gradually decreases and eventually there is pretty much no free space in > driver's old gen. Finally, all kinds of timeouts happen and the cluster is > died. > {code} > val df = sqlContext.read.format("parquet").load("/tmp/partitioned") > df.filter("a > -100").selectExpr("hash(a, b)").queryExecution.toRdd.foreach(_ > => Unit) > {code} > I did a quick test by deleting SQL metrics from project and filter operator, > my job works fine. > The reason is that for a partitioned table, when we scan it, the actual plan > is like > {code} >other operators >| >| > /--|--\ >/ | \ > /|\ > / | \ > project project ... project > || | > filter filter ... filter > || | > part1part2 ... part n > {code} > We create SQL metrics for every filter and project, which causing the > extremely high memory pressure to the driver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10339) When scanning a partitioned table having thousands of partitions, Driver has a very high memory pressure because of SQL metrics
[ https://issues.apache.org/jira/browse/SPARK-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720748#comment-14720748 ] Apache Spark commented on SPARK-10339: -- User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/8515 > When scanning a partitioned table having thousands of partitions, Driver has > a very high memory pressure because of SQL metrics > --- > > Key: SPARK-10339 > URL: https://issues.apache.org/jira/browse/SPARK-10339 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Blocker > > I have a local dataset having 5000 partitions stored in {{/tmp/partitioned}}. > When I run the following code, the free memory space in driver's old gen > gradually decreases and eventually there is pretty much no free space in > driver's old gen. Finally, all kinds of timeouts happen and the cluster is > died. > {code} > val df = sqlContext.read.format("parquet").load("/tmp/partitioned") > df.filter("a > -100").selectExpr("hash(a, b)").queryExecution.toRdd.foreach(_ > => Unit) > {code} > I did a quick test by deleting SQL metrics from project and filter operator, > my job works fine. > The reason is that for a partitioned table, when we scan it, the actual plan > is like > {code} >other operators >| >| > /--|--\ >/ | \ > /|\ > / | \ > project project ... project > || | > filter filter ... filter > || | > part1part2 ... part n > {code} > We create SQL metrics for every filter and project, which causing the > extremely high memory pressure to the driver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9078) Use of non-standard LIMIT keyword in JDBC tableExists code
[ https://issues.apache.org/jira/browse/SPARK-9078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720747#comment-14720747 ] Suresh Thalamati commented on SPARK-9078: - @Bob, Reynold I ran into the same issue when trying to write data frames into an existing table in DB2 database. if you are not working on the fix, I would like to give it a try. Thank you for the analysis of the issue , my understanding is fix should do the following to address table exists problem with LIMIT syntax.: -- Add table Exists method to the JdbcDialect interface, and allow dialects override the method as required for specific databases. -- Default implementation of table exists method should use DatabaseMetaData.getTables() to find if table exists. If that particular interface is not implement use the query "select 1 from $table where 1=0". -- Add table exist method that use LIMIT query to MySQL , and Postgres dialects. * Enhancing registering of dialect : (I think this may have to be separate Jira to avoid confusion). @Reynold : I am not understanding your comment on adding option to pass through the jdbc data source. If you can give an example that will be great. Are you referring to some thing like the following ? df.write.option("datasource.jdbc.dialects" "org.apache.DerbyDialect").jdbc("jdbc:deryby:// Use of non-standard LIMIT keyword in JDBC tableExists code > -- > > Key: SPARK-9078 > URL: https://issues.apache.org/jira/browse/SPARK-9078 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1, 1.4.0 >Reporter: Robert Beauchemin >Priority: Minor > > tableExists in > spark/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcUtils.scala uses > non-standard SQL (specifically, the LIMIT keyword) to determine whether a > table exists in a JDBC data source. This will cause an exception in many/most > JDBC databases that doesn't support LIMIT keyword. See > http://stackoverflow.com/questions/1528604/how-universal-is-the-limit-statement-in-sql > To check for table existence or an exception, it could be recrafted around > "select 1 from $table where 0 = 1" which isn't the same (it returns an empty > resultset rather than the value '1'), but would support more data sources and > also support empty tables. Arguably ugly and possibly queries every row on > sources that don't support constant folding, but better than failing on JDBC > sources that don't support LIMIT. > Perhaps "supports LIMIT" could be a field in the JdbcDialect class for > databases that support keyword this to override. The ANSI standard is (OFFSET > and) FETCH. > The standard way to check for table existence would be to use > information_schema.tables which is a SQL standard but may not work for other > JDBC data sources that support SQL, but not the information_schema. The JDBC > DatabaseMetaData interface provides getSchemas() that allows checking for > the information_schema in drivers that support it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10334) Partitioned table scan's query plan does not show Filter and Project on top of the table scan
[ https://issues.apache.org/jira/browse/SPARK-10334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai reassigned SPARK-10334: Assignee: Yin Huai > Partitioned table scan's query plan does not show Filter and Project on top > of the table scan > - > > Key: SPARK-10334 > URL: https://issues.apache.org/jira/browse/SPARK-10334 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Critical > > {code} > Seq(Tuple2(1, 1), Tuple2(2, 2)).toDF("i", > "j").write.format("parquet").partitionBy("i").save("/tmp/testFilter_partitioned") > val df1 = > sqlContext.read.format("parquet").load("/tmp/testFilter_partitioned") > df1.selectExpr("hash(i)", "hash(j)").show > df1.filter("hash(j) = 1").explain > == Physical Plan == > Scan ParquetRelation[file:/tmp/testFilter_partitioned][j#20,i#21] > {code} > Looks like the reason is that we correctly apply the project and filter. > Then, we create an RDD for the result and then manually create a PhysicalRDD. > So, the Project and Filter on top of the original table scan disappears from > the physical plan. > See > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L138-L175 > We will not generate wrong result. But, the query plan is confusing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10321) OrcRelation doesn't override sizeInBytes
[ https://issues.apache.org/jira/browse/SPARK-10321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-10321. -- Resolution: Fixed Fix Version/s: 1.5.0 1.4.2 > OrcRelation doesn't override sizeInBytes > > > Key: SPARK-10321 > URL: https://issues.apache.org/jira/browse/SPARK-10321 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Cheng Lian >Assignee: Davies Liu >Priority: Critical > Fix For: 1.4.2, 1.5.0 > > > This hurts performance badly because broadcast join can never be enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9957) Spark ML trees should filter out 1-category features
[ https://issues.apache.org/jira/browse/SPARK-9957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720720#comment-14720720 ] holdenk commented on SPARK-9957: I can give this a shot :) > Spark ML trees should filter out 1-category features > > > Key: SPARK-9957 > URL: https://issues.apache.org/jira/browse/SPARK-9957 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > Spark ML trees should filter out 1-category categorical features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10222) Deprecate, retire Bagel in favor of GraphX
[ https://issues.apache.org/jira/browse/SPARK-10222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720709#comment-14720709 ] holdenk commented on SPARK-10222: - For what its worth that sounds like a good plan to me, I occasionally get people asking me which one they should use so reducing confusion by removing it sounds like a good plan. > Deprecate, retire Bagel in favor of GraphX > -- > > Key: SPARK-10222 > URL: https://issues.apache.org/jira/browse/SPARK-10222 > Project: Spark > Issue Type: Task > Components: GraphX >Reporter: Sean Owen > > It seems like Bagel has had little or no activity since before even Spark 1.0 > (?) and is supposed to be superseded by GraphX. > Would it be reasonable to deprecate it for 1.6? and remove it in Spark 2.x? I > think it's reasonable enough that I'll assert this as a JIRA, but obviously > open to discussion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10339) When scanning a partitioned table having thousands of partitions, Driver has a very high memory pressure because of SQL metrics
[ https://issues.apache.org/jira/browse/SPARK-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai reassigned SPARK-10339: Assignee: Yin Huai > When scanning a partitioned table having thousands of partitions, Driver has > a very high memory pressure because of SQL metrics > --- > > Key: SPARK-10339 > URL: https://issues.apache.org/jira/browse/SPARK-10339 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Blocker > > I have a local dataset having 5000 partitions stored in {{/tmp/partitioned}}. > When I run the following code, the free memory space in driver's old gen > gradually decreases and eventually there is pretty much no free space in > driver's old gen. Finally, all kinds of timeouts happen and the cluster is > died. > {code} > val df = sqlContext.read.format("parquet").load("/tmp/partitioned") > df.filter("a > -100").selectExpr("hash(a, b)").queryExecution.toRdd.foreach(_ > => Unit) > {code} > I did a quick test by deleting SQL metrics from project and filter operator, > my job works fine. > The reason is that for a partitioned table, when we scan it, the actual plan > is like > {code} >other operators >| >| > /--|--\ >/ | \ > /|\ > / | \ > project project ... project > || | > filter filter ... filter > || | > part1part2 ... part n > {code} > We create SQL metrics for every filter and project, which causing the > extremely high memory pressure to the driver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10296) add preservesParitioning parameter to RDD.map
[ https://issues.apache.org/jira/browse/SPARK-10296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720706#comment-14720706 ] holdenk commented on SPARK-10296: - mapValues won't expose the key that its being operated on (its possible to have something where you aren't going to change the key but the change to the value depends on in part the key). > add preservesParitioning parameter to RDD.map > - > > Key: SPARK-10296 > URL: https://issues.apache.org/jira/browse/SPARK-10296 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Esteban Donato >Priority: Minor > > It would be nice to add the Boolean parameter preservesParitioning with > default false to RDD.map method just as it is in RDD.mapPartitions method. > If you agree I can submit a pull request with this enhancement. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10341) SMJ fail with unable to acquire memory
[ https://issues.apache.org/jira/browse/SPARK-10341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-10341: --- Target Version/s: 1.5.0 (was: 1.5.1) > SMJ fail with unable to acquire memory > -- > > Key: SPARK-10341 > URL: https://issues.apache.org/jira/browse/SPARK-10341 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > > In SMJ, the first ExternalSorter could consume all the memory before > spilling, then the second can not even acquire the first page. > {code} > ava.io.IOException: Unable to acquire 16777216 bytes of memory > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68) > at > org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146) > at > org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) > at > org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:45) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10175) Enhance spark doap file
[ https://issues.apache.org/jira/browse/SPARK-10175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720638#comment-14720638 ] Luciano Resende commented on SPARK-10175: - ping > Enhance spark doap file > --- > > Key: SPARK-10175 > URL: https://issues.apache.org/jira/browse/SPARK-10175 > Project: Spark > Issue Type: Bug > Components: Project Infra >Reporter: Luciano Resende > Attachments: SPARK-10175 > > > The Spark doap has broken links and is also missing entries related to issue > tracker and mailing lists. This affects the list in projects.apache.org and > also in the main apache website. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10299) word2vec should allow users to specify the window size
[ https://issues.apache.org/jira/browse/SPARK-10299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10299: Assignee: Apache Spark > word2vec should allow users to specify the window size > -- > > Key: SPARK-10299 > URL: https://issues.apache.org/jira/browse/SPARK-10299 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: holdenk >Assignee: Apache Spark >Priority: Minor > > Currently word2vec has the window hard coded at 5, some users may want > different sizes (for example if using on n-gram input or similar). User > request comes from > http://stackoverflow.com/questions/32231975/spark-word2vec-window-size . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10299) word2vec should allow users to specify the window size
[ https://issues.apache.org/jira/browse/SPARK-10299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10299: Assignee: (was: Apache Spark) > word2vec should allow users to specify the window size > -- > > Key: SPARK-10299 > URL: https://issues.apache.org/jira/browse/SPARK-10299 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: holdenk >Priority: Minor > > Currently word2vec has the window hard coded at 5, some users may want > different sizes (for example if using on n-gram input or similar). User > request comes from > http://stackoverflow.com/questions/32231975/spark-word2vec-window-size . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10299) word2vec should allow users to specify the window size
[ https://issues.apache.org/jira/browse/SPARK-10299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720651#comment-14720651 ] Apache Spark commented on SPARK-10299: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/8513 > word2vec should allow users to specify the window size > -- > > Key: SPARK-10299 > URL: https://issues.apache.org/jira/browse/SPARK-10299 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: holdenk >Priority: Minor > > Currently word2vec has the window hard coded at 5, some users may want > different sizes (for example if using on n-gram input or similar). User > request comes from > http://stackoverflow.com/questions/32231975/spark-word2vec-window-size . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10323) NPE in code-gened In expression
[ https://issues.apache.org/jira/browse/SPARK-10323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-10323. Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8492 [https://github.com/apache/spark/pull/8492] > NPE in code-gened In expression > --- > > Key: SPARK-10323 > URL: https://issues.apache.org/jira/browse/SPARK-10323 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Assignee: Davies Liu >Priority: Critical > Fix For: 1.5.0 > > > To reproduce the problem, you can run {{null in ('str')}}. Let's also take a > look InSet and other similar expressions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-10338) network/yarn is missing from sbt build
[ https://issues.apache.org/jira/browse/SPARK-10338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nilanjan Raychaudhuri closed SPARK-10338. - Resolution: Cannot Reproduce > network/yarn is missing from sbt build > -- > > Key: SPARK-10338 > URL: https://issues.apache.org/jira/browse/SPARK-10338 > Project: Spark > Issue Type: Bug > Components: Build, YARN >Affects Versions: 1.5.0 >Reporter: Nilanjan Raychaudhuri > > The following command fails with missing LoggingFactory class compilation > error. I am running this on current master version. > build/sbt -Pyarn -Phadoop-2.3 assembly > One fix is to add the following to the respective pom.xml file > >org.slf4j > slf4j-api > provided > > I am not sure whether this compilation error only specific to sbt or mvn also > has this problem. > And for sbt to recognize the network/yarn project we might also have to add > the module to the parent pom.xml file. > wdyt? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10343) Consider nullability of expression in codegen
Davies Liu created SPARK-10343: -- Summary: Consider nullability of expression in codegen Key: SPARK-10343 URL: https://issues.apache.org/jira/browse/SPARK-10343 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu Priority: Critical In codegen, we didn't consider nullability of expressions. Once considering this, we can avoid lots of null check (reduce the size of generated code, also improve performance). Before that, we should double-check the correctness of nullablity of all expressions and schema, or we will hit NPE or wrong results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10338) network/yarn is missing from sbt build
[ https://issues.apache.org/jira/browse/SPARK-10338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720600#comment-14720600 ] Nilanjan Raychaudhuri commented on SPARK-10338: --- You are right. I tried again with a clean repo. My sbt cache might have been messed up. > network/yarn is missing from sbt build > -- > > Key: SPARK-10338 > URL: https://issues.apache.org/jira/browse/SPARK-10338 > Project: Spark > Issue Type: Bug > Components: Build, YARN >Affects Versions: 1.5.0 >Reporter: Nilanjan Raychaudhuri > > The following command fails with missing LoggingFactory class compilation > error. I am running this on current master version. > build/sbt -Pyarn -Phadoop-2.3 assembly > One fix is to add the following to the respective pom.xml file > >org.slf4j > slf4j-api > provided > > I am not sure whether this compilation error only specific to sbt or mvn also > has this problem. > And for sbt to recognize the network/yarn project we might also have to add > the module to the parent pom.xml file. > wdyt? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10340) Use S3 bulk listing for S3-backed Hive tables
[ https://issues.apache.org/jira/browse/SPARK-10340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720581#comment-14720581 ] Apache Spark commented on SPARK-10340: -- User 'piaozhexiu' has created a pull request for this issue: https://github.com/apache/spark/pull/8512 > Use S3 bulk listing for S3-backed Hive tables > - > > Key: SPARK-10340 > URL: https://issues.apache.org/jira/browse/SPARK-10340 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.4.1, 1.5.0 >Reporter: Cheolsoo Park > > AWS S3 provides bulk listing API. It takes the common prefix of all input > paths as a parameter and returns all the objects whose prefixes start with > the common prefix in blocks of 1000. > Since SPARK-9926 allow us to list multiple partitions all together, we can > significantly speed up input split calculation using S3 bulk listing. This > optimization is particularly useful for queries like {{select * from > partitioned_table limit 10}}. > This is a common optimization for S3. For eg, here is a [blog > post|http://www.qubole.com/blog/product/optimizing-hadoop-for-s3-part-1/] > from Qubole on this topic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10342) Cooperative memory management
[ https://issues.apache.org/jira/browse/SPARK-10342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-10342: --- Issue Type: Improvement (was: Story) > Cooperative memory management > - > > Key: SPARK-10342 > URL: https://issues.apache.org/jira/browse/SPARK-10342 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 1.5.0 >Reporter: Davies Liu >Priority: Critical > > We have memory starving problems for a long time, it become worser in 1.5 > since we use larger page. > In order to increase the memory usage (reduce unnecessary spilling) also > reduce the risk of OOM, we should manage the memory in a cooperative way, it > means all the memory consume should be also responsive to release memory > (spilling) upon others' requests. > The requests of memory could be different, hard requirement (will crash if > not allocated) or soft requirement (worse performance if not allocated). Also > the costs of spilling are also different. We could introduce some kind of > priority to make them work together better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10341) SMJ fail with unable to acquire memory
[ https://issues.apache.org/jira/browse/SPARK-10341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10341: Assignee: Apache Spark (was: Davies Liu) > SMJ fail with unable to acquire memory > -- > > Key: SPARK-10341 > URL: https://issues.apache.org/jira/browse/SPARK-10341 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Davies Liu >Assignee: Apache Spark >Priority: Critical > > In SMJ, the first ExternalSorter could consume all the memory before > spilling, then the second can not even acquire the first page. > {code} > ava.io.IOException: Unable to acquire 16777216 bytes of memory > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68) > at > org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146) > at > org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) > at > org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:45) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10336) fitIntercept is a command line option but not set in the LR example program.
[ https://issues.apache.org/jira/browse/SPARK-10336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10336: -- Fix Version/s: (was: 1.5.0) 1.5.1 > fitIntercept is a command line option but not set in the LR example program. > > > Key: SPARK-10336 > URL: https://issues.apache.org/jira/browse/SPARK-10336 > Project: Spark > Issue Type: Bug >Reporter: Shuo Xiang >Assignee: Shuo Xiang > Fix For: 1.5.1 > > > the parsed parameter is not set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10340) Use S3 bulk listing for S3-backed Hive tables
[ https://issues.apache.org/jira/browse/SPARK-10340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10340: Assignee: (was: Apache Spark) > Use S3 bulk listing for S3-backed Hive tables > - > > Key: SPARK-10340 > URL: https://issues.apache.org/jira/browse/SPARK-10340 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.4.1, 1.5.0 >Reporter: Cheolsoo Park > > AWS S3 provides bulk listing API. It takes the common prefix of all input > paths as a parameter and returns all the objects whose prefixes start with > the common prefix in blocks of 1000. > Since SPARK-9926 allow us to list multiple partitions all together, we can > significantly speed up input split calculation using S3 bulk listing. This > optimization is particularly useful for queries like {{select * from > partitioned_table limit 10}}. > This is a common optimization for S3. For eg, here is a [blog > post|http://www.qubole.com/blog/product/optimizing-hadoop-for-s3-part-1/] > from Qubole on this topic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10340) Use S3 bulk listing for S3-backed Hive tables
[ https://issues.apache.org/jira/browse/SPARK-10340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10340: Assignee: Apache Spark > Use S3 bulk listing for S3-backed Hive tables > - > > Key: SPARK-10340 > URL: https://issues.apache.org/jira/browse/SPARK-10340 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.4.1, 1.5.0 >Reporter: Cheolsoo Park >Assignee: Apache Spark > > AWS S3 provides bulk listing API. It takes the common prefix of all input > paths as a parameter and returns all the objects whose prefixes start with > the common prefix in blocks of 1000. > Since SPARK-9926 allow us to list multiple partitions all together, we can > significantly speed up input split calculation using S3 bulk listing. This > optimization is particularly useful for queries like {{select * from > partitioned_table limit 10}}. > This is a common optimization for S3. For eg, here is a [blog > post|http://www.qubole.com/blog/product/optimizing-hadoop-for-s3-part-1/] > from Qubole on this topic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10342) Cooperative memory management
[ https://issues.apache.org/jira/browse/SPARK-10342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-10342: Assignee: (was: Davies Liu) > Cooperative memory management > - > > Key: SPARK-10342 > URL: https://issues.apache.org/jira/browse/SPARK-10342 > Project: Spark > Issue Type: Story > Components: Spark Core, SQL >Affects Versions: 1.5.0 >Reporter: Davies Liu >Priority: Critical > > We have memory starving problems for a long time, it become worser in 1.5 > since we use larger page. > In order to increase the memory usage (reduce unnecessary spilling) also > reduce the risk of OOM, we should manage the memory in a cooperative way, it > means all the memory consume should be also responsive to release memory > (spilling) upon others' requests. > The requests of memory could be different, hard requirement (will crash if > not allocated) or soft requirement (worse performance if not allocated). Also > the costs of spilling are also different. We could introduce some kind of > priority to make them work together better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10342) Cooperative memory management
[ https://issues.apache.org/jira/browse/SPARK-10342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-10342: Issue Type: Story (was: New Feature) > Cooperative memory management > - > > Key: SPARK-10342 > URL: https://issues.apache.org/jira/browse/SPARK-10342 > Project: Spark > Issue Type: Story > Components: Spark Core, SQL >Affects Versions: 1.5.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > > We have memory starving problems for a long time, it become worser in 1.5 > since we use larger page. > In order to increase the memory usage (reduce unnecessary spilling) also > reduce the risk of OOM, we should manage the memory in a cooperative way, it > means all the memory consume should be also responsive to release memory > (spilling) upon others' requests. > The requests of memory could be different, hard requirement (will crash if > not allocated) or soft requirement (worse performance if not allocated). Also > the costs of spilling are also different. We could introduce some kind of > priority to make them work together better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10342) Cooperative memory management
[ https://issues.apache.org/jira/browse/SPARK-10342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-10342: Issue Type: New Feature (was: Bug) > Cooperative memory management > - > > Key: SPARK-10342 > URL: https://issues.apache.org/jira/browse/SPARK-10342 > Project: Spark > Issue Type: New Feature > Components: Spark Core, SQL >Affects Versions: 1.5.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > > We have memory starving problems for a long time, it become worser in 1.5 > since we use larger page. > In order to increase the memory usage (reduce unnecessary spilling) also > reduce the risk of OOM, we should manage the memory in a cooperative way, it > means all the memory consume should be also responsive to release memory > (spilling) upon others' requests. > The requests of memory could be different, hard requirement (will crash if > not allocated) or soft requirement (worse performance if not allocated). Also > the costs of spilling are also different. We could introduce some kind of > priority to make them work together better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10336) fitIntercept is a command line option but not set in the LR example program.
[ https://issues.apache.org/jira/browse/SPARK-10336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720573#comment-14720573 ] DB Tsai commented on SPARK-10336: - Thanks for head up. > fitIntercept is a command line option but not set in the LR example program. > > > Key: SPARK-10336 > URL: https://issues.apache.org/jira/browse/SPARK-10336 > Project: Spark > Issue Type: Bug > Components: Documentation, ML >Affects Versions: 1.4.1, 1.5.0 >Reporter: Shuo Xiang >Assignee: Shuo Xiang > Fix For: 1.5.1 > > > the parsed parameter is not set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10336) fitIntercept is a command line option but not set in the LR example program.
[ https://issues.apache.org/jira/browse/SPARK-10336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720559#comment-14720559 ] Xiangrui Meng commented on SPARK-10336: --- [~dbtsai] Please check the JIRA fields after you merge a PR. When a RC release is in vote, we should set the fix version to 1.5.1 instead of 1.5.0. If there exist a new RC, there will be a batch job to modify 1.5.1 back to 1.5.0. For bugs, affects versions should be set as well. > fitIntercept is a command line option but not set in the LR example program. > > > Key: SPARK-10336 > URL: https://issues.apache.org/jira/browse/SPARK-10336 > Project: Spark > Issue Type: Bug > Components: Documentation, ML >Affects Versions: 1.4.1, 1.5.0 >Reporter: Shuo Xiang >Assignee: Shuo Xiang > Fix For: 1.5.1 > > > the parsed parameter is not set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10342) Cooperative memory management
Davies Liu created SPARK-10342: -- Summary: Cooperative memory management Key: SPARK-10342 URL: https://issues.apache.org/jira/browse/SPARK-10342 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 1.5.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Critical We have memory starving problems for a long time, it become worser in 1.5 since we use larger page. In order to increase the memory usage (reduce unnecessary spilling) also reduce the risk of OOM, we should manage the memory in a cooperative way, it means all the memory consume should be also responsive to release memory (spilling) upon others' requests. The requests of memory could be different, hard requirement (will crash if not allocated) or soft requirement (worse performance if not allocated). Also the costs of spilling are also different. We could introduce some kind of priority to make them work together better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9671) ML 1.5 QA: Programming guide update and migration guide
[ https://issues.apache.org/jira/browse/SPARK-9671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-9671. -- Resolution: Fixed Fix Version/s: 1.5.1 Issue resolved by pull request 8498 [https://github.com/apache/spark/pull/8498] > ML 1.5 QA: Programming guide update and migration guide > --- > > Key: SPARK-9671 > URL: https://issues.apache.org/jira/browse/SPARK-9671 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Joseph K. Bradley >Assignee: Xiangrui Meng >Priority: Critical > Fix For: 1.5.1 > > > Before the release, we need to update the MLlib Programming Guide. Updates > will include: > * Add migration guide subsection. > ** Use the results of the QA audit JIRAs. > * Check phrasing, especially in main sections (for outdated items such as "In > this release, ...") > * Possibly reorganize parts of the Pipelines guide if needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10336) fitIntercept is a command line option but not set in the LR example program.
[ https://issues.apache.org/jira/browse/SPARK-10336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10336: -- Affects Version/s: 1.5.0 1.4.1 Component/s: ML Documentation > fitIntercept is a command line option but not set in the LR example program. > > > Key: SPARK-10336 > URL: https://issues.apache.org/jira/browse/SPARK-10336 > Project: Spark > Issue Type: Bug > Components: Documentation, ML >Affects Versions: 1.4.1, 1.5.0 >Reporter: Shuo Xiang >Assignee: Shuo Xiang > Fix For: 1.5.1 > > > the parsed parameter is not set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10341) SMJ fail with unable to acquire memory
[ https://issues.apache.org/jira/browse/SPARK-10341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720552#comment-14720552 ] Apache Spark commented on SPARK-10341: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/8511 > SMJ fail with unable to acquire memory > -- > > Key: SPARK-10341 > URL: https://issues.apache.org/jira/browse/SPARK-10341 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > > In SMJ, the first ExternalSorter could consume all the memory before > spilling, then the second can not even acquire the first page. > {code} > ava.io.IOException: Unable to acquire 16777216 bytes of memory > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68) > at > org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146) > at > org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) > at > org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:45) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10341) SMJ fail with unable to acquire memory
[ https://issues.apache.org/jira/browse/SPARK-10341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10341: Assignee: Davies Liu (was: Apache Spark) > SMJ fail with unable to acquire memory > -- > > Key: SPARK-10341 > URL: https://issues.apache.org/jira/browse/SPARK-10341 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > > In SMJ, the first ExternalSorter could consume all the memory before > spilling, then the second can not even acquire the first page. > {code} > ava.io.IOException: Unable to acquire 16777216 bytes of memory > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68) > at > org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146) > at > org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) > at > org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:45) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10341) SMJ fail with unable to acquire memory
Davies Liu created SPARK-10341: -- Summary: SMJ fail with unable to acquire memory Key: SPARK-10341 URL: https://issues.apache.org/jira/browse/SPARK-10341 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Critical In SMJ, the first ExternalSorter could consume all the memory before spilling, then the second can not even acquire the first page. {code} ava.io.IOException: Unable to acquire 16777216 bytes of memory at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106) at org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68) at org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146) at org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) at org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169) at org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:45) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10338) network/yarn is missing from sbt build
[ https://issues.apache.org/jira/browse/SPARK-10338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720527#comment-14720527 ] Sean Owen commented on SPARK-10338: --- I can't reproduce any problem with the sbt build in master, using the same command. I'm not clear what your error is, or why it's related to network/yarn or slf4j. I'd have to close as CannotReproduce unless you can elaborate. > network/yarn is missing from sbt build > -- > > Key: SPARK-10338 > URL: https://issues.apache.org/jira/browse/SPARK-10338 > Project: Spark > Issue Type: Bug > Components: Build, YARN >Affects Versions: 1.5.0 >Reporter: Nilanjan Raychaudhuri > > The following command fails with missing LoggingFactory class compilation > error. I am running this on current master version. > build/sbt -Pyarn -Phadoop-2.3 assembly > One fix is to add the following to the respective pom.xml file > >org.slf4j > slf4j-api > provided > > I am not sure whether this compilation error only specific to sbt or mvn also > has this problem. > And for sbt to recognize the network/yarn project we might also have to add > the module to the parent pom.xml file. > wdyt? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10298) PySpark can't JSON serialize a DataFrame with DecimalType columns.
[ https://issues.apache.org/jira/browse/SPARK-10298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720515#comment-14720515 ] Dan McClary commented on SPARK-10298: - I can't seem to replicate this on master. The following scenarios work: from decimal import * import pyspark.sql.types as types from pyspark.sql import * v = [Decimal(10.0-i)/Decimal(1.0+1) for i in range(10)] d = sc.parallelize(v).map(lambda x: Row(v=x)).toDF() d.write.json("foo") v2 = [[Decimal(10.0-i)/Decimal(1.0+1)] for i in range(10)] d2= sc.parallelize(v) df = sqlContext.createDataFrame(d2, types.StructType([types.StructField("a", types.DecimalType())])) df.write.json("foo") > PySpark can't JSON serialize a DataFrame with DecimalType columns. > -- > > Key: SPARK-10298 > URL: https://issues.apache.org/jira/browse/SPARK-10298 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Kevin Cox > > {code} > In [8]: sc.sql.createDataFrame([[Decimal(123)]], > types.StructType([types.StructField("a", types.DecimalType())])) > Out[8]: DataFrame[a: decimal(10,0)] > In [9]: _.write.json("foo") > 15/08/26 14:26:21 ERROR DefaultWriterContainer: Aborting task. > scala.MatchError: (DecimalType(10,0),123) (of class scala.Tuple2) > at > org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89) > at > org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89) > at > org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:126) > at > org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89) > at > org.apache.spark.sql.execution.datasources.json.JacksonGenerator$.apply(JacksonGenerator.scala:133) > at > org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.writeInternal(JSONRelation.scala:191) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:224) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 15/08/26 14:26:21 ERROR DefaultWriterContainer: Task attempt > attempt_201508261426__m_00_0 aborted. > 15/08/26 14:26:21 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > org.apache.spark.SparkException: Task failed while writing rows. > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:232) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: scala.MatchError: (DecimalType(10,0),123) (of class scala.Tuple2) > at > org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89) > at > org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$
[jira] [Created] (SPARK-10340) Use S3 bulk listing for S3-backed Hive tables
Cheolsoo Park created SPARK-10340: - Summary: Use S3 bulk listing for S3-backed Hive tables Key: SPARK-10340 URL: https://issues.apache.org/jira/browse/SPARK-10340 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.1, 1.5.0 Reporter: Cheolsoo Park AWS S3 provides bulk listing API. It takes the common prefix of all input paths as a parameter and returns all the objects whose prefixes start with the common prefix in blocks of 1000. Since SPARK-9926 allow us to list multiple partitions all together, we can significantly speed up input split calculation using S3 bulk listing. This optimization is particularly useful for queries like {{select * from partitioned_table limit 10}}. This is a common optimization for S3. For eg, here is a [blog post|http://www.qubole.com/blog/product/optimizing-hadoop-for-s3-part-1/] from Qubole on this topic. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10339) When scanning a partitioned table having thousands of partitions, Driver has a very high memory pressure because of SQL metrics
[ https://issues.apache.org/jira/browse/SPARK-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-10339: - Description: I have a local dataset having 5000 partitions stored in {{/tmp/partitioned}}. When I run the following code, the free memory space in driver's old gen gradually decreases and eventually there is pretty much no free space in driver's old gen. Finally, all kinds of timeouts happen and the cluster is died. {code} val df = sqlContext.read.format("parquet").load("/tmp/partitioned") df.filter("a > -100").selectExpr("hash(a, b)").queryExecution.toRdd.foreach(_ => Unit) {code} I did a quick test by deleting SQL metrics from project and filter operator, my job works fine. The reason is that for a partitioned table, when we scan it, the actual plan is like {code} other operators | | /--|--\ / | \ /|\ / | \ project project ... project || | filter filter ... filter || | part1part2 ... part n {code} We create SQL metrics for every filter and project, which causing the extremely high memory pressure to the driver. was: I have a local dataset having 5000 partitions stored in {{/tmp/partitioned}}. When I run the following code, the free memory space in driver's old gen gradually decrease and eventually there is no free space in driver's old gen. {code} val df = sqlContext.read.format("parquet").load("/tmp/partitioned") df.filter("a > -100").selectExpr("hash(a, b)").queryExecution.toRdd.foreach(_ => Unit) {code} I did a quick test by deleting SQL metrics from project and filter operator, my job works fine. The reason is that for a partitioned table, when we scan it, the actual plan is like {code} other operators | | /--|--\ / | \ /|\ / | \ project project ... project || | filter filter ... filter || | part1part2 ... part n {code} We create SQL metrics for every filter and project, which causing the extremely high memory pressure to the driver. > When scanning a partitioned table having thousands of partitions, Driver has > a very high memory pressure because of SQL metrics > --- > > Key: SPARK-10339 > URL: https://issues.apache.org/jira/browse/SPARK-10339 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Priority: Blocker > > I have a local dataset having 5000 partitions stored in {{/tmp/partitioned}}. > When I run the following code, the free memory space in driver's old gen > gradually decreases and eventually there is pretty much no free space in > driver's old gen. Finally, all kinds of timeouts happen and the cluster is > died. > {code} > val df = sqlContext.read.format("parquet").load("/tmp/partitioned") > df.filter("a > -100").selectExpr("hash(a, b)").queryExecution.toRdd.foreach(_ > => Unit) > {code} > I did a quick test by deleting SQL metrics from project and filter operator, > my job works fine. > The reason is that for a partitioned table, when we scan it, the actual plan > is like > {code} >other operators >| >| > /--|--\ >/ | \ > /|\ > / | \ > project project ... project > || | > filter filter ... filter > || | > part1part2 ... part n > {code} > We create SQL metrics for every filter and project, which causing the > extremely high memory pressure to the driver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10339) When scanning a partitioned table having thousands of partitions, Driver has a very high memory pressure because of SQL metrics
[ https://issues.apache.org/jira/browse/SPARK-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-10339: - Sprint: Spark 1.5 doc/QA sprint > When scanning a partitioned table having thousands of partitions, Driver has > a very high memory pressure because of SQL metrics > --- > > Key: SPARK-10339 > URL: https://issues.apache.org/jira/browse/SPARK-10339 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Priority: Blocker > > I have a local dataset having 5000 partitions stored in {{/tmp/partitioned}}. > When I run the following code, the free memory space in driver's old gen > gradually decrease and eventually there is no free space in driver's old gen. > {code} > val df = sqlContext.read.format("parquet").load("/tmp/partitioned") > df.filter("a > -100").selectExpr("hash(a, b)").queryExecution.toRdd.foreach(_ > => Unit) > {code} > I did a quick test by deleting SQL metrics from project and filter operator, > my job works fine. > The reason is that for a partitioned table, when we scan it, the actual plan > is like > {code} >other operators >| >| > /--|--\ >/ | \ > /|\ > / | \ > project project ... project > || | > filter filter ... filter > || | > part1part2 ... part n > {code} > We create SQL metrics for every filter and project, which causing the > extremely high memory pressure to the driver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10336) fitIntercept is a command line option but not set in the LR example program.
[ https://issues.apache.org/jira/browse/SPARK-10336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-10336: Assignee: Shuo Xiang > fitIntercept is a command line option but not set in the LR example program. > > > Key: SPARK-10336 > URL: https://issues.apache.org/jira/browse/SPARK-10336 > Project: Spark > Issue Type: Bug >Reporter: Shuo Xiang >Assignee: Shuo Xiang > Fix For: 1.5.0 > > > the parsed parameter is not set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10339) When scanning a partitioned table having thousands of partitions, Driver has a very high memory pressure because of SQL metrics
Yin Huai created SPARK-10339: Summary: When scanning a partitioned table having thousands of partitions, Driver has a very high memory pressure because of SQL metrics Key: SPARK-10339 URL: https://issues.apache.org/jira/browse/SPARK-10339 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Yin Huai Priority: Blocker I have a local dataset having 5000 partitions stored in {{/tmp/partitioned}}. When I run the following code, the free memory space in driver's old gen gradually decrease and eventually there is no free space in driver's old gen. {code} val df = sqlContext.read.format("parquet").load("/tmp/partitioned") df.filter("a > -100").selectExpr("hash(a, b)").queryExecution.toRdd.foreach(_ => Unit) {code} I did a quick test by deleting SQL metrics from project and filter operator, my job works fine. The reason is that for a partitioned table, when we scan it, the actual plan is like {code} other operators | | /--|--\ / | \ /|\ / | \ project project ... project || | filter filter ... filter || | part1part2 ... part n {code} We create SQL metrics for every filter and project, which causing the extremely high memory pressure to the driver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10336) fitIntercept is a command line option but not set in the LR example program.
[ https://issues.apache.org/jira/browse/SPARK-10336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai resolved SPARK-10336. - Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8510 [https://github.com/apache/spark/pull/8510] > fitIntercept is a command line option but not set in the LR example program. > > > Key: SPARK-10336 > URL: https://issues.apache.org/jira/browse/SPARK-10336 > Project: Spark > Issue Type: Bug >Reporter: Shuo Xiang > Fix For: 1.5.0 > > > the parsed parameter is not set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10337) Views are broken
[ https://issues.apache.org/jira/browse/SPARK-10337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-10337: - Description: I haven't dug into this yet... but it seems like this should work: This works: {code} SELECT * FROM 100milints {code} This seems to work: {code} CREATE VIEW testView AS SELECT * FROM 100milints {code} This fails: {code} SELECT * FROM testView org.apache.spark.sql.AnalysisException: cannot resolve '100milints.col' given input columns id; line 1 pos 7 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:56) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:279) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:108) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:118) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:122) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:122) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:126) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:126) at
[jira] [Created] (SPARK-10338) network/yarn is missing from sbt build
Nilanjan Raychaudhuri created SPARK-10338: - Summary: network/yarn is missing from sbt build Key: SPARK-10338 URL: https://issues.apache.org/jira/browse/SPARK-10338 Project: Spark Issue Type: Bug Components: Build, YARN Affects Versions: 1.5.0 Reporter: Nilanjan Raychaudhuri The following command fails with missing LoggingFactory class compilation error. I am running this on current master version. build/sbt -Pyarn -Phadoop-2.3 assembly One fix is to add the following to the respective pom.xml file org.slf4j slf4j-api provided I am not sure whether this compilation error only specific to sbt or mvn also has this problem. And for sbt to recognize the network/yarn project we might also have to add the module to the parent pom.xml file. wdyt? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10337) Views are broken
[ https://issues.apache.org/jira/browse/SPARK-10337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720450#comment-14720450 ] Ali Ghodsi commented on SPARK-10337: [~marmbrus] do you mind updating the the first part of the description (the code snippet after {{This works:}})? It's blank on my screen. > Views are broken > > > Key: SPARK-10337 > URL: https://issues.apache.org/jira/browse/SPARK-10337 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Michael Armbrust >Priority: Critical > > I haven't dug into this yet... but it seems like this should work: > This works: > {code} > {code} > This seems to work: > {code} > CREATE VIEW testView AS SELECT * FROM 100milints > {code} > This fails: > {code} > SELECT * FROM testView > org.apache.spark.sql.AnalysisException: cannot resolve '100milints.col' given > input columns id; line 1 pos 7 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:56) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:108) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:118) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:122) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:122) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:126) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.
[jira] [Assigned] (SPARK-10336) Setting intercept in the example code of logistic regression is not working
[ https://issues.apache.org/jira/browse/SPARK-10336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10336: Assignee: (was: Apache Spark) > Setting intercept in the example code of logistic regression is not working > --- > > Key: SPARK-10336 > URL: https://issues.apache.org/jira/browse/SPARK-10336 > Project: Spark > Issue Type: Bug >Reporter: Shuo Xiang > > the parsed parameter is not set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10336) Setting intercept in the example code of logistic regression is not working
[ https://issues.apache.org/jira/browse/SPARK-10336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720447#comment-14720447 ] Apache Spark commented on SPARK-10336: -- User 'coderxiang' has created a pull request for this issue: https://github.com/apache/spark/pull/8510 > Setting intercept in the example code of logistic regression is not working > --- > > Key: SPARK-10336 > URL: https://issues.apache.org/jira/browse/SPARK-10336 > Project: Spark > Issue Type: Bug >Reporter: Shuo Xiang > > the parsed parameter is not set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10336) fitIntercept is a command line option but not set in the LR example program.
[ https://issues.apache.org/jira/browse/SPARK-10336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuo Xiang updated SPARK-10336: --- Summary: fitIntercept is a command line option but not set in the LR example program. (was: Setting intercept in the example code of logistic regression is not working) > fitIntercept is a command line option but not set in the LR example program. > > > Key: SPARK-10336 > URL: https://issues.apache.org/jira/browse/SPARK-10336 > Project: Spark > Issue Type: Bug >Reporter: Shuo Xiang > > the parsed parameter is not set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10336) Setting intercept in the example code of logistic regression is not working
[ https://issues.apache.org/jira/browse/SPARK-10336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10336: Assignee: Apache Spark > Setting intercept in the example code of logistic regression is not working > --- > > Key: SPARK-10336 > URL: https://issues.apache.org/jira/browse/SPARK-10336 > Project: Spark > Issue Type: Bug >Reporter: Shuo Xiang >Assignee: Apache Spark > > the parsed parameter is not set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10337) Views are broken
Michael Armbrust created SPARK-10337: Summary: Views are broken Key: SPARK-10337 URL: https://issues.apache.org/jira/browse/SPARK-10337 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Michael Armbrust Priority: Critical I haven't dug into this yet... but it seems like this should work: This works: {code} {code} This seems to work: {code} CREATE VIEW testView AS SELECT * FROM 100milints {code} This fails: {code} SELECT * FROM testView org.apache.spark.sql.AnalysisException: cannot resolve '100milints.col' given input columns id; line 1 pos 7 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:56) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:279) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:108) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:118) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:122) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:122) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:126) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.Ab
[jira] [Created] (SPARK-10336) Setting intercept in the example code of logistic regression is not working
Shuo Xiang created SPARK-10336: -- Summary: Setting intercept in the example code of logistic regression is not working Key: SPARK-10336 URL: https://issues.apache.org/jira/browse/SPARK-10336 Project: Spark Issue Type: Bug Reporter: Shuo Xiang the parsed parameter is not set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9284) Remove tests' dependency on the assembly
[ https://issues.apache.org/jira/browse/SPARK-9284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-9284. --- Resolution: Fixed Assignee: Marcelo Vanzin > Remove tests' dependency on the assembly > > > Key: SPARK-9284 > URL: https://issues.apache.org/jira/browse/SPARK-9284 > Project: Spark > Issue Type: Improvement > Components: Tests >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > > Some tests - in particular tests that have to spawn child processes - > currently rely on the generated Spark assembly to run properly. > This is sub-optimal for a few reasons: > - Users have to use an unnatural "package everything first, then run tests" > approach > - Sometimes tests are run using old code because the user forgot to rebuild > the assembly > The latter is particularly annoying in {{YarnClusterSuite}}. If you modify > some code outside of the {{yarn/}} module, you have to rebuild the whole > assembly before that test picks it up. > We should make all tests run without the need to have an assembly around, > making sure that they always pick up the latest code compiled by the user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10325) Public Row no longer overrides hashCode / violates hashCode + equals contract
[ https://issues.apache.org/jira/browse/SPARK-10325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-10325: - Assignee: Josh Rosen > Public Row no longer overrides hashCode / violates hashCode + equals contract > - > > Key: SPARK-10325 > URL: https://issues.apache.org/jira/browse/SPARK-10325 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Critical > Fix For: 1.5.0 > > > It appears that the public {{Row}}'s hashCode is no longer overridden as of > Spark 1.5.0: > {code} > val x = Row("Hello") > val y = Row("Hello") > println(x == y) > println(x.hashCode) > println(y.hashCode) > {code} > outputs > {code} > true > 1032103993 > 1346393532 > {code} > This violates the hashCode/equals contract. > I discovered this because it broke tests in the {{spark-avro}} library. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10325) Public Row no longer overrides hashCode / violates hashCode + equals contract
[ https://issues.apache.org/jira/browse/SPARK-10325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-10325. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 8500 [https://github.com/apache/spark/pull/8500] > Public Row no longer overrides hashCode / violates hashCode + equals contract > - > > Key: SPARK-10325 > URL: https://issues.apache.org/jira/browse/SPARK-10325 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Josh Rosen >Priority: Critical > Fix For: 1.5.0 > > > It appears that the public {{Row}}'s hashCode is no longer overridden as of > Spark 1.5.0: > {code} > val x = Row("Hello") > val y = Row("Hello") > println(x == y) > println(x.hashCode) > println(y.hashCode) > {code} > outputs > {code} > true > 1032103993 > 1346393532 > {code} > This violates the hashCode/equals contract. > I discovered this because it broke tests in the {{spark-avro}} library. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10335) GraphX Connected Components fail with large number of iterations
[ https://issues.apache.org/jira/browse/SPARK-10335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720385#comment-14720385 ] JJ Zhang commented on SPARK-10335: -- Note that I've changed the return signature of the algorithm to lower the amount of data that need to be checkpointed. You can always join back to graph to get additional info Also changed pregel signature to return number of iterations. I think this may be a good idea for the general pregel API as well, since this is a general issue for all iterative graph algorithms. For simple test, I used code below: {code} import org.apache.spark.graphx._ import org.apache.spark.SparkContext import org.apache.spark.rdd.RDD /** * @author jj.zhang */ object SampleGraph extends java.io.Serializable { /** * Create a graph with some long chains of connection to test ConnectedComponents */ def createGraph(sc : SparkContext) : Graph[VertexId, String] = { val v :RDD[(VertexId, VertexId)] = sc.parallelize(0 until 50, 10).map(r => (r.toLong, r.toLong)) val v2 :RDD[(VertexId, VertexId)] = sc.parallelize(1001 until 1020, 10).map(r => (r.toLong, r.toLong)) val v3 :RDD[(VertexId, VertexId)] = sc.parallelize(2000 until 2200, 10).map(r => (r.toLong, r.toLong)) val e : RDD[Edge[String]] = sc.parallelize(0 until 49, 10).map(r => Edge(r, r + 1, "")) val e2 : RDD[Edge[String]] = sc.parallelize(1001 until 1019, 10).map(r => Edge(r, r + 1, "")) val e3 : RDD[Edge[String]] = sc.parallelize(2000 until 2199, 10).map(r => Edge(r, r + 1, "")) val g : Graph[VertexId, String] = Graph(v ++ v2 ++ v3, e ++ e2 ++ e3) g.partitionBy(PartitionStrategy.EdgePartition2D) } } val g = SampleGraph.createGraph(sc) val dir = "hdfs://${myhost}/user/me/run1" val (g2, count) = RobustConnectedComponents.run(g, 20, dir) {code} You should see converge in 199 iterations, and all intermediate files on your hdfs. > GraphX Connected Components fail with large number of iterations > > > Key: SPARK-10335 > URL: https://issues.apache.org/jira/browse/SPARK-10335 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 1.4.0 > Environment: Tested with Yarn client mode >Reporter: JJ Zhang > > For graphs with long chains of connected vertices, the algorithm fails in > practice to converge. > The driver always runs out of memory prior to final convergence. In my test, > for 1.2B vertices/1.2B edges with the longest path requiring more than 90 > iterations, it invariably fails around iteration 80 with driver memory set at > 50G. On top of that, each iteration takes longer than previous, and it took > an overnight run on 50 node cluster to the failure point. > It is presumably due to keeping track of the RDD lineage and the DAG > computation. So truncate RDD lineage is the most straight-forward solution. > Tried using checkpoint, but apparently it is not working as expected in > version 1.4.0. Will file another ticket for that issue, but hopefully it has > been resolved by 1.5. > Proposed solution below. This was tested and able to converge after 99 > iterations on the same graph mentioned above, in less than an hour. > {code: title=RobustConnectedComponents.scala|borderStyle=solid} > import org.apache.spark.Logging > import scala.reflect.ClassTag > import org.apache.spark.graphx._ > object RobustConnectedComponents extends Logging with java.io.Serializable { > def run[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED], interval: Int = > 50, dir : String): (Graph[VertexId, String], Int) = { > val ccGraph = graph.mapVertices { case (vid, _) => vid }.mapEdges { x > =>""} > def sendMessage(edge: EdgeTriplet[VertexId, String]): Iterator[(VertexId, > Long)] = { > if (edge.srcAttr < edge.dstAttr) { > Iterator((edge.dstId, edge.srcAttr)) > } else if (edge.srcAttr > edge.dstAttr) { > Iterator((edge.srcId, edge.dstAttr)) > } else { > Iterator.empty > } > } > val initialMessage = Long.MaxValue > > > var g: Graph[VertexId, String] = ccGraph > var i = interval > var count = 0 > while (i == interval) { > g = refreshGraph(g, dir, count) > g.cache() > val (g1, i1) = pregel(g, initialMessage, interval, activeDirection = > EdgeDirection.Either)( > vprog = (id, attr, msg) => math.min(attr, msg), > sendMsg = sendMessage, > mergeMsg = (a, b) => math.min(a, b)) > g.unpersist() > g = g1 > i = i1 > count = count + i > logInfo("Checkpoint reached. iteration so far: " + count) > } > logInfo("Final Converge: Total Iteration:" + count) > (g, count) > } // end of connectedComponents > > def refreshGraph(g
[jira] [Created] (SPARK-10335) GraphX Connected Components fail with large number of iterations
JJ Zhang created SPARK-10335: Summary: GraphX Connected Components fail with large number of iterations Key: SPARK-10335 URL: https://issues.apache.org/jira/browse/SPARK-10335 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 1.4.0 Environment: Tested with Yarn client mode Reporter: JJ Zhang For graphs with long chains of connected vertices, the algorithm fails in practice to converge. The driver always runs out of memory prior to final convergence. In my test, for 1.2B vertices/1.2B edges with the longest path requiring more than 90 iterations, it invariably fails around iteration 80 with driver memory set at 50G. On top of that, each iteration takes longer than previous, and it took an overnight run on 50 node cluster to the failure point. It is presumably due to keeping track of the RDD lineage and the DAG computation. So truncate RDD lineage is the most straight-forward solution. Tried using checkpoint, but apparently it is not working as expected in version 1.4.0. Will file another ticket for that issue, but hopefully it has been resolved by 1.5. Proposed solution below. This was tested and able to converge after 99 iterations on the same graph mentioned above, in less than an hour. {code: title=RobustConnectedComponents.scala|borderStyle=solid} import org.apache.spark.Logging import scala.reflect.ClassTag import org.apache.spark.graphx._ object RobustConnectedComponents extends Logging with java.io.Serializable { def run[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED], interval: Int = 50, dir : String): (Graph[VertexId, String], Int) = { val ccGraph = graph.mapVertices { case (vid, _) => vid }.mapEdges { x =>""} def sendMessage(edge: EdgeTriplet[VertexId, String]): Iterator[(VertexId, Long)] = { if (edge.srcAttr < edge.dstAttr) { Iterator((edge.dstId, edge.srcAttr)) } else if (edge.srcAttr > edge.dstAttr) { Iterator((edge.srcId, edge.dstAttr)) } else { Iterator.empty } } val initialMessage = Long.MaxValue var g: Graph[VertexId, String] = ccGraph var i = interval var count = 0 while (i == interval) { g = refreshGraph(g, dir, count) g.cache() val (g1, i1) = pregel(g, initialMessage, interval, activeDirection = EdgeDirection.Either)( vprog = (id, attr, msg) => math.min(attr, msg), sendMsg = sendMessage, mergeMsg = (a, b) => math.min(a, b)) g.unpersist() g = g1 i = i1 count = count + i logInfo("Checkpoint reached. iteration so far: " + count) } logInfo("Final Converge: Total Iteration:" + count) (g, count) } // end of connectedComponents def refreshGraph(g : Graph[VertexId, String], dir:String, count:Int): Graph[VertexId, String] = { val vertFile = dir + "/iter-" + count + "/vertices" val edgeFile = dir + "/iter-" + count + "/edges" g.vertices.saveAsObjectFile(vertFile) g.edges.saveAsObjectFile(edgeFile) //load back val v : RDD[(VertexId, VertexId)] = g.vertices.sparkContext.objectFile(vertFile) val e : RDD[Edge[String]]= g.vertices.sparkContext.objectFile(edgeFile) val newGraph = Graph(v, e) newGraph } def pregel[VD: ClassTag, ED: ClassTag, A: ClassTag](graph: Graph[VD, ED], initialMsg: A, maxIterations: Int = Int.MaxValue, activeDirection: EdgeDirection = EdgeDirection.Either)(vprog: (VertexId, VD, A) => VD, sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)], mergeMsg: (A, A) => A): (Graph[VD, ED], Int) = { var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, initialMsg)).cache() // compute the messages var messages = g.mapReduceTriplets(sendMsg, mergeMsg) var activeMessages = messages.count() // Loop var prevG: Graph[VD, ED] = null var i = 0 while (activeMessages > 0 && i < maxIterations) { // Receive the messages and update the vertices. prevG = g g = g.joinVertices(messages)(vprog).cache() val oldMessages = messages // Send new messages, skipping edges where neither side received a message. We must cache // messages so it can be materialized on the next line, allowing us to uncache the previous // iteration. messages = g.mapReduceTriplets( sendMsg, mergeMsg, Some((oldMessages, activeDirection))).cache() // The call to count() materializes `messages` and the vertices of `g`. This hides oldMessages // (depended on by the vertices of g) and the vertices of prevG (depended on by oldMessages // and the vertices of g). activeMessages = messages.count() // Unpersist the RDDs hidden by newly-materialized RDDs oldMessages.unpersist(blocking = false) prevG.unpersistVer
[jira] [Commented] (SPARK-10298) PySpark can't JSON serialize a DataFrame with DecimalType columns.
[ https://issues.apache.org/jira/browse/SPARK-10298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720333#comment-14720333 ] Dan McClary commented on SPARK-10298: - I'll take a look at this. > PySpark can't JSON serialize a DataFrame with DecimalType columns. > -- > > Key: SPARK-10298 > URL: https://issues.apache.org/jira/browse/SPARK-10298 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Kevin Cox > > {code} > In [8]: sc.sql.createDataFrame([[Decimal(123)]], > types.StructType([types.StructField("a", types.DecimalType())])) > Out[8]: DataFrame[a: decimal(10,0)] > In [9]: _.write.json("foo") > 15/08/26 14:26:21 ERROR DefaultWriterContainer: Aborting task. > scala.MatchError: (DecimalType(10,0),123) (of class scala.Tuple2) > at > org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89) > at > org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89) > at > org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:126) > at > org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89) > at > org.apache.spark.sql.execution.datasources.json.JacksonGenerator$.apply(JacksonGenerator.scala:133) > at > org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.writeInternal(JSONRelation.scala:191) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:224) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 15/08/26 14:26:21 ERROR DefaultWriterContainer: Task attempt > attempt_201508261426__m_00_0 aborted. > 15/08/26 14:26:21 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > org.apache.spark.SparkException: Task failed while writing rows. > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:232) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: scala.MatchError: (DecimalType(10,0),123) (of class scala.Tuple2) > at > org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89) > at > org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89) > at > org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:126) > at > org.apache.spark.sql.execution.datasources.json.JacksonGenerator$$anonfun$org$apache$spark$sql$execution$datasources$json$JacksonGenerator$$valWriter$2$1.apply(JacksonGenerator.scala:89) > at > org.apache.spa
[jira] [Created] (SPARK-10334) Partitioned table scan's query plan does not show Filter and Project on top of the table scan
Yin Huai created SPARK-10334: Summary: Partitioned table scan's query plan does not show Filter and Project on top of the table scan Key: SPARK-10334 URL: https://issues.apache.org/jira/browse/SPARK-10334 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Yin Huai {code} Seq(Tuple2(1, 1), Tuple2(2, 2)).toDF("i", "j").write.format("parquet").partitionBy("i").save("/tmp/testFilter_partitioned") val df1 = sqlContext.read.format("parquet").load("/tmp/testFilter_partitioned") df1.selectExpr("hash(i)", "hash(j)").show df1.filter("hash(j) = 1").explain == Physical Plan == Scan ParquetRelation[file:/tmp/testFilter_partitioned][j#20,i#21] {code} Looks like the reason is that we correctly apply the project and filter. Then, we create an RDD for the result and then manually create a PhysicalRDD. So, the Project and Filter on top of the original table scan disappears from the physical plan. See https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L138-L175 We will not generate wrong result. But, the query plan is confusing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10334) Partitioned table scan's query plan does not show Filter and Project on top of the table scan
[ https://issues.apache.org/jira/browse/SPARK-10334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-10334: - Target Version/s: 1.6.0, 1.5.1 Priority: Critical (was: Major) > Partitioned table scan's query plan does not show Filter and Project on top > of the table scan > - > > Key: SPARK-10334 > URL: https://issues.apache.org/jira/browse/SPARK-10334 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Yin Huai >Priority: Critical > > {code} > Seq(Tuple2(1, 1), Tuple2(2, 2)).toDF("i", > "j").write.format("parquet").partitionBy("i").save("/tmp/testFilter_partitioned") > val df1 = > sqlContext.read.format("parquet").load("/tmp/testFilter_partitioned") > df1.selectExpr("hash(i)", "hash(j)").show > df1.filter("hash(j) = 1").explain > == Physical Plan == > Scan ParquetRelation[file:/tmp/testFilter_partitioned][j#20,i#21] > {code} > Looks like the reason is that we correctly apply the project and filter. > Then, we create an RDD for the result and then manually create a PhysicalRDD. > So, the Project and Filter on top of the original table scan disappears from > the physical plan. > See > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L138-L175 > We will not generate wrong result. But, the query plan is confusing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10199) Avoid using reflections for parquet model save
[ https://issues.apache.org/jira/browse/SPARK-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720277#comment-14720277 ] Feynman Liang commented on SPARK-10199: --- [~vinodkc] would it be possible to get some microbenchmarks? You can surround the call to [createDataFrame|https://github.com/apache/spark/pull/8507/files#diff-13d1de98ab7ae677f9b345eb90a8b8e8R237] with some timing code before and after the change. > Avoid using reflections for parquet model save > -- > > Key: SPARK-10199 > URL: https://issues.apache.org/jira/browse/SPARK-10199 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Feynman Liang >Priority: Minor > > These items are not high priority since the overhead writing to Parquest is > much greater than for runtime reflections. > Multiple model save/load in MLlib use case classes to infer a schema for the > data frame saved to Parquet. However, inferring a schema from case classes or > tuples uses [runtime > reflection|https://github.com/apache/spark/blob/d7b4c095271c36fcc7f9ded267ecf5ec66fac803/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L361] > which is unnecessary since the types are already known at the time `save` is > called. > It would be better to just specify the schema for the data frame directly > using {{sqlContext.createDataFrame(dataRDD, schema)}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10332) spark-submit to yarn doesn't fail if executors is 0
[ https://issues.apache.org/jira/browse/SPARK-10332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720276#comment-14720276 ] Marcelo Vanzin commented on SPARK-10332: You can call {{SparkContext::requestExecutors}} even with dynamic allocation off, right? Still, I agree that it's a little weird to not at least print a nasty warning to the user in that case. > spark-submit to yarn doesn't fail if executors is 0 > --- > > Key: SPARK-10332 > URL: https://issues.apache.org/jira/browse/SPARK-10332 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.5.0 >Reporter: Thomas Graves > > Running spark-submit with yarn with number-executors equal to 0 when not > using dynamic allocation should error out. > In spark 1.5.0 it continues and ends up hanging. > yarn.ClientArguments still has the check so something else must have changed. > spark-submit --master yarn --deploy-mode cluster --class > org.apache.spark.examples.SparkPi --num-executors 0 > spark 1.4.1 errors with: > java.lang.IllegalArgumentException: > Number of executors was 0, but must be at least 1 > (or 0 if dynamic executor allocation is enabled). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10333) Add user guide for linear-methods.md columns
Feynman Liang created SPARK-10333: - Summary: Add user guide for linear-methods.md columns Key: SPARK-10333 URL: https://issues.apache.org/jira/browse/SPARK-10333 Project: Spark Issue Type: Documentation Components: ML Reporter: Feynman Liang Priority: Minor Add example code to document input output columns based on https://github.com/apache/spark/pull/8491 feedback -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10055) San Francisco Crime Classification
[ https://issues.apache.org/jira/browse/SPARK-10055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720245#comment-14720245 ] Xiangrui Meng edited comment on SPARK-10055 at 8/28/15 5:05 PM: Thanks for posting the feedback! ~~For spark-csv, in the master branch there is an option to infer schema automatically.~~ Do you mind checking `RFormula`? It handles the string indexers and one-hot encoding automatically. was (Author: mengxr): Thanks for posting the feedback! For spark-csv, in the master branch there is an option to infer schema automatically. Do you mind checking `RFormula`? It handles the string indexers and one-hot encoding automatically. > San Francisco Crime Classification > -- > > Key: SPARK-10055 > URL: https://issues.apache.org/jira/browse/SPARK-10055 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Xiangrui Meng >Assignee: Kai Sasaki > > Apply ML pipeline API to San Francisco Crime Classification > (https://www.kaggle.com/c/sf-crime). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10055) San Francisco Crime Classification
[ https://issues.apache.org/jira/browse/SPARK-10055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720245#comment-14720245 ] Xiangrui Meng edited comment on SPARK-10055 at 8/28/15 5:05 PM: Thanks for posting the feedback! -For spark-csv, in the master branch there is an option to infer schema automatically.- Do you mind checking `RFormula`? It handles the string indexers and one-hot encoding automatically. was (Author: mengxr): Thanks for posting the feedback! ~~For spark-csv, in the master branch there is an option to infer schema automatically.~~ Do you mind checking `RFormula`? It handles the string indexers and one-hot encoding automatically. > San Francisco Crime Classification > -- > > Key: SPARK-10055 > URL: https://issues.apache.org/jira/browse/SPARK-10055 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Xiangrui Meng >Assignee: Kai Sasaki > > Apply ML pipeline API to San Francisco Crime Classification > (https://www.kaggle.com/c/sf-crime). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org