[jira] [Commented] (SPARK-12394) Support writing out pre-hash-partitioned data and exploit that in join optimizations to avoid shuffle (i.e. bucketing in Hive)
[ https://issues.apache.org/jira/browse/SPARK-12394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15438461#comment-15438461 ] Alexander Tronchin-James commented on SPARK-12394: -- Awesome news Tejas! The filter feature is secondary AFAIK, and I'd prioritize the sorted-merge bucketed-map (SMB) join if I had the choice. Strong preference for supporting all of inner and left/right/full outer joins between tables with integer multiple differences in the number of buckets, selecting the number of executors (a rational multiple or fraction of the number of buckets), and also selecting the number of emitted buckets. Bonus points for an implementation that automatically applies SMB joins and avoids re-sorts where possible. Maybe a tall order, but we know it can be done. ;-) If we don't get it all in the first pull request we can always iterate. Thanks for pushing on this! > Support writing out pre-hash-partitioned data and exploit that in join > optimizations to avoid shuffle (i.e. bucketing in Hive) > -- > > Key: SPARK-12394 > URL: https://issues.apache.org/jira/browse/SPARK-12394 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Assignee: Nong Li > Fix For: 2.0.0 > > Attachments: BucketedTables.pdf > > > In many cases users know ahead of time the columns that they will be joining > or aggregating on. Ideally they should be able to leverage this information > and pre-shuffle the data so that subsequent queries do not require a shuffle. > Hive supports this functionality by allowing the user to define buckets, > which are hash partitioning of the data based on some key. > - Allow the user to specify a set of columns when caching or writing out data > - Allow the user to specify some parallelism > - Shuffle the data when writing / caching such that its distributed by these > columns > - When planning/executing a query, use this distribution to avoid another > shuffle when reading, assuming the join or aggregation is compatible with the > columns specified > - Should work with existing save modes: append, overwrite, etc > - Should work at least with all Hadoops FS data sources > - Should work with any data source when caching -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions
[ https://issues.apache.org/jira/browse/SPARK-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15438457#comment-15438457 ] Xin Wu edited comment on SPARK-14927 at 8/26/16 4:46 AM: - [~smilegator] Do you think what you are working on will fix this issue by the way? This is to allow hive to see the partitions created by SparkSQL from a data frame. was (Author: xwu0226): [~smilegator] Do you think what you are working on regarding will fix this issue? This is to allow hive to see the partitions created by SparkSQL from a data frame. > DataFrame. saveAsTable creates RDD partitions but not Hive partitions > - > > Key: SPARK-14927 > URL: https://issues.apache.org/jira/browse/SPARK-14927 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.1 > Environment: Mac OS X 10.11.4 local >Reporter: Sasha Ovsankin > > This is a followup to > http://stackoverflow.com/questions/31341498/save-spark-dataframe-as-dynamic-partitioned-table-in-hive > . I tried to use suggestions in the answers but couldn't make it to work in > Spark 1.6.1 > I am trying to create partitions programmatically from `DataFrame. Here is > the relevant code (adapted from a Spark test): > hc.setConf("hive.metastore.warehouse.dir", "tmp/tests") > //hc.setConf("hive.exec.dynamic.partition", "true") > //hc.setConf("hive.exec.dynamic.partition.mode", "nonstrict") > hc.sql("create database if not exists tmp") > hc.sql("drop table if exists tmp.partitiontest1") > Seq(2012 -> "a").toDF("year", "val") > .write > .partitionBy("year") > .mode(SaveMode.Append) > .saveAsTable("tmp.partitiontest1") > hc.sql("show partitions tmp.partitiontest1").show > Full file is here: > https://gist.github.com/SashaOv/7c65f03a51c7e8f9c9e018cd42aa4c4a > I get the error that the table is not partitioned: > == > HIVE FAILURE OUTPUT > == > SET hive.support.sql11.reserved.keywords=false > SET hive.metastore.warehouse.dir=tmp/tests > OK > OK > FAILED: Execution Error, return code 1 from > org.apache.hadoop.hive.ql.exec.DDLTask. Table tmp.partitiontest1 is not a > partitioned table > == > It looks like the root cause is that > `org.apache.spark.sql.hive.HiveMetastoreCatalog.newSparkSQLSpecificMetastoreTable` > always creates table with empty partitions. > Any help to move this forward is appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions
[ https://issues.apache.org/jira/browse/SPARK-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15438457#comment-15438457 ] Xin Wu commented on SPARK-14927: [~smilegator] Do you think what you are working on regarding will fix this issue? This is to allow hive to see the partitions created by SparkSQL from a data frame. > DataFrame. saveAsTable creates RDD partitions but not Hive partitions > - > > Key: SPARK-14927 > URL: https://issues.apache.org/jira/browse/SPARK-14927 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.1 > Environment: Mac OS X 10.11.4 local >Reporter: Sasha Ovsankin > > This is a followup to > http://stackoverflow.com/questions/31341498/save-spark-dataframe-as-dynamic-partitioned-table-in-hive > . I tried to use suggestions in the answers but couldn't make it to work in > Spark 1.6.1 > I am trying to create partitions programmatically from `DataFrame. Here is > the relevant code (adapted from a Spark test): > hc.setConf("hive.metastore.warehouse.dir", "tmp/tests") > //hc.setConf("hive.exec.dynamic.partition", "true") > //hc.setConf("hive.exec.dynamic.partition.mode", "nonstrict") > hc.sql("create database if not exists tmp") > hc.sql("drop table if exists tmp.partitiontest1") > Seq(2012 -> "a").toDF("year", "val") > .write > .partitionBy("year") > .mode(SaveMode.Append) > .saveAsTable("tmp.partitiontest1") > hc.sql("show partitions tmp.partitiontest1").show > Full file is here: > https://gist.github.com/SashaOv/7c65f03a51c7e8f9c9e018cd42aa4c4a > I get the error that the table is not partitioned: > == > HIVE FAILURE OUTPUT > == > SET hive.support.sql11.reserved.keywords=false > SET hive.metastore.warehouse.dir=tmp/tests > OK > OK > FAILED: Execution Error, return code 1 from > org.apache.hadoop.hive.ql.exec.DDLTask. Table tmp.partitiontest1 is not a > partitioned table > == > It looks like the root cause is that > `org.apache.spark.sql.hive.HiveMetastoreCatalog.newSparkSQLSpecificMetastoreTable` > always creates table with empty partitions. > Any help to move this forward is appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17242) Update links of external dstream projects
[ https://issues.apache.org/jira/browse/SPARK-17242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-17242. - Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 > Update links of external dstream projects > - > > Key: SPARK-17242 > URL: https://issues.apache.org/jira/browse/SPARK-17242 > Project: Spark > Issue Type: Documentation > Components: Documentation >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.0.1, 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16998) select($"column1", explode($"column2")) is extremely slow
[ https://issues.apache.org/jira/browse/SPARK-16998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15438429#comment-15438429 ] Takeshi Yamamuro commented on SPARK-16998: -- If no problem, I'll pick up the pr. > select($"column1", explode($"column2")) is extremely slow > - > > Key: SPARK-16998 > URL: https://issues.apache.org/jira/browse/SPARK-16998 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: TobiasP > > Using a Dataset containing 10.000 rows, each containing null and an array of > 5.000 Ints, I observe the following performance (in local mode): > {noformat} > scala> time(ds.select(explode($"value")).sample(false, 0.001, 1).collect) > 1.219052 seconds > > res9: Array[org.apache.spark.sql.Row] = Array([3761], [3766], [3196]) > scala> time(ds.select($"dummy", explode($"value")).sample(false, 0.001, > 1).collect) > 20.219447 seconds > > res5: Array[org.apache.spark.sql.Row] = Array([null,3761], [null,3766], > [null,3196]) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12394) Support writing out pre-hash-partitioned data and exploit that in join optimizations to avoid shuffle (i.e. bucketing in Hive)
[ https://issues.apache.org/jira/browse/SPARK-12394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15438423#comment-15438423 ] Tejas Patil commented on SPARK-12394: - [~alex.n.ja...@gmail.com] [~kotim...@gmail.com] : The doc lists two things for future in the end. I am working on the last one : https://issues.apache.org/jira/browse/SPARK-15453. I am not sure if the `Filter on sorted data` one is already being worked on but I can work on that as well (just created a jira for that : https://issues.apache.org/jira/browse/SPARK-17254) > Support writing out pre-hash-partitioned data and exploit that in join > optimizations to avoid shuffle (i.e. bucketing in Hive) > -- > > Key: SPARK-12394 > URL: https://issues.apache.org/jira/browse/SPARK-12394 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Assignee: Nong Li > Fix For: 2.0.0 > > Attachments: BucketedTables.pdf > > > In many cases users know ahead of time the columns that they will be joining > or aggregating on. Ideally they should be able to leverage this information > and pre-shuffle the data so that subsequent queries do not require a shuffle. > Hive supports this functionality by allowing the user to define buckets, > which are hash partitioning of the data based on some key. > - Allow the user to specify a set of columns when caching or writing out data > - Allow the user to specify some parallelism > - Shuffle the data when writing / caching such that its distributed by these > columns > - When planning/executing a query, use this distribution to avoid another > shuffle when reading, assuming the join or aggregation is compatible with the > columns specified > - Should work with existing save modes: append, overwrite, etc > - Should work at least with all Hadoops FS data sources > - Should work with any data source when caching -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17254) Filter operator should have “stop if false” semantics for sorted data
Tejas Patil created SPARK-17254: --- Summary: Filter operator should have “stop if false” semantics for sorted data Key: SPARK-17254 URL: https://issues.apache.org/jira/browse/SPARK-17254 Project: Spark Issue Type: Improvement Components: SQL Reporter: Tejas Patil Priority: Minor >From >https://issues.apache.org/jira/secure/attachment/12778890/BucketedTables.pdf: Filter on sorted data If the data is sorted by a key, filters on the key could stop as soon as the data is out of range. For example, WHERE ticker_id < “F” should stop as soon as the first row starting with “F” is seen. This can be done adding a Filter operator that has “stop if false” semantics. This is generally useful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17253) Left join where ON clause does not reference the right table produces analysis error
Josh Rosen created SPARK-17253: -- Summary: Left join where ON clause does not reference the right table produces analysis error Key: SPARK-17253 URL: https://issues.apache.org/jira/browse/SPARK-17253 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Josh Rosen Priority: Minor The following query produces an AnalysisException: {code} CREATE TABLE currency ( cur CHAR(3) ); CREATE TABLE exchange ( cur1 CHAR(3), cur2 CHAR(3), rate double ); INSERT INTO currency VALUES ('EUR'); INSERT INTO currency VALUES ('GBP'); INSERT INTO currency VALUES ('USD'); INSERT INTO exchange VALUES ('EUR', 'GBP', 0.85); INSERT INTO exchange VALUES ('GBP', 'EUR', 1.0/0.85); SELECT c1.cur cur1, c2.cur cur2, COALESCE(self.rate, x.rate) rate FROM currency c1 CROSS JOIN currency c2 LEFT JOIN exchange x ON x.cur1=c1.cur AND x.cur2=c2.cur LEFT JOIN (SELECT 1 rate) self ON c1.cur=c2.cur; {code} {code} AnalysisException: cannot resolve '`c1.cur`' given input columns: [cur, cur1, cur2, rate]; line 5 pos 13 {code} However, this query is runnable in sqlite3 and postgres. This example query was adapted from https://www.sqlite.org/src/tktview?name=ebdbadade5, a sqlite bug report in which this query gave a wrong answer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17252) Performing arithmetic in VALUES can lead to ClassCastException / MatchErrors during query parsing
Josh Rosen created SPARK-17252: -- Summary: Performing arithmetic in VALUES can lead to ClassCastException / MatchErrors during query parsing Key: SPARK-17252 URL: https://issues.apache.org/jira/browse/SPARK-17252 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Josh Rosen The following example fails with a ClassCastException: {code} create table t(d double); insert into t VALUES (1 * 1.0); {code} Here's the error: {code} java.lang.ClassCastException: org.apache.spark.sql.types.Decimal cannot be cast to java.lang.Integer at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106) at scala.math.Numeric$IntIsIntegral$.times(Numeric.scala:57) at org.apache.spark.sql.catalyst.expressions.Multiply.nullSafeEval(arithmetic.scala:207) at org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:416) at org.apache.spark.sql.catalyst.expressions.CreateStruct$$anonfun$eval$2.apply(complexTypeCreator.scala:198) at org.apache.spark.sql.catalyst.expressions.CreateStruct$$anonfun$eval$2.apply(complexTypeCreator.scala:198) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.expressions.CreateStruct.eval(complexTypeCreator.scala:198) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:320) at org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitInlineTable$1$$anonfun$39.apply(AstBuilder.scala:677) at org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitInlineTable$1$$anonfun$39.apply(AstBuilder.scala:674) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitInlineTable$1.apply(AstBuilder.scala:674) at org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitInlineTable$1.apply(AstBuilder.scala:658) at org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:96) at org.apache.spark.sql.catalyst.parser.AstBuilder.visitInlineTable(AstBuilder.scala:658) at org.apache.spark.sql.catalyst.parser.AstBuilder.visitInlineTable(AstBuilder.scala:43) at org.apache.spark.sql.catalyst.parser.SqlBaseParser$InlineTableContext.accept(SqlBaseParser.java:9358) at org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:57) at org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitInlineTableDefault1(SqlBaseBaseVisitor.java:608) at org.apache.spark.sql.catalyst.parser.SqlBaseParser$InlineTableDefault1Context.accept(SqlBaseParser.java:7073) at org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:57) at org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitQueryTermDefault(SqlBaseBaseVisitor.java:580) at org.apache.spark.sql.catalyst.parser.SqlBaseParser$QueryTermDefaultContext.accept(SqlBaseParser.java:6895) at org.apache.spark.sql.catalyst.parser.AstBuilder.typedVisit(AstBuilder.scala:47) at org.apache.spark.sql.catalyst.parser.AstBuilder.plan(AstBuilder.scala:83) at org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleInsertQuery$1.apply(AstBuilder.scala:158) at org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleInsertQuery$1.apply(AstBuilder.scala:162) at org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:96) at org.apache.spark.sql.catalyst.parser.AstBuilder.visitSingleInsertQuery(AstBuilder.scala:157) at org.apache.spark.sql.catalyst.parser.AstBuilder.visitSingleInsertQuery(AstBuilder.scala:43) at org.apache.spark.sql.catalyst.parser.SqlBaseParser$SingleInsertQueryContext.accept(SqlBaseParser.java:6500) at org.apache.spark.sql.catalyst.parser.AstBuilder.typedVisit(AstBuilder.scala:47) at org.apache.spark.sql.catalyst.parser.AstBuilder.plan(AstBuilder.scala:83) at
[jira] [Created] (SPARK-17251) "ClassCastException: OuterReference cannot be cast to NamedExpression" for correlated subquery on the RHS of an IN operator
Josh Rosen created SPARK-17251: -- Summary: "ClassCastException: OuterReference cannot be cast to NamedExpression" for correlated subquery on the RHS of an IN operator Key: SPARK-17251 URL: https://issues.apache.org/jira/browse/SPARK-17251 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Josh Rosen The following test case produces a ClassCastException in the analyzer: {code} CREATE TABLE t1(a INTEGER); INSERT INTO t1 VALUES(1),(2); CREATE TABLE t2(b INTEGER); INSERT INTO t2 VALUES(1); SELECT a FROM t1 WHERE a NOT IN (SELECT a FROM t2); {code} Here's the exception: {code} java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.OuterReference cannot be cast to org.apache.spark.sql.catalyst.expressions.NamedExpression at org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$1.apply(basicLogicalOperators.scala:48) at scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:80) at scala.collection.immutable.List.exists(List.scala:84) at org.apache.spark.sql.catalyst.plans.logical.Project.resolved$lzycompute(basicLogicalOperators.scala:44) at org.apache.spark.sql.catalyst.plans.logical.Project.resolved(basicLogicalOperators.scala:43) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQuery(Analyzer.scala:1091) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQueries$1.applyOrElse(Analyzer.scala:1130) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQueries$1.applyOrElse(Analyzer.scala:1116) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:156) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:166) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$4.apply(QueryPlan.scala:175) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:175) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:144) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveSubquery$$resolveSubQueries(Analyzer.scala:1116) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$apply$16.applyOrElse(Analyzer.scala:1148) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery$$anonfun$apply$16.applyOrElse(Analyzer.scala:1141) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:58) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) at
[jira] [Commented] (SPARK-13525) SparkR: java.net.SocketTimeoutException: Accept timed out when running any dataframe function
[ https://issues.apache.org/jira/browse/SPARK-13525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15438331#comment-15438331 ] Sun Rui commented on SPARK-13525: - sorry, due to missing log information in such point, it is hard to determine the root cause. It seems that your are using an old SparkR version. and I am wondering if it is possible for you to modify R code to collection more information? Also you can try to set "spark.sparkr.use.daemon" to false to see wether this issue gone or not > SparkR: java.net.SocketTimeoutException: Accept timed out when running any > dataframe function > - > > Key: SPARK-13525 > URL: https://issues.apache.org/jira/browse/SPARK-13525 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Shubhanshu Mishra > Labels: sparkr > > I am following the code steps from this example: > https://spark.apache.org/docs/1.6.0/sparkr.html > There are multiple issues: > 1. The head and summary and filter methods are not overridden by spark. Hence > I need to call them using `SparkR::` namespace. > 2. When I try to execute the following, I get errors: > {code} > $> $R_HOME/bin/R > R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree" > Copyright (C) 2015 The R Foundation for Statistical Computing > Platform: x86_64-pc-linux-gnu (64-bit) > R is free software and comes with ABSOLUTELY NO WARRANTY. > You are welcome to redistribute it under certain conditions. > Type 'license()' or 'licence()' for distribution details. > Natural language support but running in an English locale > R is a collaborative project with many contributors. > Type 'contributors()' for more information and > 'citation()' on how to cite R or R packages in publications. > Type 'demo()' for some demos, 'help()' for on-line help, or > 'help.start()' for an HTML browser interface to help. > Type 'q()' to quit R. > Welcome at Fri Feb 26 16:19:35 2016 > Attaching package: ‘SparkR’ > The following objects are masked from ‘package:base’: > colnames, colnames<-, drop, intersect, rank, rbind, sample, subset, > summary, transform > Launching java with spark-submit command > /content/smishra8/SOFTWARE/spark/bin/spark-submit --driver-memory "50g" > sparkr-shell /tmp/RtmpfBQRg6/backend_portc3bc16f09b1b > > df <- createDataFrame(sqlContext, iris) > Warning messages: > 1: In FUN(X[[i]], ...) : > Use Sepal_Length instead of Sepal.Length as column name > 2: In FUN(X[[i]], ...) : > Use Sepal_Width instead of Sepal.Width as column name > 3: In FUN(X[[i]], ...) : > Use Petal_Length instead of Petal.Length as column name > 4: In FUN(X[[i]], ...) : > Use Petal_Width instead of Petal.Width as column name > > training <- filter(df, df$Species != "setosa") > Error in filter(df, df$Species != "setosa") : > no method for coercing this S4 class to a vector > > training <- SparkR::filter(df, df$Species != "setosa") > > model <- SparkR::glm(Species ~ Sepal_Length + Sepal_Width, data = training, > > family = "binomial") > 16/02/26 16:26:46 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.net.SocketTimeoutException: Accept timed out > at java.net.PlainSocketImpl.socketAccept(Native Method) > at > java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398) > at java.net.ServerSocket.implAccept(ServerSocket.java:530) > at java.net.ServerSocket.accept(ServerSocket.java:498) > at org.apache.spark.api.r.RRDD$.createRWorker(RRDD.scala:431) > at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:62) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at
[jira] [Updated] (SPARK-16283) Implement percentile_approx SQL function
[ https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-16283: --- Assignee: (was: Sean Zhong) > Implement percentile_approx SQL function > > > Key: SPARK-16283 > URL: https://issues.apache.org/jira/browse/SPARK-16283 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12394) Support writing out pre-hash-partitioned data and exploit that in join optimizations to avoid shuffle (i.e. bucketing in Hive)
[ https://issues.apache.org/jira/browse/SPARK-12394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15438317#comment-15438317 ] Darren Fu commented on SPARK-12394: --- Any luck to see this feature implemented in v2.0? > Support writing out pre-hash-partitioned data and exploit that in join > optimizations to avoid shuffle (i.e. bucketing in Hive) > -- > > Key: SPARK-12394 > URL: https://issues.apache.org/jira/browse/SPARK-12394 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Assignee: Nong Li > Fix For: 2.0.0 > > Attachments: BucketedTables.pdf > > > In many cases users know ahead of time the columns that they will be joining > or aggregating on. Ideally they should be able to leverage this information > and pre-shuffle the data so that subsequent queries do not require a shuffle. > Hive supports this functionality by allowing the user to define buckets, > which are hash partitioning of the data based on some key. > - Allow the user to specify a set of columns when caching or writing out data > - Allow the user to specify some parallelism > - Shuffle the data when writing / caching such that its distributed by these > columns > - When planning/executing a query, use this distribution to avoid another > shuffle when reading, assuming the join or aggregation is compatible with the > columns specified > - Should work with existing save modes: append, overwrite, etc > - Should work at least with all Hadoops FS data sources > - Should work with any data source when caching -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13525) SparkR: java.net.SocketTimeoutException: Accept timed out when running any dataframe function
[ https://issues.apache.org/jira/browse/SPARK-13525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15438316#comment-15438316 ] Sun Rui commented on SPARK-13525: - I checked the code and realized that SocketTimeoutException means the R worker process should have been started. Otherwise, other exception should be thrown when calling ProcessBuilder.start() if there is any problem starting the R worker process. we don't know the root cause of such kind of issue, it may be due to a bug or just issues in system runtime environment. But at least we can: 1. Update documentation for setup of SparkR. state clearly that R must be installed on each node, and R worker executable can be configured to a proper path. 2. Update the RRunner, enlarge the scope of the try block to cover the connection establishment, so that the stderr output of the R process can be printed when SocketTimeoutException is thrown. That will help us to find the root cause. > SparkR: java.net.SocketTimeoutException: Accept timed out when running any > dataframe function > - > > Key: SPARK-13525 > URL: https://issues.apache.org/jira/browse/SPARK-13525 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Shubhanshu Mishra > Labels: sparkr > > I am following the code steps from this example: > https://spark.apache.org/docs/1.6.0/sparkr.html > There are multiple issues: > 1. The head and summary and filter methods are not overridden by spark. Hence > I need to call them using `SparkR::` namespace. > 2. When I try to execute the following, I get errors: > {code} > $> $R_HOME/bin/R > R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree" > Copyright (C) 2015 The R Foundation for Statistical Computing > Platform: x86_64-pc-linux-gnu (64-bit) > R is free software and comes with ABSOLUTELY NO WARRANTY. > You are welcome to redistribute it under certain conditions. > Type 'license()' or 'licence()' for distribution details. > Natural language support but running in an English locale > R is a collaborative project with many contributors. > Type 'contributors()' for more information and > 'citation()' on how to cite R or R packages in publications. > Type 'demo()' for some demos, 'help()' for on-line help, or > 'help.start()' for an HTML browser interface to help. > Type 'q()' to quit R. > Welcome at Fri Feb 26 16:19:35 2016 > Attaching package: ‘SparkR’ > The following objects are masked from ‘package:base’: > colnames, colnames<-, drop, intersect, rank, rbind, sample, subset, > summary, transform > Launching java with spark-submit command > /content/smishra8/SOFTWARE/spark/bin/spark-submit --driver-memory "50g" > sparkr-shell /tmp/RtmpfBQRg6/backend_portc3bc16f09b1b > > df <- createDataFrame(sqlContext, iris) > Warning messages: > 1: In FUN(X[[i]], ...) : > Use Sepal_Length instead of Sepal.Length as column name > 2: In FUN(X[[i]], ...) : > Use Sepal_Width instead of Sepal.Width as column name > 3: In FUN(X[[i]], ...) : > Use Petal_Length instead of Petal.Length as column name > 4: In FUN(X[[i]], ...) : > Use Petal_Width instead of Petal.Width as column name > > training <- filter(df, df$Species != "setosa") > Error in filter(df, df$Species != "setosa") : > no method for coercing this S4 class to a vector > > training <- SparkR::filter(df, df$Species != "setosa") > > model <- SparkR::glm(Species ~ Sepal_Length + Sepal_Width, data = training, > > family = "binomial") > 16/02/26 16:26:46 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.net.SocketTimeoutException: Accept timed out > at java.net.PlainSocketImpl.socketAccept(Native Method) > at > java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398) > at java.net.ServerSocket.implAccept(ServerSocket.java:530) > at java.net.ServerSocket.accept(ServerSocket.java:498) > at org.apache.spark.api.r.RRDD$.createRWorker(RRDD.scala:431) > at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:62) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at
[jira] [Updated] (SPARK-16283) Implement percentile_approx SQL function
[ https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-16283: --- Assignee: Sean Zhong > Implement percentile_approx SQL function > > > Key: SPARK-16283 > URL: https://issues.apache.org/jira/browse/SPARK-16283 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Sean Zhong > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16948) Support empty orc table when converting hive serde table to data source table
[ https://issues.apache.org/jira/browse/SPARK-16948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated SPARK-16948: - Summary: Support empty orc table when converting hive serde table to data source table (was: Querying empty partitioned orc tables throws exception) > Support empty orc table when converting hive serde table to data source table > - > > Key: SPARK-16948 > URL: https://issues.apache.org/jira/browse/SPARK-16948 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Rajesh Balamohan >Priority: Minor > > Querying empty partitioned ORC tables from spark-sql throws exception with > "spark.sql.hive.convertMetastoreOrc=true". > {noformat} > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:347) > at scala.None$.get(Option.scala:345) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$12.apply(HiveMetastoreCatalog.scala:297) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog$$anonfun$12.apply(HiveMetastoreCatalog.scala:284) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog.org$apache$spark$sql$hive$HiveMetastoreCatalog$$convertToLogicalRelation(HiveMetastoreCatalog.scala:284) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$.org$apache$spark$sql$hive$HiveMetastoreCatalog$OrcConversions$$convertToOrcRelation(HiveMetastoreCatalo) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:423) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:414) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:322) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17250) Remove HiveClient and setCurrentDatabase from HiveSessionCatalog
[ https://issues.apache.org/jira/browse/SPARK-17250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15438233#comment-15438233 ] Apache Spark commented on SPARK-17250: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/14821 > Remove HiveClient and setCurrentDatabase from HiveSessionCatalog > > > Key: SPARK-17250 > URL: https://issues.apache.org/jira/browse/SPARK-17250 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li > > This is the first step to clean `HiveClient` from `HiveSessionState`. In the > metastore interaction, we always set fully qualified names when > accessing/operating a table. That means, we always specify the database. > Thus, it is not necessary to use `HiveClient` to change the active database > in Hive metastore. > In `HiveSessionCatalog `, `setCurrentDatabase` is the only function that uses > `HiveClient`. Thus, we can remove it after removing `setCurrentDatabase` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17250) Remove HiveClient and setCurrentDatabase from HiveSessionCatalog
[ https://issues.apache.org/jira/browse/SPARK-17250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17250: Assignee: Apache Spark > Remove HiveClient and setCurrentDatabase from HiveSessionCatalog > > > Key: SPARK-17250 > URL: https://issues.apache.org/jira/browse/SPARK-17250 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Apache Spark > > This is the first step to clean `HiveClient` from `HiveSessionState`. In the > metastore interaction, we always set fully qualified names when > accessing/operating a table. That means, we always specify the database. > Thus, it is not necessary to use `HiveClient` to change the active database > in Hive metastore. > In `HiveSessionCatalog `, `setCurrentDatabase` is the only function that uses > `HiveClient`. Thus, we can remove it after removing `setCurrentDatabase` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17250) Remove HiveClient and setCurrentDatabase from HiveSessionCatalog
Xiao Li created SPARK-17250: --- Summary: Remove HiveClient and setCurrentDatabase from HiveSessionCatalog Key: SPARK-17250 URL: https://issues.apache.org/jira/browse/SPARK-17250 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0 Reporter: Xiao Li This is the first step to clean `HiveClient` from `HiveSessionState`. In the metastore interaction, we always set fully qualified names when accessing/operating a table. That means, we always specify the database. Thus, it is not necessary to use `HiveClient` to change the active database in Hive metastore. In `HiveSessionCatalog `, `setCurrentDatabase` is the only function that uses `HiveClient`. Thus, we can remove it after removing `setCurrentDatabase` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17212) TypeCoercion support widening conversion between DateType and TimestampType
[ https://issues.apache.org/jira/browse/SPARK-17212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-17212: Assignee: Hyukjin Kwon > TypeCoercion support widening conversion between DateType and TimestampType > --- > > Key: SPARK-17212 > URL: https://issues.apache.org/jira/browse/SPARK-17212 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon > Fix For: 2.1.0 > > > Currently, type-widening does not work between {{TimestampType}} and > {{DateType}}. > This applies to {{SetOperation}}, {{Union}}, {{In}}, {{CaseWhen}}, > {{Greatest}}, {{Leatest}}, {{CreateArray}}, {{CreateMap}} and {{Coalesce}}. > For a simple example, > {code} > Seq(Tuple2(new Timestamp(0), new Date(0))).toDF("a", > "b").selectExpr("greatest(a, b)").show() > {code} > {code} > cannot resolve 'greatest(`a`, `b`)' due to data type mismatch: The > expressions should all have the same type, got GREATEST(timestamp, date) > {code} > or Union as below: > {code} > val a = Seq(Tuple1(new Timestamp(0))).toDF() > val b = Seq(Tuple1(new Date(0))).toDF() > a.union(b).show() > {code} > {code} > Union can only be performed on tables with the compatible column types. > DateType <> TimestampType at the first column of the second table; > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17212) TypeCoercion support widening conversion between DateType and TimestampType
[ https://issues.apache.org/jira/browse/SPARK-17212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-17212. - Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14786 [https://github.com/apache/spark/pull/14786] > TypeCoercion support widening conversion between DateType and TimestampType > --- > > Key: SPARK-17212 > URL: https://issues.apache.org/jira/browse/SPARK-17212 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Hyukjin Kwon > Fix For: 2.1.0 > > > Currently, type-widening does not work between {{TimestampType}} and > {{DateType}}. > This applies to {{SetOperation}}, {{Union}}, {{In}}, {{CaseWhen}}, > {{Greatest}}, {{Leatest}}, {{CreateArray}}, {{CreateMap}} and {{Coalesce}}. > For a simple example, > {code} > Seq(Tuple2(new Timestamp(0), new Date(0))).toDF("a", > "b").selectExpr("greatest(a, b)").show() > {code} > {code} > cannot resolve 'greatest(`a`, `b`)' due to data type mismatch: The > expressions should all have the same type, got GREATEST(timestamp, date) > {code} > or Union as below: > {code} > val a = Seq(Tuple1(new Timestamp(0))).toDF() > val b = Seq(Tuple1(new Date(0))).toDF() > a.union(b).show() > {code} > {code} > Union can only be performed on tables with the compatible column types. > DateType <> TimestampType at the first column of the second table; > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17249) java.lang.IllegalStateException: Did not find registered driver with class org.apache.spark.sql.execution.datasources.jdbc.DriverWrapper
Graeme Edwards created SPARK-17249: -- Summary: java.lang.IllegalStateException: Did not find registered driver with class org.apache.spark.sql.execution.datasources.jdbc.DriverWrapper Key: SPARK-17249 URL: https://issues.apache.org/jira/browse/SPARK-17249 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.0 Reporter: Graeme Edwards Priority: Minor This issue is a corner case relating to SPARK-14162 that isn't fixed by that change. It occurs when we: - Are using Oracle's ojdbc - The driver is wrapping ojdbc with a DriverWrapper because it is added via the Spark class loader. - We don't specify an explicit "driver" property Then in /org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala (createConnectionFactory) The driver will get the driverClass as: val driverClass: String = userSpecifiedDriverClass.getOrElse { DriverManager.getDriver(url).getClass.getCanonicalName } Which since the Driver is wrapped by a DriverWrapper will be "org.apache.spark.sql.execution.datasources.jdbc.DriverWrapper" That gets passed to the Executor which will attempt to find a matching wrapper with the name "org.apache.spark.sql.execution.datasources.jdbc.DriverWrapper". However the Executor is aware of the wrapping and will compare with the wrapped classname instead: case d: DriverWrapper if d.wrapped.getClass.getCanonicalName == driverClass => d I think the fix is just to change the initialization of driverClass to also be aware that there might be a wrapper and if so pass the wrapped classname. The problem can be worked around by setting the driver property for the jdbc call: val props = new java.util.Properties() props.put("driver", "oracle.jdbc.OracleDriver") val result = sqlContext.read.jdbc(connectionString, query, props) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17248) Add native Scala enum support to Dataset Encoders
Silvio Fiorito created SPARK-17248: -- Summary: Add native Scala enum support to Dataset Encoders Key: SPARK-17248 URL: https://issues.apache.org/jira/browse/SPARK-17248 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: Silvio Fiorito Enable support for Scala enums in Encoders. Ideally, users should be able to use enums as part of case classes automatically. Currently, this code... {code} object MyEnum extends Enumeration { type MyEnum = Value val EnumVal1, EnumVal2 = Value } case class MyClass(col: MyEnum.MyEnum) val data = Seq(MyClass(MyEnum.EnumVal1), MyClass(MyEnum.EnumVal2)).toDS() {code} ...results in this stacktrace: {code} ava.lang.UnsupportedOperationException: No Encoder found for MyEnum.MyEnum - field (class: "scala.Enumeration.Value", name: "col") - root class: "line550c9f34c5144aa1a1e76bcac863244717.$read.$iwC.$iwC.$iwC.$iwC.MyClass" at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:598) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:592) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:583) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:583) at org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:425) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:61) at org.apache.spark.sql.Encoders$.product(Encoders.scala:274) at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:47) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17246) Support BigDecimal literal parsing
[ https://issues.apache.org/jira/browse/SPARK-17246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17246: Assignee: Herman van Hovell (was: Apache Spark) > Support BigDecimal literal parsing > -- > > Key: SPARK-17246 > URL: https://issues.apache.org/jira/browse/SPARK-17246 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Herman van Hovell >Assignee: Herman van Hovell >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17246) Support BigDecimal literal parsing
[ https://issues.apache.org/jira/browse/SPARK-17246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17246: Assignee: Apache Spark (was: Herman van Hovell) > Support BigDecimal literal parsing > -- > > Key: SPARK-17246 > URL: https://issues.apache.org/jira/browse/SPARK-17246 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Herman van Hovell >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17246) Support BigDecimal literal parsing
[ https://issues.apache.org/jira/browse/SPARK-17246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15438152#comment-15438152 ] Apache Spark commented on SPARK-17246: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/14819 > Support BigDecimal literal parsing > -- > > Key: SPARK-17246 > URL: https://issues.apache.org/jira/browse/SPARK-17246 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Herman van Hovell >Assignee: Herman van Hovell >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17187) Support using arbitrary Java object as internal aggregation buffer object
[ https://issues.apache.org/jira/browse/SPARK-17187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-17187: - Assignee: Sean Zhong > Support using arbitrary Java object as internal aggregation buffer object > - > > Key: SPARK-17187 > URL: https://issues.apache.org/jira/browse/SPARK-17187 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Sean Zhong >Assignee: Sean Zhong > Fix For: 2.1.0 > > > *Background* > For aggregation functions like sum and count, Spark-Sql internally use an > aggregation buffer to store the intermediate aggregation result for all > aggregation functions. Each aggregation function will occupy a section of > aggregation buffer. > *Problem* > Currently, Spark-sql only allows a small set of Spark-Sql supported storage > data types stored in aggregation buffer, which is not very convenient or > performant, there are several typical cases like: > 1. If the aggregation has a complex model CountMinSketch, it is not very easy > to convert the complex model so that it can be stored with limited Spark-sql > supported data types. > 2. It is hard to reuse aggregation class definition defined in existing > libraries like algebird. > 3. It may introduces heavy serialization/deserialization cost when converting > a domain model to Spark sql supported data type. For example, the current > implementation of `TypedAggregateExpression` requires > serialization/de-serialization for each call of update or merge. > *Proposal* > We propose: > 1. Introduces a TypedImperativeAggregate which allows using arbitrary java > object as aggregation buffer, with requirements like: > - It is flexible enough that the API allows using any java object as > aggregation buffer, so that it is easier to integrate with existing Monoid > libraries like algebird. > - We don't need to call serialize/deserialize for each call of > update/merge. Instead, only a few serialization/deserialization operations > are needed. This is to guarantee theperformance. > 2. Refactors `TypedAggregateExpression` to use this new interface, to get > higher performance. > 3. Implements Appro-Percentile and other aggregation functions which has a > complex aggregation object with this new interface. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17187) Support using arbitrary Java object as internal aggregation buffer object
[ https://issues.apache.org/jira/browse/SPARK-17187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-17187. -- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14753 [https://github.com/apache/spark/pull/14753] > Support using arbitrary Java object as internal aggregation buffer object > - > > Key: SPARK-17187 > URL: https://issues.apache.org/jira/browse/SPARK-17187 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Sean Zhong > Fix For: 2.1.0 > > > *Background* > For aggregation functions like sum and count, Spark-Sql internally use an > aggregation buffer to store the intermediate aggregation result for all > aggregation functions. Each aggregation function will occupy a section of > aggregation buffer. > *Problem* > Currently, Spark-sql only allows a small set of Spark-Sql supported storage > data types stored in aggregation buffer, which is not very convenient or > performant, there are several typical cases like: > 1. If the aggregation has a complex model CountMinSketch, it is not very easy > to convert the complex model so that it can be stored with limited Spark-sql > supported data types. > 2. It is hard to reuse aggregation class definition defined in existing > libraries like algebird. > 3. It may introduces heavy serialization/deserialization cost when converting > a domain model to Spark sql supported data type. For example, the current > implementation of `TypedAggregateExpression` requires > serialization/de-serialization for each call of update or merge. > *Proposal* > We propose: > 1. Introduces a TypedImperativeAggregate which allows using arbitrary java > object as aggregation buffer, with requirements like: > - It is flexible enough that the API allows using any java object as > aggregation buffer, so that it is easier to integrate with existing Monoid > libraries like algebird. > - We don't need to call serialize/deserialize for each call of > update/merge. Instead, only a few serialization/deserialization operations > are needed. This is to guarantee theperformance. > 2. Refactors `TypedAggregateExpression` to use this new interface, to get > higher performance. > 3. Implements Appro-Percentile and other aggregation functions which has a > complex aggregation object with this new interface. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17231) Avoid building debug or trace log messages unless the respective log level is enabled
[ https://issues.apache.org/jira/browse/SPARK-17231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-17231. -- Resolution: Fixed Assignee: Michael Allman Fix Version/s: 2.1.0 2.0.1 > Avoid building debug or trace log messages unless the respective log level is > enabled > - > > Key: SPARK-17231 > URL: https://issues.apache.org/jira/browse/SPARK-17231 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 > Environment: Spark cluster with 8 r3.8xl EC2 worker instances >Reporter: Michael Allman >Assignee: Michael Allman >Priority: Minor > Fix For: 2.0.1, 2.1.0 > > Attachments: logging_perf_improvements 2.jpg, > logging_perf_improvements.jpg, master 2.jpg, master.jpg > > > While debugging the performance of a large GraphX connected components > computation, I found several places in the {{network-common}} and > {{network-shuffle}} code bases where trace or debug log messages are > constructed even if the respective log level is disabled. Refactoring the > respective code to avoid these constructions except where necessary led to a > modest but measurable reduction in task time, GC time and the ratio thereof. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7007) Add metrics source for ExecutorAllocationManager to expose internal status
[ https://issues.apache.org/jira/browse/SPARK-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437325#comment-15437325 ] Erik Erlandson edited comment on SPARK-7007 at 8/25/16 11:25 PM: - Are there instructions for how to enable these metrics? Is it an incantation in the `metrics.properties` file? Update: my cluster had been mistakenly configured without dynamic executor allocation. When that is turned on, these metrics are published under driver metrics, without any special configuration. was (Author: eje): Are there instructions for how to enable these metrics? Is it an incantation in the `metrics.properties` file? > Add metrics source for ExecutorAllocationManager to expose internal status > -- > > Key: SPARK-7007 > URL: https://issues.apache.org/jira/browse/SPARK-7007 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 1.3.0 >Reporter: Saisai Shao >Priority: Minor > > Add a metric source to expose the internal status of > ExecutorAllocationManager to better monitoring the executor allocation when > running on Yarn. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17157) Add multiclass logistic regression SparkR Wrapper
[ https://issues.apache.org/jira/browse/SPARK-17157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15438054#comment-15438054 ] Apache Spark commented on SPARK-17157: -- User 'wangmiao1981' has created a pull request for this issue: https://github.com/apache/spark/pull/14818 > Add multiclass logistic regression SparkR Wrapper > - > > Key: SPARK-17157 > URL: https://issues.apache.org/jira/browse/SPARK-17157 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Miao Wang > > [SPARK-7159][ML] Add multiclass logistic regression to Spark ML has been > merged to Master. I open this JIRA for discussion of adding SparkR wrapper > for multiclass logistic regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17157) Add multiclass logistic regression SparkR Wrapper
[ https://issues.apache.org/jira/browse/SPARK-17157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17157: Assignee: (was: Apache Spark) > Add multiclass logistic regression SparkR Wrapper > - > > Key: SPARK-17157 > URL: https://issues.apache.org/jira/browse/SPARK-17157 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Miao Wang > > [SPARK-7159][ML] Add multiclass logistic regression to Spark ML has been > merged to Master. I open this JIRA for discussion of adding SparkR wrapper > for multiclass logistic regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17240) SparkConf is Serializable but contains a non-serializable field
[ https://issues.apache.org/jira/browse/SPARK-17240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-17240. Resolution: Fixed Assignee: Marcelo Vanzin Fix Version/s: 2.1.0 > SparkConf is Serializable but contains a non-serializable field > --- > > Key: SPARK-17240 > URL: https://issues.apache.org/jira/browse/SPARK-17240 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Michael Gummelt >Assignee: Marcelo Vanzin > Fix For: 2.1.0 > > > This commit: > https://github.com/apache/spark/commit/5da6c4b24f512b63cd4e6ba7dd8968066a9396f5 > Added ConfigReader to SparkConf. SparkConf is Serializable, but ConfigReader > is not, which results in the following exception: > {code} > java.io.NotSerializableException: > org.apache.spark.internal.config.ConfigReader > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at org.apache.spark.util.Utils$.serialize(Utils.scala:134) > at > org.apache.spark.scheduler.cluster.mesos.ZookeeperMesosClusterPersistenceEngine.persist(MesosClusterPersistenceEngine.scala:111) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.submitDriver(MesosClusterScheduler.scala:170) > at > org.apache.spark.deploy.rest.mesos.MesosSubmitRequestServlet.handleSubmit(MesosRestServer.scala:126) > at > org.apache.spark.deploy.rest.SubmitRequestServlet.doPost(RestSubmissionServer.scala:265) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > at > org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:812) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) > at org.spark_project.jetty.server.Server.handle(Server.java:499) > at > org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311) > at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) > at > org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) > at > org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17157) Add multiclass logistic regression SparkR Wrapper
[ https://issues.apache.org/jira/browse/SPARK-17157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17157: Assignee: Apache Spark > Add multiclass logistic regression SparkR Wrapper > - > > Key: SPARK-17157 > URL: https://issues.apache.org/jira/browse/SPARK-17157 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Miao Wang >Assignee: Apache Spark > > [SPARK-7159][ML] Add multiclass logistic regression to Spark ML has been > merged to Master. I open this JIRA for discussion of adding SparkR wrapper > for multiclass logistic regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-16627) --jars doesn't work in Mesos mode
[ https://issues.apache.org/jira/browse/SPARK-16627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Gummelt closed SPARK-16627. --- Resolution: Won't Fix > --jars doesn't work in Mesos mode > - > > Key: SPARK-16627 > URL: https://issues.apache.org/jira/browse/SPARK-16627 > Project: Spark > Issue Type: Bug > Components: Mesos >Reporter: Michael Gummelt > > Definitely doesn't work in cluster mode. Might not work in client mode > either. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions
[ https://issues.apache.org/jira/browse/SPARK-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437861#comment-15437861 ] Corentin Kerisit commented on SPARK-14927: -- Any guideline on how we could help get this resolved ? > DataFrame. saveAsTable creates RDD partitions but not Hive partitions > - > > Key: SPARK-14927 > URL: https://issues.apache.org/jira/browse/SPARK-14927 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.1 > Environment: Mac OS X 10.11.4 local >Reporter: Sasha Ovsankin > > This is a followup to > http://stackoverflow.com/questions/31341498/save-spark-dataframe-as-dynamic-partitioned-table-in-hive > . I tried to use suggestions in the answers but couldn't make it to work in > Spark 1.6.1 > I am trying to create partitions programmatically from `DataFrame. Here is > the relevant code (adapted from a Spark test): > hc.setConf("hive.metastore.warehouse.dir", "tmp/tests") > //hc.setConf("hive.exec.dynamic.partition", "true") > //hc.setConf("hive.exec.dynamic.partition.mode", "nonstrict") > hc.sql("create database if not exists tmp") > hc.sql("drop table if exists tmp.partitiontest1") > Seq(2012 -> "a").toDF("year", "val") > .write > .partitionBy("year") > .mode(SaveMode.Append) > .saveAsTable("tmp.partitiontest1") > hc.sql("show partitions tmp.partitiontest1").show > Full file is here: > https://gist.github.com/SashaOv/7c65f03a51c7e8f9c9e018cd42aa4c4a > I get the error that the table is not partitioned: > == > HIVE FAILURE OUTPUT > == > SET hive.support.sql11.reserved.keywords=false > SET hive.metastore.warehouse.dir=tmp/tests > OK > OK > FAILED: Execution Error, return code 1 from > org.apache.hadoop.hive.ql.exec.DDLTask. Table tmp.partitiontest1 is not a > partitioned table > == > It looks like the root cause is that > `org.apache.spark.sql.hive.HiveMetastoreCatalog.newSparkSQLSpecificMetastoreTable` > always creates table with empty partitions. > Any help to move this forward is appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17247) when fall back to hdfs is enabled for stats calculation, the hdfs listing and size calcuation should be terminated as soon as total size > broadcast threshold
[ https://issues.apache.org/jira/browse/SPARK-17247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17247: Assignee: Apache Spark > when fall back to hdfs is enabled for stats calculation, the hdfs listing and > size calcuation should be terminated as soon as total size > broadcast > threshold > -- > > Key: SPARK-17247 > URL: https://issues.apache.org/jira/browse/SPARK-17247 > Project: Spark > Issue Type: Bug >Reporter: Parth Brahmbhatt >Assignee: Apache Spark > > Currently when user enables spark.sql.statistics.fallBackToHdfs and no stats > are available from metastore we fall back to hdfs. This is useful join > optimization however this can slow things down. To speed up the operation we > could stop size calculation as soon as we hit the broadcast threshold as the > accuracy of size is not important. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17247) when fall back to hdfs is enabled for stats calculation, the hdfs listing and size calcuation should be terminated as soon as total size > broadcast threshold
[ https://issues.apache.org/jira/browse/SPARK-17247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17247: Assignee: (was: Apache Spark) > when fall back to hdfs is enabled for stats calculation, the hdfs listing and > size calcuation should be terminated as soon as total size > broadcast > threshold > -- > > Key: SPARK-17247 > URL: https://issues.apache.org/jira/browse/SPARK-17247 > Project: Spark > Issue Type: Bug >Reporter: Parth Brahmbhatt > > Currently when user enables spark.sql.statistics.fallBackToHdfs and no stats > are available from metastore we fall back to hdfs. This is useful join > optimization however this can slow things down. To speed up the operation we > could stop size calculation as soon as we hit the broadcast threshold as the > accuracy of size is not important. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17247) when fall back to hdfs is enabled for stats calculation, the hdfs listing and size calcuation should be terminated as soon as total size > broadcast threshold
[ https://issues.apache.org/jira/browse/SPARK-17247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437860#comment-15437860 ] Apache Spark commented on SPARK-17247: -- User 'Parth-Brahmbhatt' has created a pull request for this issue: https://github.com/apache/spark/pull/14817 > when fall back to hdfs is enabled for stats calculation, the hdfs listing and > size calcuation should be terminated as soon as total size > broadcast > threshold > -- > > Key: SPARK-17247 > URL: https://issues.apache.org/jira/browse/SPARK-17247 > Project: Spark > Issue Type: Bug >Reporter: Parth Brahmbhatt > > Currently when user enables spark.sql.statistics.fallBackToHdfs and no stats > are available from metastore we fall back to hdfs. This is useful join > optimization however this can slow things down. To speed up the operation we > could stop size calculation as soon as we hit the broadcast threshold as the > accuracy of size is not important. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13525) SparkR: java.net.SocketTimeoutException: Accept timed out when running any dataframe function
[ https://issues.apache.org/jira/browse/SPARK-13525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437856#comment-15437856 ] Shivaram Venkataraman commented on SPARK-13525: --- Yeah this is related but a slightly different error - This means that the R daemons were started but the workers they forked didn't connect back to the JVM. I think this could happen if the machine runs of memory / file descriptors etc causing a fork to fail ? > SparkR: java.net.SocketTimeoutException: Accept timed out when running any > dataframe function > - > > Key: SPARK-13525 > URL: https://issues.apache.org/jira/browse/SPARK-13525 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Shubhanshu Mishra > Labels: sparkr > > I am following the code steps from this example: > https://spark.apache.org/docs/1.6.0/sparkr.html > There are multiple issues: > 1. The head and summary and filter methods are not overridden by spark. Hence > I need to call them using `SparkR::` namespace. > 2. When I try to execute the following, I get errors: > {code} > $> $R_HOME/bin/R > R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree" > Copyright (C) 2015 The R Foundation for Statistical Computing > Platform: x86_64-pc-linux-gnu (64-bit) > R is free software and comes with ABSOLUTELY NO WARRANTY. > You are welcome to redistribute it under certain conditions. > Type 'license()' or 'licence()' for distribution details. > Natural language support but running in an English locale > R is a collaborative project with many contributors. > Type 'contributors()' for more information and > 'citation()' on how to cite R or R packages in publications. > Type 'demo()' for some demos, 'help()' for on-line help, or > 'help.start()' for an HTML browser interface to help. > Type 'q()' to quit R. > Welcome at Fri Feb 26 16:19:35 2016 > Attaching package: ‘SparkR’ > The following objects are masked from ‘package:base’: > colnames, colnames<-, drop, intersect, rank, rbind, sample, subset, > summary, transform > Launching java with spark-submit command > /content/smishra8/SOFTWARE/spark/bin/spark-submit --driver-memory "50g" > sparkr-shell /tmp/RtmpfBQRg6/backend_portc3bc16f09b1b > > df <- createDataFrame(sqlContext, iris) > Warning messages: > 1: In FUN(X[[i]], ...) : > Use Sepal_Length instead of Sepal.Length as column name > 2: In FUN(X[[i]], ...) : > Use Sepal_Width instead of Sepal.Width as column name > 3: In FUN(X[[i]], ...) : > Use Petal_Length instead of Petal.Length as column name > 4: In FUN(X[[i]], ...) : > Use Petal_Width instead of Petal.Width as column name > > training <- filter(df, df$Species != "setosa") > Error in filter(df, df$Species != "setosa") : > no method for coercing this S4 class to a vector > > training <- SparkR::filter(df, df$Species != "setosa") > > model <- SparkR::glm(Species ~ Sepal_Length + Sepal_Width, data = training, > > family = "binomial") > 16/02/26 16:26:46 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.net.SocketTimeoutException: Accept timed out > at java.net.PlainSocketImpl.socketAccept(Native Method) > at > java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398) > at java.net.ServerSocket.implAccept(ServerSocket.java:530) > at java.net.ServerSocket.accept(ServerSocket.java:498) > at org.apache.spark.api.r.RRDD$.createRWorker(RRDD.scala:431) > at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:62) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at
[jira] [Created] (SPARK-17247) when fall back to hdfs is enabled for stats calculation, the hdfs listing and size calcuation should be terminated as soon as total size > broadcast threshold
Parth Brahmbhatt created SPARK-17247: Summary: when fall back to hdfs is enabled for stats calculation, the hdfs listing and size calcuation should be terminated as soon as total size > broadcast threshold Key: SPARK-17247 URL: https://issues.apache.org/jira/browse/SPARK-17247 Project: Spark Issue Type: Bug Reporter: Parth Brahmbhatt Currently when user enables spark.sql.statistics.fallBackToHdfs and no stats are available from metastore we fall back to hdfs. This is useful join optimization however this can slow things down. To speed up the operation we could stop size calculation as soon as we hit the broadcast threshold as the accuracy of size is not important. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17205) Literal.sql does not properly convert NaN and Infinity literals
[ https://issues.apache.org/jira/browse/SPARK-17205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-17205. --- Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 > Literal.sql does not properly convert NaN and Infinity literals > --- > > Key: SPARK-17205 > URL: https://issues.apache.org/jira/browse/SPARK-17205 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Minor > Fix For: 2.0.1, 2.1.0 > > > {{Literal.sql}} mishandles NaN and Infinity literals: the handling of these > needs to be special-cased instead of simply appending a suffix to the string > representation of the value -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17246) Support BigDecimal literal parsing
Herman van Hovell created SPARK-17246: - Summary: Support BigDecimal literal parsing Key: SPARK-17246 URL: https://issues.apache.org/jira/browse/SPARK-17246 Project: Spark Issue Type: Improvement Components: SQL Reporter: Herman van Hovell Assignee: Herman van Hovell Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17245) NPE thrown by ClientWrapper.conf
[ https://issues.apache.org/jira/browse/SPARK-17245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437809#comment-15437809 ] Apache Spark commented on SPARK-17245: -- User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/14816 > NPE thrown by ClientWrapper.conf > > > Key: SPARK-17245 > URL: https://issues.apache.org/jira/browse/SPARK-17245 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2 >Reporter: Yin Huai > > This issue has been fixed in Spark 2.0. Seems ClientWrapper.conf is trying to > access the ThreadLocal SessionState, which has been set. > {code} > java.lang.NullPointerException > at > org.apache.spark.sql.hive.client.ClientWrapper.conf(ClientWrapper.scala:225) > at > org.apache.spark.sql.hive.client.ClientWrapper.client(ClientWrapper.scala:279) > > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:291) > > at > org.apache.spark.sql.hive.client.ClientWrapper.liftedTree1$1(ClientWrapper.scala:246) > > at > org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:245) > > at > org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:288) > > at > org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:493) > > at > org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:483) > > at > org.apache.spark.sql.hive.client.ClientWrapper.addJar(ClientWrapper.scala:603) > > at org.apache.spark.sql.hive.HiveContext.addJar(HiveContext.scala:654) > at org.apache.spark.sql.hive.execution.AddJar.run(commands.scala:105) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58) > > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56) > > at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) > > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) > > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55) > > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:145) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:130) > at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:816) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17245) NPE thrown by ClientWrapper.conf
[ https://issues.apache.org/jira/browse/SPARK-17245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17245: Assignee: (was: Apache Spark) > NPE thrown by ClientWrapper.conf > > > Key: SPARK-17245 > URL: https://issues.apache.org/jira/browse/SPARK-17245 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2 >Reporter: Yin Huai > > This issue has been fixed in Spark 2.0. Seems ClientWrapper.conf is trying to > access the ThreadLocal SessionState, which has been set. > {code} > java.lang.NullPointerException > at > org.apache.spark.sql.hive.client.ClientWrapper.conf(ClientWrapper.scala:225) > at > org.apache.spark.sql.hive.client.ClientWrapper.client(ClientWrapper.scala:279) > > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:291) > > at > org.apache.spark.sql.hive.client.ClientWrapper.liftedTree1$1(ClientWrapper.scala:246) > > at > org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:245) > > at > org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:288) > > at > org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:493) > > at > org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:483) > > at > org.apache.spark.sql.hive.client.ClientWrapper.addJar(ClientWrapper.scala:603) > > at org.apache.spark.sql.hive.HiveContext.addJar(HiveContext.scala:654) > at org.apache.spark.sql.hive.execution.AddJar.run(commands.scala:105) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58) > > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56) > > at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) > > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) > > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55) > > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:145) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:130) > at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:816) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17245) NPE thrown by ClientWrapper.conf
[ https://issues.apache.org/jira/browse/SPARK-17245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17245: Assignee: Apache Spark > NPE thrown by ClientWrapper.conf > > > Key: SPARK-17245 > URL: https://issues.apache.org/jira/browse/SPARK-17245 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2 >Reporter: Yin Huai >Assignee: Apache Spark > > This issue has been fixed in Spark 2.0. Seems ClientWrapper.conf is trying to > access the ThreadLocal SessionState, which has been set. > {code} > java.lang.NullPointerException > at > org.apache.spark.sql.hive.client.ClientWrapper.conf(ClientWrapper.scala:225) > at > org.apache.spark.sql.hive.client.ClientWrapper.client(ClientWrapper.scala:279) > > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:291) > > at > org.apache.spark.sql.hive.client.ClientWrapper.liftedTree1$1(ClientWrapper.scala:246) > > at > org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:245) > > at > org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:288) > > at > org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:493) > > at > org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:483) > > at > org.apache.spark.sql.hive.client.ClientWrapper.addJar(ClientWrapper.scala:603) > > at org.apache.spark.sql.hive.HiveContext.addJar(HiveContext.scala:654) > at org.apache.spark.sql.hive.execution.AddJar.run(commands.scala:105) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58) > > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56) > > at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) > > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) > > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55) > > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:145) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:130) > at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:816) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16700) StructType doesn't accept Python dicts anymore
[ https://issues.apache.org/jira/browse/SPARK-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-16700: - Fix Version/s: 2.0.1 > StructType doesn't accept Python dicts anymore > -- > > Key: SPARK-16700 > URL: https://issues.apache.org/jira/browse/SPARK-16700 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Sylvain Zimmer >Assignee: Davies Liu > Labels: releasenotes > Fix For: 2.0.1, 2.1.0 > > > Hello, > I found this issue while testing my codebase with 2.0.0-rc5 > StructType in Spark 1.6.2 accepts the Python type, which is very > handy. 2.0.0-rc5 does not and throws an error. > I don't know if this was intended but I'd advocate for this behaviour to > remain the same. MapType is probably wasteful when your key names never > change and switching to Python tuples would be cumbersome. > Here is a minimal script to reproduce the issue: > {code} > from pyspark import SparkContext > from pyspark.sql import types as SparkTypes > from pyspark.sql import SQLContext > sc = SparkContext() > sqlc = SQLContext(sc) > struct_schema = SparkTypes.StructType([ > SparkTypes.StructField("id", SparkTypes.LongType()) > ]) > rdd = sc.parallelize([{"id": 0}, {"id": 1}]) > df = sqlc.createDataFrame(rdd, struct_schema) > print df.collect() > # 1.6.2 prints [Row(id=0), Row(id=1)] > # 2.0.0-rc5 raises TypeError: StructType can not accept object {'id': 0} in > type > {code} > Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history
[ https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437764#comment-15437764 ] Alex Bozarth commented on SPARK-17243: -- Thanks, that'll help when I look into it > Spark 2.0 history server summary page gets stuck at "loading history summary" > with 10K+ application history > --- > > Key: SPARK-17243 > URL: https://issues.apache.org/jira/browse/SPARK-17243 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 > Environment: Linux >Reporter: Gang Wu > > The summary page of Spark 2.0 history server web UI keep displaying "Loading > history summary..." all the time and crashes the browser when there are more > than 10K application history event logs on HDFS. > I did some investigation, "historypage.js" file sends a REST request to > /api/v1/applications endpoint of history server REST endpoint and gets back > json response. When there are more than 10K applications inside the event log > directory it takes forever to parse them and render the page. When there are > only hundreds or thousands of application history it is running fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history
[ https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437768#comment-15437768 ] Alex Bozarth commented on SPARK-17243: -- Sorry, my misunderstanding of your problem, I will make sure to keep this in mind once I start my work > Spark 2.0 history server summary page gets stuck at "loading history summary" > with 10K+ application history > --- > > Key: SPARK-17243 > URL: https://issues.apache.org/jira/browse/SPARK-17243 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 > Environment: Linux >Reporter: Gang Wu > > The summary page of Spark 2.0 history server web UI keep displaying "Loading > history summary..." all the time and crashes the browser when there are more > than 10K application history event logs on HDFS. > I did some investigation, "historypage.js" file sends a REST request to > /api/v1/applications endpoint of history server REST endpoint and gets back > json response. When there are more than 10K applications inside the event log > directory it takes forever to parse them and render the page. When there are > only hundreds or thousands of application history it is running fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11085) Add support for HTTP proxy
[ https://issues.apache.org/jira/browse/SPARK-11085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437727#comment-15437727 ] SURESH CHAGANTI commented on SPARK-11085: - Hi All, I have made the code changes to accept the HTTP proxy as a run-time argument and use that for out bound calls below is the pull request: https://github.com/SureshChaganti/spark-ec2/commit/cfd4bf727bdf46b9456f8f4d89221d1377d9c221 > Add support for HTTP proxy > --- > > Key: SPARK-11085 > URL: https://issues.apache.org/jira/browse/SPARK-11085 > Project: Spark > Issue Type: Improvement > Components: Spark Shell, Spark Submit >Reporter: Dustin Cote >Priority: Minor > > Add a way to update ivysettings.xml for the spark-shell and spark-submit to > support proxy settings for clusters that need to access a remote repository > through an http proxy. Typically this would be done like: > JAVA_OPTS="$JAVA_OPTS -Dhttp.proxyHost=proxy.host -Dhttp.proxyPort=8080 > -Dhttps.proxyHost=proxy.host.secure -Dhttps.proxyPort=8080" > Directly in the ivysettings.xml would look like: > > proxyport="8080" > nonproxyhosts="nonproxy.host"/> > > Even better would be a way to customize the ivysettings.xml with command > options. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17241) SparkR spark.glm should have configurable regularization parameter
[ https://issues.apache.org/jira/browse/SPARK-17241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437717#comment-15437717 ] Junyang Qian commented on SPARK-17241: -- I'll take a closer look and see if we can add it easily. > SparkR spark.glm should have configurable regularization parameter > -- > > Key: SPARK-17241 > URL: https://issues.apache.org/jira/browse/SPARK-17241 > Project: Spark > Issue Type: Improvement >Reporter: Junyang Qian > > Spark has configurable L2 regularization parameter for generalized linear > regression. It is very important to have them in SparkR so that users can run > ridge regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17245) NPE thrown by ClientWrapper.conf
[ https://issues.apache.org/jira/browse/SPARK-17245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-17245: - Summary: NPE thrown by ClientWrapper.conf (was: NPE thrown by ClientWrapper ) > NPE thrown by ClientWrapper.conf > > > Key: SPARK-17245 > URL: https://issues.apache.org/jira/browse/SPARK-17245 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2 >Reporter: Yin Huai > > This issue has been fixed in Spark 2.0. Seems ClientWrapper.conf is trying to > access the ThreadLocal SessionState, which has been set. > {code} > java.lang.NullPointerException > at > org.apache.spark.sql.hive.client.ClientWrapper.conf(ClientWrapper.scala:225) > at > org.apache.spark.sql.hive.client.ClientWrapper.client(ClientWrapper.scala:279) > > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:291) > > at > org.apache.spark.sql.hive.client.ClientWrapper.liftedTree1$1(ClientWrapper.scala:246) > > at > org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:245) > > at > org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:288) > > at > org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:493) > > at > org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:483) > > at > org.apache.spark.sql.hive.client.ClientWrapper.addJar(ClientWrapper.scala:603) > > at org.apache.spark.sql.hive.HiveContext.addJar(HiveContext.scala:654) > at org.apache.spark.sql.hive.execution.AddJar.run(commands.scala:105) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58) > > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56) > > at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) > > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) > > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55) > > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:145) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:130) > at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:816) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17245) NPE thrown by ClientWrapper
Yin Huai created SPARK-17245: Summary: NPE thrown by ClientWrapper Key: SPARK-17245 URL: https://issues.apache.org/jira/browse/SPARK-17245 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.2 Reporter: Yin Huai This issue has been fixed in Spark 2.0. Seems ClientWrapper.conf is trying to access the ThreadLocal SessionState, which has been set. {code} java.lang.NullPointerException at org.apache.spark.sql.hive.client.ClientWrapper.conf(ClientWrapper.scala:225) at org.apache.spark.sql.hive.client.ClientWrapper.client(ClientWrapper.scala:279) at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:291) at org.apache.spark.sql.hive.client.ClientWrapper.liftedTree1$1(ClientWrapper.scala:246) at org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:245) at org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:288) at org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:493) at org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:483) at org.apache.spark.sql.hive.client.ClientWrapper.addJar(ClientWrapper.scala:603) at org.apache.spark.sql.hive.HiveContext.addJar(HiveContext.scala:654) at org.apache.spark.sql.hive.execution.AddJar.run(commands.scala:105) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) at org.apache.spark.sql.DataFrame.(DataFrame.scala:145) at org.apache.spark.sql.DataFrame.(DataFrame.scala:130) at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:816) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17229) Postgres JDBC dialect should not widen float and short types during reads
[ https://issues.apache.org/jira/browse/SPARK-17229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-17229. --- Resolution: Fixed Fix Version/s: 2.1.0 > Postgres JDBC dialect should not widen float and short types during reads > - > > Key: SPARK-17229 > URL: https://issues.apache.org/jira/browse/SPARK-17229 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Minor > Fix For: 2.1.0 > > > When reading {{float4}} and {{smallint}} columns from PostgreSQL, Spark's > Postgres dialect widens these types to Decimal and Integer rather than using > the narrower Float and Short types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17241) SparkR spark.glm should have configurable regularization parameter
[ https://issues.apache.org/jira/browse/SPARK-17241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437692#comment-15437692 ] Junyang Qian commented on SPARK-17241: -- [~shivaram] It seems that spark has it for linear regression but not for glm. > SparkR spark.glm should have configurable regularization parameter > -- > > Key: SPARK-17241 > URL: https://issues.apache.org/jira/browse/SPARK-17241 > Project: Spark > Issue Type: Improvement >Reporter: Junyang Qian > > Spark has configurable L2 regularization parameter for generalized linear > regression. It is very important to have them in SparkR so that users can run > ridge regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history
[ https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437691#comment-15437691 ] Gang Wu commented on SPARK-17243: - This doesn't work. This is for the cache of WEB UIs not for the application metadata. The default value is 50 which is small enough. > Spark 2.0 history server summary page gets stuck at "loading history summary" > with 10K+ application history > --- > > Key: SPARK-17243 > URL: https://issues.apache.org/jira/browse/SPARK-17243 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 > Environment: Linux >Reporter: Gang Wu > > The summary page of Spark 2.0 history server web UI keep displaying "Loading > history summary..." all the time and crashes the browser when there are more > than 10K application history event logs on HDFS. > I did some investigation, "historypage.js" file sends a REST request to > /api/v1/applications endpoint of history server REST endpoint and gets back > json response. When there are more than 10K applications inside the event log > directory it takes forever to parse them and render the page. When there are > only hundreds or thousands of application history it is running fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history
[ https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437688#comment-15437688 ] Gang Wu commented on SPARK-17243: - Hi Alex, I think in Spark 1.5 history server obtains all application summary metadata directly from class FsHistoryProvider. You can check in HistoryPage.scala. While in Spark 2.0 it deals with JSON string (in historypage.js) which is MUCH slower than before. It may make sense if the old way is used? > Spark 2.0 history server summary page gets stuck at "loading history summary" > with 10K+ application history > --- > > Key: SPARK-17243 > URL: https://issues.apache.org/jira/browse/SPARK-17243 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 > Environment: Linux >Reporter: Gang Wu > > The summary page of Spark 2.0 history server web UI keep displaying "Loading > history summary..." all the time and crashes the browser when there are more > than 10K application history event logs on HDFS. > I did some investigation, "historypage.js" file sends a REST request to > /api/v1/applications endpoint of history server REST endpoint and gets back > json response. When there are more than 10K applications inside the event log > directory it takes forever to parse them and render the page. When there are > only hundreds or thousands of application history it is running fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history
[ https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437682#comment-15437682 ] Alex Bozarth edited comment on SPARK-17243 at 8/25/16 9:13 PM: --- [~wgtmac] until this is fixed you can limit the number of applications available by setting {{spark.history.retainedApplications}} It limits the apps the history server loads was (Author: ajbozarth): [~wgtmac] until this is fixed you can limit the number of applications available by setting {spark.history.retainedApplications} It limits the apps the history server loads > Spark 2.0 history server summary page gets stuck at "loading history summary" > with 10K+ application history > --- > > Key: SPARK-17243 > URL: https://issues.apache.org/jira/browse/SPARK-17243 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 > Environment: Linux >Reporter: Gang Wu > > The summary page of Spark 2.0 history server web UI keep displaying "Loading > history summary..." all the time and crashes the browser when there are more > than 10K application history event logs on HDFS. > I did some investigation, "historypage.js" file sends a REST request to > /api/v1/applications endpoint of history server REST endpoint and gets back > json response. When there are more than 10K applications inside the event log > directory it takes forever to parse them and render the page. When there are > only hundreds or thousands of application history it is running fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history
[ https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437682#comment-15437682 ] Alex Bozarth edited comment on SPARK-17243 at 8/25/16 9:13 PM: --- [~wgtmac] until this is fixed you can limit the number of applications available by setting {spark.history.retainedApplications} It limits the apps the history server loads was (Author: ajbozarth): [~wgtmac] until this is fixed you can limit the number of applications available by setting `spark.history.retainedApplications` It limits the apps the history server loads > Spark 2.0 history server summary page gets stuck at "loading history summary" > with 10K+ application history > --- > > Key: SPARK-17243 > URL: https://issues.apache.org/jira/browse/SPARK-17243 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 > Environment: Linux >Reporter: Gang Wu > > The summary page of Spark 2.0 history server web UI keep displaying "Loading > history summary..." all the time and crashes the browser when there are more > than 10K application history event logs on HDFS. > I did some investigation, "historypage.js" file sends a REST request to > /api/v1/applications endpoint of history server REST endpoint and gets back > json response. When there are more than 10K applications inside the event log > directory it takes forever to parse them and render the page. When there are > only hundreds or thousands of application history it is running fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history
[ https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437682#comment-15437682 ] Alex Bozarth commented on SPARK-17243: -- [~wgtmac] until this is fixed you can limit the number of applications available by setting `spark.history.retainedApplications` It limits the apps the history server loads > Spark 2.0 history server summary page gets stuck at "loading history summary" > with 10K+ application history > --- > > Key: SPARK-17243 > URL: https://issues.apache.org/jira/browse/SPARK-17243 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 > Environment: Linux >Reporter: Gang Wu > > The summary page of Spark 2.0 history server web UI keep displaying "Loading > history summary..." all the time and crashes the browser when there are more > than 10K application history event logs on HDFS. > I did some investigation, "historypage.js" file sends a REST request to > /api/v1/applications endpoint of history server REST endpoint and gets back > json response. When there are more than 10K applications inside the event log > directory it takes forever to parse them and render the page. When there are > only hundreds or thousands of application history it is running fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets
[ https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437672#comment-15437672 ] Sean McKibben commented on SPARK-17147: --- I tried Robert's changes, but the performance for any sizable number of reads is really bad. At least the way I understand it, whenever there is a discontiguous offset, it forces Kafka to do a seek, which is extremely slow. > Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets > > > Key: SPARK-17147 > URL: https://issues.apache.org/jira/browse/SPARK-17147 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 2.0.0 >Reporter: Robert Conrad > > When Kafka does log compaction offsets often end up with gaps, meaning the > next requested offset will be frequently not be offset+1. The logic in > KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset > will always be just an increment of 1 above the previous offset. > I have worked around this problem by changing CachedKafkaConsumer to use the > returned record's offset, from: > {{nextOffset = offset + 1}} > to: > {{nextOffset = record.offset + 1}} > and changed KafkaRDD from: > {{requestOffset += 1}} > to: > {{requestOffset = r.offset() + 1}} > (I also had to change some assert logic in CachedKafkaConsumer). > There's a strong possibility that I have misconstrued how to use the > streaming kafka consumer, and I'm happy to close this out if that's the case. > If, however, it is supposed to support non-consecutive offsets (e.g. due to > log compaction) I am also happy to contribute a PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history
[ https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437668#comment-15437668 ] Alex Bozarth edited comment on SPARK-17243 at 8/25/16 9:05 PM: --- I'm not sure I agree that this should be a blocker, but I was actually planning on filing a JIRA and starting work on a pr next month (September) that will switch the history server to only load application data when an application ui is opened and only loading application metadata on the initial load of the history server. This is just one of many problems that would be fixed by such a change. I won't have the bandwidth to start working on it for another week or two though. tl;dr I plan to fix this but not until next month was (Author: ajbozarth): I'm not sure I agree that this should be a blocker, but I was actually planning on filing a JIRA and starting work on a pr next month (September) that will switch the history server to only load application data when an application ui is opened and only loading application metadata on the initial load of the history server. This is just one of many problems that would be fixed by such a change. I won't have the bandwidth to start working on it for another week or two though. > Spark 2.0 history server summary page gets stuck at "loading history summary" > with 10K+ application history > --- > > Key: SPARK-17243 > URL: https://issues.apache.org/jira/browse/SPARK-17243 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 > Environment: Linux >Reporter: Gang Wu >Priority: Blocker > > The summary page of Spark 2.0 history server web UI keep displaying "Loading > history summary..." all the time and crashes the browser when there are more > than 10K application history event logs on HDFS. > I did some investigation, "historypage.js" file sends a REST request to > /api/v1/applications endpoint of history server REST endpoint and gets back > json response. When there are more than 10K applications inside the event log > directory it takes forever to parse them and render the page. When there are > only hundreds or thousands of application history it is running fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history
[ https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-17243: -- Priority: Major (was: Blocker) Yes, should not be assigned as a Blocker. > Spark 2.0 history server summary page gets stuck at "loading history summary" > with 10K+ application history > --- > > Key: SPARK-17243 > URL: https://issues.apache.org/jira/browse/SPARK-17243 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 > Environment: Linux >Reporter: Gang Wu > > The summary page of Spark 2.0 history server web UI keep displaying "Loading > history summary..." all the time and crashes the browser when there are more > than 10K application history event logs on HDFS. > I did some investigation, "historypage.js" file sends a REST request to > /api/v1/applications endpoint of history server REST endpoint and gets back > json response. When there are more than 10K applications inside the event log > directory it takes forever to parse them and render the page. When there are > only hundreds or thousands of application history it is running fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history
[ https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437668#comment-15437668 ] Alex Bozarth commented on SPARK-17243: -- I'm not sure I agree that this should be a blocker, but I was actually planning on filing a JIRA and starting work on a pr next month (September) that will switch the history server to only load application data when an application ui is opened and only loading application metadata on the initial load of the history server. This is just one of many problems that would be fixed by such a change. I won't have the bandwidth to start working on it for another week or two though. > Spark 2.0 history server summary page gets stuck at "loading history summary" > with 10K+ application history > --- > > Key: SPARK-17243 > URL: https://issues.apache.org/jira/browse/SPARK-17243 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 > Environment: Linux >Reporter: Gang Wu >Priority: Blocker > > The summary page of Spark 2.0 history server web UI keep displaying "Loading > history summary..." all the time and crashes the browser when there are more > than 10K application history event logs on HDFS. > I did some investigation, "historypage.js" file sends a REST request to > /api/v1/applications endpoint of history server REST endpoint and gets back > json response. When there are more than 10K applications inside the event log > directory it takes forever to parse them and render the page. When there are > only hundreds or thousands of application history it is running fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history
[ https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437653#comment-15437653 ] Sean Owen commented on SPARK-17243: --- Related, but not identical: https://issues.apache.org/jira/browse/SPARK-15083 > Spark 2.0 history server summary page gets stuck at "loading history summary" > with 10K+ application history > --- > > Key: SPARK-17243 > URL: https://issues.apache.org/jira/browse/SPARK-17243 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 > Environment: Linux >Reporter: Gang Wu >Priority: Blocker > > The summary page of Spark 2.0 history server web UI keep displaying "Loading > history summary..." all the time and crashes the browser when there are more > than 10K application history event logs on HDFS. > I did some investigation, "historypage.js" file sends a REST request to > /api/v1/applications endpoint of history server REST endpoint and gets back > json response. When there are more than 10K applications inside the event log > directory it takes forever to parse them and render the page. When there are > only hundreds or thousands of application history it is running fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17244) Joins should not pushdown non-deterministic conditions
[ https://issues.apache.org/jira/browse/SPARK-17244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17244: Assignee: Apache Spark > Joins should not pushdown non-deterministic conditions > -- > > Key: SPARK-17244 > URL: https://issues.apache.org/jira/browse/SPARK-17244 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Sameer Agarwal >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17244) Joins should not pushdown non-deterministic conditions
[ https://issues.apache.org/jira/browse/SPARK-17244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437649#comment-15437649 ] Apache Spark commented on SPARK-17244: -- User 'sameeragarwal' has created a pull request for this issue: https://github.com/apache/spark/pull/14815 > Joins should not pushdown non-deterministic conditions > -- > > Key: SPARK-17244 > URL: https://issues.apache.org/jira/browse/SPARK-17244 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Sameer Agarwal > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17244) Joins should not pushdown non-deterministic conditions
[ https://issues.apache.org/jira/browse/SPARK-17244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17244: Assignee: (was: Apache Spark) > Joins should not pushdown non-deterministic conditions > -- > > Key: SPARK-17244 > URL: https://issues.apache.org/jira/browse/SPARK-17244 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Sameer Agarwal > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17163) Merge MLOR into a single LOR interface
[ https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437642#comment-15437642 ] DB Tsai commented on SPARK-17163: - Maybe we can store them as the same format, and for scoring, we convert it into pivoted version for BLOR. Thus, at least the storage will be unified. > Merge MLOR into a single LOR interface > -- > > Key: SPARK-17163 > URL: https://issues.apache.org/jira/browse/SPARK-17163 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > Before the 2.1 release, we should finalize the API for logistic regression. > After SPARK-7159, we have both LogisticRegression and > MultinomialLogisticRegression models. This may be confusing to users and, is > a bit superfluous since MLOR can do basically all of what BLOR does. We > should decide if it needs to be changed and implement those changes before 2.1 > *Update*: Seems we have decided to merge the two estimators. I changed the > title to reflect that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17244) Joins should not pushdown non-deterministic conditions
Sameer Agarwal created SPARK-17244: -- Summary: Joins should not pushdown non-deterministic conditions Key: SPARK-17244 URL: https://issues.apache.org/jira/browse/SPARK-17244 Project: Spark Issue Type: Bug Components: SQL Reporter: Sameer Agarwal -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17243) Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history
[ https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Wu updated SPARK-17243: Summary: Spark 2.0 history server summary page gets stuck at "loading history summary" with 10K+ application history (was: Spark history server summary page gets stuck at "loading history summary" with 10K+ application history) > Spark 2.0 history server summary page gets stuck at "loading history summary" > with 10K+ application history > --- > > Key: SPARK-17243 > URL: https://issues.apache.org/jira/browse/SPARK-17243 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 > Environment: Linux >Reporter: Gang Wu >Priority: Blocker > > The summary page of Spark 2.0 history server web UI keep displaying "Loading > history summary..." all the time and crashes the browser when there are more > than 10K application history event logs on HDFS. > I did some investigation, "historypage.js" file sends a REST request to > /api/v1/applications endpoint of history server REST endpoint and gets back > json response. When there are more than 10K applications inside the event log > directory it takes forever to parse them and render the page. When there are > only hundreds or thousands of application history it is running fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17243) Spark history server summary page gets stuck at "loading history summary" with 10K+ application history
Gang Wu created SPARK-17243: --- Summary: Spark history server summary page gets stuck at "loading history summary" with 10K+ application history Key: SPARK-17243 URL: https://issues.apache.org/jira/browse/SPARK-17243 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.0.0 Environment: Linux Reporter: Gang Wu Priority: Blocker The summary page of Spark history server web UI keep displaying "Loading history summary..." all the time and crashes the browser when there are more than 10K application history event logs on HDFS. I did some investigation, "historypage.js" file sends a REST request to /api/v1/applications endpoint of history server REST endpoint and gets back json response. When there are more than 10K applications inside the event log directory it takes forever to parse them and render the page. When there are only hundreds or thousands of application history it is running fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17243) Spark history server summary page gets stuck at "loading history summary" with 10K+ application history
[ https://issues.apache.org/jira/browse/SPARK-17243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Wu updated SPARK-17243: Description: The summary page of Spark 2.0 history server web UI keep displaying "Loading history summary..." all the time and crashes the browser when there are more than 10K application history event logs on HDFS. I did some investigation, "historypage.js" file sends a REST request to /api/v1/applications endpoint of history server REST endpoint and gets back json response. When there are more than 10K applications inside the event log directory it takes forever to parse them and render the page. When there are only hundreds or thousands of application history it is running fine. was: The summary page of Spark history server web UI keep displaying "Loading history summary..." all the time and crashes the browser when there are more than 10K application history event logs on HDFS. I did some investigation, "historypage.js" file sends a REST request to /api/v1/applications endpoint of history server REST endpoint and gets back json response. When there are more than 10K applications inside the event log directory it takes forever to parse them and render the page. When there are only hundreds or thousands of application history it is running fine. > Spark history server summary page gets stuck at "loading history summary" > with 10K+ application history > --- > > Key: SPARK-17243 > URL: https://issues.apache.org/jira/browse/SPARK-17243 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 > Environment: Linux >Reporter: Gang Wu >Priority: Blocker > > The summary page of Spark 2.0 history server web UI keep displaying "Loading > history summary..." all the time and crashes the browser when there are more > than 10K application history event logs on HDFS. > I did some investigation, "historypage.js" file sends a REST request to > /api/v1/applications endpoint of history server REST endpoint and gets back > json response. When there are more than 10K applications inside the event log > directory it takes forever to parse them and render the page. When there are > only hundreds or thousands of application history it is running fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14241) Output of monotonically_increasing_id lacks stable relation with rows of DataFrame
[ https://issues.apache.org/jira/browse/SPARK-14241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437616#comment-15437616 ] Nicholas Chammas commented on SPARK-14241: -- [~marmbrus] - Would it be tough to make this function deterministic, or somehow "stable"? The linked Stack Overflow question shows some pretty surprising behavior from an end-user perspective. If this would be tough to change, what are some alternatives you would recommend? Do you think, for example, it would be possible to make a window function that _is_ deterministic and does effectively the same thing? Maybe something like {{row_number()}}, except the {{WindowSpec}} would not need to specify any partitioning or ordering. (Required ordering would be the main downside of using {{row_number()}} instead of {{monotonically_increasing_id()}}.) > Output of monotonically_increasing_id lacks stable relation with rows of > DataFrame > -- > > Key: SPARK-14241 > URL: https://issues.apache.org/jira/browse/SPARK-14241 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 1.6.0, 1.6.1 >Reporter: Paul Shearer > > If you use monotonically_increasing_id() to append a column of IDs to a > DataFrame, the IDs do not have a stable, deterministic relationship to the > rows they are appended to. A given ID value can land on different rows > depending on what happens in the task graph: > http://stackoverflow.com/questions/35705038/how-do-i-add-an-persistent-column-of-row-ids-to-spark-dataframe/35706321#35706321 > From a user perspective this behavior is very unexpected, and many things one > would normally like to do with an ID column are in fact only possible under > very narrow circumstances. The function should either be made deterministic, > or there should be a prominent warning note in the API docs regarding its > behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17163) Merge MLOR into a single LOR interface
[ https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437606#comment-15437606 ] Seth Hendrickson commented on SPARK-17163: -- If we store the binomial case with 2 x numFeatures coefficients, I wonder how much it will affect prediction? Doubling the number of operations for each. From a code perspective, unifying the representation is much nicer, but we may see a regression in performance. Thoughts? > Merge MLOR into a single LOR interface > -- > > Key: SPARK-17163 > URL: https://issues.apache.org/jira/browse/SPARK-17163 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > Before the 2.1 release, we should finalize the API for logistic regression. > After SPARK-7159, we have both LogisticRegression and > MultinomialLogisticRegression models. This may be confusing to users and, is > a bit superfluous since MLOR can do basically all of what BLOR does. We > should decide if it needs to be changed and implement those changes before 2.1 > *Update*: Seems we have decided to merge the two estimators. I changed the > title to reflect that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13525) SparkR: java.net.SocketTimeoutException: Accept timed out when running any dataframe function
[ https://issues.apache.org/jira/browse/SPARK-13525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437605#comment-15437605 ] Arihanth Jain commented on SPARK-13525: --- [~shivaram] I have similar trace, please see below: ERROR RBackendHandler: fitRModelFormula on org.apache.spark.ml.api.r.SparkRWrappers failed Error in invokeJava(isStatic = TRUE, className, methodName, ...) : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, test.jiffybox.net): java.net.SocketTimeoutException: Accept timed out at java.net.PlainSocketImpl.socketAccept(Native Method) at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404) at java.net.ServerSocket.implAccept(ServerSocket.java:545) at java.net.ServerSocket.accept(ServerSocket.java:513) at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:71) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadChec - the only difference when observed to [~vmenda] trace is: "at org.apache.spark.api.r.RRDD$.createRWorker(RRDD.scala:426)" Does this indicate that R workers were actually started on worker machines ? > SparkR: java.net.SocketTimeoutException: Accept timed out when running any > dataframe function > - > > Key: SPARK-13525 > URL: https://issues.apache.org/jira/browse/SPARK-13525 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Shubhanshu Mishra > Labels: sparkr > > I am following the code steps from this example: > https://spark.apache.org/docs/1.6.0/sparkr.html > There are multiple issues: > 1. The head and summary and filter methods are not overridden by spark. Hence > I need to call them using `SparkR::` namespace. > 2. When I try to execute the following, I get errors: > {code} > $> $R_HOME/bin/R > R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree" > Copyright (C) 2015 The R Foundation for Statistical Computing > Platform: x86_64-pc-linux-gnu (64-bit) > R is free software and comes with ABSOLUTELY NO WARRANTY. > You are welcome to redistribute it under certain conditions. > Type 'license()' or 'licence()' for distribution details. > Natural language support but running in an English locale > R is a collaborative project with many contributors. > Type 'contributors()' for more information and > 'citation()' on how to cite R or R packages in publications. > Type 'demo()' for some demos, 'help()' for on-line help, or > 'help.start()' for an HTML browser interface to help. > Type 'q()' to quit R. > Welcome at Fri Feb 26 16:19:35 2016 > Attaching package: ‘SparkR’ > The following objects are masked from ‘package:base’: > colnames, colnames<-, drop, intersect, rank, rbind, sample, subset, > summary, transform > Launching java with spark-submit command > /content/smishra8/SOFTWARE/spark/bin/spark-submit --driver-memory "50g" > sparkr-shell /tmp/RtmpfBQRg6/backend_portc3bc16f09b1b > > df <- createDataFrame(sqlContext, iris) > Warning messages: > 1: In FUN(X[[i]], ...) : > Use Sepal_Length instead of Sepal.Length as column name > 2: In FUN(X[[i]], ...) : > Use Sepal_Width instead of Sepal.Width as column name > 3: In FUN(X[[i]], ...) : > Use Petal_Length instead of Petal.Length as column name > 4: In FUN(X[[i]], ...) : > Use Petal_Width instead of Petal.Width as column name > > training <- filter(df, df$Species != "setosa") > Error in filter(df, df$Species != "setosa") : > no method for coercing this S4 class to a vector > > training <- SparkR::filter(df, df$Species != "setosa") > > model <- SparkR::glm(Species ~ Sepal_Length + Sepal_Width, data = training, > > family = "binomial") > 16/02/26 16:26:46 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.net.SocketTimeoutException: Accept timed out > at java.net.PlainSocketImpl.socketAccept(Native Method) > at > java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398) > at java.net.ServerSocket.implAccept(ServerSocket.java:530) > at java.net.ServerSocket.accept(ServerSocket.java:498) > at org.apache.spark.api.r.RRDD$.createRWorker(RRDD.scala:431) > at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:62) > at
[jira] [Comment Edited] (SPARK-16998) select($"column1", explode($"column2")) is extremely slow
[ https://issues.apache.org/jira/browse/SPARK-16998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437598#comment-15437598 ] Herman van Hovell edited comment on SPARK-16998 at 8/25/16 8:27 PM: I still have a code generation PR lying around: https://github.com/apache/spark/pull/13065 That should fix a lot of the performance issues. I could bring it up to date, if there are any takers. was (Author: hvanhovell): I still have a code generation PR lying around: https://github.com/apache/spark/pull/13065 I could bring it up to date. > select($"column1", explode($"column2")) is extremely slow > - > > Key: SPARK-16998 > URL: https://issues.apache.org/jira/browse/SPARK-16998 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: TobiasP > > Using a Dataset containing 10.000 rows, each containing null and an array of > 5.000 Ints, I observe the following performance (in local mode): > {noformat} > scala> time(ds.select(explode($"value")).sample(false, 0.001, 1).collect) > 1.219052 seconds > > res9: Array[org.apache.spark.sql.Row] = Array([3761], [3766], [3196]) > scala> time(ds.select($"dummy", explode($"value")).sample(false, 0.001, > 1).collect) > 20.219447 seconds > > res5: Array[org.apache.spark.sql.Row] = Array([null,3761], [null,3766], > [null,3196]) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16998) select($"column1", explode($"column2")) is extremely slow
[ https://issues.apache.org/jira/browse/SPARK-16998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437598#comment-15437598 ] Herman van Hovell commented on SPARK-16998: --- I still have a code generation PR lying around: https://github.com/apache/spark/pull/13065 I could bring it up to date. > select($"column1", explode($"column2")) is extremely slow > - > > Key: SPARK-16998 > URL: https://issues.apache.org/jira/browse/SPARK-16998 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: TobiasP > > Using a Dataset containing 10.000 rows, each containing null and an array of > 5.000 Ints, I observe the following performance (in local mode): > {noformat} > scala> time(ds.select(explode($"value")).sample(false, 0.001, 1).collect) > 1.219052 seconds > > res9: Array[org.apache.spark.sql.Row] = Array([3761], [3766], [3196]) > scala> time(ds.select($"dummy", explode($"value")).sample(false, 0.001, > 1).collect) > 20.219447 seconds > > res5: Array[org.apache.spark.sql.Row] = Array([null,3761], [null,3766], > [null,3196]) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-11085) Add support for HTTP proxy
[ https://issues.apache.org/jira/browse/SPARK-11085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SURESH CHAGANTI updated SPARK-11085: Comment: was deleted (was: The following Script accepts the "--proxy_host_port" argument from __future__ import division, print_function, with_statement import codecs import hashlib import itertools import logging import os import os.path import pipes import random import shutil import string from stat import S_IRUSR import subprocess import sys import tarfile import tempfile import textwrap import time import warnings from datetime import datetime from optparse import OptionParser from sys import stderr if sys.version < "3": from urllib2 import urlopen, Request, HTTPError else: from urllib.request import urlopen, Request from urllib.error import HTTPError raw_input = input xrange = range SPARK_EC2_VERSION = "1.6.2" SPARK_EC2_DIR = os.path.dirname(os.path.realpath(__file__)) VALID_SPARK_VERSIONS = set([ "0.7.3", "0.8.0", "0.8.1", "0.9.0", "0.9.1", "0.9.2", "1.0.0", "1.0.1", "1.0.2", "1.1.0", "1.1.1", "1.2.0", "1.2.1", "1.3.0", "1.3.1", "1.4.0", "1.4.1", "1.5.0", "1.5.1", "1.5.2", "1.6.0", "1.6.1", "1.6.2", ]) SPARK_TACHYON_MAP = { "1.0.0": "0.4.1", "1.0.1": "0.4.1", "1.0.2": "0.4.1", "1.1.0": "0.5.0", "1.1.1": "0.5.0", "1.2.0": "0.5.0", "1.2.1": "0.5.0", "1.3.0": "0.5.0", "1.3.1": "0.5.0", "1.4.0": "0.6.4", "1.4.1": "0.6.4", "1.5.0": "0.7.1", "1.5.1": "0.7.1", "1.5.2": "0.7.1", "1.6.0": "0.8.2", "1.6.1": "0.8.2", "1.6.2": "0.8.2", } DEFAULT_SPARK_VERSION = SPARK_EC2_VERSION DEFAULT_SPARK_GITHUB_REPO = "https://github.com/apache/spark; # Default location to get the spark-ec2 scripts (and ami-list) from DEFAULT_SPARK_EC2_GITHUB_REPO = "https://github.com/amplab/spark-ec2; DEFAULT_SPARK_EC2_BRANCH = "branch-1.6" def setup_external_libs(libs): """ Download external libraries from PyPI to SPARK_EC2_DIR/lib/ and prepend them to our PATH. """ PYPI_URL_PREFIX = "https://pypi.python.org/packages/source; SPARK_EC2_LIB_DIR = os.path.join(SPARK_EC2_DIR, "lib") if not os.path.exists(SPARK_EC2_LIB_DIR): print("Downloading external libraries that spark-ec2 needs from PyPI to {path}...".format( path=SPARK_EC2_LIB_DIR )) print("This should be a one-time operation.") os.mkdir(SPARK_EC2_LIB_DIR) for lib in libs: versioned_lib_name = "{n}-{v}".format(n=lib["name"], v=lib["version"]) lib_dir = os.path.join(SPARK_EC2_LIB_DIR, versioned_lib_name) if not os.path.isdir(lib_dir): tgz_file_path = os.path.join(SPARK_EC2_LIB_DIR, versioned_lib_name + ".tar.gz") print(" - Downloading {lib}...".format(lib=lib["name"])) download_stream = urlopen( "{prefix}/{first_letter}/{lib_name}/{lib_name}-{lib_version}.tar.gz".format( prefix=PYPI_URL_PREFIX, first_letter=lib["name"][:1], lib_name=lib["name"], lib_version=lib["version"] ) ) with open(tgz_file_path, "wb") as tgz_file: tgz_file.write(download_stream.read()) with open(tgz_file_path, "rb") as tar: if hashlib.md5(tar.read()).hexdigest() != lib["md5"]: print("ERROR: Got wrong md5sum for {lib}.".format(lib=lib["name"]), file=stderr) sys.exit(1) tar = tarfile.open(tgz_file_path) tar.extractall(path=SPARK_EC2_LIB_DIR) tar.close() os.remove(tgz_file_path) print(" - Finished downloading {lib}.".format(lib=lib["name"])) sys.path.insert(1, lib_dir) # Only PyPI libraries are supported. external_libs = [ { "name": "boto", "version": "2.34.0", "md5": "5556223d2d0cc4d06dd4829e671dcecd" } ] setup_external_libs(external_libs) import boto from boto.ec2.blockdevicemapping import BlockDeviceMapping, BlockDeviceType, EBSBlockDeviceType from boto import ec2 class UsageError(Exception): pass # Configure and parse our command-line arguments def parse_args(): parser = OptionParser( prog="spark-ec2", version="%prog {v}".format(v=SPARK_EC2_VERSION), usage="%prog [options] \n\n" + " can be: launch, destroy, login, stop, start, get-master, reboot-slaves") parser.add_option( "-s", "--slaves", type="int", default=1, help="Number of slaves to launch (default: %default)") parser.add_option( "-w", "--wait", type="int", help="DEPRECATED (no longer necessary) - Seconds to wait for nodes to start") parser.add_option( "-k", "--key-pair", help="Key pair to use on
[jira] [Commented] (SPARK-17241) SparkR spark.glm should have configurable regularization parameter
[ https://issues.apache.org/jira/browse/SPARK-17241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437588#comment-15437588 ] Xin Ren commented on SPARK-17241: - I can work on this one :) > SparkR spark.glm should have configurable regularization parameter > -- > > Key: SPARK-17241 > URL: https://issues.apache.org/jira/browse/SPARK-17241 > Project: Spark > Issue Type: Improvement >Reporter: Junyang Qian > > Spark has configurable L2 regularization parameter for generalized linear > regression. It is very important to have them in SparkR so that users can run > ridge regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17242) Update links of external dstream projects
[ https://issues.apache.org/jira/browse/SPARK-17242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17242: Assignee: Apache Spark (was: Shixiong Zhu) > Update links of external dstream projects > - > > Key: SPARK-17242 > URL: https://issues.apache.org/jira/browse/SPARK-17242 > Project: Spark > Issue Type: Documentation > Components: Documentation >Reporter: Shixiong Zhu >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17242) Update links of external dstream projects
[ https://issues.apache.org/jira/browse/SPARK-17242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437584#comment-15437584 ] Apache Spark commented on SPARK-17242: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/14814 > Update links of external dstream projects > - > > Key: SPARK-17242 > URL: https://issues.apache.org/jira/browse/SPARK-17242 > Project: Spark > Issue Type: Documentation > Components: Documentation >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17242) Update links of external dstream projects
[ https://issues.apache.org/jira/browse/SPARK-17242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17242: Assignee: Shixiong Zhu (was: Apache Spark) > Update links of external dstream projects > - > > Key: SPARK-17242 > URL: https://issues.apache.org/jira/browse/SPARK-17242 > Project: Spark > Issue Type: Documentation > Components: Documentation >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17242) Update links of external dstream projects
Shixiong Zhu created SPARK-17242: Summary: Update links of external dstream projects Key: SPARK-17242 URL: https://issues.apache.org/jira/browse/SPARK-17242 Project: Spark Issue Type: Documentation Components: Documentation Reporter: Shixiong Zhu Assignee: Shixiong Zhu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14586) SparkSQL doesn't parse decimal like Hive
[ https://issues.apache.org/jira/browse/SPARK-14586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437548#comment-15437548 ] Dongjoon Hyun commented on SPARK-14586: --- Hi, [~stephane.maa...@gmail.com] and [~tsuresh]. FYI, Spark 2.0.0 supports this like the following now. {code} scala> sql("create table csv_t using csv options(path '/csv')") scala> sql("select * from csv_t").show +---++ |_c0| _c1| +---++ | a| 2.0| | | 3.0| +---++ scala> spark.version res2: String = 2.0.0 {code} > SparkSQL doesn't parse decimal like Hive > > > Key: SPARK-14586 > URL: https://issues.apache.org/jira/browse/SPARK-14586 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Stephane Maarek > > create a test_data.csv with the following > {code:none} > a, 2.0 > ,3.0 > {code} > (the space is intended before the 2) > copy the test_data.csv to hdfs:///spark_testing_2 > go in hive, run the following statements > {code:sql} > CREATE SCHEMA IF NOT EXISTS spark_testing; > DROP TABLE IF EXISTS spark_testing.test_csv_2; > CREATE EXTERNAL TABLE `spark_testing.test_csv_2`( > column_1 varchar(10), > column_2 decimal(4,2)) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > STORED AS TEXTFILE LOCATION '/spark_testing_2' > TBLPROPERTIES('serialization.null.format'=''); > select * from spark_testing.test_csv_2; > OK > a 2 > NULL3 > {code} > As you can see, the value " 2" gets parsed correctly to 2 > Now onto Spark-shell: > {code:java} > val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) > sqlContext.sql("select * from spark_testing.test_csv_2").show() > +++ > |column_1|column_2| > +++ > | a|null| > |null|3.00| > +++ > {code} > As you can see, the " 2" got parsed to null. Therefore Hive and Spark don't > have a similar parsing behavior for decimals. I wouldn't say it is a bug per > se, but it looks like a necessary improvement for the two engines to > converge. Hive version is 1.5.1 > Not sure if relevant, but Scala does parse numbers with leading space > correctly > {code} > scala> "2.0".toDouble > res21: Double = 2.0 > scala> " 2.0".toDouble > res22: Double = 2.0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16501) spark.mesos.secret exposed on UI and command line
[ https://issues.apache.org/jira/browse/SPARK-16501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437539#comment-15437539 ] Alex Bozarth commented on SPARK-16501: -- The first problem (web ui) was fixed by https://github.com/apache/spark/pull/14484 as a follow up to SPARK-16796 > spark.mesos.secret exposed on UI and command line > - > > Key: SPARK-16501 > URL: https://issues.apache.org/jira/browse/SPARK-16501 > Project: Spark > Issue Type: Improvement > Components: Spark Submit, Web UI >Affects Versions: 1.6.2 >Reporter: Eric Daniel > Labels: security > > There are two related problems with spark.mesos.secret: > 1) The web UI shows its value in the "environment" tab > 2) Passing it as a command-line option to spark-submit (or creating a > SparkContext from python, with the effect of launching spark-submit) exposes > it to "ps" > I'll be happy to submit a patch but I could use some advice first. > The first problem is easy enough, just don't show that value in the UI > For the second problem, I'm not sure what the best solution is. A > "spark.mesos.secret-file" parameter would let the user store the secret in a > non-world-readable file. Alternatively, the mesos secret could be obtained > from the environment, which other users don't have access to. Either > solution would work in client mode, but I don't know if they're workable in > cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17163) Merge MLOR into a single LOR interface
[ https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437535#comment-15437535 ] DB Tsai edited comment on SPARK-17163 at 8/25/16 7:58 PM: -- BTW, having the name as `coefficientMatrix` doesn't look good as `coefficients`, but for backward compatibility, seems we don't have a choice. Also, we may need to handle loading old BLOR model, and have the new model written as matrix and vector for both MLOR and BLOR. was (Author: dbtsai): BTW, having the name as `coefficientMatrix` doesn't look good as `coefficients`, but for backward compatibility, seems we don't have a choice. > Merge MLOR into a single LOR interface > -- > > Key: SPARK-17163 > URL: https://issues.apache.org/jira/browse/SPARK-17163 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > Before the 2.1 release, we should finalize the API for logistic regression. > After SPARK-7159, we have both LogisticRegression and > MultinomialLogisticRegression models. This may be confusing to users and, is > a bit superfluous since MLOR can do basically all of what BLOR does. We > should decide if it needs to be changed and implement those changes before 2.1 > *Update*: Seems we have decided to merge the two estimators. I changed the > title to reflect that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17163) Merge MLOR into a single LOR interface
[ https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437535#comment-15437535 ] DB Tsai commented on SPARK-17163: - BTW, having the name as `coefficientMatrix` doesn't look good as `coefficients`, but for backward compatibility, seems we don't have a choice. > Merge MLOR into a single LOR interface > -- > > Key: SPARK-17163 > URL: https://issues.apache.org/jira/browse/SPARK-17163 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > Before the 2.1 release, we should finalize the API for logistic regression. > After SPARK-7159, we have both LogisticRegression and > MultinomialLogisticRegression models. This may be confusing to users and, is > a bit superfluous since MLOR can do basically all of what BLOR does. We > should decide if it needs to be changed and implement those changes before 2.1 > *Update*: Seems we have decided to merge the two estimators. I changed the > title to reflect that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17241) SparkR spark.glm should have configurable regularization parameter
[ https://issues.apache.org/jira/browse/SPARK-17241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437536#comment-15437536 ] Shivaram Venkataraman commented on SPARK-17241: --- +1 - This would be good to have. Also on a related note is it hard to get elasticnet working in spark.glm ? We can create a new JIRA for it if all we need is a new wrapper. > SparkR spark.glm should have configurable regularization parameter > -- > > Key: SPARK-17241 > URL: https://issues.apache.org/jira/browse/SPARK-17241 > Project: Spark > Issue Type: Improvement >Reporter: Junyang Qian > > Spark has configurable L2 regularization parameter for generalized linear > regression. It is very important to have them in SparkR so that users can run > ridge regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17163) Merge MLOR into a single LOR interface
[ https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437531#comment-15437531 ] DB Tsai commented on SPARK-17163: - Why do we need `_intercept`? Also, for intercept, we may need to do the following so the MLOR and BLOR have the same model format, and thus only single implementation of score is required. {code} def intercept: Double = { if (isMultinomial) { throw new Exception } intercepts(1) - intercepts(0) } def coefficients: Vector = if (!isMultinomial) { val temp = coefficientMatrix(1) - coefficientMatrix(0) Vectors.dense() } else { throw new Exception } {code} > Merge MLOR into a single LOR interface > -- > > Key: SPARK-17163 > URL: https://issues.apache.org/jira/browse/SPARK-17163 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > Before the 2.1 release, we should finalize the API for logistic regression. > After SPARK-7159, we have both LogisticRegression and > MultinomialLogisticRegression models. This may be confusing to users and, is > a bit superfluous since MLOR can do basically all of what BLOR does. We > should decide if it needs to be changed and implement those changes before 2.1 > *Update*: Seems we have decided to merge the two estimators. I changed the > title to reflect that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17240) SparkConf is Serializable but contains a non-serializable field
[ https://issues.apache.org/jira/browse/SPARK-17240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437524#comment-15437524 ] Apache Spark commented on SPARK-17240: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/14813 > SparkConf is Serializable but contains a non-serializable field > --- > > Key: SPARK-17240 > URL: https://issues.apache.org/jira/browse/SPARK-17240 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Michael Gummelt > > This commit: > https://github.com/apache/spark/commit/5da6c4b24f512b63cd4e6ba7dd8968066a9396f5 > Added ConfigReader to SparkConf. SparkConf is Serializable, but ConfigReader > is not, which results in the following exception: > {code} > java.io.NotSerializableException: > org.apache.spark.internal.config.ConfigReader > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at org.apache.spark.util.Utils$.serialize(Utils.scala:134) > at > org.apache.spark.scheduler.cluster.mesos.ZookeeperMesosClusterPersistenceEngine.persist(MesosClusterPersistenceEngine.scala:111) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.submitDriver(MesosClusterScheduler.scala:170) > at > org.apache.spark.deploy.rest.mesos.MesosSubmitRequestServlet.handleSubmit(MesosRestServer.scala:126) > at > org.apache.spark.deploy.rest.SubmitRequestServlet.doPost(RestSubmissionServer.scala:265) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > at > org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:812) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) > at org.spark_project.jetty.server.Server.handle(Server.java:499) > at > org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311) > at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) > at > org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) > at > org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17240) SparkConf is Serializable but contains a non-serializable field
[ https://issues.apache.org/jira/browse/SPARK-17240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17240: Assignee: Apache Spark > SparkConf is Serializable but contains a non-serializable field > --- > > Key: SPARK-17240 > URL: https://issues.apache.org/jira/browse/SPARK-17240 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Michael Gummelt >Assignee: Apache Spark > > This commit: > https://github.com/apache/spark/commit/5da6c4b24f512b63cd4e6ba7dd8968066a9396f5 > Added ConfigReader to SparkConf. SparkConf is Serializable, but ConfigReader > is not, which results in the following exception: > {code} > java.io.NotSerializableException: > org.apache.spark.internal.config.ConfigReader > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at org.apache.spark.util.Utils$.serialize(Utils.scala:134) > at > org.apache.spark.scheduler.cluster.mesos.ZookeeperMesosClusterPersistenceEngine.persist(MesosClusterPersistenceEngine.scala:111) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.submitDriver(MesosClusterScheduler.scala:170) > at > org.apache.spark.deploy.rest.mesos.MesosSubmitRequestServlet.handleSubmit(MesosRestServer.scala:126) > at > org.apache.spark.deploy.rest.SubmitRequestServlet.doPost(RestSubmissionServer.scala:265) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > at > org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:812) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) > at org.spark_project.jetty.server.Server.handle(Server.java:499) > at > org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311) > at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) > at > org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) > at > org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17240) SparkConf is Serializable but contains a non-serializable field
[ https://issues.apache.org/jira/browse/SPARK-17240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17240: Assignee: (was: Apache Spark) > SparkConf is Serializable but contains a non-serializable field > --- > > Key: SPARK-17240 > URL: https://issues.apache.org/jira/browse/SPARK-17240 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Michael Gummelt > > This commit: > https://github.com/apache/spark/commit/5da6c4b24f512b63cd4e6ba7dd8968066a9396f5 > Added ConfigReader to SparkConf. SparkConf is Serializable, but ConfigReader > is not, which results in the following exception: > {code} > java.io.NotSerializableException: > org.apache.spark.internal.config.ConfigReader > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at org.apache.spark.util.Utils$.serialize(Utils.scala:134) > at > org.apache.spark.scheduler.cluster.mesos.ZookeeperMesosClusterPersistenceEngine.persist(MesosClusterPersistenceEngine.scala:111) > at > org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.submitDriver(MesosClusterScheduler.scala:170) > at > org.apache.spark.deploy.rest.mesos.MesosSubmitRequestServlet.handleSubmit(MesosRestServer.scala:126) > at > org.apache.spark.deploy.rest.SubmitRequestServlet.doPost(RestSubmissionServer.scala:265) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > at > org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:812) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:587) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) > at org.spark_project.jetty.server.Server.handle(Server.java:499) > at > org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311) > at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) > at > org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) > at > org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17163) Merge MLOR into a single LOR interface
[ https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437400#comment-15437400 ] DB Tsai edited comment on SPARK-17163 at 8/25/16 7:43 PM: -- Why will it break the api if we throw the exception when we call coefficients and intercept when the models are MLOR? Also, for BLOR, how will you store the actual representation? Store them in coefficientsMatrix as 2 by nfeature matrix and intercepts as array of 2? I think this is nice since we have a single representation for BLOR and MLOR. Thanks. was (Author: dbtsai): Why will it break the api if we throw the exception when we call coefficients and intercept when the models are BLOR? Also, for BLOR, how will you store the actual representation? Store them in coefficientsMatrix as 2 by nfeature matrix and intercepts as array of 2? I think this is nice since we have a single representation for BLOR and MLOR. Thanks. > Merge MLOR into a single LOR interface > -- > > Key: SPARK-17163 > URL: https://issues.apache.org/jira/browse/SPARK-17163 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > Before the 2.1 release, we should finalize the API for logistic regression. > After SPARK-7159, we have both LogisticRegression and > MultinomialLogisticRegression models. This may be confusing to users and, is > a bit superfluous since MLOR can do basically all of what BLOR does. We > should decide if it needs to be changed and implement those changes before 2.1 > *Update*: Seems we have decided to merge the two estimators. I changed the > title to reflect that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17241) SparkR spark.glm should have configurable regularization parameter
[ https://issues.apache.org/jira/browse/SPARK-17241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junyang Qian updated SPARK-17241: - Summary: SparkR spark.glm should have configurable regularization parameter (was: SparkR spark.glm should have configurable regularization parameter(s)) > SparkR spark.glm should have configurable regularization parameter > -- > > Key: SPARK-17241 > URL: https://issues.apache.org/jira/browse/SPARK-17241 > Project: Spark > Issue Type: Improvement >Reporter: Junyang Qian > > Spark has configurable L2 regularization parameter for linear regression and > an additional elastic-net parameter for generalized linear model. It is very > important to have them in SparkR so that users can run ridge regression and > elastic-net. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17241) SparkR spark.glm should have configurable regularization parameter
[ https://issues.apache.org/jira/browse/SPARK-17241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junyang Qian updated SPARK-17241: - Description: Spark has configurable L2 regularization parameter for generalized linear regression. It is very important to have them in SparkR so that users can run ridge regression. (was: Spark has configurable L2 regularization parameter for linear regression and an additional elastic-net parameter for generalized linear model. It is very important to have them in SparkR so that users can run ridge regression and elastic-net.) > SparkR spark.glm should have configurable regularization parameter > -- > > Key: SPARK-17241 > URL: https://issues.apache.org/jira/browse/SPARK-17241 > Project: Spark > Issue Type: Improvement >Reporter: Junyang Qian > > Spark has configurable L2 regularization parameter for generalized linear > regression. It is very important to have them in SparkR so that users can run > ridge regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11085) Add support for HTTP proxy
[ https://issues.apache.org/jira/browse/SPARK-11085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437505#comment-15437505 ] SURESH CHAGANTI commented on SPARK-11085: - The following Script accepts the "--proxy_host_port" argument from __future__ import division, print_function, with_statement import codecs import hashlib import itertools import logging import os import os.path import pipes import random import shutil import string from stat import S_IRUSR import subprocess import sys import tarfile import tempfile import textwrap import time import warnings from datetime import datetime from optparse import OptionParser from sys import stderr if sys.version < "3": from urllib2 import urlopen, Request, HTTPError else: from urllib.request import urlopen, Request from urllib.error import HTTPError raw_input = input xrange = range SPARK_EC2_VERSION = "1.6.2" SPARK_EC2_DIR = os.path.dirname(os.path.realpath(__file__)) VALID_SPARK_VERSIONS = set([ "0.7.3", "0.8.0", "0.8.1", "0.9.0", "0.9.1", "0.9.2", "1.0.0", "1.0.1", "1.0.2", "1.1.0", "1.1.1", "1.2.0", "1.2.1", "1.3.0", "1.3.1", "1.4.0", "1.4.1", "1.5.0", "1.5.1", "1.5.2", "1.6.0", "1.6.1", "1.6.2", ]) SPARK_TACHYON_MAP = { "1.0.0": "0.4.1", "1.0.1": "0.4.1", "1.0.2": "0.4.1", "1.1.0": "0.5.0", "1.1.1": "0.5.0", "1.2.0": "0.5.0", "1.2.1": "0.5.0", "1.3.0": "0.5.0", "1.3.1": "0.5.0", "1.4.0": "0.6.4", "1.4.1": "0.6.4", "1.5.0": "0.7.1", "1.5.1": "0.7.1", "1.5.2": "0.7.1", "1.6.0": "0.8.2", "1.6.1": "0.8.2", "1.6.2": "0.8.2", } DEFAULT_SPARK_VERSION = SPARK_EC2_VERSION DEFAULT_SPARK_GITHUB_REPO = "https://github.com/apache/spark; # Default location to get the spark-ec2 scripts (and ami-list) from DEFAULT_SPARK_EC2_GITHUB_REPO = "https://github.com/amplab/spark-ec2; DEFAULT_SPARK_EC2_BRANCH = "branch-1.6" def setup_external_libs(libs): """ Download external libraries from PyPI to SPARK_EC2_DIR/lib/ and prepend them to our PATH. """ PYPI_URL_PREFIX = "https://pypi.python.org/packages/source; SPARK_EC2_LIB_DIR = os.path.join(SPARK_EC2_DIR, "lib") if not os.path.exists(SPARK_EC2_LIB_DIR): print("Downloading external libraries that spark-ec2 needs from PyPI to {path}...".format( path=SPARK_EC2_LIB_DIR )) print("This should be a one-time operation.") os.mkdir(SPARK_EC2_LIB_DIR) for lib in libs: versioned_lib_name = "{n}-{v}".format(n=lib["name"], v=lib["version"]) lib_dir = os.path.join(SPARK_EC2_LIB_DIR, versioned_lib_name) if not os.path.isdir(lib_dir): tgz_file_path = os.path.join(SPARK_EC2_LIB_DIR, versioned_lib_name + ".tar.gz") print(" - Downloading {lib}...".format(lib=lib["name"])) download_stream = urlopen( "{prefix}/{first_letter}/{lib_name}/{lib_name}-{lib_version}.tar.gz".format( prefix=PYPI_URL_PREFIX, first_letter=lib["name"][:1], lib_name=lib["name"], lib_version=lib["version"] ) ) with open(tgz_file_path, "wb") as tgz_file: tgz_file.write(download_stream.read()) with open(tgz_file_path, "rb") as tar: if hashlib.md5(tar.read()).hexdigest() != lib["md5"]: print("ERROR: Got wrong md5sum for {lib}.".format(lib=lib["name"]), file=stderr) sys.exit(1) tar = tarfile.open(tgz_file_path) tar.extractall(path=SPARK_EC2_LIB_DIR) tar.close() os.remove(tgz_file_path) print(" - Finished downloading {lib}.".format(lib=lib["name"])) sys.path.insert(1, lib_dir) # Only PyPI libraries are supported. external_libs = [ { "name": "boto", "version": "2.34.0", "md5": "5556223d2d0cc4d06dd4829e671dcecd" } ] setup_external_libs(external_libs) import boto from boto.ec2.blockdevicemapping import BlockDeviceMapping, BlockDeviceType, EBSBlockDeviceType from boto import ec2 class UsageError(Exception): pass # Configure and parse our command-line arguments def parse_args(): parser = OptionParser( prog="spark-ec2", version="%prog {v}".format(v=SPARK_EC2_VERSION), usage="%prog [options] \n\n" + " can be: launch, destroy, login, stop, start, get-master, reboot-slaves") parser.add_option( "-s", "--slaves", type="int", default=1, help="Number of slaves to launch (default: %default)") parser.add_option( "-w", "--wait", type="int", help="DEPRECATED (no longer necessary) - Seconds to wait for nodes to start") parser.add_option( "-k", "--key-pair", help="Key pair to use
[jira] [Commented] (SPARK-17163) Merge MLOR into a single LOR interface
[ https://issues.apache.org/jira/browse/SPARK-17163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437503#comment-15437503 ] Seth Hendrickson commented on SPARK-17163: -- I misunderstood your suggestion. I think that's best - to throw an error when those methods are called on a multinomial model, and to return the normal values in the binomial case. > Merge MLOR into a single LOR interface > -- > > Key: SPARK-17163 > URL: https://issues.apache.org/jira/browse/SPARK-17163 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson > > Before the 2.1 release, we should finalize the API for logistic regression. > After SPARK-7159, we have both LogisticRegression and > MultinomialLogisticRegression models. This may be confusing to users and, is > a bit superfluous since MLOR can do basically all of what BLOR does. We > should decide if it needs to be changed and implement those changes before 2.1 > *Update*: Seems we have decided to merge the two estimators. I changed the > title to reflect that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17237) DataFrame fill after pivot causing org.apache.spark.sql.AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-17237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17237: Assignee: (was: Apache Spark) > DataFrame fill after pivot causing org.apache.spark.sql.AnalysisException > - > > Key: SPARK-17237 > URL: https://issues.apache.org/jira/browse/SPARK-17237 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Jiang Qiqi > Labels: newbie > > I am trying to run a pivot transformation which I ran on a spark1.6 cluster, > namely > sc.parallelize(Seq((2,3,4), (3,4,5))).toDF("a", "b", "c") > res1: org.apache.spark.sql.DataFrame = [a: int, b: int, c: int] > scala> res1.groupBy("a").pivot("b").agg(count("c"), avg("c")).na.fill(0) > res2: org.apache.spark.sql.DataFrame = [a: int, 3_count(c): bigint, 3_avg(c): > double, 4_count(c): bigint, 4_avg(c): double] > scala> res1.groupBy("a").pivot("b").agg(count("c"), avg("c")).na.fill(0).show > +---+--++--++ > | a|3_count(c)|3_avg(c)|4_count(c)|4_avg(c)| > +---+--++--++ > | 2| 1| 4.0| 0| 0.0| > | 3| 0| 0.0| 1| 5.0| > +---+--++--++ > after upgrade the environment to spark2.0, got an error while executing > .na.fill method > scala> sc.parallelize(Seq((2,3,4), (3,4,5))).toDF("a", "b", "c") > res3: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field] > scala> res3.groupBy("a").pivot("b").agg(count("c"), avg("c")).na.fill(0) > org.apache.spark.sql.AnalysisException: syntax error in attribute name: > `3_count(`c`)`; > at > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:103) > at > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:113) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:168) > at org.apache.spark.sql.Dataset.resolve(Dataset.scala:218) > at org.apache.spark.sql.Dataset.col(Dataset.scala:921) > at > org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:411) > at > org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:162) > at > org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:159) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:159) > at > org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:149) > at > org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:134) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17237) DataFrame fill after pivot causing org.apache.spark.sql.AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-17237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437500#comment-15437500 ] Apache Spark commented on SPARK-17237: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/14812 > DataFrame fill after pivot causing org.apache.spark.sql.AnalysisException > - > > Key: SPARK-17237 > URL: https://issues.apache.org/jira/browse/SPARK-17237 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Jiang Qiqi > Labels: newbie > > I am trying to run a pivot transformation which I ran on a spark1.6 cluster, > namely > sc.parallelize(Seq((2,3,4), (3,4,5))).toDF("a", "b", "c") > res1: org.apache.spark.sql.DataFrame = [a: int, b: int, c: int] > scala> res1.groupBy("a").pivot("b").agg(count("c"), avg("c")).na.fill(0) > res2: org.apache.spark.sql.DataFrame = [a: int, 3_count(c): bigint, 3_avg(c): > double, 4_count(c): bigint, 4_avg(c): double] > scala> res1.groupBy("a").pivot("b").agg(count("c"), avg("c")).na.fill(0).show > +---+--++--++ > | a|3_count(c)|3_avg(c)|4_count(c)|4_avg(c)| > +---+--++--++ > | 2| 1| 4.0| 0| 0.0| > | 3| 0| 0.0| 1| 5.0| > +---+--++--++ > after upgrade the environment to spark2.0, got an error while executing > .na.fill method > scala> sc.parallelize(Seq((2,3,4), (3,4,5))).toDF("a", "b", "c") > res3: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field] > scala> res3.groupBy("a").pivot("b").agg(count("c"), avg("c")).na.fill(0) > org.apache.spark.sql.AnalysisException: syntax error in attribute name: > `3_count(`c`)`; > at > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:103) > at > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:113) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:168) > at org.apache.spark.sql.Dataset.resolve(Dataset.scala:218) > at org.apache.spark.sql.Dataset.col(Dataset.scala:921) > at > org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:411) > at > org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:162) > at > org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:159) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:159) > at > org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:149) > at > org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:134) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org