[jira] [Commented] (SPARK-20334) Return a better error message when correlated predicates contain aggregate expression that has mixture of outer and local references
[ https://issues.apache.org/jira/browse/SPARK-20334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968650#comment-15968650 ] Apache Spark commented on SPARK-20334: -- User 'dilipbiswal' has created a pull request for this issue: https://github.com/apache/spark/pull/17636 > Return a better error message when correlated predicates contain aggregate > expression that has mixture of outer and local references > > > Key: SPARK-20334 > URL: https://issues.apache.org/jira/browse/SPARK-20334 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Dilip Biswal >Priority: Minor > > Currently subqueries with correlated predicates containing aggregate > expression having mixture of outer references and local references generate a > code gen error like following : > {code:java} > Cannot evaluate expression: min((input[0, int, false] + input[4, int, false])) > at > org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:226) > at > org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:87) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:106) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:103) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:103) > at > org.apache.spark.sql.catalyst.expressions.BinaryExpression.nullSafeCodeGen(Expression.scala:461) > at > org.apache.spark.sql.catalyst.expressions.BinaryExpression.defineCodeGen(Expression.scala:443) > at > org.apache.spark.sql.catalyst.expressions.BinaryComparison.doGenCode(predicates.scala:431) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:106) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:103) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:103) > {code} > We should catch this situation and return a better error message to the user. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20334) Return a better error message when correlated predicates contain aggregate expression that has mixture of outer and local references
[ https://issues.apache.org/jira/browse/SPARK-20334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20334: Assignee: Apache Spark > Return a better error message when correlated predicates contain aggregate > expression that has mixture of outer and local references > > > Key: SPARK-20334 > URL: https://issues.apache.org/jira/browse/SPARK-20334 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Dilip Biswal >Assignee: Apache Spark >Priority: Minor > > Currently subqueries with correlated predicates containing aggregate > expression having mixture of outer references and local references generate a > code gen error like following : > {code:java} > Cannot evaluate expression: min((input[0, int, false] + input[4, int, false])) > at > org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:226) > at > org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:87) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:106) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:103) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:103) > at > org.apache.spark.sql.catalyst.expressions.BinaryExpression.nullSafeCodeGen(Expression.scala:461) > at > org.apache.spark.sql.catalyst.expressions.BinaryExpression.defineCodeGen(Expression.scala:443) > at > org.apache.spark.sql.catalyst.expressions.BinaryComparison.doGenCode(predicates.scala:431) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:106) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:103) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:103) > {code} > We should catch this situation and return a better error message to the user. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20334) Return a better error message when correlated predicates contain aggregate expression that has mixture of outer and local references
[ https://issues.apache.org/jira/browse/SPARK-20334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20334: Assignee: (was: Apache Spark) > Return a better error message when correlated predicates contain aggregate > expression that has mixture of outer and local references > > > Key: SPARK-20334 > URL: https://issues.apache.org/jira/browse/SPARK-20334 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Dilip Biswal >Priority: Minor > > Currently subqueries with correlated predicates containing aggregate > expression having mixture of outer references and local references generate a > code gen error like following : > {code:java} > Cannot evaluate expression: min((input[0, int, false] + input[4, int, false])) > at > org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:226) > at > org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:87) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:106) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:103) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:103) > at > org.apache.spark.sql.catalyst.expressions.BinaryExpression.nullSafeCodeGen(Expression.scala:461) > at > org.apache.spark.sql.catalyst.expressions.BinaryExpression.defineCodeGen(Expression.scala:443) > at > org.apache.spark.sql.catalyst.expressions.BinaryComparison.doGenCode(predicates.scala:431) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:106) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:103) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:103) > {code} > We should catch this situation and return a better error message to the user. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20335) Children expressions of Hive UDF impacts the determinism of Hive UDF
[ https://issues.apache.org/jira/browse/SPARK-20335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20335: Assignee: Apache Spark (was: Xiao Li) > Children expressions of Hive UDF impacts the determinism of Hive UDF > > > Key: SPARK-20335 > URL: https://issues.apache.org/jira/browse/SPARK-20335 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Xiao Li >Assignee: Apache Spark > > {noformat} > /** >* Certain optimizations should not be applied if UDF is not deterministic. >* Deterministic UDF returns same result each time it is invoked with a >* particular input. This determinism just needs to hold within the context > of >* a query. >* >* @return true if the UDF is deterministic >*/ > boolean deterministic() default true; > {noformat} > Based on the definition o UDFType, when Hive UDF's children are > non-deterministic, Hive UDF is also non-deterministic. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20335) Children expressions of Hive UDF impacts the determinism of Hive UDF
[ https://issues.apache.org/jira/browse/SPARK-20335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20335: Assignee: Xiao Li (was: Apache Spark) > Children expressions of Hive UDF impacts the determinism of Hive UDF > > > Key: SPARK-20335 > URL: https://issues.apache.org/jira/browse/SPARK-20335 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > > {noformat} > /** >* Certain optimizations should not be applied if UDF is not deterministic. >* Deterministic UDF returns same result each time it is invoked with a >* particular input. This determinism just needs to hold within the context > of >* a query. >* >* @return true if the UDF is deterministic >*/ > boolean deterministic() default true; > {noformat} > Based on the definition o UDFType, when Hive UDF's children are > non-deterministic, Hive UDF is also non-deterministic. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20335) Children expressions of Hive UDF impacts the determinism of Hive UDF
[ https://issues.apache.org/jira/browse/SPARK-20335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968639#comment-15968639 ] Apache Spark commented on SPARK-20335: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/17635 > Children expressions of Hive UDF impacts the determinism of Hive UDF > > > Key: SPARK-20335 > URL: https://issues.apache.org/jira/browse/SPARK-20335 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > > {noformat} > /** >* Certain optimizations should not be applied if UDF is not deterministic. >* Deterministic UDF returns same result each time it is invoked with a >* particular input. This determinism just needs to hold within the context > of >* a query. >* >* @return true if the UDF is deterministic >*/ > boolean deterministic() default true; > {noformat} > Based on the definition o UDFType, when Hive UDF's children are > non-deterministic, Hive UDF is also non-deterministic. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20335) Children expressions of Hive UDF impacts the determinism of Hive UDF
Xiao Li created SPARK-20335: --- Summary: Children expressions of Hive UDF impacts the determinism of Hive UDF Key: SPARK-20335 URL: https://issues.apache.org/jira/browse/SPARK-20335 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0, 2.0.2 Reporter: Xiao Li Assignee: Xiao Li {noformat} /** * Certain optimizations should not be applied if UDF is not deterministic. * Deterministic UDF returns same result each time it is invoked with a * particular input. This determinism just needs to hold within the context of * a query. * * @return true if the UDF is deterministic */ boolean deterministic() default true; {noformat} Based on the definition o UDFType, when Hive UDF's children are non-deterministic, Hive UDF is also non-deterministic. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20334) Return a better error message when correlated predicates contain aggregate expression that has mixture of outer and local references
Dilip Biswal created SPARK-20334: Summary: Return a better error message when correlated predicates contain aggregate expression that has mixture of outer and local references Key: SPARK-20334 URL: https://issues.apache.org/jira/browse/SPARK-20334 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.1.0 Reporter: Dilip Biswal Priority: Minor Currently subqueries with correlated predicates containing aggregate expression having mixture of outer references and local references generate a code gen error like following : {code:java} Cannot evaluate expression: min((input[0, int, false] + input[4, int, false])) at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:226) at org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:87) at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:106) at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:103) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:103) at org.apache.spark.sql.catalyst.expressions.BinaryExpression.nullSafeCodeGen(Expression.scala:461) at org.apache.spark.sql.catalyst.expressions.BinaryExpression.defineCodeGen(Expression.scala:443) at org.apache.spark.sql.catalyst.expressions.BinaryComparison.doGenCode(predicates.scala:431) at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:106) at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:103) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:103) {code} We should catch this situation and return a better error message to the user. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20333) Fix HashPartitioner in DAGSchedulerSuite
[ https://issues.apache.org/jira/browse/SPARK-20333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jin xing updated SPARK-20333: - Description: In test "don't submit stage until its dependencies map outputs are registered (SPARK-5259)" "run trivial shuffle with out-of-band executor failure and retry" "reduce tasks should be placed locally with map output" HashPartitioner should be compatible with num of child RDD's partitions. was:In test "don't submit stage until its dependencies map outputs are registered (SPARK-5259)", HashPartitioner should be compatible with num of child RDD's partitions. > Fix HashPartitioner in DAGSchedulerSuite > > > Key: SPARK-20333 > URL: https://issues.apache.org/jira/browse/SPARK-20333 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: jin xing >Priority: Minor > > In test > "don't submit stage until its dependencies map outputs are registered > (SPARK-5259)" > "run trivial shuffle with out-of-band executor failure and retry" > "reduce tasks should be placed locally with map output" > HashPartitioner should be compatible with num of child RDD's partitions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20333) Fix HashPartitioner in DAGSchedulerSuite
[ https://issues.apache.org/jira/browse/SPARK-20333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20333: Assignee: Apache Spark > Fix HashPartitioner in DAGSchedulerSuite > > > Key: SPARK-20333 > URL: https://issues.apache.org/jira/browse/SPARK-20333 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: jin xing >Assignee: Apache Spark >Priority: Minor > > In test "don't submit stage until its dependencies map outputs are registered > (SPARK-5259)", HashPartitioner should be compatible with num of child RDD's > partitions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20333) Fix HashPartitioner in DAGSchedulerSuite
[ https://issues.apache.org/jira/browse/SPARK-20333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20333: Assignee: (was: Apache Spark) > Fix HashPartitioner in DAGSchedulerSuite > > > Key: SPARK-20333 > URL: https://issues.apache.org/jira/browse/SPARK-20333 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: jin xing >Priority: Minor > > In test "don't submit stage until its dependencies map outputs are registered > (SPARK-5259)", HashPartitioner should be compatible with num of child RDD's > partitions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20333) Fix HashPartitioner in DAGSchedulerSuite
[ https://issues.apache.org/jira/browse/SPARK-20333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968576#comment-15968576 ] Apache Spark commented on SPARK-20333: -- User 'jinxing64' has created a pull request for this issue: https://github.com/apache/spark/pull/17634 > Fix HashPartitioner in DAGSchedulerSuite > > > Key: SPARK-20333 > URL: https://issues.apache.org/jira/browse/SPARK-20333 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: jin xing >Priority: Minor > > In test "don't submit stage until its dependencies map outputs are registered > (SPARK-5259)", HashPartitioner should be compatible with num of child RDD's > partitions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20333) Fix HashPartitioner in DAGSchedulerSuite
[ https://issues.apache.org/jira/browse/SPARK-20333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jin xing updated SPARK-20333: - Description: In test "don't submit stage until its dependencies map outputs are registered (SPARK-5259)", HashPartitioner should be compatible with num of child RDD's partitions. > Fix HashPartitioner in DAGSchedulerSuite > > > Key: SPARK-20333 > URL: https://issues.apache.org/jira/browse/SPARK-20333 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: jin xing >Priority: Minor > > In test "don't submit stage until its dependencies map outputs are registered > (SPARK-5259)", HashPartitioner should be compatible with num of child RDD's > partitions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20333) Fix HashPartitioner in DAGSchedulerSuite
jin xing created SPARK-20333: Summary: Fix HashPartitioner in DAGSchedulerSuite Key: SPARK-20333 URL: https://issues.apache.org/jira/browse/SPARK-20333 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0 Reporter: jin xing Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16742) Kerberos support for Spark on Mesos
[ https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968545#comment-15968545 ] Saisai Shao commented on SPARK-16742: - [~mgummelt], do you have a design doc of the kerberos support for Spark on Mesos, so that my work of SPARK-19143 could be based on yours. > Kerberos support for Spark on Mesos > --- > > Key: SPARK-16742 > URL: https://issues.apache.org/jira/browse/SPARK-16742 > Project: Spark > Issue Type: New Feature > Components: Mesos >Reporter: Michael Gummelt > > We at Mesosphere have written Kerberos support for Spark on Mesos. We'll be > contributing it to Apache Spark soon. > Mesosphere design doc: > https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6 > Mesosphere code: > https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20332) Avro/Parquet GenericFixed decimal is not read into Spark correctly
Justin Pihony created SPARK-20332: - Summary: Avro/Parquet GenericFixed decimal is not read into Spark correctly Key: SPARK-20332 URL: https://issues.apache.org/jira/browse/SPARK-20332 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.0 Reporter: Justin Pihony Priority: Minor Take the following code: spark-shell --packages org.apache.avro:avro:1.8.1 import org.apache.avro.{Conversions, LogicalTypes, Schema} import java.math.BigDecimal val dc = new Conversions.DecimalConversion() val javaBD = BigDecimal.valueOf(643.85924958) val schema = Schema.parse("{\"type\":\"record\",\"name\":\"Header\",\"namespace\":\"org.apache.avro.file\",\"fields\":["+ "{\"name\":\"COLUMN\",\"type\":[\"null\",{\"type\":\"fixed\",\"name\":\"COLUMN\","+ "\"size\":19,\"precision\":17,\"scale\":8,\"logicalType\":\"decimal\"}]}]}") val schemaDec = schema.getField("COLUMN").schema() val fieldSchema = if(schemaDec.getType() == Schema.Type.UNION) schemaDec.getTypes.get(1) else schemaDec val converted = dc.toFixed(javaBD, fieldSchema, LogicalTypes.decimal(javaBD.precision, javaBD.scale)) sqlContext.createDataFrame(List(("value",converted))) and you'll get this error: java.lang.UnsupportedOperationException: Schema for type org.apache.avro.generic.GenericFixed is not supported However if you write out a parquet file using the AvroParquetWriter and the above GenericFixed value (converted), then read it in via the DataFrameReader the decimal value that is retrieved is not accurate (ie. 643... above is listed as -0.5...) Even if not supported, is there any way to at least have it throw an UnsupportedOperationException as it does when you try to do it directly (as compared to read in from a file) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20293) In the page of 'jobs' or 'stages' of history server web ui,,click the 'Go' button, query paging data, the page error
[ https://issues.apache.org/jira/browse/SPARK-20293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Bozarth resolved SPARK-20293. -- Resolution: Duplicate Already fixed in master and branch-2.1 > In the page of 'jobs' or 'stages' of history server web ui,,click the 'Go' > button, query paging data, the page error > - > > Key: SPARK-20293 > URL: https://issues.apache.org/jira/browse/SPARK-20293 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0 >Reporter: guoxiaolongzte > Attachments: error1.png, error2.png, jobs.png, stages.png > > > In the page of 'jobs' or 'stages' of history server web ui, > Click on the 'Go' button, query paging data, the page error, function can not > be used. > The reasons are as follows: > '#' Was escaped by the browser as% 23. > & CompletedStage.desc = true% 23completed, the parameter value desc becomes = > true% 23, causing the page to report an error. The error is as follows: > HTTP ERROR 400 > Problem Access / history / app-20170411132432-0004 / stages /. Reason: > For input string: "true # completed" > Powered by Jetty: // > The amendments are as follows: > The URL of the accessed URL is escaped to ensure that the URL is not escaped > by the browser. > please see attachment. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
[ https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968474#comment-15968474 ] Marcelo Vanzin commented on SPARK-20328: bq. Since the driver is authenticated, it can request further delegation tokens No. To create a delegation token you need a TGT. You can't create a delegation token just with an existing delegation token. If that were possible, all the shenanigans to distribute the user's keytab for long running applications wouldn't be needed. > HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs > - > > Key: SPARK-20328 > URL: https://issues.apache.org/jira/browse/SPARK-20328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1, 2.1.2 >Reporter: Michael Gummelt > > In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a > MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138 > Semantically, this is a problem because a HadoopRDD does not represent a > Hadoop MapReduce job. Practically, this is a problem because this line: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194 > results in this MapReduce-specific security code being called: > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130, > which assumes the MapReduce master is configured (e.g. via > {{yarn.resourcemanager.*}}). If it isn't, an exception is thrown. > So I'm seeing this exception thrown as I'm trying to add Kerberos support for > the Spark Mesos scheduler: > {code} > Exception in thread "main" java.io.IOException: Can't get Master Kerberos > principal for use as renewer > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) > {code} > I have a workaround where I set a YARN-specific configuration variable to > trick {{TokenCache}} into thinking YARN is configured, but this is obviously > suboptimal. > The proper fix to this would likely require significant {{hadoop}} > refactoring to make split information available without going through > {{JobConf}}, so I'm not yet sure what the best course of action is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20293) In the page of 'jobs' or 'stages' of history server web ui,,click the 'Go' button, query paging data, the page error
[ https://issues.apache.org/jira/browse/SPARK-20293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968473#comment-15968473 ] Hyukjin Kwon commented on SPARK-20293: -- I can't reproduce in the current master. See - https://github.com/apache/spark/pull/17608#issuecomment-294059200 > In the page of 'jobs' or 'stages' of history server web ui,,click the 'Go' > button, query paging data, the page error > - > > Key: SPARK-20293 > URL: https://issues.apache.org/jira/browse/SPARK-20293 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0 >Reporter: guoxiaolongzte > Attachments: error1.png, error2.png, jobs.png, stages.png > > > In the page of 'jobs' or 'stages' of history server web ui, > Click on the 'Go' button, query paging data, the page error, function can not > be used. > The reasons are as follows: > '#' Was escaped by the browser as% 23. > & CompletedStage.desc = true% 23completed, the parameter value desc becomes = > true% 23, causing the page to report an error. The error is as follows: > HTTP ERROR 400 > Problem Access / history / app-20170411132432-0004 / stages /. Reason: > For input string: "true # completed" > Powered by Jetty: // > The amendments are as follows: > The URL of the accessed URL is escaped to ensure that the URL is not escaped > by the browser. > please see attachment. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
[ https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968469#comment-15968469 ] Michael Gummelt commented on SPARK-20328: - bq. I have no idea what that means. I'm pretty sure a delegation token is just another way for a subject to authenticate. So the driver uses the delegation token provided to it by {{spark-submit}} to authenticate. This is what I mean by "driver is already logged in via the delegation token". Since the driver is authenticated, it can request further delegation tokens. But my point is that it shouldn't need to, because that code is not "delegating" the tokens to any other process, which is the only thing delegation tokens are needed for. But this is neither here nor there. I think I know what I have to do. > HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs > - > > Key: SPARK-20328 > URL: https://issues.apache.org/jira/browse/SPARK-20328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1, 2.1.2 >Reporter: Michael Gummelt > > In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a > MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138 > Semantically, this is a problem because a HadoopRDD does not represent a > Hadoop MapReduce job. Practically, this is a problem because this line: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194 > results in this MapReduce-specific security code being called: > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130, > which assumes the MapReduce master is configured (e.g. via > {{yarn.resourcemanager.*}}). If it isn't, an exception is thrown. > So I'm seeing this exception thrown as I'm trying to add Kerberos support for > the Spark Mesos scheduler: > {code} > Exception in thread "main" java.io.IOException: Can't get Master Kerberos > principal for use as renewer > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) > {code} > I have a workaround where I set a YARN-specific configuration variable to > trick {{TokenCache}} into thinking YARN is configured, but this is obviously > suboptimal. > The proper fix to this would likely require significant {{hadoop}} > refactoring to make split information available without going through > {{JobConf}}, so I'm not yet sure what the best course of action is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
[ https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968440#comment-15968440 ] Marcelo Vanzin commented on SPARK-20328: bq. that the driver is already logged in via the delegation token I have no idea what that means. Yes, "spark-submit" has a TGT. It uses it to login to e.g. HDFS and generate delegation tokens. But in this situation, the *driver* has no TGT; it only has the delegation token generated by the spark-submit process, which has no communication with the driver. So when the driver calls into that code you linked that tries to fetch delegation tokens, it should fail. But it doesn't. Which tells me the code detects whether there is already a valid delegation token and, in that case, doesn't try to create a new one. So what I'm saying is that if the above is correct, all you have to do is create delegation tokens yourself when the Mesos backend initializes (i.e. before any HadoopRDD code is run), and you'll avoid the issue with setting the configuration options. That's something you'd have to do anyway, because delegation tokens are needed by the executors to talk to the data nodes. > HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs > - > > Key: SPARK-20328 > URL: https://issues.apache.org/jira/browse/SPARK-20328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1, 2.1.2 >Reporter: Michael Gummelt > > In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a > MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138 > Semantically, this is a problem because a HadoopRDD does not represent a > Hadoop MapReduce job. Practically, this is a problem because this line: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194 > results in this MapReduce-specific security code being called: > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130, > which assumes the MapReduce master is configured (e.g. via > {{yarn.resourcemanager.*}}). If it isn't, an exception is thrown. > So I'm seeing this exception thrown as I'm trying to add Kerberos support for > the Spark Mesos scheduler: > {code} > Exception in thread "main" java.io.IOException: Can't get Master Kerberos > principal for use as renewer > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) > {code} > I have a workaround where I set a YARN-specific configuration variable to > trick {{TokenCache}} into thinking YARN is configured, but this is obviously > suboptimal. > The proper fix to this would likely require significant {{hadoop}} > refactoring to make split information available without going through > {{JobConf}}, so I'm not yet sure what the best course of action is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20331) Broaden support for Hive partition pruning predicate pushdown
[ https://issues.apache.org/jira/browse/SPARK-20331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20331: Assignee: Apache Spark > Broaden support for Hive partition pruning predicate pushdown > - > > Key: SPARK-20331 > URL: https://issues.apache.org/jira/browse/SPARK-20331 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Michael Allman >Assignee: Apache Spark > > Spark 2.1 introduced scalable support for Hive tables with huge numbers of > partitions. Key to leveraging this support is the ability to prune > unnecessary table partitions to answer queries. Spark supports a subset of > the class of partition pruning predicates that the Hive metastore supports. > If a user writes a query with a partition pruning predicate that is *not* > supported by Spark, Spark falls back to loading all partitions and pruning > client-side. We want to broaden Spark's current partition pruning predicate > pushdown capabilities. > One of the key missing capabilities is support for disjunctions. For example, > for a table partitioned by date, specifying with a predicate like > {code}date = 20161011 or date = 20161014{code} > will result in Spark fetching all partitions. For a table partitioned by date > and hour, querying a range of hours across dates can be quite difficult to > accomplish without fetching all partition metadata. > The current partition pruning support supports only comparisons against > literals. We can expand that to foldable expressions by evaluating them at > planning time. > We can also implement support for the "IN" comparison by expanding it to a > sequence of "OR"s. > This ticket covers those enhancements. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20331) Broaden support for Hive partition pruning predicate pushdown
[ https://issues.apache.org/jira/browse/SPARK-20331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20331: Assignee: (was: Apache Spark) > Broaden support for Hive partition pruning predicate pushdown > - > > Key: SPARK-20331 > URL: https://issues.apache.org/jira/browse/SPARK-20331 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Michael Allman > > Spark 2.1 introduced scalable support for Hive tables with huge numbers of > partitions. Key to leveraging this support is the ability to prune > unnecessary table partitions to answer queries. Spark supports a subset of > the class of partition pruning predicates that the Hive metastore supports. > If a user writes a query with a partition pruning predicate that is *not* > supported by Spark, Spark falls back to loading all partitions and pruning > client-side. We want to broaden Spark's current partition pruning predicate > pushdown capabilities. > One of the key missing capabilities is support for disjunctions. For example, > for a table partitioned by date, specifying with a predicate like > {code}date = 20161011 or date = 20161014{code} > will result in Spark fetching all partitions. For a table partitioned by date > and hour, querying a range of hours across dates can be quite difficult to > accomplish without fetching all partition metadata. > The current partition pruning support supports only comparisons against > literals. We can expand that to foldable expressions by evaluating them at > planning time. > We can also implement support for the "IN" comparison by expanding it to a > sequence of "OR"s. > This ticket covers those enhancements. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20331) Broaden support for Hive partition pruning predicate pushdown
[ https://issues.apache.org/jira/browse/SPARK-20331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968439#comment-15968439 ] Apache Spark commented on SPARK-20331: -- User 'mallman' has created a pull request for this issue: https://github.com/apache/spark/pull/17633 > Broaden support for Hive partition pruning predicate pushdown > - > > Key: SPARK-20331 > URL: https://issues.apache.org/jira/browse/SPARK-20331 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Michael Allman > > Spark 2.1 introduced scalable support for Hive tables with huge numbers of > partitions. Key to leveraging this support is the ability to prune > unnecessary table partitions to answer queries. Spark supports a subset of > the class of partition pruning predicates that the Hive metastore supports. > If a user writes a query with a partition pruning predicate that is *not* > supported by Spark, Spark falls back to loading all partitions and pruning > client-side. We want to broaden Spark's current partition pruning predicate > pushdown capabilities. > One of the key missing capabilities is support for disjunctions. For example, > for a table partitioned by date, specifying with a predicate like > {code}date = 20161011 or date = 20161014{code} > will result in Spark fetching all partitions. For a table partitioned by date > and hour, querying a range of hours across dates can be quite difficult to > accomplish without fetching all partition metadata. > The current partition pruning support supports only comparisons against > literals. We can expand that to foldable expressions by evaluating them at > planning time. > We can also implement support for the "IN" comparison by expanding it to a > sequence of "OR"s. > This ticket covers those enhancements. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20331) Broaden support for Hive partition pruning predicate pushdown
[ https://issues.apache.org/jira/browse/SPARK-20331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Allman updated SPARK-20331: --- Description: Spark 2.1 introduced scalable support for Hive tables with huge numbers of partitions. Key to leveraging this support is the ability to prune unnecessary table partitions to answer queries. Spark supports a subset of the class of partition pruning predicates that the Hive metastore supports. If a user writes a query with a partition pruning predicate that is *not* supported by Spark, Spark falls back to loading all partitions and pruning client-side. We want to broaden Spark's current partition pruning predicate pushdown capabilities. One of the key missing capabilities is support for disjunctions. For example, for a table partitioned by date, specifying with a predicate like {code}date = 20161011 or date = 20161014{code} will result in Spark fetching all partitions. For a table partitioned by date and hour, querying a range of hours across dates can be quite difficult to accomplish without fetching all partition metadata. The current partition pruning support supports only comparisons against literals. We can expand that to foldable expressions by evaluating them at planning time. We can also implement support for the "IN" comparison by expanding it to a sequence of "OR"s. This ticket covers those enhancements. was: Spark 2.1 introduced scalable support for Hive tables with huge numbers of partitions. Key to leveraging this support is the ability to prune unnecessary table partitions to answer queries. Spark supports a subset of the class of partition pruning predicates that the Hive metastore supports. If a user writes a query with a partition pruning predicate that is *not* supported by Spark, Spark falls back to loading all partitions and pruning client-side. We want to broaden Spark's current partition pruning predicate pushdown capabilities. One of the key missing capabilities is support for disjunctions. For example, for a table partitioned by date, specifying with a predicate like {code}date = 20161011 or date = 20161014{code} will result in Spark fetching all partitions. For a table partitioned by date and hour, querying a range of hours across dates can be quite difficult to accomplish without fetching all partition metadata. The current partition pruning support supports only comparisons against literals. We can expand that to foldable expressions by evaluating them at planning time. We can also implement support for the "IN" comparison by expanding it to a sequence of "OR"s. This ticket covers those enhancements. A PR will follow. > Broaden support for Hive partition pruning predicate pushdown > - > > Key: SPARK-20331 > URL: https://issues.apache.org/jira/browse/SPARK-20331 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Michael Allman > > Spark 2.1 introduced scalable support for Hive tables with huge numbers of > partitions. Key to leveraging this support is the ability to prune > unnecessary table partitions to answer queries. Spark supports a subset of > the class of partition pruning predicates that the Hive metastore supports. > If a user writes a query with a partition pruning predicate that is *not* > supported by Spark, Spark falls back to loading all partitions and pruning > client-side. We want to broaden Spark's current partition pruning predicate > pushdown capabilities. > One of the key missing capabilities is support for disjunctions. For example, > for a table partitioned by date, specifying with a predicate like > {code}date = 20161011 or date = 20161014{code} > will result in Spark fetching all partitions. For a table partitioned by date > and hour, querying a range of hours across dates can be quite difficult to > accomplish without fetching all partition metadata. > The current partition pruning support supports only comparisons against > literals. We can expand that to foldable expressions by evaluating them at > planning time. > We can also implement support for the "IN" comparison by expanding it to a > sequence of "OR"s. > This ticket covers those enhancements. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
[ https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968432#comment-15968432 ] Michael Gummelt commented on SPARK-20328: - bq. It depends. e.g. on YARN, when you submit in cluster mode, the driver is running in the cluster and all it has are delegation tokens. (The TGT is only available to the launcher process.) Right, but my understanding is that the driver is already logged in via the delegation token provided to it by the {{spark-submit}} process (via {{amContainer.setTokens}}), so it wouldn't need to then fetch further delegation tokens. > HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs > - > > Key: SPARK-20328 > URL: https://issues.apache.org/jira/browse/SPARK-20328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1, 2.1.2 >Reporter: Michael Gummelt > > In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a > MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138 > Semantically, this is a problem because a HadoopRDD does not represent a > Hadoop MapReduce job. Practically, this is a problem because this line: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194 > results in this MapReduce-specific security code being called: > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130, > which assumes the MapReduce master is configured (e.g. via > {{yarn.resourcemanager.*}}). If it isn't, an exception is thrown. > So I'm seeing this exception thrown as I'm trying to add Kerberos support for > the Spark Mesos scheduler: > {code} > Exception in thread "main" java.io.IOException: Can't get Master Kerberos > principal for use as renewer > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) > {code} > I have a workaround where I set a YARN-specific configuration variable to > trick {{TokenCache}} into thinking YARN is configured, but this is obviously > suboptimal. > The proper fix to this would likely require significant {{hadoop}} > refactoring to make split information available without going through > {{JobConf}}, so I'm not yet sure what the best course of action is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
[ https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968423#comment-15968423 ] Marcelo Vanzin commented on SPARK-20328: bq. But it shouldn't need delegation tokens at all, right? It depends. e.g. on YARN, when you submit in cluster mode, the driver is running in the cluster and all it has are delegation tokens. (The TGT is only available to the launcher process.) Actually it would be interesting to understand how that case works internally; because if that code is trying to generate delegation tokens, it should theoretically fail in the above scenario. So maybe it doesn't generate tokens if they're already there, and that could be a workaround for your case too. > HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs > - > > Key: SPARK-20328 > URL: https://issues.apache.org/jira/browse/SPARK-20328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1, 2.1.2 >Reporter: Michael Gummelt > > In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a > MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138 > Semantically, this is a problem because a HadoopRDD does not represent a > Hadoop MapReduce job. Practically, this is a problem because this line: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194 > results in this MapReduce-specific security code being called: > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130, > which assumes the MapReduce master is configured (e.g. via > {{yarn.resourcemanager.*}}). If it isn't, an exception is thrown. > So I'm seeing this exception thrown as I'm trying to add Kerberos support for > the Spark Mesos scheduler: > {code} > Exception in thread "main" java.io.IOException: Can't get Master Kerberos > principal for use as renewer > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) > {code} > I have a workaround where I set a YARN-specific configuration variable to > trick {{TokenCache}} into thinking YARN is configured, but this is obviously > suboptimal. > The proper fix to this would likely require significant {{hadoop}} > refactoring to make split information available without going through > {{JobConf}}, so I'm not yet sure what the best course of action is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
[ https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968416#comment-15968416 ] Michael Gummelt edited comment on SPARK-20328 at 4/14/17 12:02 AM: --- bq. It shouldn't need to do it not for the reasons you mention, but because Spark already the necessary credentials available (either a TGT, or a valid delegation token for HDFS). But it shouldn't need delegation tokens at all, right? The authentication of the currently logged in user, whether it be through the OS or through Kerberos, should be sufficient. was (Author: mgummelt): bq. It shouldn't need to do it not for the reasons you mention, but because Spark already the necessary credentials available (either a TGT, or a valid delegation token for HDFS). But it shouldn't need delegation tokens at all, right? > HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs > - > > Key: SPARK-20328 > URL: https://issues.apache.org/jira/browse/SPARK-20328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1, 2.1.2 >Reporter: Michael Gummelt > > In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a > MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138 > Semantically, this is a problem because a HadoopRDD does not represent a > Hadoop MapReduce job. Practically, this is a problem because this line: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194 > results in this MapReduce-specific security code being called: > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130, > which assumes the MapReduce master is configured (e.g. via > {{yarn.resourcemanager.*}}). If it isn't, an exception is thrown. > So I'm seeing this exception thrown as I'm trying to add Kerberos support for > the Spark Mesos scheduler: > {code} > Exception in thread "main" java.io.IOException: Can't get Master Kerberos > principal for use as renewer > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) > {code} > I have a workaround where I set a YARN-specific configuration variable to > trick {{TokenCache}} into thinking YARN is configured, but this is obviously > suboptimal. > The proper fix to this would likely require significant {{hadoop}} > refactoring to make split information available without going through > {{JobConf}}, so I'm not yet sure what the best course of action is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
[ https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968411#comment-15968411 ] Michael Gummelt edited comment on SPARK-20328 at 4/13/17 11:59 PM: --- bq. The Mesos backend (I mean the code in Spark, not the Mesos service) can set the configs in the SparkContext's "hadoopConfiguration" object, can't it? I suppose this would work. It would rely on the assumption that the Mesos scheduler backend is started before the HadoopRDD is created, which happens to be true, but ideally we wouldn't have to rely on that ordering. Right now I'm just setting it in {{SparkSubmit}}, but that's not great either. I filed a Hadoop ticket for the {{FileInputFormat}} issue and linked it here. was (Author: mgummelt): > The Mesos backend (I mean the code in Spark, not the Mesos service) can set > the configs in the SparkContext's "hadoopConfiguration" object, can't it? I suppose this would work. It would rely on the assumption that the Mesos scheduler backend is started before the HadoopRDD is created, which happens to be true, but ideally we wouldn't have to rely on that ordering. Right now I'm just setting it in {{SparkSubmit}}, but that's not great either. I filed a Hadoop ticket for the {{FileInputFormat}} issue and linked it here. > HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs > - > > Key: SPARK-20328 > URL: https://issues.apache.org/jira/browse/SPARK-20328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1, 2.1.2 >Reporter: Michael Gummelt > > In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a > MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138 > Semantically, this is a problem because a HadoopRDD does not represent a > Hadoop MapReduce job. Practically, this is a problem because this line: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194 > results in this MapReduce-specific security code being called: > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130, > which assumes the MapReduce master is configured (e.g. via > {{yarn.resourcemanager.*}}). If it isn't, an exception is thrown. > So I'm seeing this exception thrown as I'm trying to add Kerberos support for > the Spark Mesos scheduler: > {code} > Exception in thread "main" java.io.IOException: Can't get Master Kerberos > principal for use as renewer > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) > {code} > I have a workaround where I set a YARN-specific configuration variable to > trick {{TokenCache}} into thinking YARN is configured, but this is obviously > suboptimal. > The proper fix to this would likely require significant {{hadoop}} > refactoring to make split information available without going through > {{JobConf}}, so I'm not yet sure what the best course of action is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
[ https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968416#comment-15968416 ] Michael Gummelt edited comment on SPARK-20328 at 4/13/17 11:59 PM: --- bq. It shouldn't need to do it not for the reasons you mention, but because Spark already the necessary credentials available (either a TGT, or a valid delegation token for HDFS). But it shouldn't need delegation tokens at all, right? was (Author: mgummelt): > It shouldn't need to do it not for the reasons you mention, but because Spark > already the necessary credentials available (either a TGT, or a valid > delegation token for HDFS). But it shouldn't need delegation tokens at all, right? > HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs > - > > Key: SPARK-20328 > URL: https://issues.apache.org/jira/browse/SPARK-20328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1, 2.1.2 >Reporter: Michael Gummelt > > In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a > MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138 > Semantically, this is a problem because a HadoopRDD does not represent a > Hadoop MapReduce job. Practically, this is a problem because this line: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194 > results in this MapReduce-specific security code being called: > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130, > which assumes the MapReduce master is configured (e.g. via > {{yarn.resourcemanager.*}}). If it isn't, an exception is thrown. > So I'm seeing this exception thrown as I'm trying to add Kerberos support for > the Spark Mesos scheduler: > {code} > Exception in thread "main" java.io.IOException: Can't get Master Kerberos > principal for use as renewer > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) > {code} > I have a workaround where I set a YARN-specific configuration variable to > trick {{TokenCache}} into thinking YARN is configured, but this is obviously > suboptimal. > The proper fix to this would likely require significant {{hadoop}} > refactoring to make split information available without going through > {{JobConf}}, so I'm not yet sure what the best course of action is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
[ https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968416#comment-15968416 ] Michael Gummelt commented on SPARK-20328: - > It shouldn't need to do it not for the reasons you mention, but because Spark > already the necessary credentials available (either a TGT, or a valid > delegation token for HDFS). But it shouldn't need delegation tokens at all, right? > HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs > - > > Key: SPARK-20328 > URL: https://issues.apache.org/jira/browse/SPARK-20328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1, 2.1.2 >Reporter: Michael Gummelt > > In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a > MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138 > Semantically, this is a problem because a HadoopRDD does not represent a > Hadoop MapReduce job. Practically, this is a problem because this line: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194 > results in this MapReduce-specific security code being called: > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130, > which assumes the MapReduce master is configured (e.g. via > {{yarn.resourcemanager.*}}). If it isn't, an exception is thrown. > So I'm seeing this exception thrown as I'm trying to add Kerberos support for > the Spark Mesos scheduler: > {code} > Exception in thread "main" java.io.IOException: Can't get Master Kerberos > principal for use as renewer > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) > {code} > I have a workaround where I set a YARN-specific configuration variable to > trick {{TokenCache}} into thinking YARN is configured, but this is obviously > suboptimal. > The proper fix to this would likely require significant {{hadoop}} > refactoring to make split information available without going through > {{JobConf}}, so I'm not yet sure what the best course of action is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
[ https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968411#comment-15968411 ] Michael Gummelt commented on SPARK-20328: - > The Mesos backend (I mean the code in Spark, not the Mesos service) can set > the configs in the SparkContext's "hadoopConfiguration" object, can't it? I suppose this would work. It would rely on the assumption that the Mesos scheduler backend is started before the HadoopRDD is created, which happens to be true, but ideally we wouldn't have to rely on that ordering. Right now I'm just setting it in {{SparkSubmit}}, but that's not great either. I filed a Hadoop ticket for the {{FileInputFormat}} issue and linked it here. > HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs > - > > Key: SPARK-20328 > URL: https://issues.apache.org/jira/browse/SPARK-20328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1, 2.1.2 >Reporter: Michael Gummelt > > In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a > MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138 > Semantically, this is a problem because a HadoopRDD does not represent a > Hadoop MapReduce job. Practically, this is a problem because this line: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194 > results in this MapReduce-specific security code being called: > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130, > which assumes the MapReduce master is configured (e.g. via > {{yarn.resourcemanager.*}}). If it isn't, an exception is thrown. > So I'm seeing this exception thrown as I'm trying to add Kerberos support for > the Spark Mesos scheduler: > {code} > Exception in thread "main" java.io.IOException: Can't get Master Kerberos > principal for use as renewer > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) > {code} > I have a workaround where I set a YARN-specific configuration variable to > trick {{TokenCache}} into thinking YARN is configured, but this is obviously > suboptimal. > The proper fix to this would likely require significant {{hadoop}} > refactoring to make split information available without going through > {{JobConf}}, so I'm not yet sure what the best course of action is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
[ https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968396#comment-15968396 ] Marcelo Vanzin commented on SPARK-20328: bq. The problem can't be solved in the Mesos backend I meant setting the configs. The Mesos backend (I mean the code in Spark, not the Mesos service) can set the configs in the SparkContext's "hadoopConfiguration" object, can't it? Otherwise you'd be putting a burden on the user to have a proper Hadoop config around with those properties set. bq. is why in the world is FileInputFormat fetching delegation tokens That's actually a good question. It shouldn't need to do it not for the reasons you mention, but because Spark already the necessary credentials available (either a TGT, or a valid delegation token for HDFS). > HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs > - > > Key: SPARK-20328 > URL: https://issues.apache.org/jira/browse/SPARK-20328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1, 2.1.2 >Reporter: Michael Gummelt > > In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a > MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138 > Semantically, this is a problem because a HadoopRDD does not represent a > Hadoop MapReduce job. Practically, this is a problem because this line: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194 > results in this MapReduce-specific security code being called: > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130, > which assumes the MapReduce master is configured (e.g. via > {{yarn.resourcemanager.*}}). If it isn't, an exception is thrown. > So I'm seeing this exception thrown as I'm trying to add Kerberos support for > the Spark Mesos scheduler: > {code} > Exception in thread "main" java.io.IOException: Can't get Master Kerberos > principal for use as renewer > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) > {code} > I have a workaround where I set a YARN-specific configuration variable to > trick {{TokenCache}} into thinking YARN is configured, but this is obviously > suboptimal. > The proper fix to this would likely require significant {{hadoop}} > refactoring to make split information available without going through > {{JobConf}}, so I'm not yet sure what the best course of action is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
[ https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968388#comment-15968388 ] Michael Gummelt edited comment on SPARK-20328 at 4/13/17 11:27 PM: --- Hey [~vanzin], thanks for the response. Everything you said is correct, but I want to clarify one thing: > You just need to make the Mesos backend in Spark do that automatically for > the submitting user. The problem can't be solved in the Mesos backend. When I fetch delegation tokens for transmission to Executors in the Mesos backend, there's no problem. I can set whatever renewer I want. The problem is that there's a second location where delegation tokens are fetched: {{HadoopRDD}}. This is entirely separate from the fetching that the scheduler backends do (either Mesos or YARN). {{HadoopRDD}} tries to fetch split data, and ultimately calls into {{TokenCache}} in the hadoop library, which fetches delegation tokens with the renewer set to the YARN ResourceManager's principal: https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java#L213. I'm currently solving this by setting that config var in {{SparkSubmit}}. The big question I have, which I suppose is more for the {{hadoop}} team, is why in the world is {{FileInputFormat}} fetching delegation tokens? AFAICT, they're not sending those tokens to any other process. They're just fetching split data directly from the Name Nodes, and there should be no delegation required. was (Author: mgummelt): Hey [~vanzin], thanks for the response. Everything you said is correct, but I want to clarify one thing: > You just need to make the Mesos backend in Spark do that automatically for > the submitting user. The problem can't be solved in the Mesos backend. When I fetch delegation tokens for transmission to Executors in the Mesos backend, there's no problem. I can set whatever renewer I want. The problem is that there's a second location where delegation tokens are fetched: {{HadoopRDD}}. This is entirely separate from the fetching that the scheduler backends do (either Mesos or YARN). {{HadoopRDD}} tries to fetch split data, and ultimately calls into {{TokenCache}} in the hadoop library, which fetches delegation tokens with the renewer set to the YARN ResourceManager's principal: https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java#L213. I'm currently solving this by setting that config var in {{SparkSubmit}}. The big question I have, which I suppose is more for the {{hadoop}} team, is why in the world is {{FileInputFormat}} fetching delegation tokens. AFAICT, they're not sending those tokens to any other process. They're just fetching split data directly from the Name Nodes, and there should be no delegation required. > HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs > - > > Key: SPARK-20328 > URL: https://issues.apache.org/jira/browse/SPARK-20328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1, 2.1.2 >Reporter: Michael Gummelt > > In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a > MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138 > Semantically, this is a problem because a HadoopRDD does not represent a > Hadoop MapReduce job. Practically, this is a problem because this line: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194 > results in this MapReduce-specific security code being called: > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130, > which assumes the MapReduce master is configured (e.g. via > {{yarn.resourcemanager.*}}). If it isn't, an exception is thrown. > So I'm seeing this exception thrown as I'm trying to add Kerberos support for > the Spark Mesos scheduler: > {code} > Exception in thread "main" java.io.IOException: Can't get Master Kerberos > principal for use as renewer > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80) >
[jira] [Comment Edited] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
[ https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968388#comment-15968388 ] Michael Gummelt edited comment on SPARK-20328 at 4/13/17 11:27 PM: --- Hey [~vanzin], thanks for the response. Everything you said is correct, but I want to clarify one thing: > You just need to make the Mesos backend in Spark do that automatically for > the submitting user. The problem can't be solved in the Mesos backend. When I fetch delegation tokens for transmission to Executors in the Mesos backend, there's no problem. I can set whatever renewer I want. The problem is that there's a second location where delegation tokens are fetched: {{HadoopRDD}}. This is entirely separate from the fetching that the scheduler backends do (either Mesos or YARN). {{HadoopRDD}} tries to fetch split data, and ultimately calls into {{TokenCache}} in the hadoop library, which fetches delegation tokens with the renewer set to the YARN ResourceManager's principal: https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java#L213. I'm currently solving this by setting that config var in {{SparkSubmit}}. The big question I have, which I suppose is more for the {{hadoop}} team, is why in the world is {{FileInputFormat}} fetching delegation tokens. AFAICT, they're not sending those tokens to any other process. They're just fetching split data directly from the Name Nodes, and there should be no delegation required. was (Author: mgummelt): Hey [~vanzin], thanks for the response. Everything you said is correct, but I want to clarify one thing: > You just need to make the Mesos backend in Spark do that automatically for > the submitting user. The problem can't be solved in the Mesos backend. When I fetch delegation tokens for transmission to Executors in the Mesos backend, there's no problem. I can set whatever renewer I want. The problem is that there's a second location where delegation tokens are fetched: {{HadoopRDD}}. This is entirely separate from the fetching that the scheduler backends do (either Mesos or YARN). {{HadoopRDD}} tries to fetch split data, and ultimately calls into {{TokenCache}} in the hadoop library, which fetches delegation tokens with the renewer set to the YARN ResourceManager's principal: https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java#L213 The big question I have, which I suppose is more for the {{hadoop}} team, is why in the world is {{FileInputFormat}} fetching delegation tokens. AFAICT, they're not sending those tokens to any other process. They're just fetching split data directly from the Name Nodes, and there should be no delegation required. > HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs > - > > Key: SPARK-20328 > URL: https://issues.apache.org/jira/browse/SPARK-20328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1, 2.1.2 >Reporter: Michael Gummelt > > In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a > MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138 > Semantically, this is a problem because a HadoopRDD does not represent a > Hadoop MapReduce job. Practically, this is a problem because this line: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194 > results in this MapReduce-specific security code being called: > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130, > which assumes the MapReduce master is configured (e.g. via > {{yarn.resourcemanager.*}}). If it isn't, an exception is thrown. > So I'm seeing this exception thrown as I'm trying to add Kerberos support for > the Spark Mesos scheduler: > {code} > Exception in thread "main" java.io.IOException: Can't get Master Kerberos > principal for use as renewer > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80) > at >
[jira] [Commented] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
[ https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968388#comment-15968388 ] Michael Gummelt commented on SPARK-20328: - Hey [~vanzin], thanks for the response. Everything you said is correct, but I want to clarify one thing: > You just need to make the Mesos backend in Spark do that automatically for > the submitting user. The problem can't be solved in the Mesos backend. When I fetch delegation tokens for transmission to Executors in the Mesos backend, there's no problem. I can set whatever renewer I want. The problem is that there's a second location where delegation tokens are fetched: {{HadoopRDD}}. This is entirely separate from the fetching that the scheduler backends do (either Mesos or YARN). {{HadoopRDD}} tries to fetch split data, and ultimately calls into {{TokenCache}} in the hadoop library, which fetches delegation tokens with the renewer set to the YARN ResourceManager's principal: https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java#L213 The big question I have, which I suppose is more for the {{hadoop}} team, is why in the world is {{FileInputFormat}} fetching delegation tokens. AFAICT, they're not sending those tokens to any other process. They're just fetching split data directly from the Name Nodes, and there should be no delegation required. > HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs > - > > Key: SPARK-20328 > URL: https://issues.apache.org/jira/browse/SPARK-20328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1, 2.1.2 >Reporter: Michael Gummelt > > In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a > MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138 > Semantically, this is a problem because a HadoopRDD does not represent a > Hadoop MapReduce job. Practically, this is a problem because this line: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194 > results in this MapReduce-specific security code being called: > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130, > which assumes the MapReduce master is configured (e.g. via > {{yarn.resourcemanager.*}}). If it isn't, an exception is thrown. > So I'm seeing this exception thrown as I'm trying to add Kerberos support for > the Spark Mesos scheduler: > {code} > Exception in thread "main" java.io.IOException: Can't get Master Kerberos > principal for use as renewer > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) > {code} > I have a workaround where I set a YARN-specific configuration variable to > trick {{TokenCache}} into thinking YARN is configured, but this is obviously > suboptimal. > The proper fix to this would likely require significant {{hadoop}} > refactoring to make split information available without going through > {{JobConf}}, so I'm not yet sure what the best course of action is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-20330) CLONE - SparkContext.localProperties leaked
[ https://issues.apache.org/jira/browse/SPARK-20330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oleg White closed SPARK-20330. -- Resolution: Duplicate > CLONE - SparkContext.localProperties leaked > --- > > Key: SPARK-20330 > URL: https://issues.apache.org/jira/browse/SPARK-20330 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Oleg White >Priority: Minor > > I have a non-deterministic but quite reliable reproduction for a case where > {{spark.sql.execution.id}} is leaked. Operations then die with > {{spark.sql.execution.id is already set}}. These threads never recover > because there is nothing to unset {{spark.sql.execution.id}}. (It's not a > case of nested {{withNewExecutionId}} calls.) > I have figured out why this happens. We are within a {{withNewExecutionId}} > block. At some point we call back to user code. (In our case this is a custom > data source's {{buildScan}} method.) The user code calls > {{scala.concurrent.Await.result}}. Because our thread is a member of a > {{ForkJoinPool}} (this is a Play HTTP serving thread) {{Await.result}} starts > a new thread. {{SparkContext.localProperties}} is cloned for this thread and > then it's ready to serve an HTTP request. > The first thread then finishes waiting, finishes {{buildScan}}, and leaves > {{withNewExecutionId}}, clearing {{spark.sql.execution.id}} in the > {{finally}} block. All good. But some time later another HTTP request will be > served by the second thread. This thread is "poisoned" with a > {{spark.sql.execution.id}}. When it tries to use {{withNewExecutionId}} it > fails. > > I don't know who's at fault here. > - I don't like the {{ThreadLocal}} properties anyway. Why not create an > Execution object and let it wrap the operation? Then you could have two > executions in parallel on the same thread, and other stuff like that. It > would be much clearer than storing the execution ID in a kind-of-global > variable. > - Why do we have to inherit the {{ThreadLocal}} properties? I'm sure there > is a good reason, but this is essentially a bug-generator in my view. (It has > already generated https://issues.apache.org/jira/browse/SPARK-10563.) > - {{Await.result}} --- I never would have thought it starts threads. > - We probably shouldn't be calling {{Await.result}} inside {{buildScan}}. > - We probably shouldn't call Spark things from HTTP serving threads. > I'm not sure what could be done on the Spark side, but I thought I should > mention this interesting issue. For supporting evidence here is the stack > trace when {{localProperties}} is getting cloned. It's contents at that point > are: > {noformat} > {spark.sql.execution.id=0, spark.rdd.scope.noOverride=true, > spark.rdd.scope={"id":"4","name":"ExecutedCommand"}} > {noformat} > {noformat} > at org.apache.spark.SparkContext$$anon$2.childValue(SparkContext.scala:364) > [spark-assembly-1.6.0-hadoop2.4.0.jar:1.6.0] > at org.apache.spark.SparkContext$$anon$2.childValue(SparkContext.scala:362) > [spark-assembly-1.6.0-hadoop2.4.0.jar:1.6.0] > at java.lang.ThreadLocal$ThreadLocalMap.(ThreadLocal.java:353) > [na:1.7.0_91] > at java.lang.ThreadLocal$ThreadLocalMap.(ThreadLocal.java:261) > [na:1.7.0_91] > at java.lang.ThreadLocal.createInheritedMap(ThreadLocal.java:236) > [na:1.7.0_91] > at java.lang.Thread.init(Thread.java:416) [na:1.7.0_91] > > at java.lang.Thread.init(Thread.java:349) [na:1.7.0_91] > > at java.lang.Thread.(Thread.java:508) [na:1.7.0_91] > > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.(ForkJoinWorkerThread.java:48) > [org.scala-lang.scala-library-2.10.5.jar:na] > at > scala.concurrent.impl.ExecutionContextImpl$DefaultThreadFactory$$anon$2.(ExecutionContextImpl.scala:42) > [org.scala-lang.scala-library-2.10.5.jar:na] > at > scala.concurrent.impl.ExecutionContextImpl$DefaultThreadFactory.newThread(ExecutionContextImpl.scala:42) > [org.scala-lang.scala-library-2.10.5.jar:na] > at > scala.concurrent.forkjoin.ForkJoinPool.tryCompensate(ForkJoinPool.java:2341) > [org.scala-lang.scala-library-2.10.5.jar:na] > at > scala.concurrent.forkjoin.ForkJoinPool.managedBlock(ForkJoinPool.java:3638) > [org.scala-lang.scala-library-2.10.5.jar:na] > at > scala.concurrent.impl.ExecutionContextImpl$DefaultThreadFactory$$anon$2.blockOn(ExecutionContextImpl.scala:45) > [org.scala-lang.scala-library-2.10.5.jar:na] > at scala.concurrent.Await$.result(package.scala:107) > [org.scala-lang.scala-library-2.10.5.jar:na] > at >
[jira] [Created] (SPARK-20331) Broaden support for Hive partition pruning predicate pushdown
Michael Allman created SPARK-20331: -- Summary: Broaden support for Hive partition pruning predicate pushdown Key: SPARK-20331 URL: https://issues.apache.org/jira/browse/SPARK-20331 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0 Reporter: Michael Allman Spark 2.1 introduced scalable support for Hive tables with huge numbers of partitions. Key to leveraging this support is the ability to prune unnecessary table partitions to answer queries. Spark supports a subset of the class of partition pruning predicates that the Hive metastore supports. If a user writes a query with a partition pruning predicate that is *not* supported by Spark, Spark falls back to loading all partitions and pruning client-side. We want to broaden Spark's current partition pruning predicate pushdown capabilities. One of the key missing capabilities is support for disjunctions. For example, for a table partitioned by date, specifying with a predicate like {code}date = 20161011 or date = 20161014{code} will result in Spark fetching all partitions. For a table partitioned by date and hour, querying a range of hours across dates can be quite difficult to accomplish without fetching all partition metadata. The current partition pruning support supports only comparisons against literals. We can expand that to foldable expressions by evaluating them at planning time. We can also implement support for the "IN" comparison by expanding it to a sequence of "OR"s. This ticket covers those enhancements. A PR will follow. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
[ https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968343#comment-15968343 ] Marcelo Vanzin commented on SPARK-20328: Hmm... that seems related to delegation token support. Delegation tokens need a "renewer", and generally in YARN applications the renewer is the YARN service (IIRC), since it will take care of renewing delegation tokens submitted with your application (and cancel them after the application is done). In your case Mesos doesn't know about kerberos, so the user submitting the app needs to be the renewer; and, aside from this particular issue, you may need to add code to actually renew those tokens periodically (this is different from creating new tokens after their max life time). I don't think you'll find a different way around this from the one you have (setting the YARN configs). You just need to make the Mesos backend in Spark do that automatically for the submitting user. As far as the Hadoop library, you could try to open a bug so they can add an explicit option so that non-MR, non-YARN applications can set the renewer more easily. > HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs > - > > Key: SPARK-20328 > URL: https://issues.apache.org/jira/browse/SPARK-20328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1, 2.1.2 >Reporter: Michael Gummelt > > In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a > MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138 > Semantically, this is a problem because a HadoopRDD does not represent a > Hadoop MapReduce job. Practically, this is a problem because this line: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194 > results in this MapReduce-specific security code being called: > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130, > which assumes the MapReduce master is configured (e.g. via > {{yarn.resourcemanager.*}}). If it isn't, an exception is thrown. > So I'm seeing this exception thrown as I'm trying to add Kerberos support for > the Spark Mesos scheduler: > {code} > Exception in thread "main" java.io.IOException: Can't get Master Kerberos > principal for use as renewer > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) > {code} > I have a workaround where I set a YARN-specific configuration variable to > trick {{TokenCache}} into thinking YARN is configured, but this is obviously > suboptimal. > The proper fix to this would likely require significant {{hadoop}} > refactoring to make split information available without going through > {{JobConf}}, so I'm not yet sure what the best course of action is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
[ https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Gummelt updated SPARK-20328: Description: In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138 Semantically, this is a problem because a HadoopRDD does not represent a Hadoop MapReduce job. Practically, this is a problem because this line: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194 results in this MapReduce-specific security code being called: https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130, which assumes the MapReduce master is configured (e.g. via {{yarn.resourcemanager.*}}). If it isn't, an exception is thrown. So I'm seeing this exception thrown as I'm trying to add Kerberos support for the Spark Mesos scheduler: {code} Exception in thread "main" java.io.IOException: Can't get Master Kerberos principal for use as renewer at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116) at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100) at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) {code} I have a workaround where I set a YARN-specific configuration variable to trick {{TokenCache}} into thinking YARN is configured, but this is obviously suboptimal. The proper fix to this would likely require significant {{hadoop}} refactoring to make split information available without going through {{JobConf}}, so I'm not yet sure what the best course of action is. was: In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138 Semantically, this is a problem because a HadoopRDD does not represent a Hadoop MapReduce job. Practically, this is a problem because this line: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194 results in this MapReduce-specific security code being called: https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130, which assumes the MapReduce master is configured. If it isn't, an exception is thrown. So I'm seeing this exception thrown as I'm trying to add Kerberos support for the Spark Mesos scheduler: {code} Exception in thread "main" java.io.IOException: Can't get Master Kerberos principal for use as renewer at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116) at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100) at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) {code} I have a workaround where I set a YARN-specific configuration variable to trick {{TokenCache}} into thinking YARN is configured, but this is obviously suboptimal. The proper fix to this would likely require significant {{hadoop}} refactoring to make split information available without going through {{JobConf}}, so I'm not yet sure what the best course of action is. > HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs > - > > Key: SPARK-20328 > URL: https://issues.apache.org/jira/browse/SPARK-20328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1, 2.1.2 >Reporter: Michael Gummelt > > In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a > MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138 >
[jira] [Created] (SPARK-20330) CLONE - SparkContext.localProperties leaked
Oleg White created SPARK-20330: -- Summary: CLONE - SparkContext.localProperties leaked Key: SPARK-20330 URL: https://issues.apache.org/jira/browse/SPARK-20330 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.6.0 Reporter: Oleg White Priority: Minor I have a non-deterministic but quite reliable reproduction for a case where {{spark.sql.execution.id}} is leaked. Operations then die with {{spark.sql.execution.id is already set}}. These threads never recover because there is nothing to unset {{spark.sql.execution.id}}. (It's not a case of nested {{withNewExecutionId}} calls.) I have figured out why this happens. We are within a {{withNewExecutionId}} block. At some point we call back to user code. (In our case this is a custom data source's {{buildScan}} method.) The user code calls {{scala.concurrent.Await.result}}. Because our thread is a member of a {{ForkJoinPool}} (this is a Play HTTP serving thread) {{Await.result}} starts a new thread. {{SparkContext.localProperties}} is cloned for this thread and then it's ready to serve an HTTP request. The first thread then finishes waiting, finishes {{buildScan}}, and leaves {{withNewExecutionId}}, clearing {{spark.sql.execution.id}} in the {{finally}} block. All good. But some time later another HTTP request will be served by the second thread. This thread is "poisoned" with a {{spark.sql.execution.id}}. When it tries to use {{withNewExecutionId}} it fails. I don't know who's at fault here. - I don't like the {{ThreadLocal}} properties anyway. Why not create an Execution object and let it wrap the operation? Then you could have two executions in parallel on the same thread, and other stuff like that. It would be much clearer than storing the execution ID in a kind-of-global variable. - Why do we have to inherit the {{ThreadLocal}} properties? I'm sure there is a good reason, but this is essentially a bug-generator in my view. (It has already generated https://issues.apache.org/jira/browse/SPARK-10563.) - {{Await.result}} --- I never would have thought it starts threads. - We probably shouldn't be calling {{Await.result}} inside {{buildScan}}. - We probably shouldn't call Spark things from HTTP serving threads. I'm not sure what could be done on the Spark side, but I thought I should mention this interesting issue. For supporting evidence here is the stack trace when {{localProperties}} is getting cloned. It's contents at that point are: {noformat} {spark.sql.execution.id=0, spark.rdd.scope.noOverride=true, spark.rdd.scope={"id":"4","name":"ExecutedCommand"}} {noformat} {noformat} at org.apache.spark.SparkContext$$anon$2.childValue(SparkContext.scala:364) [spark-assembly-1.6.0-hadoop2.4.0.jar:1.6.0] at org.apache.spark.SparkContext$$anon$2.childValue(SparkContext.scala:362) [spark-assembly-1.6.0-hadoop2.4.0.jar:1.6.0] at java.lang.ThreadLocal$ThreadLocalMap.(ThreadLocal.java:353) [na:1.7.0_91] at java.lang.ThreadLocal$ThreadLocalMap.(ThreadLocal.java:261) [na:1.7.0_91] at java.lang.ThreadLocal.createInheritedMap(ThreadLocal.java:236) [na:1.7.0_91] at java.lang.Thread.init(Thread.java:416) [na:1.7.0_91] at java.lang.Thread.init(Thread.java:349) [na:1.7.0_91] at java.lang.Thread.(Thread.java:508) [na:1.7.0_91] at scala.concurrent.forkjoin.ForkJoinWorkerThread.(ForkJoinWorkerThread.java:48) [org.scala-lang.scala-library-2.10.5.jar:na] at scala.concurrent.impl.ExecutionContextImpl$DefaultThreadFactory$$anon$2.(ExecutionContextImpl.scala:42) [org.scala-lang.scala-library-2.10.5.jar:na] at scala.concurrent.impl.ExecutionContextImpl$DefaultThreadFactory.newThread(ExecutionContextImpl.scala:42) [org.scala-lang.scala-library-2.10.5.jar:na] at scala.concurrent.forkjoin.ForkJoinPool.tryCompensate(ForkJoinPool.java:2341) [org.scala-lang.scala-library-2.10.5.jar:na] at scala.concurrent.forkjoin.ForkJoinPool.managedBlock(ForkJoinPool.java:3638) [org.scala-lang.scala-library-2.10.5.jar:na] at scala.concurrent.impl.ExecutionContextImpl$DefaultThreadFactory$$anon$2.blockOn(ExecutionContextImpl.scala:45) [org.scala-lang.scala-library-2.10.5.jar:na] at scala.concurrent.Await$.result(package.scala:107) [org.scala-lang.scala-library-2.10.5.jar:na] at com.lynxanalytics.biggraph.graph_api.SafeFuture.awaitResult(SafeFuture.scala:50) [biggraph.jar] at com.lynxanalytics.biggraph.graph_api.DataManager.get(DataManager.scala:315) [biggraph.jar] at com.lynxanalytics.biggraph.graph_api.Scripting$.getData(Scripting.scala:87) [biggraph.jar] at com.lynxanalytics.biggraph.table.TableRelation$$anonfun$1.apply(DefaultSource.scala:46)
[jira] [Created] (SPARK-20329) Resolution error when HAVING clause uses GROUP BY expression that involves implicit type coercion
Josh Rosen created SPARK-20329: -- Summary: Resolution error when HAVING clause uses GROUP BY expression that involves implicit type coercion Key: SPARK-20329 URL: https://issues.apache.org/jira/browse/SPARK-20329 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Josh Rosen Priority: Blocker The following example runs without error on Spark 2.0.x and 2.1.x but fails in the current Spark master: {code} create temporary view foo (a, b) as values (cast(1 as bigint), 2), (cast(3 as bigint), 4); select a + b from foo group by a + b having (a + b) > 1 {code} The error is {code} Error in SQL statement: AnalysisException: cannot resolve '`a`' given input columns: [(a + CAST(b AS BIGINT))]; line 1 pos 45; 'Filter (('a + 'b) > 1) +- Aggregate [(a#249243L + cast(b#249244 as bigint))], [(a#249243L + cast(b#249244 as bigint)) AS (a + CAST(b AS BIGINT))#249246L] +- SubqueryAlias foo +- Project [col1#249241L AS a#249243L, col2#249242 AS b#249244] +- LocalRelation [col1#249241L, col2#249242] {code} I think what's happening here is that the implicit cast is breaking things: if we change the types so that both columns are integers then the analysis error disappears. Similarly, adding explicit casts, as in {code} select a + cast(b as bigint) from foo group by a + cast(b as bigint) having (a + cast(b as bigint)) > 1 {code} works so I'm pretty sure that the resolution problem is being introduced when the casts are automatically added by the type coercion rule. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16900) Complete-mode output for file sinks
[ https://issues.apache.org/jira/browse/SPARK-16900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-16900: - Component/s: (was: DStreams) Structured Streaming > Complete-mode output for file sinks > --- > > Key: SPARK-16900 > URL: https://issues.apache.org/jira/browse/SPARK-16900 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Vladimir Feinberg > > Currently there is no way to checkpoint aggregations (see SPARK-16899), > except by using a custom foreach-based sink, which is pretty difficult and > requires that the user deal with ensuring idempotency, versioning, etc. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20312) query optimizer calls udf with null values when it doesn't expect them
[ https://issues.apache.org/jira/browse/SPARK-20312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-20312: - Component/s: (was: Spark Core) SQL > query optimizer calls udf with null values when it doesn't expect them > -- > > Key: SPARK-20312 > URL: https://issues.apache.org/jira/browse/SPARK-20312 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Albert Meltzer > > When optimizing an outer join, spark passes an empty row to both sides to see > if nulls would be ignored (side comment: for half-outer joins it subsequently > ignores the assessment on the dominant side). > For some reason, a condition such as {{xx IS NOT NULL && udf(xx) IS NOT > NULL}} might result in checking the right side first, and an exception if the > udf doesn't expect a null input (given the left side first). > A example is SIMILAR to the following (see actual query plans separately): > {noformat} > def func(value: Any): Int = ... // return AnyVal which probably causes a IS > NOT NULL added filter on the result > val df1 = sparkSession > .table(...) > .select("col1", "col2") // LongType both > val df11 = df1 > .filter(df1("col1").isNotNull) > .withColumn("col3", functions.udf(func)(df1("col1")) > .repartition(df1("col2")) > .sortWithinPartitions(df1("col2")) > val df2 = ... // load other data containing col2, similarly repartition and > sort > val df3 = > df1.join(df2, Seq("col2"), "left_outer") > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18127) Add hooks and extension points to Spark
[ https://issues.apache.org/jira/browse/SPARK-18127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell reassigned SPARK-18127: - Assignee: Sameer Agarwal (was: Herman van Hovell) > Add hooks and extension points to Spark > --- > > Key: SPARK-18127 > URL: https://issues.apache.org/jira/browse/SPARK-18127 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Srinath >Assignee: Sameer Agarwal > > As a Spark user I want to be able to customize my spark session. I currently > want to be able to do the following things: > # I want to be able to add custom analyzer rules. This allows me to implement > my own logical constructs; an example of this could be a recursive operator. > # I want to be able to add my own analysis checks. This allows me to catch > problems with spark plans early on. An example of this can be some datasource > specific checks. > # I want to be able to add my own optimizations. This allows me to optimize > plans in different ways, for instance when you use a very different cluster > (for example a one-node X1 instance). This supersedes the current > {{spark.experimental}} methods > # I want to be able to add my own planning strategies. This supersedes the > current {{spark.experimental}} methods. This allows me to plan my own > physical plan, an example of this would to plan my own heavily integrated > data source (CarbonData for example). > # I want to be able to use my own customized SQL constructs. An example of > this would supporting my own dialect, or be able to add constructs to the > current SQL language. I should not have to implement a complete parse, and > should be able to delegate to an underlying parser. > # I want to be able to track modifications and calls to the external catalog. > I want this API to be stable. This allows me to do synchronize with other > systems. > This API should modify the SparkSession when the session gets started, and it > should NOT change the session in flight. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20321) Spark UI cannot be shutdown in spark streaming app
[ https://issues.apache.org/jira/browse/SPARK-20321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968290#comment-15968290 ] Shixiong Zhu commented on SPARK-20321: -- You cannot stop a StreamingContext in foreachRDD. For safety, I would only stop SparkContext/StreamingContext in main thread, or always launch a new thread to stop it. Stopping SparkContext/StreamingContext in Spark's threads is not supported and pretty easy to cause dead-lock because they usually will wait until Spark's threads die. > Spark UI cannot be shutdown in spark streaming app > -- > > Key: SPARK-20321 > URL: https://issues.apache.org/jira/browse/SPARK-20321 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Andrei Taleanu > > When an exception thrown the transform stage is handled in foreachRDD and the > streaming context is forced to stop, the SparkUI appears to hang and > continually dump the following logs in an infinite loop: > {noformat} > ... > 2017-04-12 14:11:13,470 > [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG > org.spark_project.jetty.io.SelectorManager - Selector loop woken up from > select, 0/0 selected > 2017-04-12 14:11:13,470 > [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG > org.spark_project.jetty.io.SelectorManager - Selector loop waiting on select > 2017-04-12 14:11:13,470 > [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG > org.spark_project.jetty.io.SelectorManager - Selector loop woken up from > select, 0/0 selected > 2017-04-12 14:11:13,470 > [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG > org.spark_project.jetty.io.SelectorManager - Selector loop waiting on select > ... > {noformat} > Unfortunately I don't have a minimal example that reproduces this issue but > here is what I can share: > {noformat} > val dstream = pull data from kafka > val mapped = dstream.transform { rdd => > val data = getData // Perform a call that potentially throws an exception > // broadcast the data > // flatMap the RDD using the data > } > mapped.foreachRDD { > try { > // write some data in a DB > } catch { > case t: Throwable => > dstream.context.stop(stopSparkContext = true, stopGracefully = false) > } > } > mapped.foreachRDD { > try { > // write data to Kafka > // manually checkpoint the Kafka offsets (because I need them in JSON > format) > } catch { > case t: Throwable => > dstream.context.stop(stopSparkContext = true, stopGracefully = false) > } > } > {noformat} > The issue appears when stop is invoked. At the point when SparkUI is stopped, > it enters that infinite loop. Initially I thought it relates to Jetty, as the > version used in SparkUI had some bugs (e.g. [this > one|https://bugs.eclipse.org/bugs/show_bug.cgi?id=452465]). I bumped Jetty to > a more recent version (March 2017) and built Spark 2.1.0 with that one but > still got the error. > I encountered this issue with Spark 2.1.0 built with Hadoop 2.6 on top of > Mesos. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
[ https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Gummelt updated SPARK-20328: Description: In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138 Semantically, this is a problem because a HadoopRDD does not represent a Hadoop MapReduce job. Practically, this is a problem because this line: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194 results in this MapReduce-specific security code being called: https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130, which assumes the MapReduce master is configured. If it isn't, an exception is thrown. So I'm seeing this exception thrown as I'm trying to add Kerberos support for the Spark Mesos scheduler: {code} Exception in thread "main" java.io.IOException: Can't get Master Kerberos principal for use as renewer at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116) at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100) at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) {code} I have a workaround where I set a YARN-specific configuration variable to trick {{TokenCache}} into thinking YARN is configured, but this is obviously suboptimal. The proper fix to this would likely require significant {{hadoop}} refactoring to make split information available without going through {{JobConf}}, so I'm not yet sure what the best course of action is. was: In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138 Semantically, this is a problem because a HadoopRDD does not represent a Hadoop MapReduce job. Practically, this is a problem because this line: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194 results in this MapReduce-specific security code being called: https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130, which assumes the MapReduce master is configured. If it isn't, an exception is thrown. So I'm seeing this exception thrown as I'm trying to add Kerberos support for the Spark Mesos scheduler. I have a workaround where I set a YARN-specific configuration variable to trick {{TokenCache}} into thinking YARN is configured, but this is obviously suboptimal. The proper fix to this would likely require significant {{hadoop}} refactoring to make split information available without going through {{JobConf}}, so I'm not yet sure what the best course of action is. > HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs > - > > Key: SPARK-20328 > URL: https://issues.apache.org/jira/browse/SPARK-20328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1, 2.1.2 >Reporter: Michael Gummelt > > In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a > MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138 > Semantically, this is a problem because a HadoopRDD does not represent a > Hadoop MapReduce job. Practically, this is a problem because this line: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194 > results in this MapReduce-specific security code being called: > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130, > which assumes the MapReduce master is configured. If it isn't, an exception > is thrown. > So I'm seeing this exception thrown as I'm trying to add Kerberos support for > the Spark Mesos scheduler: > {code} >
[jira] [Commented] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
[ https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968232#comment-15968232 ] Michael Gummelt commented on SPARK-20328: - cc [~colorant] [~hfeng] [~vanzin] > HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs > - > > Key: SPARK-20328 > URL: https://issues.apache.org/jira/browse/SPARK-20328 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1, 2.1.2 >Reporter: Michael Gummelt > > In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a > MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138 > Semantically, this is a problem because a HadoopRDD does not represent a > Hadoop MapReduce job. Practically, this is a problem because this line: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194 > results in this MapReduce-specific security code being called: > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130, > which assumes the MapReduce master is configured. If it isn't, an exception > is thrown. > So I'm seeing this exception thrown as I'm trying to add Kerberos support for > the Spark Mesos scheduler. I have a workaround where I set a YARN-specific > configuration variable to trick {{TokenCache}} into thinking YARN is > configured, but this is obviously suboptimal. > The proper fix to this would likely require significant {{hadoop}} > refactoring to make split information available without going through > {{JobConf}}, so I'm not yet sure what the best course of action is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
Michael Gummelt created SPARK-20328: --- Summary: HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs Key: SPARK-20328 URL: https://issues.apache.org/jira/browse/SPARK-20328 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0, 2.1.1, 2.1.2 Reporter: Michael Gummelt In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138 Semantically, this is a problem because a HadoopRDD does not represent a Hadoop MapReduce job. Practically, this is a problem because this line: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194 results in this MapReduce-specific security code being called: https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130, which assumes the MapReduce master is configured. If it isn't, an exception is thrown. So I'm seeing this exception thrown as I'm trying to add Kerberos support for the Spark Mesos scheduler. I have a workaround where I set a YARN-specific configuration variable to trick {{TokenCache}} into thinking YARN is configured, but this is obviously suboptimal. The proper fix to this would likely require significant {{hadoop}} refactoring to make split information available without going through {{JobConf}}, so I'm not yet sure what the best course of action is. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20038) FileFormatWriter.ExecuteWriteTask.releaseResources() implementations to be re-entrant
[ https://issues.apache.org/jira/browse/SPARK-20038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid reassigned SPARK-20038: Assignee: Steve Loughran > FileFormatWriter.ExecuteWriteTask.releaseResources() implementations to be > re-entrant > - > > Key: SPARK-20038 > URL: https://issues.apache.org/jira/browse/SPARK-20038 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > Fix For: 2.2.0 > > > Both {{FileFormatWriter.ExecuteWriteTask.releaseResources()}} implementations > {{close()}} any non-null {{currentWriter}}, then set the field to null > However, if the close() call throws an exception in the execution of > {{{FileFormatWriter.executeTask}}, the exception handler will attempt to > cleanup, by calling {{releaseResources()}} again. Looking at the codepath, > this may cause {{committer.abortTask()}} to get skipped on failure. > This surfaces in SPARK-10109 and I've just seen it in HADOOP-14204); Parquet > seems to be in the trace as it NPEs the second time it's {{close()}} method > is called. > Fix: always set {{currentWriter}} to null, -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20038) FileFormatWriter.ExecuteWriteTask.releaseResources() implementations to be re-entrant
[ https://issues.apache.org/jira/browse/SPARK-20038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-20038. -- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17364 [https://github.com/apache/spark/pull/17364] > FileFormatWriter.ExecuteWriteTask.releaseResources() implementations to be > re-entrant > - > > Key: SPARK-20038 > URL: https://issues.apache.org/jira/browse/SPARK-20038 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Steve Loughran >Priority: Minor > Fix For: 2.2.0 > > > Both {{FileFormatWriter.ExecuteWriteTask.releaseResources()}} implementations > {{close()}} any non-null {{currentWriter}}, then set the field to null > However, if the close() call throws an exception in the execution of > {{{FileFormatWriter.executeTask}}, the exception handler will attempt to > cleanup, by calling {{releaseResources()}} again. Looking at the codepath, > this may cause {{committer.abortTask()}} to get skipped on failure. > This surfaces in SPARK-10109 and I've just seen it in HADOOP-14204); Parquet > seems to be in the trace as it NPEs the second time it's {{close()}} method > is called. > Fix: always set {{currentWriter}} to null, -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20327) Add CLI support for YARN-3926
[ https://issues.apache.org/jira/browse/SPARK-20327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968149#comment-15968149 ] Mark Grover commented on SPARK-20327: - Daniel, we don't assign JIRAs in Spark. Folks issue a PR and once the PR gets merged, the committer will assign the JIRA to the contributor. > Add CLI support for YARN-3926 > - > > Key: SPARK-20327 > URL: https://issues.apache.org/jira/browse/SPARK-20327 > Project: Spark > Issue Type: Improvement > Components: Spark Shell, Spark Submit >Affects Versions: 2.1.0 >Reporter: Daniel Templeton > Labels: newbie > > YARN-3926 adds the ability for administrators to configure custom resources, > like GPUs. This JIRA is to add support to Spark for requesting resources > other than CPU virtual cores and memory. See YARN-3926. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20232) Better combineByKey documentation: clarify memory allocation, better example
[ https://issues.apache.org/jira/browse/SPARK-20232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk reassigned SPARK-20232: --- Assignee: David Gingrich > Better combineByKey documentation: clarify memory allocation, better example > > > Key: SPARK-20232 > URL: https://issues.apache.org/jira/browse/SPARK-20232 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.1.0 > Environment: macOS Sierra 10.12.4 > Spark 2.1.0 installed via Homebrew >Reporter: David Gingrich >Assignee: David Gingrich >Priority: Trivial > Fix For: 2.2.0 > > > combineByKey docs has a few flaws: > - Doesn't include note about memory allocation (on aggregateBykey) > - Example doesn't show difference between mergeValue and mergeCombiners (both > are add) > I have a trivial patch, will attach momentarily. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20232) Better combineByKey documentation: clarify memory allocation, better example
[ https://issues.apache.org/jira/browse/SPARK-20232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk resolved SPARK-20232. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17545 [https://github.com/apache/spark/pull/17545] > Better combineByKey documentation: clarify memory allocation, better example > > > Key: SPARK-20232 > URL: https://issues.apache.org/jira/browse/SPARK-20232 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.1.0 > Environment: macOS Sierra 10.12.4 > Spark 2.1.0 installed via Homebrew >Reporter: David Gingrich >Priority: Trivial > Fix For: 2.2.0 > > > combineByKey docs has a few flaws: > - Doesn't include note about memory allocation (on aggregateBykey) > - Example doesn't show difference between mergeValue and mergeCombiners (both > are add) > I have a trivial patch, will attach momentarily. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19909) Batches will fail in case that temporary checkpoint dir is on local file system while metadata dir is on HDFS
[ https://issues.apache.org/jira/browse/SPARK-19909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968092#comment-15968092 ] Rob commented on SPARK-19909: - Is there an easy way to avoid this issue while waiting for it to be resolved? Perhaps tweaking a setting? > Batches will fail in case that temporary checkpoint dir is on local file > system while metadata dir is on HDFS > - > > Key: SPARK-19909 > URL: https://issues.apache.org/jira/browse/SPARK-19909 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Kousuke Saruta >Priority: Minor > > When we try to run Structured Streaming in local mode but use HDFS for the > storage, batches will be fail because of error like as follows. > {code} > val handle = stream.writeStream.format("console").start() > 17/03/09 16:54:45 ERROR StreamMetadata: Error writing stream metadata > StreamMetadata(fc07a0b1-5423-483e-a59d-b2206a49491e) to > /private/var/folders/4y/tmspvv353y59p3w4lknrf7ccgn/T/temporary-79d4fe05-4301-4b6d-a902-dff642d0ddca/metadata > org.apache.hadoop.security.AccessControlException: Permission denied: > user=kou, access=WRITE, > inode="/private/var/folders/4y/tmspvv353y59p3w4lknrf7ccgn/T/temporary-79d4fe05-4301-4b6d-a902-dff642d0ddca/metadata":hdfs:supergroup:drwxr-xr-x > {code} > It's because that a temporary checkpoint directory is created on local file > system but metadata whose path is based on the checkpoint directory will be > created on HDFS. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19946) DebugFilesystem.assertNoOpenStreams should report the open streams to help debugging
[ https://issues.apache.org/jira/browse/SPARK-19946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell updated SPARK-19946: -- Fix Version/s: 2.1.1 > DebugFilesystem.assertNoOpenStreams should report the open streams to help > debugging > > > Key: SPARK-19946 > URL: https://issues.apache.org/jira/browse/SPARK-19946 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.1.0 >Reporter: Bogdan Raducanu >Assignee: Bogdan Raducanu >Priority: Minor > Fix For: 2.1.1, 2.2.0 > > > In DebugFilesystem.assertNoOpenStreams if there are open streams an exception > is thrown showing the number of open streams. This doesn't help much to debug > where the open streams were leaked. > The exception should also report where the stream was leaked. This can be > done through a cause exception. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20327) Add CLI support for YARN-3926
[ https://issues.apache.org/jira/browse/SPARK-20327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967913#comment-15967913 ] Daniel Templeton commented on SPARK-20327: -- YARN-3926 won't go in until Hadoop 3.0.0, so this one is definitely forward-looking. > Add CLI support for YARN-3926 > - > > Key: SPARK-20327 > URL: https://issues.apache.org/jira/browse/SPARK-20327 > Project: Spark > Issue Type: Improvement > Components: Spark Shell, Spark Submit >Affects Versions: 2.1.0 >Reporter: Daniel Templeton > Labels: newbie > > YARN-3926 adds the ability for administrators to configure custom resources, > like GPUs. This JIRA is to add support to Spark for requesting resources > other than CPU virtual cores and memory. See YARN-3926. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14519) Cross-publish Kafka for Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-14519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967897#comment-15967897 ] Fabio Pinheiro edited comment on SPARK-14519 at 4/13/17 5:12 PM: - Kafka official supports Scala 2.12 since '0.10.1.1' and it is the recommend version of Scala for the latest release '0.10.2.0'. (KAFKA-4376 - Add scala 2.12 support) I have experience (as a user) with Scala, Spark and Kafka, but I new in the community. Can I help with this ticket? was (Author: fmgp): Kafka official supports Scala 2.12 since '0.10.1.1' and it is the recommend version of Scala for the latest release '0.10.2.0'. I have experience (as a user) with Scala, Spark and Kafka, but I new in the community. Can I help with this ticket? > Cross-publish Kafka for Scala 2.12 > -- > > Key: SPARK-14519 > URL: https://issues.apache.org/jira/browse/SPARK-14519 > Project: Spark > Issue Type: Sub-task > Components: Build, Project Infra >Reporter: Josh Rosen > > In order to build the streaming Kafka connector, we need to publish Kafka for > Scala 2.12.0-M4. Someone should file an issue against the Kafka project and > work with their developers to figure out what will block their upgrade / > release. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14519) Cross-publish Kafka for Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-14519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967897#comment-15967897 ] Fabio Pinheiro commented on SPARK-14519: Kafka official supports Scala 2.12 since '0.10.1.1' and it is the recommend version of Scala for the latest release '0.10.2.0'. I have experience (as a user) with Scala, Spark and Kafka, but I new in the community. Can I help with this ticket? > Cross-publish Kafka for Scala 2.12 > -- > > Key: SPARK-14519 > URL: https://issues.apache.org/jira/browse/SPARK-14519 > Project: Spark > Issue Type: Sub-task > Components: Build, Project Infra >Reporter: Josh Rosen > > In order to build the streaming Kafka connector, we need to publish Kafka for > Scala 2.12.0-M4. Someone should file an issue against the Kafka project and > work with their developers to figure out what will block their upgrade / > release. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20327) Add CLI support for YARN-3926
[ https://issues.apache.org/jira/browse/SPARK-20327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967886#comment-15967886 ] Sean Owen commented on SPARK-20327: --- What version of YARN would this require? Spark is still supporting 2.6. > Add CLI support for YARN-3926 > - > > Key: SPARK-20327 > URL: https://issues.apache.org/jira/browse/SPARK-20327 > Project: Spark > Issue Type: Improvement > Components: Spark Shell, Spark Submit >Affects Versions: 2.1.0 >Reporter: Daniel Templeton > Labels: newbie > > YARN-3926 adds the ability for administrators to configure custom resources, > like GPUs. This JIRA is to add support to Spark for requesting resources > other than CPU virtual cores and memory. See YARN-3926. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20327) Add CLI support for YARN-3926
[ https://issues.apache.org/jira/browse/SPARK-20327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Templeton updated SPARK-20327: - Labels: newbie (was: ) > Add CLI support for YARN-3926 > - > > Key: SPARK-20327 > URL: https://issues.apache.org/jira/browse/SPARK-20327 > Project: Spark > Issue Type: Improvement > Components: Spark Shell, Spark Submit >Affects Versions: 2.1.0 >Reporter: Daniel Templeton > Labels: newbie > > YARN-3926 adds the ability for administrators to configure custom resources, > like GPUs. This JIRA is to add support to Spark for requesting resources > other than CPU virtual cores and memory. See YARN-3926. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20327) Add CLI support for YARN-3926
[ https://issues.apache.org/jira/browse/SPARK-20327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967855#comment-15967855 ] Daniel Templeton commented on SPARK-20327: -- [~mgrover], would you mind assigning this one to me? > Add CLI support for YARN-3926 > - > > Key: SPARK-20327 > URL: https://issues.apache.org/jira/browse/SPARK-20327 > Project: Spark > Issue Type: Improvement > Components: Spark Shell, Spark Submit >Affects Versions: 2.1.0 >Reporter: Daniel Templeton > > YARN-3926 adds the ability for administrators to configure custom resources, > like GPUs. This JIRA is to add support to Spark for requesting resources > other than CPU virtual cores and memory. See YARN-3926. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20327) Add CLI support for YARN-3926
[ https://issues.apache.org/jira/browse/SPARK-20327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Templeton updated SPARK-20327: - Description: YARN-3926 adds the ability for administrators to configure custom resources, like GPUs. This JIRA is to add support to Spark for requesting resources other than CPU virtual cores and memory. See YARN-3926. > Add CLI support for YARN-3926 > - > > Key: SPARK-20327 > URL: https://issues.apache.org/jira/browse/SPARK-20327 > Project: Spark > Issue Type: Improvement > Components: Spark Shell, Spark Submit >Affects Versions: 2.1.0 >Reporter: Daniel Templeton > > YARN-3926 adds the ability for administrators to configure custom resources, > like GPUs. This JIRA is to add support to Spark for requesting resources > other than CPU virtual cores and memory. See YARN-3926. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20327) Add CLI support for YARN-3926
Daniel Templeton created SPARK-20327: Summary: Add CLI support for YARN-3926 Key: SPARK-20327 URL: https://issues.apache.org/jira/browse/SPARK-20327 Project: Spark Issue Type: Improvement Components: Spark Shell, Spark Submit Affects Versions: 2.1.0 Reporter: Daniel Templeton -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20326) Run Spark in local mode from IntelliJ (or other IDE)
[ https://issues.apache.org/jira/browse/SPARK-20326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20326. --- Resolution: Invalid Questions belong on the mailing list u...@spark.apache.org > Run Spark in local mode from IntelliJ (or other IDE) > > > Key: SPARK-20326 > URL: https://issues.apache.org/jira/browse/SPARK-20326 > Project: Spark > Issue Type: Question > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Marcello Oliva > > I wrote a simple Spark test in IntelliJ and it fails when creating the > SparkSession: > val spark = SparkSession.builder().master("local[2]").getOrCreate() > Error: > java.lang.IllegalArgumentException: Can't get Kerberos realm > ... > Caused by: java.lang.reflect.InvocationTargetException > ... > Caused by: KrbException: Cannot locate default realm > How do I run tests from an IDE? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20326) Run Spark in local mode from IntelliJ (or other IDE)
Marcello Oliva created SPARK-20326: -- Summary: Run Spark in local mode from IntelliJ (or other IDE) Key: SPARK-20326 URL: https://issues.apache.org/jira/browse/SPARK-20326 Project: Spark Issue Type: Question Components: Documentation Affects Versions: 2.1.0 Reporter: Marcello Oliva I wrote a simple Spark test in IntelliJ and it fails when creating the SparkSession: val spark = SparkSession.builder().master("local[2]").getOrCreate() Error: java.lang.IllegalArgumentException: Can't get Kerberos realm ... Caused by: java.lang.reflect.InvocationTargetException ... Caused by: KrbException: Cannot locate default realm How do I run tests from an IDE? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8586) SQL add jar command does not work well with Scala REPL
[ https://issues.apache.org/jira/browse/SPARK-8586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967689#comment-15967689 ] wasif masood commented on SPARK-8586: - I am using spark in Zeppelin and facing the same error. I am trying to add the jar using the following: sqlContext.sql("ADD JAR /user/hive/udf/myLib.jar") Please advise. > SQL add jar command does not work well with Scala REPL > -- > > Key: SPARK-8586 > URL: https://issues.apache.org/jira/browse/SPARK-8586 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0, 2.1.0 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Critical > > Seems SparkIMain always resets the context class loader in {{loadAndRunReq}}. > So, SerDe added through add jar command may not be loaded in the context > class loader when we lookup the table. > For example, the following code will fail when we try to show the table. > {code} > hive.sql("add jar sql/hive/src/test/resources/hive-hcatalog-core-0.13.1.jar") > hive.sql("drop table if exists jsonTable") > hive.sql("CREATE TABLE jsonTable(key int, val string) ROW FORMAT SERDE > 'org.apache.hive.hcatalog.data.JsonSerDe'") > hive.createDataFrame((1 to 100).map(i => (i, s"str$i"))).toDF("key", > "val").insertInto("jsonTable") > hive.table("jsonTable").show > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20287) Kafka Consumer should be able to subscribe to more than one topic partition
[ https://issues.apache.org/jira/browse/SPARK-20287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cody Koeninger updated SPARK-20287: --- What you're describing is closer to the receiver-based implementation, which had a number of issues. What I tried to achieve with the direct stream implementation was to have the driver figure out offset ranges for the next batch, then have executors deterministically consume exactly those messages with a 1:1 mapping between kafka partition and spark partition. If you have a single consumer subscribed to multiple topicpartitions, you'll get intermingled messages for all of those partitions. With the new consumer api subscribed to multiple partitions, there isn't a way to say "get topicpartition A until offset 1234", which is what we need. -- Cody Koeninger > Kafka Consumer should be able to subscribe to more than one topic partition > --- > > Key: SPARK-20287 > URL: https://issues.apache.org/jira/browse/SPARK-20287 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Stephane Maarek > > As I understand and as it stands, one Kafka Consumer is created for each > topic partition in the source Kafka topics, and they're cached. > cf > https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L48 > In my opinion, that makes the design an anti pattern for Kafka and highly > unefficient: > - Each Kafka consumer creates a connection to Kafka > - Spark doesn't leverage the power of the Kafka consumers, which is that it > automatically assigns and balances partitions amongst all the consumers that > share the same group.id > - You can still cache your Kafka consumer even if it has multiple partitions. > I'm not sure about how that translates to the spark underlying RDD > architecture, but from a Kafka standpoint, I believe creating one consumer > per partition is a big overhead, and a risk as the user may have to increase > the spark.streaming.kafka.consumer.cache.maxCapacity parameter. > Happy to discuss to understand the rationale -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20233) Apply star-join filter heuristics to dynamic programming join enumeration
[ https://issues.apache.org/jira/browse/SPARK-20233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-20233: --- Assignee: Ioana Delaney > Apply star-join filter heuristics to dynamic programming join enumeration > - > > Key: SPARK-20233 > URL: https://issues.apache.org/jira/browse/SPARK-20233 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Ioana Delaney >Assignee: Ioana Delaney >Priority: Critical > Fix For: 2.2.0 > > > This JIRA integrates star-join detection with the cost-based optimizer. > The join enumeration using dynamic programming generates a set of feasible > joins. The sub-optimal plans can be eliminated by a sequence of independent, > optional filters. The optional filters include heuristics for reducing the > search space. For example, > # Star-join: Tables in a star schema relationship are planned together since > they are assumed to have an optimal execution. > # Cartesian products: Cartesian products are deferred as late as possible to > avoid large intermediate results (expanding joins, in general). > # Composite inners: “Bushy tree” plans are not generated to avoid > materializing intermediate result. > For reference, see “Measuring the Complexity of Join Enumeration in Query > Optimization” by Ono et al. > This JIRA implements the star join filter. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20233) Apply star-join filter heuristics to dynamic programming join enumeration
[ https://issues.apache.org/jira/browse/SPARK-20233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-20233. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17546 [https://github.com/apache/spark/pull/17546] > Apply star-join filter heuristics to dynamic programming join enumeration > - > > Key: SPARK-20233 > URL: https://issues.apache.org/jira/browse/SPARK-20233 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Ioana Delaney >Priority: Critical > Fix For: 2.2.0 > > > This JIRA integrates star-join detection with the cost-based optimizer. > The join enumeration using dynamic programming generates a set of feasible > joins. The sub-optimal plans can be eliminated by a sequence of independent, > optional filters. The optional filters include heuristics for reducing the > search space. For example, > # Star-join: Tables in a star schema relationship are planned together since > they are assumed to have an optimal execution. > # Cartesian products: Cartesian products are deferred as late as possible to > avoid large intermediate results (expanding joins, in general). > # Composite inners: “Bushy tree” plans are not generated to avoid > materializing intermediate result. > For reference, see “Measuring the Complexity of Join Enumeration in Query > Optimization” by Ono et al. > This JIRA implements the star join filter. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19946) DebugFilesystem.assertNoOpenStreams should report the open streams to help debugging
[ https://issues.apache.org/jira/browse/SPARK-19946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967599#comment-15967599 ] Apache Spark commented on SPARK-19946: -- User 'bogdanrdc' has created a pull request for this issue: https://github.com/apache/spark/pull/17632 > DebugFilesystem.assertNoOpenStreams should report the open streams to help > debugging > > > Key: SPARK-19946 > URL: https://issues.apache.org/jira/browse/SPARK-19946 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.1.0 >Reporter: Bogdan Raducanu >Assignee: Bogdan Raducanu >Priority: Minor > Fix For: 2.2.0 > > > In DebugFilesystem.assertNoOpenStreams if there are open streams an exception > is thrown showing the number of open streams. This doesn't help much to debug > where the open streams were leaked. > The exception should also report where the stream was leaked. This can be > done through a cause exception. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20199) GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter
[ https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967574#comment-15967574 ] pralabhkumar commented on SPARK-20199: -- Shouldn't there be a parameter in GBMParameters() val gbmParams = new GBMParameters() gbmParams._featuresubsetStrategy="auto" and then in private[ml] def train(data: RDD[LabeledPoint], oldStrategy: OldStrategy): DecisionTreeRegressionModel = { we should have oldStrategy.getStrategy() and set it in val trees = RandomForest.run(data, oldStrategy, numTrees = 1, featureSubsetStrategy = oldStrategy.getStrategy(), seed = $(seed), instr = Some(instr), parentUID = Some(uid)) > GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter > --- > > Key: SPARK-20199 > URL: https://issues.apache.org/jira/browse/SPARK-20199 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: pralabhkumar > > Spark GradientBoostedTreesModel doesn't have Column sampling rate parameter > . This parameter is available in H2O and XGBoost. > Sample from H2O.ai > gbmParams._col_sample_rate > Please provide the parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20325) Spark Structured Streaming documentation Update: checkpoint configuration
Kate Eri created SPARK-20325: Summary: Spark Structured Streaming documentation Update: checkpoint configuration Key: SPARK-20325 URL: https://issues.apache.org/jira/browse/SPARK-20325 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.1.0 Reporter: Kate Eri Priority: Minor I have configured the following stream outputting to Kafka: {code} map.foreach(metric => { streamToProcess .groupBy(metric) .agg(count(metric)) .writeStream .outputMode("complete") .option("checkpointLocation", checkpointDir) .foreach(kafkaWriter) .start() }) {code} And configured the checkpoint Dir for each of output sinks like: .option("checkpointLocation", checkpointDir) according to the documentation => http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovering-from-failures-with-checkpointing As a result I've got the following exception: Cannot start query with id bf6a1003-6252-4c62-8249-c6a189701255 as another query with same id is already active. Perhaps you are attempting to restart a query from checkpoint that is already active. java.lang.IllegalStateException: Cannot start query with id bf6a1003-6252-4c62-8249-c6a189701255 as another query with same id is already active. Perhaps you are attempting to restart a query from checkpoint that is already active. at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:291) So according to current spark logic for “foreach” sink the checkpoint configuration is loaded in the following way: {code:title=StreamingQueryManager.scala} val checkpointLocation = userSpecifiedCheckpointLocation.map { userSpecified => new Path(userSpecified).toUri.toString }.orElse { df.sparkSession.sessionState.conf.checkpointLocation.map { location => new Path(location, userSpecifiedName.getOrElse(UUID.randomUUID().toString)).toUri.toString } }.getOrElse { if (useTempCheckpointLocation) { Utils.createTempDir(namePrefix = s"temporary").getCanonicalPath } else { throw new AnalysisException( "checkpointLocation must be specified either " + """through option("checkpointLocation", ...) or """ + s"""SparkSession.conf.set("${SQLConf.CHECKPOINT_LOCATION.key}", ...)""") } } {code} so first spark take checkpointDir from query, then from sparksession (spark.sql.streaming.checkpointLocation) and so on. But this behavior was not documented, thus two questions: 1) could we update documentation for Structured Streaming and describe this behavior 2) Do we really need to specify the checkpoint dir per query? what the reason for this? finally we will be forced to write some checkpointDir name generator, for example associate it with some particular named query and so on? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967554#comment-15967554 ] Mathieu Boespflug commented on SPARK-650: - [~Skamandros] how did you manage to hook `JavaSerializer`? I tried doing so myself, by defining a new subclass, but then I need to make sure that new class is installed on all executors. Meaning I have to copy a .jar on all my nodes manually. For some reason Spark won't try looking for the serializer inside my application JAR. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20323) Calling stop in a transform stage causes the app to hang
[ https://issues.apache.org/jira/browse/SPARK-20323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967535#comment-15967535 ] Sean Owen commented on SPARK-20323: --- Yeah I don't think it would do anything because that's not even the same copy of the context. If it happens to execute locally it might work. It might cause undefined behavior. If there's an easy way to explicitly error in that case, great, but otherwise would just avoid it. > Calling stop in a transform stage causes the app to hang > > > Key: SPARK-20323 > URL: https://issues.apache.org/jira/browse/SPARK-20323 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Andrei Taleanu > > I'm not sure if this is a bug or just the way it needs to happen but I've run > in this issue with the following code: > {noformat} > object ImmortalStreamingJob extends App { > val conf = new SparkConf().setAppName("fun-spark").setMaster("local[*]") > val ssc = new StreamingContext(conf, Seconds(1)) > val elems = (1 to 1000).grouped(10) > .map(seq => ssc.sparkContext.parallelize(seq)) > .toSeq > val stream = ssc.queueStream(mutable.Queue[RDD[Int]](elems: _*)) > val transformed = stream.transform { rdd => > try { > if (Random.nextInt(6) == 5) throw new RuntimeException("boom") > else println("lucky bastard") > rdd > } catch { > case e: Throwable => > println("stopping streaming context", e) > ssc.stop(stopSparkContext = true, stopGracefully = false) > throw e > } > } > transformed.foreachRDD { rdd => > println(rdd.collect().mkString(",")) > } > ssc.start() > ssc.awaitTermination() > } > {noformat} > There are two things I can note here: > * if the exception is thrown in the first transformation (when the first RDD > is processed), the spark context is stopped and the app dies > * if the exception is thrown after at least one RDD has been processed, the > app hangs after printing the error message and never stops > I think there's some sort of deadlock in the second case, is that normal? I > also asked this > [here|http://stackoverflow.com/questions/43273783/immortal-spark-streaming-job/43373624#43373624] > but up two this point there's no answer pointing exactly to what happens, > only guidelines. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20323) Calling stop in a transform stage causes the app to hang
[ https://issues.apache.org/jira/browse/SPARK-20323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967516#comment-15967516 ] Andrei Taleanu commented on SPARK-20323: All right, so the behaviour is undefined in that case? > Calling stop in a transform stage causes the app to hang > > > Key: SPARK-20323 > URL: https://issues.apache.org/jira/browse/SPARK-20323 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Andrei Taleanu > > I'm not sure if this is a bug or just the way it needs to happen but I've run > in this issue with the following code: > {noformat} > object ImmortalStreamingJob extends App { > val conf = new SparkConf().setAppName("fun-spark").setMaster("local[*]") > val ssc = new StreamingContext(conf, Seconds(1)) > val elems = (1 to 1000).grouped(10) > .map(seq => ssc.sparkContext.parallelize(seq)) > .toSeq > val stream = ssc.queueStream(mutable.Queue[RDD[Int]](elems: _*)) > val transformed = stream.transform { rdd => > try { > if (Random.nextInt(6) == 5) throw new RuntimeException("boom") > else println("lucky bastard") > rdd > } catch { > case e: Throwable => > println("stopping streaming context", e) > ssc.stop(stopSparkContext = true, stopGracefully = false) > throw e > } > } > transformed.foreachRDD { rdd => > println(rdd.collect().mkString(",")) > } > ssc.start() > ssc.awaitTermination() > } > {noformat} > There are two things I can note here: > * if the exception is thrown in the first transformation (when the first RDD > is processed), the spark context is stopped and the app dies > * if the exception is thrown after at least one RDD has been processed, the > app hangs after printing the error message and never stops > I think there's some sort of deadlock in the second case, is that normal? I > also asked this > [here|http://stackoverflow.com/questions/43273783/immortal-spark-streaming-job/43373624#43373624] > but up two this point there's no answer pointing exactly to what happens, > only guidelines. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20321) Spark UI cannot be shutdown in spark streaming app
[ https://issues.apache.org/jira/browse/SPARK-20321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967514#comment-15967514 ] Andrei Taleanu commented on SPARK-20321: Not sure what it would help. I could provide more logs, but I'm not sure how that helps as they're typical stopping logs. Also, the SparkUI remains accessible when this happens and you can see a single job that runs forever. I've tried recreating a minimal example but somehow I can't reproduce it in another minimal streaming app. In the app where I've encountered this issue it happens every time so it's really easy to reproduce. I'm not sure what I should be looking at so I can provide more details. > Spark UI cannot be shutdown in spark streaming app > -- > > Key: SPARK-20321 > URL: https://issues.apache.org/jira/browse/SPARK-20321 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Andrei Taleanu > > When an exception thrown the transform stage is handled in foreachRDD and the > streaming context is forced to stop, the SparkUI appears to hang and > continually dump the following logs in an infinite loop: > {noformat} > ... > 2017-04-12 14:11:13,470 > [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG > org.spark_project.jetty.io.SelectorManager - Selector loop woken up from > select, 0/0 selected > 2017-04-12 14:11:13,470 > [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG > org.spark_project.jetty.io.SelectorManager - Selector loop waiting on select > 2017-04-12 14:11:13,470 > [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG > org.spark_project.jetty.io.SelectorManager - Selector loop woken up from > select, 0/0 selected > 2017-04-12 14:11:13,470 > [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG > org.spark_project.jetty.io.SelectorManager - Selector loop waiting on select > ... > {noformat} > Unfortunately I don't have a minimal example that reproduces this issue but > here is what I can share: > {noformat} > val dstream = pull data from kafka > val mapped = dstream.transform { rdd => > val data = getData // Perform a call that potentially throws an exception > // broadcast the data > // flatMap the RDD using the data > } > mapped.foreachRDD { > try { > // write some data in a DB > } catch { > case t: Throwable => > dstream.context.stop(stopSparkContext = true, stopGracefully = false) > } > } > mapped.foreachRDD { > try { > // write data to Kafka > // manually checkpoint the Kafka offsets (because I need them in JSON > format) > } catch { > case t: Throwable => > dstream.context.stop(stopSparkContext = true, stopGracefully = false) > } > } > {noformat} > The issue appears when stop is invoked. At the point when SparkUI is stopped, > it enters that infinite loop. Initially I thought it relates to Jetty, as the > version used in SparkUI had some bugs (e.g. [this > one|https://bugs.eclipse.org/bugs/show_bug.cgi?id=452465]). I bumped Jetty to > a more recent version (March 2017) and built Spark 2.1.0 with that one but > still got the error. > I encountered this issue with Spark 2.1.0 built with Hadoop 2.6 on top of > Mesos. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20321) Spark UI cannot be shutdown in spark streaming app
[ https://issues.apache.org/jira/browse/SPARK-20321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967502#comment-15967502 ] Sean Owen commented on SPARK-20321: --- I don't recall any other reports like this. Yeah, it's hard to action this without a reproduction. Unless you have more detail about what's executing, what's stuck where, I am not sure how this would proceed. > Spark UI cannot be shutdown in spark streaming app > -- > > Key: SPARK-20321 > URL: https://issues.apache.org/jira/browse/SPARK-20321 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Andrei Taleanu > > When an exception thrown the transform stage is handled in foreachRDD and the > streaming context is forced to stop, the SparkUI appears to hang and > continually dump the following logs in an infinite loop: > {noformat} > ... > 2017-04-12 14:11:13,470 > [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG > org.spark_project.jetty.io.SelectorManager - Selector loop woken up from > select, 0/0 selected > 2017-04-12 14:11:13,470 > [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG > org.spark_project.jetty.io.SelectorManager - Selector loop waiting on select > 2017-04-12 14:11:13,470 > [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG > org.spark_project.jetty.io.SelectorManager - Selector loop woken up from > select, 0/0 selected > 2017-04-12 14:11:13,470 > [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG > org.spark_project.jetty.io.SelectorManager - Selector loop waiting on select > ... > {noformat} > Unfortunately I don't have a minimal example that reproduces this issue but > here is what I can share: > {noformat} > val dstream = pull data from kafka > val mapped = dstream.transform { rdd => > val data = getData // Perform a call that potentially throws an exception > // broadcast the data > // flatMap the RDD using the data > } > mapped.foreachRDD { > try { > // write some data in a DB > } catch { > case t: Throwable => > dstream.context.stop(stopSparkContext = true, stopGracefully = false) > } > } > mapped.foreachRDD { > try { > // write data to Kafka > // manually checkpoint the Kafka offsets (because I need them in JSON > format) > } catch { > case t: Throwable => > dstream.context.stop(stopSparkContext = true, stopGracefully = false) > } > } > {noformat} > The issue appears when stop is invoked. At the point when SparkUI is stopped, > it enters that infinite loop. Initially I thought it relates to Jetty, as the > version used in SparkUI had some bugs (e.g. [this > one|https://bugs.eclipse.org/bugs/show_bug.cgi?id=452465]). I bumped Jetty to > a more recent version (March 2017) and built Spark 2.1.0 with that one but > still got the error. > I encountered this issue with Spark 2.1.0 built with Hadoop 2.6 on top of > Mesos. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20323) Calling stop in a transform stage causes the app to hang
[ https://issues.apache.org/jira/browse/SPARK-20323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967500#comment-15967500 ] Sean Owen commented on SPARK-20323: --- You can't manipulate the context in a transform. > Calling stop in a transform stage causes the app to hang > > > Key: SPARK-20323 > URL: https://issues.apache.org/jira/browse/SPARK-20323 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Andrei Taleanu > > I'm not sure if this is a bug or just the way it needs to happen but I've run > in this issue with the following code: > {noformat} > object ImmortalStreamingJob extends App { > val conf = new SparkConf().setAppName("fun-spark").setMaster("local[*]") > val ssc = new StreamingContext(conf, Seconds(1)) > val elems = (1 to 1000).grouped(10) > .map(seq => ssc.sparkContext.parallelize(seq)) > .toSeq > val stream = ssc.queueStream(mutable.Queue[RDD[Int]](elems: _*)) > val transformed = stream.transform { rdd => > try { > if (Random.nextInt(6) == 5) throw new RuntimeException("boom") > else println("lucky bastard") > rdd > } catch { > case e: Throwable => > println("stopping streaming context", e) > ssc.stop(stopSparkContext = true, stopGracefully = false) > throw e > } > } > transformed.foreachRDD { rdd => > println(rdd.collect().mkString(",")) > } > ssc.start() > ssc.awaitTermination() > } > {noformat} > There are two things I can note here: > * if the exception is thrown in the first transformation (when the first RDD > is processed), the spark context is stopped and the app dies > * if the exception is thrown after at least one RDD has been processed, the > app hangs after printing the error message and never stops > I think there's some sort of deadlock in the second case, is that normal? I > also asked this > [here|http://stackoverflow.com/questions/43273783/immortal-spark-streaming-job/43373624#43373624] > but up two this point there's no answer pointing exactly to what happens, > only guidelines. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20324) Control itemSets length in PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-20324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cyril de Vogelaere updated SPARK-20324: --- Description: The idea behind this improvement would be to allow better control over the size of itemSets in solution patterns. For example, assuming you posses a huge dataset of series product bought together, one sequence per client. And you want to find item frequently bough in pairs, as to make interesting promotions to your client or boost certains sales. In the current implementation, all solutions would have to be calculated, before the user can sort through them and select only interesting ones. What i'm proposing here, is the addition of two parameters : First, a maxItemPerItemset parameter which would limit the maximum number of item per itemset to a certain size X. Allowing potential important reduction in the search space, hastening the process of finding theses specific solutions. Second a tandem minItemPerItemset parameter that would limit the minimum number of item per itemset. Discarding solution that do not fit this constraint. Although this wouldn't entail a reduction of the constraint, this should still allow interested user to reduce the number of solutions collected by the driver. If this improvement proposition seems interesting to the community, I will implement a solution along with test to guarantee the correcteness of it's implementation. was: The idea behind this improvement would be to allow better control over the size of itemSets in solution patterns. For example, assuming you posses a huge dataset of series product bought together, one sequence per client. And you want to find item frequently bough in pairs, as to make interesting promotions to your client or boost certains sales. In the current implementation, all solutions would have to be calculated, before the user can sort through them and select only interesting ones. What i'm proposing here, is the addition of two parameters : First, a maxItemPerItemset parameter which would limit the maximum number of item per itemset to a certain size X. Allowing potential important reduction in the search space, hastening the process of finding theses specific solutions. Second a tandem minItemPerItemset parameter that would limit the minimum number of item per itemset. Discarding solution that do not fit this constraint. Although this wouldn't entail a reduction of the constraint, this should still allow interested user to reduce the number of solutions collected by the driver. If this solution seems interesting to the community, I will implement a solution along with test to guarantee the correcteness of it's implementation. > Control itemSets length in PrefixSpan > - > > Key: SPARK-20324 > URL: https://issues.apache.org/jira/browse/SPARK-20324 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > The idea behind this improvement would be to allow better control over the > size of itemSets in solution patterns. > For example, assuming you posses a huge dataset of series product bought > together, one sequence per client. And you want to find item frequently bough > in pairs, as to make interesting promotions to your client or boost certains > sales. > In the current implementation, all solutions would have to be calculated, > before the user can sort through them and select only interesting ones. > What i'm proposing here, is the addition of two parameters : > First, a maxItemPerItemset parameter which would limit the maximum number of > item per itemset to a certain size X. Allowing potential important reduction > in the search space, hastening the process of finding theses specific > solutions. > Second a tandem minItemPerItemset parameter that would limit the minimum > number of item per itemset. Discarding solution that do not fit this > constraint. Although this wouldn't entail a reduction of the constraint, this > should still allow interested user to reduce the number of solutions > collected by the driver. > If this improvement proposition seems interesting to the community, I will > implement a solution along with test to guarantee the correcteness of it's > implementation. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20324) Control itemSets length in PrefixSpan
Cyril de Vogelaere created SPARK-20324: -- Summary: Control itemSets length in PrefixSpan Key: SPARK-20324 URL: https://issues.apache.org/jira/browse/SPARK-20324 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 2.1.0 Reporter: Cyril de Vogelaere Priority: Minor The idea behind this improvement would be to allow better control over the size of itemSets in solution patterns. For example, assuming you posses a huge dataset of series product bought together, one sequence per client. And you want to find item frequently bough in pairs, as to make interesting promotions to your client or boost certains sales. In the current implementation, all solutions would have to be calculated, before the user can sort through them and select only interesting ones. What i'm proposing here, is the addition of two parameters : First, a maxItemPerItemset parameter which would limit the maximum number of item per itemset to a certain size X. Allowing potential important reduction in the search space, hastening the process of finding theses specific solutions. Second a tandem minItemPerItemset parameter that would limit the minimum number of item per itemset. Discarding solution that do not fit this constraint. Although this wouldn't entail a reduction of the constraint, this should still allow interested user to reduce the number of solutions collected by the driver. If this solution seems interesting to the community, I will implement a solution along with test to guarantee the correcteness of it's implementation. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20321) Spark UI cannot be shutdown in spark streaming app
[ https://issues.apache.org/jira/browse/SPARK-20321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrei Taleanu updated SPARK-20321: --- Description: When an exception thrown the transform stage is handled in foreachRDD and the streaming context is forced to stop, the SparkUI appears to hang and continually dump the following logs in an infinite loop: {noformat} ... 2017-04-12 14:11:13,470 [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG org.spark_project.jetty.io.SelectorManager - Selector loop woken up from select, 0/0 selected 2017-04-12 14:11:13,470 [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG org.spark_project.jetty.io.SelectorManager - Selector loop waiting on select 2017-04-12 14:11:13,470 [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG org.spark_project.jetty.io.SelectorManager - Selector loop woken up from select, 0/0 selected 2017-04-12 14:11:13,470 [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG org.spark_project.jetty.io.SelectorManager - Selector loop waiting on select ... {noformat} Unfortunately I don't have a minimal example that reproduces this issue but here is what I can share: {noformat} val dstream = pull data from kafka val mapped = dstream.transform { rdd => val data = getData // Perform a call that potentially throws an exception // broadcast the data // flatMap the RDD using the data } mapped.foreachRDD { try { // write some data in a DB } catch { case t: Throwable => dstream.context.stop(stopSparkContext = true, stopGracefully = false) } } mapped.foreachRDD { try { // write data to Kafka // manually checkpoint the Kafka offsets (because I need them in JSON format) } catch { case t: Throwable => dstream.context.stop(stopSparkContext = true, stopGracefully = false) } } {noformat} The issue appears when stop is invoked. At the point when SparkUI is stopped, it enters that infinite loop. Initially I thought it relates to Jetty, as the version used in SparkUI had some bugs (e.g. [this one|https://bugs.eclipse.org/bugs/show_bug.cgi?id=452465]). I bumped Jetty to a more recent version (March 2017) and built Spark 2.1.0 with that one but still got the error. I encountered this issue with Spark 2.1.0 built with Hadoop 2.6 on top of Mesos. was: When an exception thrown the transform stage is handled in foreachRDD and the streaming context is forced to stop, the SparkUI appears to hang and continually dump the following logs in an infinite loop: {noformat} ... 2017-04-12 14:11:13,470 [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG org.spark_project.jetty.io.SelectorManager - Selector loop woken up from select, 0/0 selected 2017-04-12 14:11:13,470 [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG org.spark_project.jetty.io.SelectorManager - Selector loop waiting on select 2017-04-12 14:11:13,470 [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG org.spark_project.jetty.io.SelectorManager - Selector loop woken up from select, 0/0 selected 2017-04-12 14:11:13,470 [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG org.spark_project.jetty.io.SelectorManager - Selector loop waiting on select ... {noformat} Unfortunately I don't have a minimal example that reproduces this issue but here is what I can share: {noformat} val dstream = pull data from kafka val mapped = dstream transform { rdd => val data = getData // Perform a call that potentially throws an exception // broadcast the data // flatMap the RDD using the data } mapped.foreachRDD { try { // write some data in a DB } catch { case t: Throwable => dstream.context.stop(stopSparkContext = true, stopGracefully = false) } } mapped.foreachRDD { try { // write data to Kafka // manually checkpoint the Kafka offsets (because I need them in JSON format) } catch { case t: Throwable => dstream.context.stop(stopSparkContext = true, stopGracefully = false) } } {noformat} The issue appears when stop is invoked. At the point when SparkUI is stopped, it enters that infinite loop. Initially I thought it relates to Jetty, as the version used in SparkUI had some bugs (e.g. [this one|https://bugs.eclipse.org/bugs/show_bug.cgi?id=452465]). I bumped Jetty to a more recent version (March 2017) but still got the error. I encountered this issue with Spark 2.1.0 built with Hadoop 2.6 on top of Mesos. > Spark UI cannot be shutdown in spark streaming app > -- > > Key: SPARK-20321 > URL: https://issues.apache.org/jira/browse/SPARK-20321 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Andrei Taleanu > > When an exception thrown the transform stage is handled in foreachRDD and the >
[jira] [Created] (SPARK-20323) Calling stop in a transform stage causes the app to hang
Andrei Taleanu created SPARK-20323: -- Summary: Calling stop in a transform stage causes the app to hang Key: SPARK-20323 URL: https://issues.apache.org/jira/browse/SPARK-20323 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0 Reporter: Andrei Taleanu I'm not sure if this is a bug or just the way it needs to happen but I've run in this issue with the following code: {noformat} object ImmortalStreamingJob extends App { val conf = new SparkConf().setAppName("fun-spark").setMaster("local[*]") val ssc = new StreamingContext(conf, Seconds(1)) val elems = (1 to 1000).grouped(10) .map(seq => ssc.sparkContext.parallelize(seq)) .toSeq val stream = ssc.queueStream(mutable.Queue[RDD[Int]](elems: _*)) val transformed = stream.transform { rdd => try { if (Random.nextInt(6) == 5) throw new RuntimeException("boom") else println("lucky bastard") rdd } catch { case e: Throwable => println("stopping streaming context", e) ssc.stop(stopSparkContext = true, stopGracefully = false) throw e } } transformed.foreachRDD { rdd => println(rdd.collect().mkString(",")) } ssc.start() ssc.awaitTermination() } {noformat} There are two things I can note here: * if the exception is thrown in the first transformation (when the first RDD is processed), the spark context is stopped and the app dies * if the exception is thrown after at least one RDD has been processed, the app hangs after printing the error message and never stops I think there's some sort of deadlock in the second case, is that normal? I also asked this [here|http://stackoverflow.com/questions/43273783/immortal-spark-streaming-job/43373624#43373624] but up two this point there's no answer pointing exactly to what happens, only guidelines. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20322) MinPattern length in PrefixSpan
Cyril de Vogelaere created SPARK-20322: -- Summary: MinPattern length in PrefixSpan Key: SPARK-20322 URL: https://issues.apache.org/jira/browse/SPARK-20322 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 2.1.0 Reporter: Cyril de Vogelaere Priority: Trivial The current implementation already diposes of a maxPatternLength parameter. It might be nice to add a similar minPatternLength parameter. This would allow user to better control the solutions received, and lower the amount of memory needed to store the solutions in the driver. This can be implemented as a simple check before adding a solution to the solution list, no difference in performance should be observable. I'm proposing this fonctionality here, if it looks interesting to other people, I will implement it along with tests to verify it's implementation. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20322) MinPattern length in PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-20322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cyril de Vogelaere updated SPARK-20322: --- Description: The current implementation already diposes of a maxPatternLength parameter. It might be nice to add a similar minPatternLength parameter. This would allow user to better control the solutions received, and lower the amount of memory needed to store the solutions in the driver. This can be implemented as a simple check before adding a solution to the solution list, no difference in performance should be observable. I'm proposing this fonctionality here, if it looks interesting to other people, I will implement it along with tests to verify it's implementation. was: The current implementation already diposes of a maxPatternLength parameter. It might be nice to add a similar minPatternLength parameter. This would allow user to better control the solutions received, and lower the amount of memory needed to store the solutions in the driver. This can be implemented as a simple check before adding a solution to the solution list, no difference in performance should be observable. I'm proposing this fonctionality here, if it looks interesting to other people, I will implement it along with tests to verify it's implementation. > MinPattern length in PrefixSpan > --- > > Key: SPARK-20322 > URL: https://issues.apache.org/jira/browse/SPARK-20322 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Trivial > Original Estimate: 0h > Remaining Estimate: 0h > > The current implementation already diposes of a maxPatternLength parameter. > It might be nice to add a similar minPatternLength parameter. > This would allow user to better control the solutions received, and lower the > amount of memory needed to store the solutions in the driver. > This can be implemented as a simple check before adding a solution to the > solution list, no difference in performance should be observable. > I'm proposing this fonctionality here, if it looks interesting to other > people, I will implement it along with tests to verify it's implementation. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20321) Spark UI cannot be shutdown in spark streaming app
Andrei Taleanu created SPARK-20321: -- Summary: Spark UI cannot be shutdown in spark streaming app Key: SPARK-20321 URL: https://issues.apache.org/jira/browse/SPARK-20321 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0 Reporter: Andrei Taleanu When an exception thrown the transform stage is handled in foreachRDD and the streaming context is forced to stop, the SparkUI appears to hang and continually dump the following logs in an infinite loop: {noformat} ... 2017-04-12 14:11:13,470 [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG org.spark_project.jetty.io.SelectorManager - Selector loop woken up from select, 0/0 selected 2017-04-12 14:11:13,470 [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG org.spark_project.jetty.io.SelectorManager - Selector loop waiting on select 2017-04-12 14:11:13,470 [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG org.spark_project.jetty.io.SelectorManager - Selector loop woken up from select, 0/0 selected 2017-04-12 14:11:13,470 [SparkUI-50-selector-ServerConnectorManager@512d4583/1] DEBUG org.spark_project.jetty.io.SelectorManager - Selector loop waiting on select ... {noformat} Unfortunately I don't have a minimal example that reproduces this issue but here is what I can share: {noformat} val dstream = pull data from kafka val mapped = dstream transform { rdd => val data = getData // Perform a call that potentially throws an exception // broadcast the data // flatMap the RDD using the data } mapped.foreachRDD { try { // write some data in a DB } catch { case t: Throwable => dstream.context.stop(stopSparkContext = true, stopGracefully = false) } } mapped.foreachRDD { try { // write data to Kafka // manually checkpoint the Kafka offsets (because I need them in JSON format) } catch { case t: Throwable => dstream.context.stop(stopSparkContext = true, stopGracefully = false) } } {noformat} The issue appears when stop is invoked. At the point when SparkUI is stopped, it enters that infinite loop. Initially I thought it relates to Jetty, as the version used in SparkUI had some bugs (e.g. [this one|https://bugs.eclipse.org/bugs/show_bug.cgi?id=452465]). I bumped Jetty to a more recent version (March 2017) but still got the error. I encountered this issue with Spark 2.1.0 built with Hadoop 2.6 on top of Mesos. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19924) Handle InvocationTargetException for all Hive Shim
[ https://issues.apache.org/jira/browse/SPARK-19924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-19924: Fix Version/s: 2.1.1 > Handle InvocationTargetException for all Hive Shim > -- > > Key: SPARK-19924 > URL: https://issues.apache.org/jira/browse/SPARK-19924 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.1.1, 2.2.0 > > > Since we are using shim for most Hive metastore APIs, the exceptions thrown > by the underlying method of Method.invoke() are wrapped by > `InvocationTargetException`. Instead of doing it one by one, we should handle > all of them in the `withClient`. If any of them is missing, the error message > could looks unfriendly. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20081) RandomForestClassifier doesn't seem to support more than 100 labels
[ https://issues.apache.org/jira/browse/SPARK-20081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967403#comment-15967403 ] Christian Reiniger commented on SPARK-20081: Thanks for the information, but as you already suspected it doesn't work. I'll look into using StringIndexer, but that change takes more time and the performance costs might be too high. > RandomForestClassifier doesn't seem to support more than 100 labels > --- > > Key: SPARK-20081 > URL: https://issues.apache.org/jira/browse/SPARK-20081 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.1.0 > Environment: Java >Reporter: Christian Reiniger > > When feeding data with more than 100 labels into RanfomForestClassifer#fit() > (from java code), I get the following error message: > {code} > Classifier inferred 143 from label values in column > rfc_df0e968db9df__labelCol, but this exceeded the max numClasses (100) > allowed to be inferred from values. > To avoid this error for labels with > 100 classes, specify numClasses > explicitly in the metadata; this can be done by applying StringIndexer to the > label column. > {code} > Setting "numClasses" in the metadata for the label column doesn't make a > difference. Looking at the code, this is not surprising, since > MetadataUtils.getNumClasses() ignores this setting: > {code:language=scala} > def getNumClasses(labelSchema: StructField): Option[Int] = { > Attribute.fromStructField(labelSchema) match { > case binAttr: BinaryAttribute => Some(2) > case nomAttr: NominalAttribute => nomAttr.getNumValues > case _: NumericAttribute | UnresolvedAttribute => None > } > } > {code} > The alternative would be to pass a proper "maxNumClasses" parameter to the > classifier, so that Classifier#getNumClasses() allows a larger number of > auto-detected labels. However, RandomForestClassifer#train() calls > #getNumClasses without the "maxNumClasses" parameter, causing it to use the > default of 100: > {code:language=scala} > override protected def train(dataset: Dataset[_]): > RandomForestClassificationModel = { > val categoricalFeatures: Map[Int, Int] = > MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol))) > val numClasses: Int = getNumClasses(dataset) > // ... > {code} > My scala skills are pretty sketchy, so please correct me if I misinterpreted > something. But as it seems right now, there is no way to learn from data with > more than 100 labels via RandomForestClassifier. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20320) AnalysisException: Columns of grouping_id (count(value#17L)) does not match grouping columns (count(value#17L))
[ https://issues.apache.org/jira/browse/SPARK-20320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski updated SPARK-20320: Shepherd: Davies Liu (was: Herman van Hovell) Description: I'm not questioning the {{AnalysisException}} (which I don't know whether should be reported or not), but the exception message that tells...nothing helpful. {code} val records = spark.range(5).flatMap(n => Seq.fill(n.toInt)(n)) scala> records.cube(count("value")).agg(grouping_id(count("value"))).queryExecution.logical org.apache.spark.sql.AnalysisException: Columns of grouping_id (count(value#17L)) does not match grouping columns (count(value#17L)); at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGroupingAnalytics$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveGroupingAnalytics$$replaceGroupingFunc$1.applyOrElse(Analyzer.scala:313) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGroupingAnalytics$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveGroupingAnalytics$$replaceGroupingFunc$1.applyOrElse(Analyzer.scala:308) {code} > AnalysisException: Columns of grouping_id (count(value#17L)) does not match > grouping columns (count(value#17L)) > --- > > Key: SPARK-20320 > URL: https://issues.apache.org/jira/browse/SPARK-20320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jacek Laskowski >Priority: Minor > > I'm not questioning the {{AnalysisException}} (which I don't know whether > should be reported or not), but the exception message that tells...nothing > helpful. > {code} > val records = spark.range(5).flatMap(n => Seq.fill(n.toInt)(n)) > scala> > records.cube(count("value")).agg(grouping_id(count("value"))).queryExecution.logical > org.apache.spark.sql.AnalysisException: Columns of grouping_id > (count(value#17L)) does not match grouping columns (count(value#17L)); > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGroupingAnalytics$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveGroupingAnalytics$$replaceGroupingFunc$1.applyOrElse(Analyzer.scala:313) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGroupingAnalytics$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveGroupingAnalytics$$replaceGroupingFunc$1.applyOrElse(Analyzer.scala:308) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20320) AnalysisException: Columns of grouping_id (count(value#17L)) does not match grouping columns (count(value#17L))
Jacek Laskowski created SPARK-20320: --- Summary: AnalysisException: Columns of grouping_id (count(value#17L)) does not match grouping columns (count(value#17L)) Key: SPARK-20320 URL: https://issues.apache.org/jira/browse/SPARK-20320 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Jacek Laskowski Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20081) RandomForestClassifier doesn't seem to support more than 100 labels
[ https://issues.apache.org/jira/browse/SPARK-20081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967347#comment-15967347 ] Yan Facai (颜发才) edited comment on SPARK-20081 at 4/13/17 9:42 AM: -- Yes, you should use `builder.putLong("num_vals", numClasses).putString("type", "nominal")`. A little hacky, and it might not work. I am not familiar with Metadata and Attribute class at present. Some experts perhaps have a better solution, unfortunately, I have no idea. If you like to dig deeper, more details to see: org.apache.spark.ml.attribute.Attribute org.apache.spark.ml.attribute.NominalAttribute Use StringIndexer with your label column should work well, which take care of itself, I guess. was (Author: facai): Yes, you should use `builder.putLong("num_vals", numClasses).putString("type", "nominal")`. A little hacky, and it might not work. I am not familiar with Metadata and Attribute class at present. Some experts perhaps have a better solution, unfortunately, I have no idea. Use StringIndexer with your label column should work well, which take care of itself, I guess. > RandomForestClassifier doesn't seem to support more than 100 labels > --- > > Key: SPARK-20081 > URL: https://issues.apache.org/jira/browse/SPARK-20081 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.1.0 > Environment: Java >Reporter: Christian Reiniger > > When feeding data with more than 100 labels into RanfomForestClassifer#fit() > (from java code), I get the following error message: > {code} > Classifier inferred 143 from label values in column > rfc_df0e968db9df__labelCol, but this exceeded the max numClasses (100) > allowed to be inferred from values. > To avoid this error for labels with > 100 classes, specify numClasses > explicitly in the metadata; this can be done by applying StringIndexer to the > label column. > {code} > Setting "numClasses" in the metadata for the label column doesn't make a > difference. Looking at the code, this is not surprising, since > MetadataUtils.getNumClasses() ignores this setting: > {code:language=scala} > def getNumClasses(labelSchema: StructField): Option[Int] = { > Attribute.fromStructField(labelSchema) match { > case binAttr: BinaryAttribute => Some(2) > case nomAttr: NominalAttribute => nomAttr.getNumValues > case _: NumericAttribute | UnresolvedAttribute => None > } > } > {code} > The alternative would be to pass a proper "maxNumClasses" parameter to the > classifier, so that Classifier#getNumClasses() allows a larger number of > auto-detected labels. However, RandomForestClassifer#train() calls > #getNumClasses without the "maxNumClasses" parameter, causing it to use the > default of 100: > {code:language=scala} > override protected def train(dataset: Dataset[_]): > RandomForestClassificationModel = { > val categoricalFeatures: Map[Int, Int] = > MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol))) > val numClasses: Int = getNumClasses(dataset) > // ... > {code} > My scala skills are pretty sketchy, so please correct me if I misinterpreted > something. But as it seems right now, there is no way to learn from data with > more than 100 labels via RandomForestClassifier. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20081) RandomForestClassifier doesn't seem to support more than 100 labels
[ https://issues.apache.org/jira/browse/SPARK-20081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967347#comment-15967347 ] Yan Facai (颜发才) edited comment on SPARK-20081 at 4/13/17 9:40 AM: -- Yes, you should use `builder.putLong("num_vals", numClasses).putString("type", "nominal")`. A little hacky, and it might not work. I am not familiar with Metadata and Attribute class at present. Some experts perhaps have a better solution, unfortunately, I have no idea. Use StringIndexer with your label column should work well, which take care of itself, I guess. was (Author: facai): Yes, you should use `builder.putLong("num_vals", numClasses).putString("type", "nominal")`. A little hacky, and it might not work. I am not familiar with Metadata and Attribute class at present. Some experts perhaps have a better solution, unfortunately, I have no idea. Use StringIndexer with your label column should work well, which take care of itself, I guess. > RandomForestClassifier doesn't seem to support more than 100 labels > --- > > Key: SPARK-20081 > URL: https://issues.apache.org/jira/browse/SPARK-20081 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.1.0 > Environment: Java >Reporter: Christian Reiniger > > When feeding data with more than 100 labels into RanfomForestClassifer#fit() > (from java code), I get the following error message: > {code} > Classifier inferred 143 from label values in column > rfc_df0e968db9df__labelCol, but this exceeded the max numClasses (100) > allowed to be inferred from values. > To avoid this error for labels with > 100 classes, specify numClasses > explicitly in the metadata; this can be done by applying StringIndexer to the > label column. > {code} > Setting "numClasses" in the metadata for the label column doesn't make a > difference. Looking at the code, this is not surprising, since > MetadataUtils.getNumClasses() ignores this setting: > {code:language=scala} > def getNumClasses(labelSchema: StructField): Option[Int] = { > Attribute.fromStructField(labelSchema) match { > case binAttr: BinaryAttribute => Some(2) > case nomAttr: NominalAttribute => nomAttr.getNumValues > case _: NumericAttribute | UnresolvedAttribute => None > } > } > {code} > The alternative would be to pass a proper "maxNumClasses" parameter to the > classifier, so that Classifier#getNumClasses() allows a larger number of > auto-detected labels. However, RandomForestClassifer#train() calls > #getNumClasses without the "maxNumClasses" parameter, causing it to use the > default of 100: > {code:language=scala} > override protected def train(dataset: Dataset[_]): > RandomForestClassificationModel = { > val categoricalFeatures: Map[Int, Int] = > MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol))) > val numClasses: Int = getNumClasses(dataset) > // ... > {code} > My scala skills are pretty sketchy, so please correct me if I misinterpreted > something. But as it seems right now, there is no way to learn from data with > more than 100 labels via RandomForestClassifier. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20081) RandomForestClassifier doesn't seem to support more than 100 labels
[ https://issues.apache.org/jira/browse/SPARK-20081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967347#comment-15967347 ] Yan Facai (颜发才) edited comment on SPARK-20081 at 4/13/17 9:40 AM: -- Yes, you should use `builder.putLong("num_vals", numClasses).putString("type", "nominal")`. A little hacky, and it might not work. I am not familiar with Metadata and Attribute class at present. Some experts perhaps have a better solution, unfortunately, I have no idea. Use StringIndexer with your label column should work well, which take care of itself, I guess. was (Author: facai): Yes, you should use `builder.putLong("num_vals", numClasses)`. A little hacky, and it might not work. I am not familiar with Metadata and Attribute class at present. Some experts perhaps have a better solution, unfortunately, I have no idea. Use StringIndexer with your label column should work well, which take care of itself, I guess. > RandomForestClassifier doesn't seem to support more than 100 labels > --- > > Key: SPARK-20081 > URL: https://issues.apache.org/jira/browse/SPARK-20081 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.1.0 > Environment: Java >Reporter: Christian Reiniger > > When feeding data with more than 100 labels into RanfomForestClassifer#fit() > (from java code), I get the following error message: > {code} > Classifier inferred 143 from label values in column > rfc_df0e968db9df__labelCol, but this exceeded the max numClasses (100) > allowed to be inferred from values. > To avoid this error for labels with > 100 classes, specify numClasses > explicitly in the metadata; this can be done by applying StringIndexer to the > label column. > {code} > Setting "numClasses" in the metadata for the label column doesn't make a > difference. Looking at the code, this is not surprising, since > MetadataUtils.getNumClasses() ignores this setting: > {code:language=scala} > def getNumClasses(labelSchema: StructField): Option[Int] = { > Attribute.fromStructField(labelSchema) match { > case binAttr: BinaryAttribute => Some(2) > case nomAttr: NominalAttribute => nomAttr.getNumValues > case _: NumericAttribute | UnresolvedAttribute => None > } > } > {code} > The alternative would be to pass a proper "maxNumClasses" parameter to the > classifier, so that Classifier#getNumClasses() allows a larger number of > auto-detected labels. However, RandomForestClassifer#train() calls > #getNumClasses without the "maxNumClasses" parameter, causing it to use the > default of 100: > {code:language=scala} > override protected def train(dataset: Dataset[_]): > RandomForestClassificationModel = { > val categoricalFeatures: Map[Int, Int] = > MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol))) > val numClasses: Int = getNumClasses(dataset) > // ... > {code} > My scala skills are pretty sketchy, so please correct me if I misinterpreted > something. But as it seems right now, there is no way to learn from data with > more than 100 labels via RandomForestClassifier. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20081) RandomForestClassifier doesn't seem to support more than 100 labels
[ https://issues.apache.org/jira/browse/SPARK-20081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967347#comment-15967347 ] Yan Facai (颜发才) commented on SPARK-20081: - Yes, you should use `builder.putLong("num_vals", numClasses)`. A little hacky, and it might not work. I am not familiar with Metadata and Attribute class at present. Some experts perhaps have a better solution, unfortunately, I have no idea. Use StringIndexer with your label column should work well, which take care of itself, I guess. > RandomForestClassifier doesn't seem to support more than 100 labels > --- > > Key: SPARK-20081 > URL: https://issues.apache.org/jira/browse/SPARK-20081 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.1.0 > Environment: Java >Reporter: Christian Reiniger > > When feeding data with more than 100 labels into RanfomForestClassifer#fit() > (from java code), I get the following error message: > {code} > Classifier inferred 143 from label values in column > rfc_df0e968db9df__labelCol, but this exceeded the max numClasses (100) > allowed to be inferred from values. > To avoid this error for labels with > 100 classes, specify numClasses > explicitly in the metadata; this can be done by applying StringIndexer to the > label column. > {code} > Setting "numClasses" in the metadata for the label column doesn't make a > difference. Looking at the code, this is not surprising, since > MetadataUtils.getNumClasses() ignores this setting: > {code:language=scala} > def getNumClasses(labelSchema: StructField): Option[Int] = { > Attribute.fromStructField(labelSchema) match { > case binAttr: BinaryAttribute => Some(2) > case nomAttr: NominalAttribute => nomAttr.getNumValues > case _: NumericAttribute | UnresolvedAttribute => None > } > } > {code} > The alternative would be to pass a proper "maxNumClasses" parameter to the > classifier, so that Classifier#getNumClasses() allows a larger number of > auto-detected labels. However, RandomForestClassifer#train() calls > #getNumClasses without the "maxNumClasses" parameter, causing it to use the > default of 100: > {code:language=scala} > override protected def train(dataset: Dataset[_]): > RandomForestClassificationModel = { > val categoricalFeatures: Map[Int, Int] = > MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol))) > val numClasses: Int = getNumClasses(dataset) > // ... > {code} > My scala skills are pretty sketchy, so please correct me if I misinterpreted > something. But as it seems right now, there is no way to learn from data with > more than 100 labels via RandomForestClassifier. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20081) RandomForestClassifier doesn't seem to support more than 100 labels
[ https://issues.apache.org/jira/browse/SPARK-20081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967328#comment-15967328 ] Christian Reiniger commented on SPARK-20081: Thanks for the feedback. We had to implement indexing ourselves, since StringIndexer only operates on a single input column and we have potentially hundreds of columns that need to be indexed. Applying StringIndexer sequentially on them would be (and was) a performance desaster. We could of course use StringIndexer for the label column only, but if we can keep our all-in-one indexer that would be preferable. I'm also suspicious of whether using StringIndexer would help at all -- it may set numClasses in the Metadata, but RandomForestClassifier doesn't seem to even look at it (as described above). On the other hand -- rereading your comment I get the suspicion that I've been misled by the error message. Do you mean that "specify numClasses explicitly in the metadata" does *not* mean setting the key "numClasses" in the column metadata, as in this snippet: {code:language=java} MetadataBuilder builder = new MetadataBuilder(); builder.putLong("numClasses", numClasses); // <-- StructField labelField = DataTypes.createStructField("label", DataTypes.DoubleType, true, builder.build()); // <-- StructField featuresField = DataTypes.createStructField("features", new VectorUDT(), true); StructField[] fields = new StructField[] { labelField, featuresField }; StructType schema = DataTypes.createStructType(fields); {code} ... and that instead the {{nomAttr.getNumValues}} is meant with that? And that this value can only be set inside of some internal mllib classes? > RandomForestClassifier doesn't seem to support more than 100 labels > --- > > Key: SPARK-20081 > URL: https://issues.apache.org/jira/browse/SPARK-20081 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.1.0 > Environment: Java >Reporter: Christian Reiniger > > When feeding data with more than 100 labels into RanfomForestClassifer#fit() > (from java code), I get the following error message: > {code} > Classifier inferred 143 from label values in column > rfc_df0e968db9df__labelCol, but this exceeded the max numClasses (100) > allowed to be inferred from values. > To avoid this error for labels with > 100 classes, specify numClasses > explicitly in the metadata; this can be done by applying StringIndexer to the > label column. > {code} > Setting "numClasses" in the metadata for the label column doesn't make a > difference. Looking at the code, this is not surprising, since > MetadataUtils.getNumClasses() ignores this setting: > {code:language=scala} > def getNumClasses(labelSchema: StructField): Option[Int] = { > Attribute.fromStructField(labelSchema) match { > case binAttr: BinaryAttribute => Some(2) > case nomAttr: NominalAttribute => nomAttr.getNumValues > case _: NumericAttribute | UnresolvedAttribute => None > } > } > {code} > The alternative would be to pass a proper "maxNumClasses" parameter to the > classifier, so that Classifier#getNumClasses() allows a larger number of > auto-detected labels. However, RandomForestClassifer#train() calls > #getNumClasses without the "maxNumClasses" parameter, causing it to use the > default of 100: > {code:language=scala} > override protected def train(dataset: Dataset[_]): > RandomForestClassificationModel = { > val categoricalFeatures: Map[Int, Int] = > MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol))) > val numClasses: Int = getNumClasses(dataset) > // ... > {code} > My scala skills are pretty sketchy, so please correct me if I misinterpreted > something. But as it seems right now, there is no way to learn from data with > more than 100 labels via RandomForestClassifier. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20319) Already quoted identifiers are getting wrapped with additional quotes
[ https://issues.apache.org/jira/browse/SPARK-20319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20319: Assignee: Apache Spark > Already quoted identifiers are getting wrapped with additional quotes > - > > Key: SPARK-20319 > URL: https://issues.apache.org/jira/browse/SPARK-20319 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Umesh Chaudhary >Assignee: Apache Spark > > The issue was caused by > [SPARK-16387|https://issues.apache.org/jira/browse/SPARK-16387] where > reserved SQL words are honored by wrapping quotes on column names. > In our test we found that when quotes are explicitly wrapped in column names > then Oracle JDBC driver is throwing : > java.sql.BatchUpdateException: ORA-01741: illegal zero-length identifier > at > oracle.jdbc.driver.OraclePreparedStatement.executeBatch(OraclePreparedStatement.java:12296) > > at > oracle.jdbc.driver.OracleStatementWrapper.executeBatch(OracleStatementWrapper.java:246) > > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:597) > > and Cassandra JDBC driver is throwing : > 17/04/12 19:03:48 ERROR executor.Executor: Exception in task 0.0 in stage 5.0 > (TID 6) > java.sql.SQLSyntaxErrorException: [FMWGEN][Cassandra JDBC > Driver][Cassandra]syntax error or access rule violation: base table or view > not found: > at weblogic.jdbc.cassandrabase.ddcl.b(Unknown Source) > at weblogic.jdbc.cassandrabase.ddt.a(Unknown Source) > at weblogic.jdbc.cassandrabase.BaseConnection.prepareStatement(Unknown > Source) > at weblogic.jdbc.cassandrabase.BaseConnection.prepareStatement(Unknown > Source) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.insertStatement(JdbcUtils.scala:118) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:571) > CC: [~rxin] , [~joshrosen] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20319) Already quoted identifiers are getting wrapped with additional quotes
[ https://issues.apache.org/jira/browse/SPARK-20319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967318#comment-15967318 ] Apache Spark commented on SPARK-20319: -- User 'umesh9794' has created a pull request for this issue: https://github.com/apache/spark/pull/17631 > Already quoted identifiers are getting wrapped with additional quotes > - > > Key: SPARK-20319 > URL: https://issues.apache.org/jira/browse/SPARK-20319 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Umesh Chaudhary > > The issue was caused by > [SPARK-16387|https://issues.apache.org/jira/browse/SPARK-16387] where > reserved SQL words are honored by wrapping quotes on column names. > In our test we found that when quotes are explicitly wrapped in column names > then Oracle JDBC driver is throwing : > java.sql.BatchUpdateException: ORA-01741: illegal zero-length identifier > at > oracle.jdbc.driver.OraclePreparedStatement.executeBatch(OraclePreparedStatement.java:12296) > > at > oracle.jdbc.driver.OracleStatementWrapper.executeBatch(OracleStatementWrapper.java:246) > > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:597) > > and Cassandra JDBC driver is throwing : > 17/04/12 19:03:48 ERROR executor.Executor: Exception in task 0.0 in stage 5.0 > (TID 6) > java.sql.SQLSyntaxErrorException: [FMWGEN][Cassandra JDBC > Driver][Cassandra]syntax error or access rule violation: base table or view > not found: > at weblogic.jdbc.cassandrabase.ddcl.b(Unknown Source) > at weblogic.jdbc.cassandrabase.ddt.a(Unknown Source) > at weblogic.jdbc.cassandrabase.BaseConnection.prepareStatement(Unknown > Source) > at weblogic.jdbc.cassandrabase.BaseConnection.prepareStatement(Unknown > Source) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.insertStatement(JdbcUtils.scala:118) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:571) > CC: [~rxin] , [~joshrosen] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20319) Already quoted identifiers are getting wrapped with additional quotes
[ https://issues.apache.org/jira/browse/SPARK-20319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20319: Assignee: (was: Apache Spark) > Already quoted identifiers are getting wrapped with additional quotes > - > > Key: SPARK-20319 > URL: https://issues.apache.org/jira/browse/SPARK-20319 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Umesh Chaudhary > > The issue was caused by > [SPARK-16387|https://issues.apache.org/jira/browse/SPARK-16387] where > reserved SQL words are honored by wrapping quotes on column names. > In our test we found that when quotes are explicitly wrapped in column names > then Oracle JDBC driver is throwing : > java.sql.BatchUpdateException: ORA-01741: illegal zero-length identifier > at > oracle.jdbc.driver.OraclePreparedStatement.executeBatch(OraclePreparedStatement.java:12296) > > at > oracle.jdbc.driver.OracleStatementWrapper.executeBatch(OracleStatementWrapper.java:246) > > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:597) > > and Cassandra JDBC driver is throwing : > 17/04/12 19:03:48 ERROR executor.Executor: Exception in task 0.0 in stage 5.0 > (TID 6) > java.sql.SQLSyntaxErrorException: [FMWGEN][Cassandra JDBC > Driver][Cassandra]syntax error or access rule violation: base table or view > not found: > at weblogic.jdbc.cassandrabase.ddcl.b(Unknown Source) > at weblogic.jdbc.cassandrabase.ddt.a(Unknown Source) > at weblogic.jdbc.cassandrabase.BaseConnection.prepareStatement(Unknown > Source) > at weblogic.jdbc.cassandrabase.BaseConnection.prepareStatement(Unknown > Source) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.insertStatement(JdbcUtils.scala:118) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:571) > CC: [~rxin] , [~joshrosen] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20284) Make SerializationStream and DeserializationStream extend Closeable
[ https://issues.apache.org/jira/browse/SPARK-20284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20284. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17598 [https://github.com/apache/spark/pull/17598] > Make SerializationStream and DeserializationStream extend Closeable > --- > > Key: SPARK-20284 > URL: https://issues.apache.org/jira/browse/SPARK-20284 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.3, 2.1.0 >Reporter: Sergei Lebedev >Priority: Trivial > Fix For: 2.2.0 > > > Both {{SerializationStream}} and {{DeserializationStream}} implement > {{close}} but do not extend {{Closeable}}. As a result, these streams cannot > be used in try-with-resources. > Was this intentional or rather nobody ever needed that? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20284) Make SerializationStream and DeserializationStream extend Closeable
[ https://issues.apache.org/jira/browse/SPARK-20284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-20284: - Assignee: Sergei Lebedev > Make SerializationStream and DeserializationStream extend Closeable > --- > > Key: SPARK-20284 > URL: https://issues.apache.org/jira/browse/SPARK-20284 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.3, 2.1.0 >Reporter: Sergei Lebedev >Assignee: Sergei Lebedev >Priority: Trivial > Fix For: 2.2.0 > > > Both {{SerializationStream}} and {{DeserializationStream}} implement > {{close}} but do not extend {{Closeable}}. As a result, these streams cannot > be used in try-with-resources. > Was this intentional or rather nobody ever needed that? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org